o11y-logging
JoelClaw Observability + Logging
Prevent silent failure by default. Observability is not optional polish: it is part of done.
Non-Negotiable Rules
- Use the canonical event contract only.
packages/system-bus/src/observability/otel-event.tspackages/system-bus/src/observability/emit.tspackages/system-bus/src/observability/store.ts
- Worker/Inngest code emits through
emitOtelEventoremitMeasuredOtelEvent. - Gateway code emits through
emitGatewayOtel. - Internal ingestion goes through
POST /observability/emit(packages/system-bus/src/serve.ts), not ad-hoc writes. - Never treat
console.logas primary observability. Keep structured events as source of truth. - High-cardinality values go in
metadata, not in facet fields (source,component,level,success). - Failures must set
success: falsewith a meaningfulerror. - For warn/error/fatal, verify Convex mirror behavior (rolling window) in addition to Typesense write.
- In Inngest durable functions, any "emit once" telemetry must live inside
step.run(...)to avoid replay duplication after resume.
Event Conventions
source: subsystem (worker,gateway,webhook,memory,verification, etc.)component: stable module/service name (check-system-health,redis-channel,observe)action: stable dotted action (system.health.checked,events.immediate_telegram)metadata: request IDs, deployment IDs, function IDs, session IDs, payload identifiersduration_ms: include for timed operations
Use event-per-hop (wide event style): one context-rich event for each major boundary/operation, not scattered string logs.
Implementation Workflow
- Identify the boundary being changed.
- Inngest function, gateway channel, webhook route, API route, background job, sync step.
- Add success and failure envelopes.
- Start + completion for long tasks, or a single completion event for short tasks.
- Include operational and business context in
metadata.- Example: function id, event id, provider, queue depth, affected resource id.
- Keep severity useful.
debug/infofor normal activity,warnfor degraded but recoverable,error/fatalfor failures.
- Run verification gates before finishing.
For full checklists and command recipes, read references/implementation-checklist.md.
Quick Patterns
Worker / Inngest timed operation
import { emitMeasuredOtelEvent } from "../../observability/emit";
await emitMeasuredOtelEvent(
{
level: "info",
source: "worker",
component: "content-sync",
action: "content_sync.run",
metadata: { trigger: event.name },
},
async () => {
await runSync();
}
);
Gateway emission
import { emitGatewayOtel } from "../observability";
await emitGatewayOtel({
level: "error",
component: "redis-channel",
action: "events.immediate_telegram",
success: false,
error: "telegram_send_failed",
metadata: { sessionId, queueDepth },
});
Definition of Done
- Structured OTEL events added for the changed path.
- No direct feature-level writes to Typesense/Convex for observability data.
- Smoke probe passes (
scripts/otel-smoke.sh). joelclaw otel listandjoelclaw otel statsshow expected behavior.- New failure modes are queryable by
source,component, andaction.
Inngest Replay + Hang Triage
Use this when step code appears to run but runs remain RUNNING/CANCELLED with Finalization errors.
- Inspect run trace first.
joelclaw run <run-id>
Look for errors.Finalization.stack containing Unable to reach SDK URL.
- Confirm whether this is true network reachability or worker-side blocking.
joelclaw inngest status
joelclaw logs worker --lines 200
joelclaw logs errors --lines 200
- Check for replay-noise in OTEL.
If an action that should emit once (for example manifest.archive.prereqs-passed) appears hundreds of times in one run window, move that emit into its own step.run.
joelclaw otel search "manifest.archive.prereqs-passed" --hours 1
- Treat
Unable to reach SDK URLas an ambiguous symptom.
It can indicate ingress problems, but in practice it can also happen when a function handler blocks on local IO/dependencies long enough that finalization cannot complete.
Helper Script
Use scripts/otel-smoke.sh for a fast end-to-end probe:
./skills/o11y-logging/scripts/otel-smoke.sh verification o11y-skill probe.emit
Key Files
packages/system-bus/src/observability/otel-event.tspackages/system-bus/src/observability/emit.tspackages/system-bus/src/observability/store.tspackages/system-bus/src/serve.tspackages/gateway/src/observability.tspackages/system-bus/src/inngest/functions/check-system-health.tspackages/cli/src/commands/otel.tsapps/web/app/api/otel/route.ts
More from joelhooks/joelclaw
cli-design
Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation.
129docker-sandbox
Create, manage, and execute agent tools (claude, codex) inside Docker sandboxes for isolated code execution. Use when running agent loops, spawning tool subprocesses, or any task requiring process isolation. Triggers on "sandbox", "isolated execution", "docker sandbox", "safe agent execution", or when working on agent loop infrastructure.
86skill-review
Audit and maintain the joelclaw skill inventory. Use when checking skill health, fixing broken symlinks, finding stale skills, or running the skill garden. Triggers: 'skill audit', 'check skills', 'stale skills', 'skill health', 'skill garden', 'broken skill', 'skill review', 'fix skills', 'garden skills', or any task involving skill inventory maintenance.
49contacts
Add, enrich, and manage contacts in Joel's Vault. Fire the Inngest enrichment pipeline for full multi-source dossiers, or create quick contacts manually. Use when: 'add a contact', 'enrich this person', 'who is X', 'VIP contact', 'update contact', or any task involving the Vault/Contacts directory.
43gateway
Operate the joelclaw gateway daemon — the always-on pi session that receives events, notifications, and messages. Use the joelclaw CLI for ALL gateway operations. Use when: 'restart gateway', 'gateway status', 'is gateway healthy', 'push to gateway', 'gateway not responding', 'telegram not working', 'messages not going through', 'gateway stuck', 'gateway debug', 'check gateway', 'drain queue', 'test gateway', 'stream events', or any task involving the gateway daemon.
40email-triage
Triage Joel's email inboxes via the joelclaw email CLI. Scan, categorize, archive noise, surface actionable items, and draft replies. Use when: 'check my email', 'scan inbox', 'triage email', 'what needs a reply', 'clean up inbox', 'archive junk', 'email summary', 'anything important in email', or any request involving email inbox review or cleanup.
38