system-architecture
System Architecture (Canonical Topology)
This skill is the single source of truth for joelclaw system wiring. Use it for:
- "why did this run / not run"
- "which worker handles this function"
- "what is listening on port X"
- "how does event Y flow"
- full-stack routing/debug across CLI → Inngest → workers → gateway → telemetry
Ground-Truth Scope + Evidence Snapshot
This document is grounded in direct reads of:
apps/docs-api/src/index.tspackages/restate/Dockerfilepackages/restate/src/index.tspackages/restate/src/workflows/dag-orchestrator.tspackages/agent-execution/src/microvm.tspackages/system-bus/src/serve.tspackages/system-bus/src/inngest/functions/index.host.tspackages/system-bus/src/inngest/functions/index.cluster.tspackages/system-bus/src/inngest/client.tsinfra/worker-supervisor/src/main.rs~/Library/LaunchAgents/com.joel*.plistk8s/*(all files)infra/pds/values.yamlpackages/gateway/src/daemon.tspackages/gateway/src/channels/*.ts~/.joelclaw/gateway/AGENTS.md~/.joelclaw/gateway/.pi/settings.json~/.local/caddy/Caddyfile~/.colima/default/colima.yaml+colima status --jsonpackages/cli/src/cli.ts,packages/cli/src/config.ts,packages/cli/src/inngest.tspackages/system-bus/src/observability/*(key files:emit.ts,otel-event.ts,store.ts)packages/telemetry/src/emitter.tspackages/system-bus/src/lib/langfuse.tspackages/inference-router/src/tracing.ts- ADRs in
~/Vault/docs/decisions/(required + topology-adjacent) - last 50 lines of
~/Vault/system/system-log.jsonl
Related docs verified
docs/architecture.md— Restate/Firecracker runtime + workload execution flowdocs/deploy.md— Restate worker deploy + auth/identity/PVC proceduresdocs/cli.md— workload command tree + runtime bridgedocs/observability.md— not inspected in this update
1) Physical Topology
Mac Mini "Panda" (host macOS)
├─ launchd services (gateway, worker supervisor, caddy, talon, agent-mail, etc.)
├─ Colima VM (driver: VZ, arch: aarch64, runtime: docker, VM IP: 192.168.64.2)
│ └─ Talos node: joelclaw-controlplane-1 (k8s v1.35.0, internal IP 10.5.0.2)
│ ├─ namespace: joelclaw
│ │ ├─ inngest (StatefulSet + NodePort 8288/8289)
│ │ ├─ redis (StatefulSet + NodePort 6379)
│ │ ├─ typesense (StatefulSet + ClusterIP 8108)
│ │ ├─ restate (StatefulSet + NodePort 8080/9070/9071)
│ │ ├─ system-bus-worker (Deployment + ClusterIP 3111)
│ │ ├─ restate-worker (Deployment + ClusterIP 9080; full agent image + Firecracker)
│ │ ├─ dkron (StatefulSet + ClusterIP 8080)
│ │ ├─ docs-api (Deployment + NodePort 3838)
│ │ ├─ livekit-server (Deployment + NodePort 7880/7881)
│ │ ├─ bluesky-pds (Deployment + NodePort 3000)
│ │ └─ minio (StatefulSet + NodePort 30900/30901)
│ └─ namespace: aistor
│ ├─ aistor operator (Deployments: adminjob-operator, object-store-operator)
│ └─ aistor-s3 object store (StatefulSet + NodePort 31000/31001)
├─ Caddy reverse proxy (tailnet HTTPS fan-in)
├─ Gateway daemon (embedded pi session)
├─ Firecracker substrate (requires Colima nestedVirtualization=true for /dev/kvm; OFF by default — unstable under load)
└─ NAS "three-body" (NFS tiers per ADR-0088)
Known runtime endpoints
- Colima VM IP:
192.168.64.2(colima status --json) - Kubernetes API (local forward):
https://127.0.0.1:64784(kubectl cluster-info) - Tailnet hostnames seen in config:
panda.tail7af24.ts.net(Caddy routes)pds.panda.tail7af24.ts.net(PDS values)
Tailscale mesh state
tailscale status --jsonfailed in this environment: UNKNOWN — needs manual verification
2) Process Inventory (Long-Running)
Host launchd inventory (snapshot)
Snapshot source:
launchctl print gui/$(id -u)/<label>and plist inspection.
| Launchd label | State | PID (snapshot) | Role | Ports / endpoints |
|---|---|---|---|---|
com.joel.system-bus-worker |
running | 75292 | Host worker supervisor (worker-supervisor) |
supervises child bun on 3111 |
com.joel.restate-worker |
retired / rollback-only | — | Historical host Restate wrapper (scripts/restate/start.sh) |
superseded by deployment/restate-worker on 9080 |
com.joel.gateway |
running | 81275 | Gateway daemon (packages/gateway/src/daemon.ts) |
WS :3018, Redis bridge |
com.joel.caddy |
running | 9347 | Reverse proxy | 3443, 5443, 6443, 7443, 8290, 8443, 9443 |
com.joel.talon |
running | 96359 | Infra watchdog | health 127.0.0.1:9999 |
com.joel.agent-secrets |
running | 98048 | Secret lease daemon | no public port |
com.joel.imsg-rpc |
running | 61110 | iMessage JSON-RPC socket daemon | Unix socket /tmp/imsg.sock |
com.joel.typesense-portforward |
running | 32095 | kubectl port-forward svc/typesense 8108:8108 |
local 8108 |
com.joel.voice-agent |
running | 71887 | voice agent runtime | local 8081 |
com.joel.local-sandbox-janitor |
scheduled | (launchd timer) | ADR-0221 local sandbox janitor (scripts/local-sandbox-janitor.sh → joelclaw workload sandboxes janitor) |
logs in /tmp/joelclaw/local-sandbox-janitor.{log,err} |
com.joelclaw.agent-mail |
spawn scheduled | (none in launchctl snapshot) | agent-mail MCP HTTP service | observed listener 127.0.0.1:8765 (python process) |
com.joel.colima |
not running | — | startup helper for Colima | n/a |
com.joel.k8s-reboot-heal |
not running | — | periodic k8s heal script | n/a |
com.joel.system-bus-sync |
not running | — | sync guard watcher | n/a |
com.joel.gateway-tripwire |
not running | — | gateway tripwire script | n/a |
com.joel.content-sync-watcher |
not running | — | fs watch -> content/updated event | n/a |
com.joel.vault-log-sync |
not running | — | Vault log sync watcher | n/a |
Process supervision behavior: worker-supervisor
Source: infra/worker-supervisor/src/main.rs
- Default config:
- worker dir:
~/Code/joelhooks/joelclaw/packages/system-bus - command:
bun run src/serve.ts - port:
3111 - health endpoint:
/api/inngest - sync endpoint:
/api/inngest(PUT) - health interval: 30s
- restart after 3 consecutive health failures
- restart backoff: 1s → 30s max
- worker dir:
- Pre-start kills stale process on port 3111.
- Runs host import preflight before spawn:
bun --eval "await import('./src/inngest/functions/index.host.ts');"- on failure, skips spawn and retries with exponential backoff
- Loads env from
~/.config/system-bus.envplus leased secrets. - Forces
WORKER_ROLE=hostfor the supervised host worker. - Emits OTEL events via CLI on supervisor failures/restarts:
worker.supervisor.preflight.failedworker.supervisor.worker_exitworker.supervisor.health_check.restart
Worker supervision split note
- Talon is running (
com.joel.talon), but host worker is still launched viacom.joel.system-bus-worker->worker-supervisor. - ADR + system-log indicate Talon can defer worker supervision during coexistence.
Kubernetes process inventory
Node
joelclaw-controlplane-1(Talos v1.12.4, k8s v1.35.0, internal IP10.5.0.2)
Core services
| Service | Workload kind | Service type | Service port(s) | NodePort(s) / exposure | Role |
|---|---|---|---|---|---|
| Inngest | StatefulSet inngest |
NodePort (inngest-svc) |
8288, 8289 | 8288, 8289 | Event API + connect ws |
| Redis | StatefulSet redis |
NodePort | 6379 | 6379 | Queue/state/pubsub |
| Typesense | StatefulSet typesense |
ClusterIP | 8108 | host via launchd port-forward 8108 | Search + telemetry store |
| Restate | StatefulSet restate |
NodePort | 8080, 9070, 9071 | 8080, 9070, 9071 | Durable workflow ingress + admin + metrics |
| system-bus-worker | Deployment | ClusterIP | 3111 | in-cluster only | Cluster-role worker (12 functions) |
| restate-worker | Deployment | ClusterIP | 9080 | in-cluster only | dagOrchestrator + dagWorker + queue drainer in full agent image |
| docs-api | Deployment | NodePort | 3838 | 3838 | PDF/docs API + agentic search + taxonomy graph |
| dkron | StatefulSet | ClusterIP (dkron-svc) + headless peer svc (dkron-peer) |
8080, 8946, 6868 | in-cluster only; operator access via short-lived CLI-managed tunnel | Distributed cron scheduler for Restate pipelines |
| livekit-server | Deployment (Helm) | NodePort | 80, 7881 | 7880 (for svc port 80), 7881 | LiveKit signaling + rtc tcp |
| bluesky-pds | Deployment (Helm-managed) | NodePort | 3000 | 3000 | AT Proto PDS |
| minio | StatefulSet | ClusterIP + NodePort | 9000, 9001 | 30900, 30901 | Legacy local S3-compatible runtime |
aistor-s3-api (aistor ns) |
NodePort service (operator-managed) | NodePort | 443, 9000 | 31000 (+ dynamic management NodePort) | AIStor S3 API (TLS + management) |
aistor-s3-console (aistor ns) |
NodePort service (operator-managed) | NodePort | 9443 | 31001 | AIStor web console |
Restate / Firecracker runtime note
deployment/restate-workeris the current durable execution worker. The image bundles Bun + Node +pi+codex, the full repo checkout, and 76 symlinked skills.- Runtime auth/identity come from
secret/pi-authandconfigmap/agent-identity, which recreate/root/.pi/agent/auth.jsonplus the joelclaw identity chain inside the pod. - Firecracker is enabled in-pod via privileged access to
/dev/kvmon Colima VZ. The/dev/kvmhostPath mount uses type""(optional) so the pod starts without it when nestedVirtualization is off. - Persistent microVM assets live on PVC
firecracker-images, mounted at/tmp/firecracker-testfor kernel, rootfs, and snapshot files. - Retry caps (2026-03-17): dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents Restate journal poisoning from infinite retries after code changes or infrastructure failures.
- Colima stability: nestedVirtualization is OFF by default (crashes VM under Docker build load). Toggle ON only for Firecracker testing sessions, then toggle OFF. See k8s skill for recovery procedures.
Control-plane access
- kube API exposed locally at
127.0.0.1:64784(forwarded) - additional forwarded control ports observed:
64785,9627(exact ownership mapping UNKNOWN — needs manual verification)
3) Worker Architecture (Role Split + Registration)
Source files:
packages/system-bus/src/serve.tspackages/system-bus/src/inngest/functions/index.host.tspackages/system-bus/src/inngest/functions/index.cluster.tspackages/system-bus/src/inngest/client.ts
Role model
WORKER_ROLEparsed ashost(default) orcluster.- Registered function set is role-dependent:
- host uses
hostFunctionDefinitions - cluster uses
clusterFunctionDefinitions
- host uses
Ground-truth counts
- Host function set: 101
- Cluster function set: 12
- Cluster subset functions:
approvalRequest,approvalResolvetodoistCommentAdded,todoistTaskCompleted,todoistTaskCreatedfrontMessageReceived,frontMessageSent,frontAssigneeChangedtodoistMemoryReviewBridgegithubWorkflowRunCompleted,githubPackagePublishedwebhookSubscriptionDispatchGithubWorkflowRunCompleted
App registration isolation
From inngest/client.ts:
- app id resolves to:
system-bus-hostwhen role is hostsystem-bus-clusterwhen role is cluster
- explicit
INNGEST_APP_IDoverrides role-derived id.
This prevents host and cluster workers from overwriting each other’s function graphs.
serveHost behavior
From serve.ts:
- host role default
serveHost:http://host.docker.internal:3111 - cluster role default
serveHost: unset (connect-mode default) INNGEST_SERVE_HOSToverrides either role.
Kubernetes cluster worker manifest sets:
INNGEST_BASE_URL=http://inngest-svc:8288INNGEST_SERVE_HOST=http://system-bus-worker:3111
Registration mechanics
- Worker exposes
GET|POST|PUT /api/inngest. - Worker sends a delayed self-sync
PUT /api/inngest~5s after startup. worker-supervisoralso performs startup PUT sync.
Host is primary today
From index comments + function lists:
- ADR-0089 transition: host remains authoritative for broad function ownership.
- Cluster is intentionally limited to cluster-safe subset (12 functions).
4) Event Flow (CLI → Inngest → Worker → Completion)
Canonical flow: joelclaw send
- CLI
joelclaw send <event>callsInngest.send(). Inngest.send()POSTs event JSON to:${INNGEST_URL}/e/${INNGEST_EVENT_KEY}- default:
http://localhost:8288/e/<key>
- Inngest server persists the event and resolves matching function triggers.
- Inngest dispatches function steps to the worker app graph that owns that function ID:
- host app (
system-bus-host) for 101-host set - cluster app (
system-bus-cluster) for 12-cluster subset
- host app (
- Worker handles callbacks via
/api/inngest(Hono +inngest/honohandler). - Each
step.runresult is memoized by Inngest; next step executes when prior completes. - Completion/failure is queryable via GraphQL (
/v0/gql) and CLI commands (runs,run,event,events).
Queue flow: joelclaw queue emit → Restate drainer → durable dispatch
- CLI
joelclaw queue emit <event>persists aQueueEventEnvelopeinto Redis streamjoelclaw:queue:eventsand indexes it in sorted setjoelclaw:queue:priority. - The
restate-workerk8s deployment (packages/restate/src/index.ts) starts a deterministic queue drainer beside the channel callback listener. - On startup, the drainer claims pending + never-delivered entries via
@joelclaw/queue#getUnacked(), reindexes replayable entries, and emits OTEL replay evidence. - Each drain tick selects the next priority candidate from the sorted set, resolves its static registry target from
packages/queue/src/registry.ts, and POSTs a one-node DAG request to Restate/dagOrchestrator/{workflowId}/run/send. - When backlog remains and a dispatch slot frees, the drainer self-pulses immediately instead of waiting for the next
QUEUE_DRAIN_INTERVAL_MSheartbeat. That interval is now the idle poll / retry cadence, not a mandatory 2-second tax between successful sends. - The current Story-3 bridge re-emits the queue item to its registered Inngest event target inside that one-node DAG request. This is deliberate: the deterministic queue/drainer is proven first; per-family Restate cutovers remain Story 4 work.
- On accepted Restate dispatch, the drainer acks the queue message; on failure it leaves the message in Redis, applies retry cooldown, and emits
queue.dispatch.failedOTEL evidence. - If backlog remains in Redis but the drainer stops making progress past
QUEUE_DRAIN_STALL_AFTER_MS, it emitsqueue.drainer.stalledand exits non-zero so k8s restartsdeployment/restate-worker. That is the self-heal path for a wedged drainer inside an otherwise-running Bun process. - Crash recovery comes from the Redis stream + consumer-group replay path, not from vibes: restart the
restate-workerpod, letgetUnacked()reclaim the inflight entries, then drain resumes.
Workload flow: joelclaw workload run → Redis → Restate DAG → execution
joelclaw workload plan ... --stages-from <file>can load an explicit stage DAG, validate unknown deps/self-deps/duplicates/cycles, and preserve per-stage acceptance gates.joelclaw workload run <plan-artifact>normalizes the selected stage into the canonicalworkload/requestedruntime request.- Queue admission writes the request into Redis, where the deterministic drainer forwards it into Restate as a
dagOrchestrator/{workflowId}/run/sendrequest. dagOrchestratorexecutes dependency waves: ready nodes in parallel, chained nodes only after everydependsOnnode has terminal output.dagWorkerexecutes the node handler:shell→ subprocess work inside therestate-workerpodinfer→pi -p --no-session --no-extensionsinside the pod, using the mounted auth + identity + skill setmicrovm→ Firecracker boot/restore through/dev/kvmwith kernel/rootfs/snapshot files on PVCfirecracker-images
- Each node emits OTEL (
dag.node.*), and the workflow emitsdag.workflow.*so queue → Restate → execution remains observable. - Current truthful limit: the microVM runtime boots and restores snapshots in-cluster, but the broader exec-in-VM workspace drive protocol is still incomplete for general coding slices.
Webhook flow
- External service posts to
/webhooks/:provider. - Caddy routes
/webhooks/*onlocalhost:8443to workerlocalhost:3111. webhookAppverifies signature, normalizes payload, emits Inngest events (provider/event).- Inngest executes subscribed functions.
"Why did this run / not run" trace recipe
joelclaw send <event> -d '<payload>'joelclaw events --prefix <event-prefix> --hours 1joelclaw event <event-id>(fan-out to function runs)joelclaw run <run-id>(step trace + errors)joelclaw runs --count 20 --hours 1joelclaw otel search "<component/action>" --hours 1- Validate function ownership in
index.host.ts/index.cluster.ts.
5) Port Map (Canonical)
Exposure sources: k8s service manifests, Caddyfile,
kubectl get svc,lsoflisteners.
| Port | Listener / owner | What it is | Exposure path |
|---|---|---|---|
| 3111 | host bun worker | host system-bus worker HTTP (/, /api/inngest, /webhooks, /observability/emit) |
local host; proxied via Caddy 3443 + webhook path via 8443 |
| 8080 | ssh forward (Colima) -> restate | Restate ingress / workflow API | NodePort + host forward |
| 8288 | ssh forward (Colima) -> Inngest svc | Inngest API + dashboard backend | NodePort + host forward; proxied via Caddy 9443 |
| 8289 | ssh forward (Colima) -> Inngest ws | Inngest connect websocket | NodePort + host forward; proxied via Caddy 8290 |
| 6379 | ssh forward (Colima) -> Redis | Redis | NodePort + host forward |
| 8108 | ssh forward / kubectl port-forward | Typesense API | ClusterIP; exposed locally by port-forward |
| 9070 | ssh forward (Colima) -> restate | Restate admin API | NodePort + host forward |
| 9071 | ssh forward (Colima) -> restate | Restate metrics | NodePort + host forward |
| 9080 | k8s restate-worker service |
Restate worker HTTP (dagOrchestrator, dagWorker, queue drainer) |
ClusterIP only |
| random high local port | transient kubectl port-forward (CLI-managed) -> svc/dkron-svc:8080 |
Dkron HTTP API | ClusterIP only; short-lived operator tunnel |
| 3838 | ssh forward (Colima) -> docs-api | docs-api HTTP | NodePort + host forward; proxied via Caddy 5443 |
| 7880 | ssh forward (Colima) -> livekit-server | LiveKit signaling | NodePort 7880; proxied via Caddy 7443 |
| 7881 | ssh forward (Colima) -> livekit-server | LiveKit RTC TCP | NodePort 7881 |
| 3000 | k8s bluesky-pds NodePort | Bluesky PDS HTTP | NodePort 3000 |
| 30900 | k8s minio-nodeport | Legacy MinIO S3 API (HTTP) | NodePort 30900 |
| 30901 | k8s minio-nodeport | Legacy MinIO console (HTTP) | NodePort 30901 |
| 31000 | k8s aistor-s3-api (aistor ns) |
AIStor S3 API (TLS) | NodePort 31000 |
| 31001 | k8s aistor-s3-console (aistor ns) |
AIStor console (TLS) | NodePort 31001 |
| 3443 | Caddy | HTTPS reverse proxy to localhost:3111 |
tailnet HTTPS |
| 5443 | Caddy | HTTPS reverse proxy to localhost:3838 |
tailnet HTTPS |
| 7443 | Caddy | HTTPS reverse proxy to localhost:7880 |
tailnet HTTPS |
| 9443 | Caddy | HTTPS reverse proxy to localhost:8288 |
tailnet HTTPS |
| 8290 | Caddy | HTTPS reverse proxy to localhost:8289 |
tailnet HTTPS |
| 8443 | Caddy (HTTP) | webhook/public ingress router | expected Funnel target |
| 6443 | Caddy | reverse proxy to local 6333 (Qdrant) | tailnet HTTPS |
| 3018 | gateway daemon | gateway websocket stream port | local |
| 9999 | talon | Talon health endpoint | local 127.0.0.1 |
| 8765 | agent-mail HTTP service | MCP agent-mail API | local 127.0.0.1 |
| 64784 | ssh forward | Kubernetes API | local kubectl endpoint |
Notes
- Host NodePort exposure appears through an
sshlistener process (Colima portForwarder=ssh). - Exact per-port ssh forward command line is UNKNOWN — needs manual verification (process introspection restricted in this environment).
6) Storage Topology
Redis
- Runtime: k8s StatefulSet (
redis:7-alpine, appendonly enabled). - Primary uses:
- gateway queue/session keys (
joelclaw:events:*,joelclaw:notify:*,joelclaw:gateway:sessions) - webhook subscriptions (
joelclaw:webhook:*) - gateway health mute/streak keys (
gateway:health:*)
- gateway queue/session keys (
Typesense
From observability code:
otel_eventscollection (canonical telemetry event store)memory_observationscollection (vector-aware memory index; schema validated at startup)- docs-api also points at
http://typesense:8108for docs search/index surfaces.
Firecracker runtime storage
- PVC:
firecracker-images - Mounted in
deployment/restate-workerat/tmp/firecracker-test - Stores:
- kernel (
vmlinux) - rootfs (
agent-rootfs.ext4) - snapshots (
snapshots/vm.snap,snapshots/vm.mem)
- kernel (
- Firecracker snapshot restore is currently operator-proven at ~9ms on the Colima VZ nested-virt path.
Inngest state
- StatefulSet PVC mounted at
/data INNGEST_SQLITE_DIR=/data
docs-api surface
- Deployment:
docs-apion NodePort3838 - Route count: 11 endpoints including
/health - Key routes:
GET /search— hybrid chunk search withconcept,concepts,doc_id,expand, andassembleGET /docs/searchGET /docsGET /docs/:idGET /docs/:id/tocGET /docs/:id/chunksGET /chunks/:idGET /conceptsGET /concepts/:idGET /concepts/:id/docs
- Taxonomy surface: 21-concept SKOS graph (10 parents + 11 sub-concepts) with
broader,narrower, andrelatededges.
NAS (ADR-0088 + ADR-0187)
Tiering policy:
- Tier 1 local SSD (hot runtime state)
- Tier 2 NAS NVMe (
/Volumes/nas-nvme↔/volume2/data) - Tier 3 NAS HDD (
/Volumes/three-body)
Degradation contract (ADR-0187):
- writes must fallback
local -> remote -> queued - queue spool default:
/tmp/joelclaw/nas-queue
Vault
- Obsidian vault at
/Users/joel/Vault - system log file:
/Users/joel/Vault/system/system-log.jsonl
7) Networking Topology
Caddy reverse proxy routes (from ~/.local/caddy/Caddyfile)
https://panda.tail7af24.ts.net:9443->localhost:8288(Inngest)https://panda.tail7af24.ts.net:8290->localhost:8289(Inngest connect)https://panda.tail7af24.ts.net:3443->localhost:3111(worker)https://panda.tail7af24.ts.net:5443->localhost:3838(docs-api)https://panda.tail7af24.ts.net:7443->localhost:7880(LiveKit)https://panda.tail7af24.ts.net:6443->localhost:6333(Qdrant)http://localhost:8443path router:/webhooks/*->localhost:3111- fallback ->
localhost:8288
Tailscale + Funnel
- Config comments and ADR-0051 describe Funnel path
:443 -> localhost:8443. - Runtime
tailscale statusunavailable here: UNKNOWN — needs manual verification.
External webhook ingress
Expected path:
- Internet provider -> Tailscale Funnel :443
- Funnel -> local
:8443 - Caddy path route
/webhooks/*-> worker:3111 - worker
/webhooks/:providerverifies + emits Inngest event
8) CLI Wiring (Command Tree → Endpoint Surface)
Primary command tree root: packages/cli/src/cli.ts.
Endpoint map by command family
| Command family | Primary backend |
|---|---|
send |
Inngest Event API POST /e/<event-key> |
runs, run, functions, event, events |
Inngest GraphQL POST /v0/gql |
status |
Inngest/worker health probes + k8s checks + agent-mail liveness |
gateway * |
Redis keys/channels + launchd/system ops |
workload * |
workload planner + Redis queue admission + Restate dagOrchestrator / dagWorker runtime |
docs * |
docs-api REST API (/search, /docs/*, /chunks/*, /concepts*) |
restate cron * |
Dkron REST API via direct --base-url or short-lived kubectl port-forward to svc/dkron-svc |
otel * |
Typesense otel_events via capability adapter |
recall * |
Typesense recall adapter |
mail * |
Agent-mail MCP HTTP (127.0.0.1:8765) via CLI adapter wrappers |
inngest * |
worker launchd + Talon + k8s + Typesense diagnostics |
Config source:
~/.config/system-bus.env(plus env overrides)- defaults:
INNGEST_URL=http://localhost:8288INNGEST_WORKER_URL=http://localhost:3111
9) Observability + Tracing Topology
OTEL event pipeline
- Worker emits via
emitOtelEvent()/emitMeasuredOtelEvent(). - Gateway emits via
@joelclaw/telemetry(emitGatewayOtel) to:- default
OTEL_EMIT_URL=http://localhost:3111/observability/emit
- default
- Worker endpoint
/observability/emitvalidates token (x-otel-emit-token) if configured. - Store path (
storeOtelEvent):- Typesense
otel_events(primary) - optional Convex mirror for high-severity recent window
- optional Sentry forward for
warn/error/fatal
- Typesense
Langfuse integration points
- Gateway boot:
packages/gateway/src/daemon.tscallsinitTracing({})from inference-router. - Inference router traces model-route decisions:
packages/inference-router/src/tracing.ts- used from
packages/inference-router/src/router.ts
- System-bus LLM traces:
packages/system-bus/src/lib/langfuse.ts(traceLlmGeneration)- called by
packages/system-bus/src/lib/inference.tsandchannel-message-classify.ts
10) Key ADR Topology Decisions
| ADR | Title | Status | Topology impact |
|---|---|---|---|
| ADR-0048 | Webhook gateway | shipped | /webhooks/:provider normalization + signature verification + Inngest emission |
| ADR-0088 | NAS-backed storage tiering | shipped | Defines SSD/NAS NVMe/NAS HDD storage contract |
| ADR-0089 | Single-source worker deployment | shipped | Host/cluster role split + single canonical source |
| ADR-0144 | Gateway hexagonal architecture | shipped | Gateway as composition root; heavy logic in @joelclaw/* |
| ADR-0155 | Three-stage story pipeline | shipped | Simplified story function flow through Inngest durable steps |
| ADR-0156 | Graceful worker restart | superseded | Historical restart strategy; superseded by Talon ADR |
| ADR-0159 | Talon watchdog daemon | shipped | Compiled watchdog + infra supervision model |
| ADR-0038 | Embedded pi gateway daemon | shipped | Always-on gateway session architecture |
| ADR-0051 | Tailscale Funnel ingress | shipped | Public webhook ingress via Funnel/Caddy pattern |
| ADR-0148 | k8s resilience policy | accepted | NodePort-first exposure, probe requirements, restart recovery checklist |
| ADR-0158 | worker-supervisor binary | superseded | Legacy supervisor ADR now superseded, but binary remains in active launchd path |
| ADR-0182 | node-0 localhost resilience | shipped | endpoint class fallback (localhost -> vm -> svc_dns) |
| ADR-0187 | NAS degradation fallback contract | accepted | mandatory local/remote/queued write fallback |
| ADR-0212 | AIStor as local S3 runtime | accepted | maintained local S3 runtime in aistor namespace; legacy MinIO retained for rollback |
10.1) Sandbox Execution Contract (@joelclaw/agent-execution)
Package: packages/agent-execution/
Purpose: Canonical contract for sandboxed story execution shared between Restate workflows, system-bus Inngest functions, and k8s Job launcher.
Contract Types
Request: SandboxExecutionRequest
workflowId,requestId,storyId: identifierstask: story prompt/task to executeagent:{ name, variant?, model?, program? }sandbox:"workspace-write" | "danger-full-access"baseSha: git SHA before executioncwd?: working directorytimeoutSeconds?: timeoutverificationCommands?: post-execution verificationsessionId?: tracking identifier
Result: SandboxExecutionResult
requestId: correlation IDstate:"pending" | "running" | "completed" | "failed" | "cancelled"startedAt,completedAt?,durationMs?: timingartifacts?: execution artifacts (see below)error?: error message (failed state)output?: stdout/stderr output
Artifacts: ExecutionArtifacts
headSha: git SHA after executiontouchedFiles: list of modified/untracked files fromgit status --porcelainpatch?: git patch content (format-patch or diff)verification?:{ commands, success, output }logs?:{ executionLog?, verificationLog? }
Repo Materialization (Story 3)
Function: materializeRepo(targetPath, baseSha, options)
Behavior:
- Clone repo if target path doesn't exist (requires
remoteUrl) - Fetch + checkout if target path exists
- SHA verification after checkout
- Automatic unshallow if SHA not in shallow clone
- Isolated sandbox-local workspace (host worktree untouched)
Returns: { path, sha, freshClone, durationMs }
Key options:
remoteUrl?: remote URL for fresh clonebranch?: branch/ref to fetch (default:"main")depth?: shallow clone depth (default:1)includeSubmodules?: include submodulestimeoutSeconds?: timeout (default:300)
Artifact Export (Story 3)
Function: generatePatchArtifact(options)
Behavior:
- Captures touched-file inventory via
getTouchedFiles() - Generates git patch from
baseSha..headSha:- Uses
git format-patchif commits exist in range - Uses
git diffif only uncommitted changes
- Uses
- Optionally includes untracked files as patch content
- Embeds verification summary and log references
- Serializable to JSON via
writeArtifactBundle()
Key options:
repoPath: path to git repobaseSha: base SHA (start of diff range)headSha?: head SHA (default: HEAD)includeUntracked?: include untracked files (default:true)verificationCommands?,verificationSuccess?,verificationOutput?: verification dataexecutionLogPath?,verificationLogPath?: log referencestimeoutSeconds?: timeout (default:60)
Returns: ExecutionArtifacts
Promotion Boundary (Phase 1)
Authoritative output is patch bundle + metadata.
Sandbox runs do not merge to main or push to remote. The runtime:
- Materializes repo at
baseShain sandbox-local workspace - Executes agent task
- Runs verification commands
- Exports patch artifact with touched files and verification results
- Emits
SandboxExecutionResultevent withExecutionArtifacts
Promotion is a separate operator decision:
- Restate workflow receives
ExecutionArtifacts - Operator reviews patch + verification summary
- Operator applies patch to host repo (or discards)
- Operator commits and pushes (if approved)
This keeps sandbox runs isolated and reversible.
k8s Job Integration
Job spec generation: generateJobSpec(request, options)
Cold k8s Jobs for isolated story execution:
- Deterministic Job naming keyed by
requestId - Runtime image contract: Git, Bun, agent tooling,
/workspacedirectory - Environment-driven config:
WORKFLOW_ID,REQUEST_ID,STORY_ID,TASK_PROMPT_B64,BASE_SHA, etc. - Resource limits:
500m-2CPU,1-4Gimemory (configurable) - TTL cleanup: auto-delete after 5 minutes (default)
- Active deadline: 1 hour max runtime (default)
- No automatic retries (
backoffLimit: 0) - Security: non-root (UID 1000), no privilege escalation, capabilities dropped
Runtime contract:
- Decode
TASK_PROMPT_B64from env - Call
materializeRepo()atBASE_SHA - Execute agent with task
- Run verification commands (if
VERIFICATION_COMMANDS_B64set) - Call
generatePatchArtifact()with results - Emit
SandboxExecutionResultevent withExecutionArtifacts - Exit 0 (success) or non-zero (failure)
Cancellation: Delete Job resource (SIGTERM to container)
Job deletion: generateJobDeletion(requestId) -> { name, namespace, propagationPolicy }
See k8s/agent-runner.yaml for full runtime contract specification.
Topology Impact
- Story 2: Added contract types and Job spec generation
- Story 3: Added repo materialization and artifact export helpers
- ADR-0221 phase 1: added explicit local sandbox isolation primitives — deterministic sandbox identity, deterministic local sandbox paths, per-sandbox env materialization, minimal/full mode vocabulary, and a JSON registry helper for host-worker sandboxes
- ADR-0221 phase 2: wired those local helpers into the real host-worker
system/agent-dispatchlocal backend so sandbox runs now allocate deterministic paths under~/.joelclaw/sandboxes/, materialize.sandbox.env, persist registry state, and carrylocalSandboxmetadata in inbox snapshots - ADR-0221 phase 3/4/5/6: phase 3 added terminal retention/cleanup policy (
cleanupAfter+ registry metadata), opportunistic pruning of expired local sandboxes on new-run startup, copy-first.devcontainermaterialization helpers with exclusion rules for env/secret junk, live sandbox env injection so the agent process actually sees the reserved runtime identity, a hash-preserving sandbox identity fix after live dogfood exposed path collisions from long shared requestId prefixes, abbreviated-baseShaacceptance during repo materialization, truthful failed inbox snapshots when dispatch crashes before normal terminal writeback, and a repeatable operator probe atbun scripts/verify-local-sandbox-dispatch.ts; phase 4 addssandboxMode=minimal|fullthrough the workload front door, requested-cwd mapping inside the cloned checkout, compose-backed full local mode startup, the reality that stale Restate workers can rejectworkload/requesteduntil restarted and reloaded, a recursion guard because sandboxed stage runs were able to callscripts/verify-workload-full-mode.ts/joelclaw workload runfrom inside the sandbox and spawn nested canaries instead of terminating honestly, and a guarded workflow-rig proof run (WR_20260310_013158) that completes terminally with healthy compose startup plus clean teardown; phase 5 adds the operator-facing CLI surfacejoelclaw workload sandboxes list|cleanup|janitorso retained sandboxes can be inspected and janitored on demand instead of only during startup opportunistic pruning, and the operator surfaces now reconcile registry entries against per-sandbox metadata before reporting or deleting so old partial writeback residue stops lying about terminal state; phase 6 makes janitoring scheduled instead of purely manual via repo-managed launchd servicecom.joel.local-sandbox-janitor, which runsscripts/local-sandbox-janitor.sh→joelclaw workload sandboxes janitorat load and every 30 minutes - Future: Runtime image build, hot-image CronJob, warm-pool scheduler, Restate integration
Current state: the host-worker local sandbox path is now using the local-isolation helpers in production code, the package has a concurrent proof that two local sandboxes keep distinct compose identity plus copied devcontainer state, guarded full-mode workflow-rig dogfood closes terminally, and cleanup now has both on-demand CLI surfaces and scheduled launchd janitoring. Follow-on work is now about deeper runtime ergonomics and debugging any remaining non-terminal stale residues, not missing basic cleanup automation.
11) Verification Commands (Health + Wiring)
Core topology
# Colima + VM IP
colima status --json
# Kubernetes control plane + node
kubectl cluster-info
kubectl get nodes -o wide
# Core workloads
kubectl get pods -n joelclaw -o wide
kubectl get svc -n joelclaw -o wide
Host supervision
# Worker supervisor launchd state
launchctl print gui/$(id -u)/com.joel.system-bus-worker | rg "state =|pid =|last exit code"
# Gateway / Caddy / Talon
launchctl print gui/$(id -u)/com.joel.gateway | rg "state =|pid ="
launchctl print gui/$(id -u)/com.joel.caddy | rg "state =|pid ="
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid ="
# Talon health
curl -s http://127.0.0.1:9999/health
Worker role split
# Parse role counts directly from source lists
python - <<'PY'
import re
from pathlib import Path
for f,name in [('packages/system-bus/src/inngest/functions/index.host.ts','host'),('packages/system-bus/src/inngest/functions/index.cluster.ts','cluster')]:
txt=Path(f).read_text()
body=re.search(rf'export const {name}FunctionDefinitions = \[(.*?)\];', txt, re.S).group(1)
count=sum(1 for line in body.splitlines() if line.strip() and not line.strip().startswith('//'))
print(name, count)
PY
# Inngest app ID derivation logic
rg -n "INNGEST_APP_ID|system-bus-host|system-bus-cluster|WORKER_ROLE" packages/system-bus/src/inngest/client.ts
Event flow trace
# Send event
joelclaw send <event> -d '<json>'
# Trace event and resulting runs
joelclaw events --prefix <event-prefix> --hours 1 --count 20
joelclaw event <event-id>
joelclaw runs --hours 1 --count 20
joelclaw run <run-id>
# Telemetry correlation
joelclaw otel search "<component_or_action>" --hours 1
Networking
# Caddy route config
caddy validate --config ~/.local/caddy/Caddyfile
# Listening ports snapshot
/usr/sbin/lsof -iTCP -sTCP:LISTEN -n -P
# Tailscale runtime (if daemon available)
tailscale status --json
12) Known Unknowns (Do Not Guess)
- Tailscale daemon state is not readable in this environment.
tailscale status --json-> failed to connect.- UNKNOWN — needs manual verification
docs/architecture.md,docs/deploy.md,docs/observability.mdare absent in-repo.- UNKNOWN — needs manual verification
- Exact command-line ownership of all Colima ssh forwarding ports (
64784,64785,9627, etc.)- UNKNOWN — needs manual verification
- Ingress controller runtime status for
k8s/docs-api-ingress.yaml- UNKNOWN — needs manual verification
13) Mandatory Update Policy (Non-Optional)
Update this skill in the same change whenever any of these change:
- Worker runtime wiring
serve.ts,client.ts,index.host.ts,index.cluster.tsWORKER_ROLE, app IDs, serveHost behavior, registration path
- Supervision/process topology
- any
~/Library/LaunchAgents/com.joel*.plist infra/worker-supervisor/*, Talon behavior, gateway launch script/label
- any
- Kubernetes topology
- any file under
k8s/ - Helm values affecting core services (
livekit,pds, etc.) - Service type/port changes (NodePort/ClusterIP)
- any file under
- Networking/ingress
- Caddyfile route/port changes
- Tailscale/Funnel hostnames or ingress path changes
- Colima/VM networking model changes
- Storage topology
- Redis keyspace contracts for gateway/webhook routing
- Typesense telemetry collection/schema changes
- NAS mount/fallback/queue contract changes
- Observability/tracing
- OTEL emit endpoint/token behavior
- telemetry storage path changes (Typesense/Convex/Sentry)
- Langfuse integration points
- CLI control-plane routing
- command families moved to different endpoints/services
- ADR status changes affecting topology
- especially ADR-0048, 0088, 0089, 0144, 0155, 0156, 0159, 0182, 0187
If any item above changed and this skill was not updated, this skill is stale and non-canonical.
More from joelhooks/joelclaw
adr-skill
Create and maintain Architecture Decision Records (ADRs) optimized for agentic coding workflows. Use when you need to propose, write, update, accept/reject, deprecate, or supersede an ADR; bootstrap an adr folder and index; consult existing ADRs before implementing changes; or enforce ADR conventions. This skill uses Socratic questioning to capture intent before drafting, and validates output against an agent-readiness checklist.
41gateway
Operate the joelclaw gateway daemon — the always-on pi session that receives events, notifications, and messages. Use the joelclaw CLI for ALL gateway operations. Use when: 'restart gateway', 'gateway status', 'is gateway healthy', 'push to gateway', 'gateway not responding', 'telegram not working', 'messages not going through', 'gateway stuck', 'gateway debug', 'check gateway', 'drain queue', 'test gateway', 'stream events', or any task involving the gateway daemon.
40inngest-steps
Use Inngest step methods to build durable workflows. Covers step.run, step.sleep, step.waitForEvent, step.waitForSignal, step.sendEvent, step.invoke, step.ai, and patterns for loops and parallel execution.
30joelclaw-web
Update and maintain joelclaw.com — the Next.js web app at apps/web/. Use when writing blog posts, editing pages, updating the network page, changing layout/header/footer, adding components, or fixing anything on the site. Hard content triggers: 'write article about X' (draft in Convex), 'publish article <slug>' (set draft=false + revalidate tags/paths). Also triggers on: 'update the site', 'write a post', 'fix the blog', 'joelclaw.com', 'update network page', 'add a page', 'change the header', or any task involving the public-facing web app.
29loop-nanny
Monitor running agent loops, triage failures, clean up after completion, and decide when to intervene. Use when a loop is running and needs babysitting, when a loop just finished and needs post-merge verification, when stories are skipping/failing and need diagnosis, or when stale test artifacts need cleanup. Triggers on: 'check the loop', 'what happened with the loop', 'loop finished', 'clean up after loop', 'why did that story skip', 'monitor loop', 'nanny the loop', or any post-start loop management task. Distinct from agent-loop skill (which handles starting loops).
27copywriting
When the user wants to write, rewrite, or improve marketing copy for any page — including homepage, landing pages, pricing pages, feature pages, about pages, or product pages. Also use when the user says "write copy for," "improve this copy," "rewrite this page," "marketing copy," "headline help," or "CTA copy." For email copy, see email-sequence. For popup copy, see popup-cro.
9