debug-production-renders
Debug Production Renders
Telecine renders flow through a queue-based pipeline backed by Valkey (Redis-compatible). Each render is a workflow that progresses through multiple queues, each served by a dedicated Cloud Run worker service. Debugging a render means checking its state at three layers: the Postgres database, the Valkey queue state, and the Cloud Run worker logs.
Quick Start
All debug scripts run inside docker. From the monorepo root:
# Full render status: DB row, fragment breakdown, error detail
telecine/scripts/debug-render <render-id>
# Add live Valkey state (queued/claimed/completed/failed jobs)
telecine/scripts/debug-render <render-id> --redis
# Add docker compose log grep (local dev only)
telecine/scripts/debug-render <render-id> --logs
# All three
telecine/scripts/debug-render <render-id> --redis --logs
Render Pipeline Flow
A render progresses through queues in this order:
- process-html-initializer -- Preprocesses raw HTML from the API submission.
- process-html-finalizer -- Sets workflow data, then enqueues the render-initializer job. This is the bridge from the HTML pipeline into the render pipeline.
- render-initializer -- Checks out render source, extracts render info (dimensions, duration, fps) via Electron RPC, creates an assets metadata bundle, then fans out N fragment jobs. For still images (png/jpeg/webp), renders the image directly and skips fragments.
- render-fragment (N parallel) -- Each job renders one time slice of the video via Electron RPC, writes the fragment file to storage, and reports progress.
- render-finalizer -- Auto-triggered when all fragment jobs complete. Merges fragment files into the final output. Marks the render as complete.
The pipeline is triggered by a Hasura event on INSERT into video2.renders. The workflow system automatically routes the finalizer queue job when all child jobs complete or when any job fails with exhausted retries.
Queue names, worker service names, and resource allocations are defined in source files -- see "Key Source Files" below.
Debug Scripts
| Script | Purpose |
|---|---|
telecine/scripts/debug-render <id> [--redis] [--logs] |
Primary debug tool: DB state, fragments, errors, optional Valkey + logs |
telecine/scripts/inspect-render.ts |
Lower-level Valkey inspection: workflow data, claimed jobs with ages, all workflow keys |
telecine/scripts/check-queue.ts |
Queue-level Valkey state: queued/claimed/failed counts, org membership |
telecine/scripts/restart-render.ts |
Restart a failed render: resets DB status, re-enqueues initializer job |
telecine/scripts/create-render.ts |
Create a test render from an existing render's org context |
telecine/scripts/render-logs [-f] <id> |
Grep docker compose logs for initializer/fragment/finalizer services |
worktree smoke [branch] (runs telecine/scripts/smoke-test.ts) |
End-to-end smoke tests via the public API |
telecine/scripts/smoke-test-waveform.ts |
Waveform-specific smoke test: inserts renders directly into DB, exercises all ef-waveform modes concurrently |
telecine/scripts/console |
Node REPL with project imports (db, valkey, queues available) |
Run .ts scripts via: telecine/scripts/run tsx scripts/<script>.ts <args>
Querying Cloud Run Logs
There are no dedicated log-querying scripts. Use gcloud directly. Workers log structured JSON via pino, so use jsonPayload filters:
# All render workers for a specific render ID
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name=~"telecine-worker-render"
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=100 --format=json
# Specific worker stage
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name="telecine-worker-render-initializer"
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=100
# Errors only across all workers
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name=~"telecine-worker"
AND severity>=ERROR
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=50
Cloud Run service names follow the pattern telecine-worker-{queue-name}. See telecine/deploy/resources/queues/configs.ts for the full list of queue names.
Valkey Key Schema
Queues and workflows use predictable Valkey key patterns. Understanding these lets you query state directly when scripts don't cover your case.
Queue-level keys (per queue name):
queues:{queueName}:queued-- zset of job keys waiting to be claimedqueues:{queueName}:claimed-- zset of job keys being processed (score = claim timestamp)queues:{queueName}:completed-- zset of completed job keysqueues:{queueName}:failed-- zset of failed job keysqueues:{queueName}:jobs:{jobId}-- serialized job data (SuperJSON)queues:{queueName}:orgs-- zset of org keys with active jobs
Workflow-level keys (per render ID):
workflows:{renderId}:queued-- zset of workflow-level queued jobsworkflows:{renderId}:claimed-- zset of claimed jobsworkflows:{renderId}:completed-- zset of completed jobsworkflows:{renderId}:failed-- zset of failed jobsworkflows:{renderId}:data-- SuperJSON workflow payload (render config)workflows:{renderId}:status-- workflow status string
Progress tracking:
render:{renderId}-- Redis stream for fragment completion progress
Stalled jobs are detected by checking claimed jobs whose score (claim timestamp) is older than 10 seconds.
Database Tables
Render state is persisted in Postgres via Kysely (not Prisma). The db client is imported from @/sql-client.server.
video2.renders-- Main render record: status, html, org_id, dimensions, fps, duration_ms, failure_detail, timestampsvideo2.render_fragments-- Per-segment fragment records: render_id, segment_id, attempt_number, timestamps, last_errorvideo2.process_html-- HTML processing recordsvideo2.files-- File records (used by process-isobmff and ingest-image)
Render status values: created -> queued -> rendering -> complete | failed
Debugging Workflow
- Start with
debug-render-- get the DB status, error detail, and fragment breakdown. - If stuck in "rendering" -- add
--redisto see if jobs are queued, claimed (possibly stalled), or silently failed in Valkey. - If Valkey state is unclear -- use
inspect-render.tsfor detailed claimed job ages andcheck-queue.tsfor queue-level counts. - If you need worker logs -- query Cloud Run logs with
gcloudusing the render ID (see commands above). In local dev, use--logsflag orrender-logsscript. - To retry -- use
restart-render.tsto reset DB state and re-enqueue the initializer. - For production DB access -- use
telecine/scripts/debug-prod-web --use-prod-db --shellto get a container with production database connectivity.
Diagnosing Electron RPC Failures
Fragment renders fail via RPC timeout. The stack trace tells you how far rendering got:
RPC.ts:182— initial 5 s timer fired, no keepalives received at all. Electron never started rendering. Causes: scheduler opened more connections than the single local container can handle (see below), Electron failed to load the page, or the render context couldn't be created.RPC.ts:153— keepalive-reset timer fired mid-render. At least one frame rendered before the hang. Causes: a race condition aborted an in-flight fetch (theAbortErrorcase), a frame took too long, or Electron crashed mid-segment.
AbortError / FrameController race
If Electron logs show [EF_FRAMEGEN.beginFrame] error: [object DOMException] / AbortError: The user aborted a request, a fetch started during seekForRender was cancelled by an autonomous re-render firing concurrently. This is a timing-dependent race: EFTemporal.updated() or EFTimegroup.updated() fires when media loads and calls FrameController.abort(), killing the in-flight GCS fetch.
Fix: set data-no-playback-controller on the timegroup before seekForRender to suppress autonomous re-renders — the same attribute used on render clones. Check EF_FRAMEGEN.ts initialize() and EFTemporal.ts/EFTimegroup.ts updated().
Scheduler over-scaling in local dev
In production, MAX_WORKER_COUNT controls how many Cloud Run instances the scheduler spins up. Locally there is one container per queue. If the scheduler opens more WebSocket connections than the single container expects (e.g. 30 connections for 30 queued jobs), every concurrent renderFragment RPC call beyond the worker's WORKER_CONCURRENCY quota times out at RPC.ts:182 before Electron starts processing it.
The two dials and how they interact:
| Dial | Production | Local dev |
|---|---|---|
MAX_WORKER_COUNT |
scales container count | must match scale: in docker-compose (usually 1) |
WORKER_CONCURRENCY |
jobs per container | effective parallelism when MAX_WORKER_COUNT=1 |
Both are read from telecine/.env (via env_file in worker containers and ${VAR:-default} substitution in scheduler-go/docker-compose.yaml). Changing them requires telecine/scripts/docker-compose up -d (not just restart) to force recreation with the new env.
Key Source Files
telecine/scripts/debug-render.ts-- Primary debug tool implementationtelecine/scripts/inspect-render.ts-- Low-level Valkey render inspectiontelecine/scripts/check-queue.ts-- Queue-level Valkey state checkertelecine/scripts/restart-render.ts-- Render restart/retry tooltelecine/lib/queues/Queue.ts-- Queue base class, key patternstelecine/lib/queues/Workflow.ts-- Workflow base class, workflow key patternstelecine/lib/queues/Job.ts-- Job serialization, enqueue, stall detectiontelecine/lib/queues/units-of-work/Render/-- Render pipeline queue/worker definitionstelecine/lib/queues/units-of-work/ProcessHtml/-- HTML pipeline queue/worker definitionstelecine/deploy/resources/queues/configs.ts-- Production queue names and scaling configtelecine/deploy/resources/queues/defineWorker.ts-- Cloud Run service definition templatetelecine/deploy/resources/constants.ts-- GCP project/region constantstelecine/lib/valkey/valkey.ts-- Valkey connection setup
When to Use This Skill
Use this skill when:
- A production render is stuck, failed, or producing unexpected results
- You need to trace a render through the pipeline to find where it stalled
- You need to query Cloud Run logs for a specific render
- You need to inspect or manipulate Valkey queue state
- You need to restart a failed render
- You need to understand the render pipeline architecture
More from editframe/skills
video-analysis
Analyze video files using ffprobe, mp4dump, and jq. Use when investigating video samples, keyframes, MP4 box structure, codec info, packet timing, or debugging video playback issues.
79visual-thinking
Create visual analogies by mapping relational structure from familiar domains onto unfamiliar concepts using spatial relationships to make abstract patterns concrete. Covers static diagrams AND animated video storytelling (camera choreography, race comparisons, pacing). Use when explaining complex concepts, creating analogies, designing diagrams, creating explainer animations, or revealing system structure.
74css-animations
CSS animation fill-mode requirements for Editframe timeline system. Use when creating CSS animations, debugging flashing/flickering issues, or when user mentions animation problems, fade effects, slide effects, or sequential animations.
72threejs-compositions
Integrate Three.js 3D scenes into Editframe compositions via addFrameTask. Scenes are pure functions of time, fully scrubable, and renderable to MP4. Use when creating 3D animations, WebGL content in compositions, or integrating Three.js with Editframe's timeline system.
69editor-gui
Build video editing interfaces using Editframe's GUI web components. Assemble timeline, scrubber, filmstrip, preview, and playback controls like lego bricks. Use when creating video editors, editing tools, or when user mentions timeline, scrubber, preview, playback controls, trim handles, or wants to build editing UIs.
67elements-new-package
Create a new @editframe/* workspace package in the elements monorepo and publish it to npm.
67