diagnosing-failed-warehouse-syncs
Diagnosing failed data warehouse syncs
Work top-down when a data warehouse source or table is failing, stuck, or producing bad data: source → schema → recovery action. Do not jump straight to "resync from scratch" — that discards synced data and restarts from zero, which is rarely the right first step.
When to use this skill
- The user reports a specific sync is failing (e.g. "my Stripe source is red")
- A table has been in
Runningstate far longer than expected - Data in a warehouse table is stale, missing rows, or looks corrupt
- Latest rows aren't appearing despite the schema being marked
Completed - The user is choosing between cancel / reload / resync / delete-data and isn't sure which
- Another skill — typically
auditing-warehouse-data-health— has surfaced a failing source or schema and the user wants to dig into it
Both entry points (user-reported and audit-handoff) use the same workflow; the audit just means you already know which item to diagnose and can skip Step 1's discovery search.
Available tools
| Tool | Purpose |
|---|---|
external-data-sources-list |
List all sources with connection status and latest error |
external-data-sources-retrieve |
Full details for one source including all its schemas |
external-data-schemas-list |
All table schemas across all sources, with per-table status + latest_error |
external-data-schemas-retrieve |
Full details for one schema including sync_type_config |
external-data-schemas-cancel |
Cancel a sync currently in Running state |
external-data-schemas-reload |
Trigger a sync using the configured sync method (respects incremental) |
external-data-schemas-resync |
Full resync — wipes synced data and restarts. Destructive |
external-data-schemas-delete-data |
Delete the synced table but keep the schema entry |
external-data-schemas-partial-update |
Change sync_type / incremental_field / cdc_table_mode |
external-data-sources-partial-update |
Update a source's credentials (job_inputs) after rotation |
external-data-sources-reload |
Retrigger syncs for every enabled schema on a source |
external-data-sources-refresh-schemas |
Re-fetch the source's table list to pick up new tables |
external-data-sources-check-cdc-prerequisites-create |
Verify Postgres CDC setup for a source |
external-data-schemas-incremental-fields-create |
Refresh candidate incremental fields when the source schema has changed |
external-data-sources-webhook-info-retrieve |
Check webhook registration state and external service status |
external-data-sources-create-webhook-create |
Re-register a webhook that was lost or never registered |
external-data-sources-update-webhook-inputs-create |
Update the signing secret after rotation on the source side |
external-data-sources-delete-webhook-create |
Remove a broken webhook before re-registering |
Workflow
Step 1 — Locate the failing item
If the user named a source, go straight to external-data-sources-retrieve. Otherwise start with
external-data-sources-list and external-data-schemas-list to find what's red.
Two kinds of failure:
- Source-level (
ExternalDataSource.status = "Error"): the connection itself is broken — credentials expired, host unreachable, account disabled. Affects every table. - Schema-level — the source connects fine but one or more tables are failing. In the serialized API response
from
external-data-schemas-list, look forstatusvalues"Failed","Billing limits", or"Billing limits too low". (The underlying model enum values areBillingLimitReachedandBillingLimitTooLow, but the serializer rewrites them — match on both the human-readable and enum forms to be safe.)
A source can look Completed at the top level while one of its schemas is Failed — always check both.
Step 2 — Classify the schema status
From external-data-schemas-list, each schema has a status:
| Status | Meaning | Usually means |
|---|---|---|
Running |
Sync currently executing | Normal, unless stuck for hours |
Completed |
Last sync finished successfully | Healthy |
Failed |
Last sync errored — see latest_error |
Needs diagnosis |
Paused |
User disabled sync (should_sync = false) |
Intentional |
Billing limits (serializer) / BillingLimitReached (enum) |
Team hit its warehouse row quota | Billing issue, not a technical failure |
Billing limits too low (serializer) / BillingLimitTooLow (enum) |
Team has insufficient credit | Billing issue |
Always check last_synced_at alongside status. A schema in Running with last_synced_at from 12 hours ago is
almost certainly stuck, even though the status isn't Failed.
Step 3 — Interpret latest_error
Map the latest_error string to a root cause. Common patterns:
| Error substring | Root cause | Fix |
|---|---|---|
authentication failed, 401, 403, invalid credentials |
Credentials expired or rotated | User rotates creds, then external-data-sources-partial-update with new job_inputs |
Could not establish session to SSH gateway |
SSH tunnel misconfigured or remote host down | User checks SSH host/key/bastion |
Primary key required for incremental syncs |
Table has no PK and sync_type is incremental/cdc |
Either add PK in source, or switch schema to full_refresh |
primary keys for this table are not unique |
Declared PK columns aren't actually unique | Pick different PK columns via partial-update |
Integration matching query does not exist |
Source's saved integration was deleted | Recreate the source |
column "X" does not exist, does not have a column named |
Schema drift — incremental field or tracked column removed | Use incremental-fields-create to re-detect, then partial-update |
relation "..." does not exist |
Source table was dropped/renamed | Remove schema or rename source-side |
SSL, connection refused, timeout, unreachable |
Network / firewall / host reachability | User side — check host/port/allowlist |
replication slot, publication, wal_level |
CDC prerequisites broken | Run check-cdc-prerequisites-create; may need slot recreate |
Schema exceeds row limit, billing |
Billing limit | Upgrade plan or disable the schema |
If latest_error is null but the schema is Failed, retrieve the schema directly — the error may only be populated
on the detail view.
Step 4 — Pick the recovery action
The recovery action depends on root cause, not just status. Match the user's situation to one of these:
A. Transient failure (network blip, temporary API outage)
- Data synced so far is still valid.
- Action:
external-data-schemas-reloadto retry using the configured sync method. - Incremental/append syncs pick up where they left off.
B. Credentials expired or rotated
- Every schema under the source is failing with an auth error.
- Action: user rotates creds →
external-data-sources-partial-updatewith the newjob_inputs→ the reload happens automatically when the source status flips back to running, or trigger manually withexternal-data-sources-reload.
C. Schema drift — column renamed, dropped, or type changed
- Error mentions a specific column that no longer matches the source.
- Action:
external-data-schemas-incremental-fields-createto get the current fields, thenexternal-data-schemas-partial-updatewith the correctedincremental_field/incremental_field_type/primary_key_columns. Usually no need to wipe data.
C2. Added / renamed tables in the source database
- User mentions "I added a new table to Postgres but it isn't appearing", or a source table was renamed.
- Action:
external-data-sources-refresh-schemasto pick up the new table list, then configure sync on any new schemas.
D. Incremental state is wrong (duplicates, missing rows, data looks corrupt)
- Schema status may be
Completed— this isn't a "failure" per se, it's bad data. - Action:
external-data-schemas-resyncto wipe synced data and re-import from source. Destructive but often the right call for data-quality issues.
E. CDC pipeline broken on Postgres
- Error mentions replication slot, publication, WAL.
- Action:
external-data-sources-check-cdc-prerequisites-createto enumerate what's broken, fix on the Postgres side, thenexternal-data-schemas-reload. If the WAL position was lost, aresyncis sometimes unavoidable.
F. Sync is stuck in Running for hours
- Check
last_synced_at. If it's hours old and stillRunning, the job is orphaned. - Action:
external-data-schemas-cancelto stop it, thenexternal-data-schemas-reload.
G. Table data is corrupt but you want to keep the schema config
- Action:
external-data-schemas-delete-datato drop the synced table but preserve the schema entry. Next reload re-imports from scratch without losing the configured sync_type/incremental_field.
H. Billing limit
- Action isn't technical. Explain the limit, recommend upgrading the plan or disabling lower-priority schemas so the important ones fit under quota.
I. Webhook-backed schema isn't receiving events
- Symptoms: schema has
sync_type: "webhook", initial bulk sync finished, but no new rows arrive despite activity on the source side. Status may still readCompletedbecause the bulk sync (the safety-net cadence) is succeeding — the problem is the push path. - Action:
external-data-sources-webhook-info-retrieve({source_id}).- If
exists: false→ the webhook was never registered, or was deleted. Callcreate-webhook-createto register it. - If
exists: truebutexternal_status.erroris set → typically "API key doesn't have permission to read webhooks" or similar. The webhook may have been deleted on the source's dashboard. Re-create it. - If
external_status.statusisn't"enabled"→ the source disabled the webhook (usually after repeated delivery failures). Re-enable or re-register. - If payloads are arriving but failing signature verification → the signing secret was rotated. Get the new
one from the source's dashboard and call
update-webhook-inputs-create({source_id}, {inputs: {signing_secret: "..."}}).
- After any fix, check the source's webhook logs (on their side) to confirm PostHog is now responding 2xx.
Step 5 — Confirm before destructive actions
Three recovery actions discard data and cannot be undone:
external-data-schemas-resync— wipes synced rows, re-imports from scratchexternal-data-schemas-delete-data— drops the synced tableexternal-data-sources-destroy— deletes the source and all its schemas
Always present the fix you're proposing and wait for explicit approval before calling any of these. "Just try resync" is rarely the right default.
Example interaction
User: "Our Stripe sync is broken, can you check?"
Agent:
- external-data-sources-list → find Stripe source, status = Error
- external-data-sources-retrieve({id}) → latest_error: "authentication failed: 401 Unauthorized"
- Report: "Your Stripe source's API key is no longer authenticating.
All 8 tables under it are failing with 401s. This usually means the key was rotated on the Stripe side.
To fix:
1. Grab a fresh restricted API key from the Stripe dashboard.
2. I'll update the source with the new key.
3. Syncs will resume automatically — no data loss.
Paste the new key here when ready."
User: "sk_live_..."
Agent:
- external-data-sources-partial-update({id}, {job_inputs: {stripe_secret_key: "sk_live_..."}})
- external-data-sources-reload({id}) to trigger retry
- Report: "Updated and re-triggered. Check back in a few minutes — latest_error should clear."
Important notes
- Source status overrides schema status for diagnosis. If the source is
Error, nothing under it will work; fixing the source usually fixes all its schemas at once. Runningisn't always healthy. Cross-checklast_synced_at. A sync stuck inRunningneedscancelthenreload, notresync.- Resync is destructive. It discards synced data. Only recommend it when the data itself is bad (duplicates, missing rows, corrupt) or when recovery genuinely requires a clean slate (lost WAL position on CDC). Never use it as a first-try for transient errors.
- Delete-data preserves config. When a user says "I just want to start this table over from scratch", prefer
delete-data+reloadoverresync+ new schema entry — it keeps the configured sync_type / incremental_field / PK setup. - Billing limits aren't technical failures. Don't try to retry or reconfigure your way out. Route to billing.
- Webhook failures can hide behind a green status. A webhook-type schema whose bulk fallback sync succeeded looks
Completedeven when the push channel is broken. When users say "my data is hours behind" on a webhook schema, callwebhook-info-retrievebefore looking at schema status. Webhook issues don't surface onexternal-data-schemas-list.