skills/posthog/ai-plugin/diagnosing-failed-warehouse-syncs

diagnosing-failed-warehouse-syncs

Installation
SKILL.md

Diagnosing failed data warehouse syncs

Work top-down when a data warehouse source or table is failing, stuck, or producing bad data: source → schema → recovery action. Do not jump straight to "resync from scratch" — that discards synced data and restarts from zero, which is rarely the right first step.

When to use this skill

  • The user reports a specific sync is failing (e.g. "my Stripe source is red")
  • A table has been in Running state far longer than expected
  • Data in a warehouse table is stale, missing rows, or looks corrupt
  • Latest rows aren't appearing despite the schema being marked Completed
  • The user is choosing between cancel / reload / resync / delete-data and isn't sure which
  • Another skill — typically auditing-warehouse-data-health — has surfaced a failing source or schema and the user wants to dig into it

Both entry points (user-reported and audit-handoff) use the same workflow; the audit just means you already know which item to diagnose and can skip Step 1's discovery search.

Available tools

Tool Purpose
external-data-sources-list List all sources with connection status and latest error
external-data-sources-retrieve Full details for one source including all its schemas
external-data-schemas-list All table schemas across all sources, with per-table status + latest_error
external-data-schemas-retrieve Full details for one schema including sync_type_config
external-data-schemas-cancel Cancel a sync currently in Running state
external-data-schemas-reload Trigger a sync using the configured sync method (respects incremental)
external-data-schemas-resync Full resync — wipes synced data and restarts. Destructive
external-data-schemas-delete-data Delete the synced table but keep the schema entry
external-data-schemas-partial-update Change sync_type / incremental_field / cdc_table_mode
external-data-sources-partial-update Update a source's credentials (job_inputs) after rotation
external-data-sources-reload Retrigger syncs for every enabled schema on a source
external-data-sources-refresh-schemas Re-fetch the source's table list to pick up new tables
external-data-sources-check-cdc-prerequisites-create Verify Postgres CDC setup for a source
external-data-schemas-incremental-fields-create Refresh candidate incremental fields when the source schema has changed
external-data-sources-webhook-info-retrieve Check webhook registration state and external service status
external-data-sources-create-webhook-create Re-register a webhook that was lost or never registered
external-data-sources-update-webhook-inputs-create Update the signing secret after rotation on the source side
external-data-sources-delete-webhook-create Remove a broken webhook before re-registering

Workflow

Step 1 — Locate the failing item

If the user named a source, go straight to external-data-sources-retrieve. Otherwise start with external-data-sources-list and external-data-schemas-list to find what's red.

Two kinds of failure:

  • Source-level (ExternalDataSource.status = "Error"): the connection itself is broken — credentials expired, host unreachable, account disabled. Affects every table.
  • Schema-level — the source connects fine but one or more tables are failing. In the serialized API response from external-data-schemas-list, look for status values "Failed", "Billing limits", or "Billing limits too low". (The underlying model enum values are BillingLimitReached and BillingLimitTooLow, but the serializer rewrites them — match on both the human-readable and enum forms to be safe.)

A source can look Completed at the top level while one of its schemas is Failed — always check both.

Step 2 — Classify the schema status

From external-data-schemas-list, each schema has a status:

Status Meaning Usually means
Running Sync currently executing Normal, unless stuck for hours
Completed Last sync finished successfully Healthy
Failed Last sync errored — see latest_error Needs diagnosis
Paused User disabled sync (should_sync = false) Intentional
Billing limits (serializer) / BillingLimitReached (enum) Team hit its warehouse row quota Billing issue, not a technical failure
Billing limits too low (serializer) / BillingLimitTooLow (enum) Team has insufficient credit Billing issue

Always check last_synced_at alongside status. A schema in Running with last_synced_at from 12 hours ago is almost certainly stuck, even though the status isn't Failed.

Step 3 — Interpret latest_error

Map the latest_error string to a root cause. Common patterns:

Error substring Root cause Fix
authentication failed, 401, 403, invalid credentials Credentials expired or rotated User rotates creds, then external-data-sources-partial-update with new job_inputs
Could not establish session to SSH gateway SSH tunnel misconfigured or remote host down User checks SSH host/key/bastion
Primary key required for incremental syncs Table has no PK and sync_type is incremental/cdc Either add PK in source, or switch schema to full_refresh
primary keys for this table are not unique Declared PK columns aren't actually unique Pick different PK columns via partial-update
Integration matching query does not exist Source's saved integration was deleted Recreate the source
column "X" does not exist, does not have a column named Schema drift — incremental field or tracked column removed Use incremental-fields-create to re-detect, then partial-update
relation "..." does not exist Source table was dropped/renamed Remove schema or rename source-side
SSL, connection refused, timeout, unreachable Network / firewall / host reachability User side — check host/port/allowlist
replication slot, publication, wal_level CDC prerequisites broken Run check-cdc-prerequisites-create; may need slot recreate
Schema exceeds row limit, billing Billing limit Upgrade plan or disable the schema

If latest_error is null but the schema is Failed, retrieve the schema directly — the error may only be populated on the detail view.

Step 4 — Pick the recovery action

The recovery action depends on root cause, not just status. Match the user's situation to one of these:

A. Transient failure (network blip, temporary API outage)

  • Data synced so far is still valid.
  • Action: external-data-schemas-reload to retry using the configured sync method.
  • Incremental/append syncs pick up where they left off.

B. Credentials expired or rotated

  • Every schema under the source is failing with an auth error.
  • Action: user rotates creds → external-data-sources-partial-update with the new job_inputs → the reload happens automatically when the source status flips back to running, or trigger manually with external-data-sources-reload.

C. Schema drift — column renamed, dropped, or type changed

  • Error mentions a specific column that no longer matches the source.
  • Action: external-data-schemas-incremental-fields-create to get the current fields, then external-data-schemas-partial-update with the corrected incremental_field / incremental_field_type / primary_key_columns. Usually no need to wipe data.

C2. Added / renamed tables in the source database

  • User mentions "I added a new table to Postgres but it isn't appearing", or a source table was renamed.
  • Action: external-data-sources-refresh-schemas to pick up the new table list, then configure sync on any new schemas.

D. Incremental state is wrong (duplicates, missing rows, data looks corrupt)

  • Schema status may be Completed — this isn't a "failure" per se, it's bad data.
  • Action: external-data-schemas-resync to wipe synced data and re-import from source. Destructive but often the right call for data-quality issues.

E. CDC pipeline broken on Postgres

  • Error mentions replication slot, publication, WAL.
  • Action: external-data-sources-check-cdc-prerequisites-create to enumerate what's broken, fix on the Postgres side, then external-data-schemas-reload. If the WAL position was lost, a resync is sometimes unavoidable.

F. Sync is stuck in Running for hours

  • Check last_synced_at. If it's hours old and still Running, the job is orphaned.
  • Action: external-data-schemas-cancel to stop it, then external-data-schemas-reload.

G. Table data is corrupt but you want to keep the schema config

  • Action: external-data-schemas-delete-data to drop the synced table but preserve the schema entry. Next reload re-imports from scratch without losing the configured sync_type/incremental_field.

H. Billing limit

  • Action isn't technical. Explain the limit, recommend upgrading the plan or disabling lower-priority schemas so the important ones fit under quota.

I. Webhook-backed schema isn't receiving events

  • Symptoms: schema has sync_type: "webhook", initial bulk sync finished, but no new rows arrive despite activity on the source side. Status may still read Completed because the bulk sync (the safety-net cadence) is succeeding — the problem is the push path.
  • Action:
    1. external-data-sources-webhook-info-retrieve({source_id}).
    2. If exists: false → the webhook was never registered, or was deleted. Call create-webhook-create to register it.
    3. If exists: true but external_status.error is set → typically "API key doesn't have permission to read webhooks" or similar. The webhook may have been deleted on the source's dashboard. Re-create it.
    4. If external_status.status isn't "enabled" → the source disabled the webhook (usually after repeated delivery failures). Re-enable or re-register.
    5. If payloads are arriving but failing signature verification → the signing secret was rotated. Get the new one from the source's dashboard and call update-webhook-inputs-create({source_id}, {inputs: {signing_secret: "..."}}).
  • After any fix, check the source's webhook logs (on their side) to confirm PostHog is now responding 2xx.

Step 5 — Confirm before destructive actions

Three recovery actions discard data and cannot be undone:

  • external-data-schemas-resync — wipes synced rows, re-imports from scratch
  • external-data-schemas-delete-data — drops the synced table
  • external-data-sources-destroy — deletes the source and all its schemas

Always present the fix you're proposing and wait for explicit approval before calling any of these. "Just try resync" is rarely the right default.

Example interaction

User: "Our Stripe sync is broken, can you check?"

Agent:
- external-data-sources-list → find Stripe source, status = Error
- external-data-sources-retrieve({id}) → latest_error: "authentication failed: 401 Unauthorized"
- Report: "Your Stripe source's API key is no longer authenticating.
   All 8 tables under it are failing with 401s. This usually means the key was rotated on the Stripe side.

   To fix:
   1. Grab a fresh restricted API key from the Stripe dashboard.
   2. I'll update the source with the new key.
   3. Syncs will resume automatically — no data loss.

   Paste the new key here when ready."

User: "sk_live_..."

Agent:
- external-data-sources-partial-update({id}, {job_inputs: {stripe_secret_key: "sk_live_..."}})
- external-data-sources-reload({id}) to trigger retry
- Report: "Updated and re-triggered. Check back in a few minutes — latest_error should clear."

Important notes

  • Source status overrides schema status for diagnosis. If the source is Error, nothing under it will work; fixing the source usually fixes all its schemas at once.
  • Running isn't always healthy. Cross-check last_synced_at. A sync stuck in Running needs cancel then reload, not resync.
  • Resync is destructive. It discards synced data. Only recommend it when the data itself is bad (duplicates, missing rows, corrupt) or when recovery genuinely requires a clean slate (lost WAL position on CDC). Never use it as a first-try for transient errors.
  • Delete-data preserves config. When a user says "I just want to start this table over from scratch", prefer delete-data + reload over resync + new schema entry — it keeps the configured sync_type / incremental_field / PK setup.
  • Billing limits aren't technical failures. Don't try to retry or reconfigure your way out. Route to billing.
  • Webhook failures can hide behind a green status. A webhook-type schema whose bulk fallback sync succeeded looks Completed even when the push channel is broken. When users say "my data is hours behind" on a webhook schema, call webhook-info-retrieve before looking at schema status. Webhook issues don't surface on external-data-schemas-list.
Weekly Installs
6
GitHub Stars
26
First Seen
Today