fix-buildkite-ci
Fix Buildkite CI
Overview
Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.
Target Selection
Resolve triage target with this precedence:
- If user provides a Buildkite build URL, use that build directly.
- Else if user specifies a branch and/or a pipeline (for example
pull-request,main-cron), use the specified scope. - Else default to the current git branch and inspect the checks for the PR associated with that branch.
Workflow
- Identify the failing Buildkite build(s).
- Retrieve build JSON and list failed jobs.
- Pull job logs and extract the first concrete failure signal.
- Inspect artifacts when top-level logs are truncated.
- Map failure to root cause and apply a focused fix.
- Verify locally where feasible and summarize evidence.
Use bk CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via curl.
For exact commands and endpoint patterns, read references/buildkite-ci-triage.md.
Step 1: Identify Failing Buildkite Checks
When no explicit target is given, find the PR for the current branch first, then run gh pr checks <PR_NUMBER> to find failing checks and capture Buildkite URLs (.../builds/<N>).
If user specifies a branch/pipeline, list and filter builds with bk build list using those parameters.
If user provides a Buildkite build URL, skip discovery and start from that build number.
Step 2: Pull Build JSON and Failed Jobs
Fetch builds/<N>.json, then list failed jobs by non-zero exit_status.
Capture at least:
- pipeline
- build number
- job id
- job name
- exit status
Step 3: Extract the Concrete Failure
Fetch each failed job log and search for high-signal patterns:
query result mismatch[Diff] (-expected|+actual)query is expected to fail with error:- panic/assertion lines
- deterministic simulation error markers
- OOM/timeout/cancellation markers
Stop once you have one concrete failing file/case and mismatch.
Step 4: Fall Back to Artifacts
If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:
risedev-logs.ziprisedev-logs/nodetype-*.log
Extract and search artifact logs for the exact mismatch.
Step 5: Apply Focused Fixes
Prefer minimal fixes tied to evidence:
- SQLLogicTest mismatch: update expected sections in the correct
.slt/.slt.partfile only when query output change is intentional. - Wrong runtime behavior: fix source code and keep tests as-is.
- Flaky/cancellation-only signal (
143): treat as infra/cancel unless corroborated by product errors.
Avoid broad "retry and hope" actions without root-cause evidence.
Step 6: Verify and Report
Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.
Always report:
- failing check/build/job identifiers
- failing file/test/case
- exact mismatch/error evidence
- applied fix (files changed)
- verification status and remaining risk
Buildkite-Specific Heuristics
- Exit code
105: often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch. - Exit code
4: common in simulation/recovery steps; inspect uploaded simulation logs. - Exit code
143: usually cancellation/termination, not a deterministic product regression. raw_log_urlmay be null in JSON; use explicit job log endpoints by job id.- Prefer JSON endpoints plus
jq; avoid scraping large HTML pages.