dyad:deflake-e2e-from-run
Deflake E2E Tests from a CI Run
Use this skill when the user points you at a specific failing CI run (e.g. https://github.com/dyad-sh/dyad/actions/runs/<id>) and asks you to root-cause the E2E failures. Unlike deflake-e2e, this skill starts by reading the already-recorded Playwright report from the run's artifacts, which is faster and gives you the exact failure state CI saw. After making fixes, always rebuild and rerun the affected E2E tests locally before committing/pushing.
Arguments
$ARGUMENTS: The GitHub Actions run URL or run ID. If absent, ask the user.
Phase 1 — Get the report
- Extract
run_idfrom the URL (/actions/runs/<run_id>or/actions/runs/<run_id>/job/<job_id>). - List artifacts and find the
html-report(merged across shards):gh api repos/dyad-sh/dyad/actions/runs/<run_id>/artifacts --jq '.artifacts[] | {name, size_in_bytes}' - Download it into a scratch dir (use
-R dyad-sh/dyad—gh run downloaddoes not auto-detect the repo from arbitrary cwd):mkdir -p /tmp/pw-report gh run download <run_id> -R dyad-sh/dyad -n html-report -D /tmp/pw-report - Confirm layout:
index.html,results.json,data/*.zip(trace archives),data/*.png(screenshots),data/*.markdown(error-context files).
Phase 2 — Enumerate failures
Use jq on results.json. The schema has suites[].specs[], with each spec's tests[].results[] holding one result per attempt.
- Stats headline:
jq '.stats' results.json→{expected, skipped, unexpected, flaky}. - Unexpected (all attempts failed):
jq '[.suites[].specs[]? | select(.ok == false) | {title, file, err: [.tests[].results[] | {status, error: .error.message}]}]' results.json - Flaky (some attempt failed but final passed):
jq '[.suites[].specs[]? | select(.tests[].status == "flaky") | {title, file}]' results.json
Group by error shape. If every failure shares the same locator / error ("element is not enabled", "locator.click timeout", etc.) you're probably looking at one root cause across multiple tests. Don't investigate them all — pick one representative trace.
Phase 3 — Analyze a specific failure
- Find the trace zip. The
attachments[].pathinresults.jsonpoints atall-blob-reports/resources/<hash>.zip— those are CI-side paths, not local. The file actually lives at/tmp/pw-report/data/<hash>.zip. Match by hash, or grep the trace for the test title / spec file:for f in /tmp/pw-report/data/*.zip; do hit=$(unzip -p "$f" test.trace | grep -c "chat_tabs\.spec\.ts:68") [ "$hit" -gt 0 ] && echo "$f" done - Extract:
unzip -o <zip> -d /tmp/trace-extract. - Read the step-by-step actions (
test.traceis JSONL):
Look for the last few actions before the timeout — that tells you which call hung and what its locator resolved to.import json for line in open('/tmp/trace-extract/test.trace'): obj = json.loads(line) if obj.get('type') == 'before' and obj.get('class') == 'Test': print(round(obj['startTime']/1000, 2), obj.get('method'), obj.get('title','')[:200]) - Correlate with app logs. Electron
console.log/console.errorlands instderr/stdouttrace events:
IPC log lines likefor line in open('/tmp/trace-extract/test.trace'): obj = json.loads(line) if obj.get('type') in ('stderr','stdout'): text = obj.get('text','') if 'proposal' in text or 'chatId' in text or 'stream' in text.lower(): print(text[:300])(proposal_handlers) › IPC: get-proposal returned: …reveal what state the backend was in at failure time — gold for race-condition root-causing. - View the failure screenshot. Trace resources are stored unhashed; PNG files in
/tmp/trace-extract/resources/are screenshots. Resize before Read (Claude's image limit is ~1.5MB):
Thensips -Z 800 /tmp/trace-extract/resources/<hash> --out /tmp/fail.pngRead /tmp/fail.png. This is often the single most useful artifact — e.g. an "empty input, disabled Send button" screenshot is a dead giveaway for a fill() race.
Phase 4 — Root-cause playbook
Common patterns and what they mean:
- "element is not enabled" on a button after fill() → React render race between URL/atom state updates and the editor's onChange. The fill runs, onChange writes under the old key, next render clears the editor for the new context. Fix: wrap fill+click in
expect.toPass()and assert editor content + button enabled before clicking. SeeChatActions.sendPrompt(). - "locator.click timeout" with multiple matching elements → stale component still in DOM during a transition. Fix: scope the locator tighter (
getChatInputContainer().locator(...)) or add a visibility assertion on the stable target first. - Assertion flakes right after navigation → atom/URL mismatch during a single render cycle. Either wait for a post-navigation signal (e.g. a data-loaded state) or wrap the assertion in
toPasswith a bounded timeout. - Different error on retry vs. first attempt → test is mutating shared state. Look for missing teardown or cross-test singletons.
Prefer fixing the test over the app unless the race would actually bite a real user. A real user can't type at 2ms after clicking a button; Playwright can. A retry wrapper is the correct contract there.
Phase 5 — Fix, verify, PR
- Make the minimal change — usually in
e2e-tests/helpers/page-objects/since many specs share the same helper. npm run fmt && npm run lint && npm run ts.- Rebuild the app locally before running E2E. E2E tests run against the built app, so use the repository's standard build command:
If the known Homebrew Python 3.14npm run buildpyexpatnative rebuild issue occurs, rerun with:PYTHON=/usr/bin/python3 npm run build - Rerun the affected E2E test files locally after the rebuild. Prefer the narrowest set that covers the CI failures you fixed:
If the fix is in a shared helper that affected several failing specs, run all representative affected specs in one command or separate commands.PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<affected-file>.spec.ts - Use
/dyad:pr-pushor commit +gh pr createdirectly. The PR body MUST include:- A link to the failing run.
- The root-cause narrative (what raced, in concrete terms — not "timing issue").
- Why the fix is correct (what the retry loop is doing that the original flow wasn't).
- The local build and affected E2E commands you ran.
Gotchas
gh run downloadneeds-R <owner>/<repo>if you're not in a cwd with matching origin.results.jsonpaths insideattachments[]are CI-side; only use them to match hashes, never to read files.- A fork PR's artifacts live on the fork's run, not the upstream's. Make sure
run_idis on the right repo. - Many traces unpack to the same
/tmp/trace-extract/— clean between extractions or use unique subdirs. - The
html-reportis the merged report across shards. Individual shard artifacts (blob-report-*,flakiness-report-*) are usually unnecessary for root-causing.
More from dyad-sh/dyad
dyad:deflake-e2e-recent-commits
Automatically gather flaky E2E tests from recent CI runs on the main branch and from recent PRs by wwwillchen/keppo-bot/dyad-assistant, then deflake them.
25dyad:fix-issue
Create a plan to fix a GitHub issue, then implement it locally.
24dyad:pr-fix:actions
Fix failing CI checks and GitHub Actions on a Pull Request.
24dyad:fast-push
Commit any uncommitted changes, run lint checks, fix any issues, and push the current branch. Delegates to a haiku sub-agent for speed.
24dyad:pr-rebase
Rebase the current branch on the latest upstream changes, resolve conflicts, and push.
24dyad:pr-fix:comments
Read all unresolved GitHub PR comments from trusted authors and address or resolve them appropriately.
23