Deflake E2E Tests from a CI Run

Use this skill when the user points you at a specific failing CI run (e.g. https://github.com/dyad-sh/dyad/actions/runs/<id>) and asks you to root-cause the E2E failures. Unlike deflake-e2e, this skill starts by reading the already-recorded Playwright report from the run's artifacts, which is faster and gives you the exact failure state CI saw. After making fixes, always rebuild and rerun the affected E2E tests locally before committing/pushing.

Arguments

$ARGUMENTS: The GitHub Actions run URL or run ID. If absent, ask the user.

Phase 1 — Get the report

Extract run_id from the URL (/actions/runs/<run_id> or /actions/runs/<run_id>/job/<job_id>).

List artifacts and find the html-report (merged across shards):

gh api repos/dyad-sh/dyad/actions/runs/<run_id>/artifacts --jq '.artifacts[] | {name, size_in_bytes}'

Download it into a scratch dir (use -R dyad-sh/dyad — gh run download does not auto-detect the repo from arbitrary cwd):
```
mkdir -p /tmp/pw-report
gh run download <run_id> -R dyad-sh/dyad -n html-report -D /tmp/pw-report
```
Confirm layout: index.html, results.json, data/*.zip (trace archives), data/*.png (screenshots), data/*.markdown (error-context files).

Phase 2 — Enumerate failures

Use jq on results.json. The schema has suites[].specs[], with each spec's tests[].results[] holding one result per attempt.

Stats headline: jq '.stats' results.json → {expected, skipped, unexpected, flaky}.

Unexpected (all attempts failed):

jq '[.suites[].specs[]? | select(.ok == false) | {title, file,
    err: [.tests[].results[] | {status, error: .error.message}]}]' results.json

Flaky (some attempt failed but final passed):

jq '[.suites[].specs[]? | select(.tests[].status == "flaky") | {title, file}]' results.json

Group by error shape. If every failure shares the same locator / error ("element is not enabled", "locator.click timeout", etc.) you're probably looking at one root cause across multiple tests. Don't investigate them all — pick one representative trace.

Phase 3 — Analyze a specific failure

Find the trace zip. The attachments[].path in results.json points at all-blob-reports/resources/<hash>.zip — those are CI-side paths, not local. The file actually lives at /tmp/pw-report/data/<hash>.zip. Match by hash, or grep the trace for the test title / spec file:
```
for f in /tmp/pw-report/data/*.zip; do
  hit=$(unzip -p "$f" test.trace | grep -c "chat_tabs\.spec\.ts:68")
  [ "$hit" -gt 0 ] && echo "$f"
done
```
Extract: unzip -o <zip> -d /tmp/trace-extract.

Read the step-by-step actions (test.trace is JSONL):

import json
for line in open('/tmp/trace-extract/test.trace'):
    obj = json.loads(line)
    if obj.get('type') == 'before' and obj.get('class') == 'Test':
        print(round(obj['startTime']/1000, 2), obj.get('method'), obj.get('title','')[:200])

Look for the last few actions before the timeout — that tells you which call hung and what its locator resolved to.

Correlate with app logs. Electron console.log/console.error lands in stderr/stdout trace events:

for line in open('/tmp/trace-extract/test.trace'):
    obj = json.loads(line)
    if obj.get('type') in ('stderr','stdout'):
        text = obj.get('text','')
        if 'proposal' in text or 'chatId' in text or 'stream' in text.lower():
            print(text[:300])

IPC log lines like (proposal_handlers) › IPC: get-proposal returned: … reveal what state the backend was in at failure time — gold for race-condition root-causing.

View the failure screenshot. Trace resources are stored unhashed; PNG files in /tmp/trace-extract/resources/ are screenshots. Resize before Read (Claude's image limit is ~1.5MB):
```
sips -Z 800 /tmp/trace-extract/resources/<hash> --out /tmp/fail.png
```
Then Read /tmp/fail.png. This is often the single most useful artifact — e.g. an "empty input, disabled Send button" screenshot is a dead giveaway for a fill() race.

Phase 4 — Root-cause playbook

Common patterns and what they mean:

"element is not enabled" on a button after fill() → React render race between URL/atom state updates and the editor's onChange. The fill runs, onChange writes under the old key, next render clears the editor for the new context. Fix: wrap fill+click in expect.toPass() and assert editor content + button enabled before clicking. See ChatActions.sendPrompt().
"locator.click timeout" with multiple matching elements → stale component still in DOM during a transition. Fix: scope the locator tighter (getChatInputContainer().locator(...)) or add a visibility assertion on the stable target first.
Assertion flakes right after navigation → atom/URL mismatch during a single render cycle. Either wait for a post-navigation signal (e.g. a data-loaded state) or wrap the assertion in toPass with a bounded timeout.
Different error on retry vs. first attempt → test is mutating shared state. Look for missing teardown or cross-test singletons.

Prefer fixing the test over the app unless the race would actually bite a real user. A real user can't type at 2ms after clicking a button; Playwright can. A retry wrapper is the correct contract there.

Phase 5 — Fix, verify, PR

Make the minimal change — usually in e2e-tests/helpers/page-objects/ since many specs share the same helper.
npm run fmt && npm run lint && npm run ts.
Rebuild the app locally before running E2E. E2E tests run against the built app, so use the repository's standard build command:
```
npm run build
```
If the known Homebrew Python 3.14 pyexpat native rebuild issue occurs, rerun with:
```
PYTHON=/usr/bin/python3 npm run build
```
Rerun the affected E2E test files locally after the rebuild. Prefer the narrowest set that covers the CI failures you fixed:
```
PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<affected-file>.spec.ts
```
If the fix is in a shared helper that affected several failing specs, run all representative affected specs in one command or separate commands.
Use /dyad:pr-push or commit + gh pr create directly. The PR body MUST include:
- A link to the failing run.
- The root-cause narrative (what raced, in concrete terms — not "timing issue").
- Why the fix is correct (what the retry loop is doing that the original flow wasn't).
- The local build and affected E2E commands you ran.

Gotchas

gh run download needs -R <owner>/<repo> if you're not in a cwd with matching origin.
results.json paths inside attachments[] are CI-side; only use them to match hashes, never to read files.
A fork PR's artifacts live on the fork's run, not the upstream's. Make sure run_id is on the right repo.
Many traces unpack to the same /tmp/trace-extract/ — clean between extractions or use unique subdirs.
The html-report is the merged report across shards. Individual shard artifacts (blob-report-*, flakiness-report-*) are usually unnecessary for root-causing.

dyad:deflake-e2e-from-run

Deflake E2E Tests from a CI Run

Arguments

Phase 1 — Get the report

Phase 2 — Enumerate failures

Phase 3 — Analyze a specific failure

Phase 4 — Root-cause playbook

Phase 5 — Fix, verify, PR

Gotchas

More from dyad-sh/dyad

dyad:deflake-e2e-recent-commits

dyad:fix-issue

dyad:pr-fix:actions

dyad:fast-push

dyad:pr-rebase

dyad:pr-fix:comments