ux-audit
UX Audit
Exhaustively audit a web app from the perspective of a real user. Two lenses, combined:
- Flow lens — walk the 3–5 real tasks a user would do ("threads"). Count clicks, mark dead ends, screenshot every state change.
- Element lens — after threads, sweep every interactive element not already hit. Forms, menus, toggles, edge data volumes.
Then run an eight-scenario battery (first contact, interrupted workflow, wrong turn recovery, returning user, keyboard only, heavy data, destructive confidence, second user) and close with a fix-and-verify loop for critical findings.
This is the thorough one. There is no quick mode. If the user wants a targeted check ("audit the dashboard"), scope the audit to that area — but within that area, still go exhaustive. Runtime is unbounded: assume a long-running session, write findings to the report incrementally, don't batch.
Setup
Four things before any browsing. Stop if any fail.
1. Browser tool
| Target | Tool | Why |
|---|---|---|
| Authenticated app (internal tools, client apps) | Chrome MCP | Uses your real logged-in Chrome session — OAuth, cookies, RBAC just work |
| Public site (marketing, competitor, unauthed) | Playwright MCP | No login needed |
| Neither available | Stop | Ask the user to connect Chrome MCP or install Playwright |
Do not silently fall back to a fresh Playwright session for an authenticated app — the audit is worthless if you can't log in. If Chrome MCP isn't connected, stop and say: "Open Chrome, click Connect in the Claude extension, then rerun."
See references/browser-tools.md for commands.
2. URL
Prefer the deployed/live version — that's what real users see.
- Check
wrangler.jsoncforpatternorcustom_domain - Check CLAUDE.md, README,
package.jsonhomepagefield - Fall back to
lsof -i :5173 -i :3000 -i :8787 -tfor a running dev server - Ask the user
Live site has real auth, real latency, real CDN and CORS behaviour. Local misses deployment-specific bugs (missing env vars, broken asset paths, CORS, slow APIs). Only use local if the user asks or the feature isn't deployed.
3. Viewport
Pin the window at the start:
width: 1440, height: 900
This is the standard MacBook resolution and the audit's baseline. Responsive sweep (below) also tests 1280, 1024, 768, 375. Do not go above 2000px wide — it breaks the harness.
4. Persona
The audit needs a persona. Without one, findings drift toward generic "looks fine". Source in order:
- Argument — if the user provided one ("ux audit as a busy insurance broker")
- Project personas — read
.jez/personas/default.mdand.jez/personas/<app-name>.mdif they exist - Ask — if neither, ask one question: "Who uses this app and what are they trying to get done?"
Capture: role, tech comfort, time pressure, emotional state, device context. A good persona predicts what they'd miss ("A receptionist between phone calls won't scroll below the fold").
5. Screenshot post-processing
On Retina Macs, Chrome captures screenshots at 2× the logical viewport — a 1440×900 window produces a 2880×1800 PNG. That's larger than the vision pipeline needs and wastes context. Downsize screenshots after each capture phase:
img-process batch ./screenshots --action optimise --max-width 1440
The optimise action is idempotent — no-op if already ≤ 1440px, downsize otherwise. Safe to run unconditionally, on every folder, at any point. At minimum, run it after each thread traversal and after the responsive sweep.
Playwright MCP users can set deviceScaleFactor: 1 in the browser context to sidestep this entirely — screenshots match viewport 1:1. Chrome MCP has no equivalent flag, so post-process instead.
Discovery
Sitemap crawl
Build the complete page inventory before auditing any page.
- Router config — read the app's route definitions (React Router, TanStack Router, Next.js app dir)
- Nav crawl — click through every section and sub-section of the sidebar/menu
- Deep links — URLs in CLAUDE.md, docs, or a prior audit report
One line per route, with purpose: /app/clients — list of clients, search, add new.
Thread inventory
Identify 3–5 real tasks that make up a user's day. These are the spines of the audit.
Examples:
- Insurance broker app: renew a policy, create a new client, work through today's queue
- Project management: morning triage, update a task, send a client summary
- SEO tool: check latest crawl, diagnose a regression, export a report
How to find them:
- Ask the user if they know the app's use cases
- Read CLAUDE.md / README for "what this app does"
- Infer from top-level nav — each major section usually maps to a thread
Write them down. They structure the traversal phase.
Element inventory
For each route as you reach it, list every interactive element:
/app/clients — 31 elements
Buttons: Add Client, Import, Export, Filter, Sort (5)
Inputs: Search
Table: row-click ×N, star toggle ×N, action menu ×N
Pagination: Next, Prev, page numbers, items-per-page
This inventory powers the coverage metric. After the audit you can say "tested 29 of 31 elements on /app/clients; 2 pagination controls not reached due to data volume". Coverage becomes arithmetic, not a vibe.
Build inventories lazily — per-page as you traverse, not all up-front.
Thread Traversal (primary phase)
For each thread:
- Start from the app entry point — not mid-thread. Real users land at
/or/dashboard. - Walk as the persona — if they'd skim, skim. If they'd misread a label, misread it. Note the hesitations.
- Screenshot every state change — default → hover/focus → active → after click → after load → confirmation. Meaningful transitions, not every pixel movement. With modern context windows and sub-agent review, don't scrimp — the filmstrip is the evidence.
- Track the cost:
- Click count from entry to completion
- Decision points (moments you stopped to think)
- Dead ends (wrong paths and backtracks)
- Interrupt recovery (close the tab at step 3, return at step 4 — did state survive?)
- Hand screenshots to a sub-agent for review — at the end of each thread, dispatch: "Here are 42 screenshots from Thread 2 (Create Client). Flag every moment of: unclear next action, broken visual hierarchy, lost context, confusing copy, missing feedback, anxiety, or 'did that work?'. Return structured findings." The sub-agent returns a list; you stay focused on driving the browser.
At the end of each thread, answer three questions (as the persona):
- Did it end clearly? Did I know the task was done?
- Would I come back? Confidence, trust, momentum.
- One thing to make this twice as easy? Often the most actionable insight in the report.
See references/walkthrough-checklist.md for per-screen evaluation prompts and references/workflow-comprehension.md for wayfinding, mental model, and page-to-page continuity.
Element Exhaustion (completeness phase)
For each route, work the inventory. Skip elements already exercised by thread traversal.
| Element type | Test |
|---|---|
| Button | Click, screenshot result. If it opens a modal → test the modal. If it triggers an action → verify feedback. |
| Link | Follow it. Destination matches the label? |
| Input | Valid, invalid, empty, very long, special characters (O'Brien, café, emoji, <script>). |
| Select / dropdown | Every option. Verify behaviour per option. |
| Toggle / checkbox | On, off. Persistence on refresh? |
| Tab | Every tab. URL updates? Back button works? |
| Accordion | Expand, collapse. State saves? |
| Menu / context menu | Every item. Right-click for context menus. |
| Drag handle | Drag to new position. Drag to invalid target. |
| Pagination | Next, prev, jump, items-per-page. |
For every form:
- Valid data → verify creation/update/confirmation
- Invalid data → field-level error, form state preserved
- Empty submit → required-field marking
- Tab through fields → focus order logical
For every list/table, test at these volumes if data permits:
- 0 items (empty state)
- 1 item (pluralisation edge case)
- 100 items (pagination)
- 1000+ items (virtualisation, search, filter performance)
Write findings to the report as you go. Don't hold them in memory.
Responsive Sweep
For each route:
| Width | Mode | Why |
|---|---|---|
| 1440px | light | Baseline (matches audit viewport) |
| 1280px | light | Standard desktop |
| 1024px | light | Tablet landscape / nav collapse point |
| 768px | light | Tablet portrait / stacked layout |
| 375px | light | Mobile / touch targets |
| 1440px | dark | Dark mode baseline |
| 375px | dark | Dark mode mobile |
On any page that breaks at an intermediate width, screenshot the transition point.
Run the automated layout-detection JS (overflow, clipping, invisible text) at each width — snippets in references/walkthrough-checklist.md. Read console output after injection; every warning is a potential finding.
Scenario Battery
Eight scenarios. All eight, always. They catch what screen-by-screen testing misses. Full protocols in references/scenario-tests.md.
- First Contact — figure out the app with zero prior knowledge, then write a 2-minute plain-English guide to each thread. Gaps in the guide are UX gaps.
- Interrupted Workflow — start a task, close the tab, refresh, navigate away mid-form. Does state survive?
- Wrong Turn Recovery — deliberately click the wrong thing. How many clicks to get back on track? What context is lost?
- Returning User — repeat a thread already done. Faster the second time? Shortcuts available? On the dashboard, can you tell what changed since last visit?
- Keyboard Only — unplug the mouse. Can every thread complete keyboard-only? Focus visible? Tab order logical? Escape closes?
- Heavy Data — seed with 500+ records. Lists virtualise? Search returns the right thing? Filters narrow meaningfully?
- Destructive Confidence — for every delete, send, publish, pay, share: is consent clear? Does confirm copy tell you what's about to happen? Undo available?
- Second User — log in as a restricted role (viewer not editor, client not staff). Read-only views coherent? Permissions errors make sense?
Live Interaction Smoke
Code reading verifies a button exists and has an onClick. It does not
verify that clicking the button actually does something observable.
Bugs of this shape — "the handler runs, fires a call into an SDK, but
the flow never completes" — are invisible to static analysis and
require a live click + network check.
Run these for every interactive control on every page:
- Click it. Pointer moves, element highlights, click lands.
- Watch the Network tab. Did a request fire? To the right URL? Correct method + body shape?
- Watch the DOM. Did something visibly change — new element, removed element, state transition (loading spinner, toast, route change)?
- If nothing changed in (2) or (3), that's a bug. The control LOOKS alive but isn't doing its job. Log it.
Known control categories that silently fail
| Control category | Silent-failure mode to watch for |
|---|---|
| Approve / Deny buttons on tool-call cards | Handler fires but server never hears about it (SDK needs a separate "send on state change" callback). See rules/ai-sdk-tool-approval-autosubmit.md. |
| "Connect X" OAuth buttons inside dialogs | window.open() silently popup-blocked when click originates in a modal. Must use window.location.href. See rules/oauth-popup-blocked-in-dialogs.md. |
| Save / update buttons on forms with async validation | Button disables during mutation but the mutation itself silently 5xx'd. No toast, no error state, form just sits there. |
| Delete / archive actions | Optimistic UI removes the row but server rejected — after refresh, the row is back. |
| Pagination / "Load more" buttons | Fires request but response empty due to off-by-one offset. |
| Filter chips on list views | Query param updates but query key doesn't — TanStack Query serves stale cached results. |
| "Reply" / "Forward" in email-style UIs | Opens compose pane but Message-ID headers not set — reply threads orphan in recipient's inbox. |
SDK contract check
When the page uses a third-party SDK with its own state model, verify the SDK's required options are passed. Silent failures usually trace to an undocumented-but-required option.
Common SDK contracts to verify:
| SDK | Option that silently breaks behaviour if missing |
|---|---|
@ai-sdk/react useChat with needsApproval: true tools |
sendAutomaticallyWhen: lastAssistantMessageIsCompleteWithApprovalResponses |
@ai-sdk/react useChat with custom transport |
prepareSendMessagesRequest reading latest refs (otherwise pinned to initial values) |
better-auth createAuthClient |
sessionOptions.refetchOnWindowFocus: false for SPAs that route on session state |
TanStack Query QueryClient |
refetchOnWindowFocus: false if your app redirects on empty query results |
React Router v7 createBrowserRouter |
loader / action defined for routes that need data (not just component) |
| Radix Dialog | modal: true + onEscapeKeyDown handler if Escape should do more than close |
| zodResolver | as any around schema if using Zod v4 and resolver is older — silent validation miss |
If the page uses an SDK not on this list, spend 2 minutes reading
its "useX" export's options. Anything named *On*Change, *On*Finish,
*SendAutomatically*, *RefetchOn*, or *Configure* is a prime
suspect for "silent failure because it's undefined."
Passive Error Sweeps
Run these throughout, not as a dedicated phase.
Console errors
On every page load, check the browser console:
- JS errors (TypeError, ReferenceError) — High
- Framework warnings (key props, deprecated APIs) — Medium
- CSP violations — Medium
- Deprecation warnings — Low
Network errors
Monitor network responses across the entire session. Visual browsing misses API failures — TanStack Query and SWR swallow errors and show empty/loading states instead. A "No results found" might actually be a 403.
| Status | Severity | Usual cause |
|---|---|---|
| 500+ | Critical | Server error |
| 403 | High | Permissions or route collision (static shadowed by /:param) |
| 404 | Medium | Missing endpoint |
| 401 | Low | Expected unauthenticated; flag on authenticated pages |
| CORS | High | Missing headers — production-only failure |
Chrome MCP: use read_network_requests after each page. Playwright: attach page.on('response') before browsing starts.
Report
Write to docs/ux-audit-YYYY-MM-DD.md as you go. If .jez/artifacts/ exists, use .jez/artifacts/ux-audit-YYYY-MM-DD.md instead.
Structure in references/report-template.md:
- Summary
- Coverage metrics (threads walked, elements tested, inventoried coverage %)
- Findings by severity (Critical → Low) with screenshot refs
- Per-thread results
- Scenario battery results
- Network + console error inventory
- What works well
- Priority recommendations
Severity:
- Critical — user cannot complete the task
- High — confusion, friction, trust damage
- Medium — suboptimal, workable
- Low — polish
Fix-and-Verify Loop
After writing the report, offer the loop:
"Found N critical and M high issues. Fix them now and re-verify?"
If yes:
- Group findings by file/area
- Patch each one
- Re-walk just the affected slice (not the whole app)
- Update the report: mark finding
✓ fixed,✗ still present, or⚠ new issue found - Close with a "fixed in this session" summary
Closes the loop inside one session instead of waiting for a second audit pass tomorrow.
Cross-reference with ux-extract
If a pattern library exists at .jez/artifacts/ux-extracts/<ref>.md or docs/ux-extracts/<ref>.md, read it before starting and use it as the bar. Findings can then say:
"Empty state on /app/clients shows no CTA. Reference (claude.ai) shows 3 keyboard shortcuts + 'New chat' button in the same position."
Graceful if absent — the audit still works without a reference.
Autonomy
- Just do it: Navigate, screenshot, read pages, inject layout-detection JS, submit forms with obviously-fake test data, write the report file, dispatch screenshot-review sub-agents
- Ask first: Destructive actions (delete, send, publish, pay). For Destructive Confidence testing, ask once before running that scenario
- Stop and confirm: Anything that emails/notifies external people
Execution discipline
Three principles for audit runs that catch the most bugs:
-
Drive the audit from the main session, not a sub-agent. Sub-agents are fine for bounded analysis tasks (reviewing a batch of captured screenshots, summarising a thread). But the audit itself — the navigation, the noticing, the "this button looks off after the previous interaction" kind of judgement — must happen in the session that's been watching the app. Cross-interaction state lives in that session's working memory; a fresh sub-agent starts cold and misses second-order findings.
-
Use the browser tool directly. Chrome MCP or Playwright MCP via the main session's tool calls. Don't hand screenshots to a fresh agent and ask for opinions — that loses the state-dependence that makes audits catch logic errors, business-logic issues, and odd edge cases.
-
Loop to exhaustion with variations. Don't stop after one pass through the checklist. After each pass, generate a new angle — different persona, different workflow, different input volume, different starting point, different validation lens — and re-walk. Stop only when a full pass produces no new findings. Single-pass audits by definition miss the second-order issues.
For audits expected to run longer than 30 minutes, set up a 15-min /loop check-in alongside the main session — it journals findings, grounds the session, and provides a natural termination signal. See references/long-running-check-in-pattern.md.
Reference files
| When | Read |
|---|---|
| Per-screen evaluation questions | references/walkthrough-checklist.md |
| Workflow threads, wayfinding, mental model alignment, page-to-page continuity | references/workflow-comprehension.md |
| Full protocol for each of the 8 scenarios | references/scenario-tests.md |
| Report format and severity rubrics | references/report-template.md |
| Browser tool commands and viewport notes | references/browser-tools.md |
Long-running audit supervision via 15-min /loop check-in |
references/long-running-check-in-pattern.md |
Tips
- Thread first, elements second. If time is short, threads tell you more.
- Sub-agents for screenshot review. Don't drive the browser and analyse 200 screenshots in one loop.
- Write findings incrementally. The report file is cheaper memory than your context.
- Stay in persona. If you catch yourself thinking "a developer would know…" stop. Your persona doesn't.
- Every hesitation is a finding. If you paused to figure out what to click, that's friction worth reporting.
- Coverage is arithmetic. Inventoried ÷ tested. Publish the ratio in the report.