ai-bug-triage by petrkindlmann/qa-skills

Key reframe: The LLM is best at explaining and routing, not deduplication. Teach agents to DESIGN the pipeline, not BE the pipeline.

Before starting: Check for .agents/qa-project-context.md in the project root. It contains tech stack, component mapping, and known flaky areas that improve classification accuracy.

Discovery Questions

Before building or using a triage pipeline, clarify:

What is the failure source?
- CI pipeline logs (GitHub Actions, GitLab CI, Jenkins, CircleCI)
- Test framework output (Playwright, Jest, pytest, Vitest)
- Production error monitoring (Sentry, Datadog, Bugsnag)
- Manual bug reports from QA or users
What is the ticket destination?
- Jira, Linear, GitHub Issues, Azure DevOps, Shortcut
- What fields are required? (component, severity, priority, labels)
- What workflows exist? (triage board, auto-assignment rules)
What is the deduplication scope?
- Same test run? Same sprint? Same release? All time?
- Do you already have fingerprinting? What is the current duplicate rate?
What approval workflow is needed?
- Auto-create tickets with human review?
- Suggest tickets for human approval before creation?
- Auto-close duplicates? (dangerous -- require approval)
What historical data exists?
- Past bug reports with resolution data?
- Flaky test history? Known environment issues?
- Component ownership mapping?

Core Principles

Deterministic first, LLM second. Use stable, reproducible fingerprinting for deduplication and clustering. Use LLM only for tasks requiring understanding: severity classification, root cause hypothesis, and human-readable ticket writing.
Normalize before comparing. Raw CI logs are full of timestamps, port numbers, process IDs, and random suffixes that make identical failures look different. Strip all noise before fingerprinting.
Fingerprints are anchored to stable elements. Exception type, top stack frames, test name, error message template, and URL pattern are stable. Timestamps, request IDs, and ephemeral ports are not.
Human approval before destructive actions. Auto-closing a ticket as duplicate or auto-merging reports requires human confirmation. False deduplication wastes more time than manual triage.
Classification drives routing. The value of triage is not the label itself but the routing decision it enables: which team, what priority, what SLA.
Track triage accuracy. Measure how often auto-classification matches human judgment. Below 85% accuracy, the pipeline needs tuning.

The Pipeline

CI Log / Error Report
  │
  ▼
Step 1: NORMALIZE
  Strip timestamps, process IDs, ports, random suffixes, ANSI codes
  │
  ▼
Step 2: EXTRACT STABLE ANCHORS
  Exception type, top N stack frames, test name, error message template, URL pattern
  │
  ▼
Step 3: HASH CANONICAL FORM
  Deterministic fingerprint from ordered anchors
  │
  ▼
Step 4: CLUSTER NEAR-DUPLICATES
  Similarity scoring for non-identical but related failures
  │
  ▼
Step 5: LLM CLASSIFY
  Severity, component, suspected root cause, failure category
  │
  ▼
Step 6: LLM GENERATE TICKET
  Title, description, repro steps, evidence, suggested assignee
  │
  ▼
Step 7: HUMAN APPROVAL
  Review before create/close/merge

Step 1: Normalize

Strip noise that makes identical failures look different.

Normalization rules (apply in order):

1. Strip ANSI color codes:        \x1b\[[0-9;]*m → ""
2. Strip timestamps:              \d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[.\d]*Z? → "<TIMESTAMP>"
3. Strip UUIDs:                   [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} → "<UUID>"
4. Strip process IDs:             pid[=: ]\d+ → "pid=<PID>"
5. Strip port numbers:            :\d{4,5}(?=[\s/]) → ":<PORT>"
6. Strip temp file paths:         /tmp/[^\s]+ → "<TMPPATH>"
7. Strip memory addresses:        0x[0-9a-f]{8,16} → "<ADDR>"
8. Strip random suffixes:         [-_][a-z0-9]{6,8}(?=\.) → "<RAND>"
9. Strip request IDs:             (?:request[_-]?id|trace[_-]?id|correlation[_-]?id)[=: ]["']?[a-zA-Z0-9-]+ → "<REQ_ID>"
10. Collapse whitespace:          \s+ → " "

Example:

Before: 2025-03-22T14:32:01.456Z [pid=42891] Error: Connection refused at 127.0.0.1:54321
        request_id=abc-123-def-456
After:  <TIMESTAMP> [pid=<PID>] Error: Connection refused at 127.0.0.1:<PORT>
        <REQ_ID>

Step 2: Extract Stable Anchors

From the normalized log, extract elements that identify the failure regardless of environment or timing.

Anchor types (in priority order):

Anchor	Example	Stability
Exception type	`TypeError`, `AssertionError`, `HTTP 500`	Very high
Error message template	`Cannot read property 'X' of undefined`	High
Top 3 stack frames	`at processOrder (order.ts:142)`	High
Test name	`checkout.spec.ts > completes payment`	Very high
URL pattern	`POST /api/orders`	High
HTTP status code	`500`, `429`, `503`	Very high
Exit code	`exit code 1`, `SIGKILL`	High
Assertion diff	`Expected: 200, Received: 500`	Medium

Extraction rules:

Keep function names but strip line numbers (they change with edits)
Keep URL paths but strip query parameters and IDs in paths (/api/orders/<ID>)
Keep error message structure but replace dynamic values with placeholders
Keep test file and test name exactly as-is

Step 3: Hash Canonical Form

Create a deterministic fingerprint from the extracted anchors.

Algorithm:

1. Sort anchors alphabetically by type
2. Concatenate: exception_type + "|" + message_template + "|" + top_frames + "|" + test_name
3. SHA-256 hash the concatenated string
4. Take first 16 hex characters as fingerprint

Fingerprint properties:

Same failure always produces same fingerprint (deterministic)
Different failures produce different fingerprints (collision-resistant)
Minor log format changes do not change fingerprint (stable)
Fingerprint is short enough for Jira labels and GitHub tags

Example:

Anchors:
  exception_type: "TypeError"
  message_template: "Cannot read property 'vendorId' of undefined"
  top_frames: "processOrder|groupByVendor|checkout"
  test_name: "checkout.spec.ts > multi-vendor checkout"

Canonical: "TypeError|Cannot read property 'vendorId' of undefined|processOrder|groupByVendor|checkout|checkout.spec.ts > multi-vendor checkout"
Fingerprint: a3f8b2c1e9d04567

Step 4: Cluster Near-Duplicates

Exact fingerprint matching catches identical failures. Similarity scoring catches related failures that differ slightly (same root cause, different manifestation).

Similarity dimensions:

Dimension	Weight	Match Criteria
Exception type	0.30	Exact match
Error message	0.25	Levenshtein distance < 20% of message length
Stack frames	0.25	Jaccard similarity of top 5 frames > 0.6
Component/file	0.10	Same directory or module
Test name	0.10	Same describe block or test file

Clustering threshold: similarity score > 0.75 = likely duplicate, suggest merge.

Human review required for:

Scores between 0.60 and 0.75 (ambiguous)
First occurrence of a new fingerprint (no history to compare)
Failures in components with known intermittent issues

Step 5: LLM Classify

After deterministic fingerprinting and clustering, use LLM to classify the failure.

LLM classification prompt:

Given this normalized failure:

Exception: [TYPE]
Message: [MESSAGE]
Stack trace (top 5 frames): [FRAMES]
Test name: [TEST]
CI context: [branch, commit, runner OS]

Classify this failure:

1. **Failure category:** test bug | application bug | environment issue | flaky test | build failure
2. **Severity:** critical | major | minor | trivial (see severity matrix below)
3. **Component:** [infer from stack trace and file paths]
4. **Suspected root cause:** [1-2 sentence hypothesis]
5. **Confidence:** high | medium | low

If confidence is low, explain what additional information would help.

Failure categories (see references/ci-failure-analysis.md for detail):

Category	Description	Typical Action
Application bug	The app is broken	File bug ticket, assign to owning team
Test bug	The test is wrong	Fix the test, no app change needed
Environment issue	CI infra / network / service down	Retry, notify infra team
Flaky test	Intermittent, non-deterministic	Quarantine, investigate root cause
Build failure	Compilation, dependency, config	Fix build, usually blocking

Step 6: LLM Generate Ticket

Once classified, use LLM to generate a human-quality bug ticket.

Ticket generation prompt:

Generate a bug ticket from this classified failure:

Failure category: [CATEGORY]
Severity: [SEVERITY]
Component: [COMPONENT]
Fingerprint: [HASH]
Suspected root cause: [HYPOTHESIS]

Normalized error:
[NORMALIZED ERROR WITH CONTEXT]

Original log excerpt (last 30 lines before failure):
[LOG EXCERPT]

Related failures (same cluster):
[LIST OF SIMILAR FINGERPRINTS WITH DATES]

Generate:
1. **Title:** concise, searchable, includes component name (under 80 chars)
2. **Description:** what happened, in plain language
3. **Steps to reproduce:** derived from test name and log context
4. **Evidence:** relevant log lines, assertion diffs, screenshots if available
5. **Suggested labels:** [component, severity, failure-category]
6. **Suggested assignee:** based on component ownership (if known)

Step 7: Human Approval

No automated action without review. The pipeline suggests; humans decide.

Approval decisions:

Create ticket — New failure, clear root cause, assign to team
Merge into existing — Duplicate of known issue, add evidence to existing ticket
Quarantine test — Flaky test, not an app bug, quarantine and schedule investigation
Retry and monitor — Environment issue, retry CI, alert if persists
Dismiss — Known issue already fixed in pending deploy, or test bug with obvious fix

Severity/Priority Matrix

Severity measures impact. Priority measures urgency. They are independent dimensions.

Severity Definitions

Severity	Definition	Examples
Critical	System unusable, data loss, security breach, no workaround	Payment processing fails, user data exposed, app crashes on launch
Major	Core feature broken, degraded experience, workaround exists	Search returns wrong results, checkout requires page reload, form data lost on back-button
Minor	Non-core feature affected, cosmetic with functional impact	Sorting does not persist, tooltip clipped on mobile, secondary action fails
Trivial	Cosmetic only, no functional impact	Typo in label, 1px alignment, inconsistent capitalization

Priority Definitions

Priority	Definition	SLA (example)
P0	Fix immediately, blocks release or production	Same day
P1	Fix this sprint, significant user impact	This sprint
P2	Fix next sprint, moderate impact	Next sprint
P3	Fix when convenient, low impact	Backlog

Severity x Priority Decision Guide

	Critical	Major	Minor	Trivial
Affects all users	P0	P0	P1	P2
Affects segment (>10%)	P0	P1	P2	P3
Affects few users (<10%)	P1	P1	P2	P3
Edge case only	P1	P2	P3	P3

Bug Report Template

Use this template for any bug report, whether auto-generated or human-written.

## [Component] Brief description of the defect

**Severity:** Critical | Major | Minor | Trivial
**Priority:** P0 | P1 | P2 | P3
**Component:** [module/service/page]
**Environment:** [OS, browser, deploy environment]
**Fingerprint:** [if auto-generated: hash ID]
**Reporter:** [person or "auto-triage pipeline"]

### Description
[1-3 sentences: what is broken, who is affected, what is the business impact]

### Steps to Reproduce
1. [Precondition: user role, data state]
2. [Navigate to / call endpoint]
3. [Perform action]
4. [Observe failure]

### Expected Behavior
[What should happen]

### Actual Behavior
[What actually happens — include error messages verbatim]

### Evidence
- **Error log:** [relevant lines]
- **Screenshot:** [if applicable]
- **Assertion diff:** [expected vs actual values]
- **Trace/request ID:** [for distributed tracing]

### Frequency
- [Always | Intermittent (N/M runs) | Once observed]
- First seen: [date/commit]
- Last seen: [date/commit]

### Suggested Root Cause
[Hypothesis based on evidence — helps developer investigation]

### Related Issues
- [Links to similar/duplicate tickets]
- [Links to related PRs or deployments]

Deduplication Patterns

Pattern	Detection	Action
Exact duplicate	Same fingerprint	Merge into existing ticket, add evidence
Near-duplicate	Same cluster (similarity > 0.75)	Link tickets, suggest merge for human review
Same root cause, different symptom	Same exception type + overlapping frames in different tests	Create parent ticket linking symptom tickets
Regression of fixed bug	Fingerprint matches closed ticket	Reopen ticket, flag as regression, increase priority
Flaky recurrence	Same fingerprint intermittently across CI runs	Tag as flaky, quarantine if rate > 10%

CI Failure Analysis

See references/ci-failure-analysis.md for comprehensive patterns. Key decision: consistent failure = test bug or app bug; intermittent failure = flaky test or environment; multiple failures at once = environment or shared component; build failure = code or dependency issue.

Integration Patterns

GitHub Issues

# Create issue with labels from pipeline output
gh issue create \
  --title "[Checkout] Payment fails for multi-vendor carts" \
  --body "$(cat ticket-body.md)" \
  --label "bug,severity:critical,component:checkout" \
  --assignee "@me"

# Check for duplicate by fingerprint
gh issue list --label "fingerprint:a3f8b2c1" --state all

CI Pipeline Integration

# GitHub Actions: run triage on test failure
- name: Triage failures
  if: failure()
  run: |
    node scripts/extract-failures.js test-results/
    node scripts/triage-pipeline.js --input failures.json --output tickets/
    for ticket in tickets/*.json; do
      gh issue create --title "$(jq -r .title $ticket)" \
        --body "$(jq -r .body $ticket)" \
        --label "$(jq -r '.labels | join(",")' $ticket)"
    done

For Jira, Linear, and Azure DevOps integration, use their respective REST/GraphQL APIs with the same ticket data generated by Step 6. The pipeline output is tracker-agnostic -- it produces title, description, labels, severity, and component that map to any tracker's fields.

Anti-Patterns

1. Using LLM for Deduplication

LLMs are non-deterministic. The same two errors compared twice may get different similarity scores. Use deterministic fingerprinting for deduplication; use LLM only for explaining and classifying.

2. Auto-Closing Without Review

Automatically closing a ticket as "duplicate" based on fingerprint matching can merge distinct issues. Always require human confirmation for close/merge actions.

3. Over-Classifying Severity

If everything is "critical," nothing is. Follow the severity matrix strictly. A cosmetic typo is trivial even if it annoys someone.

4. Ignoring Environment Failures

Labeling all failures as "app bug" when many are CI infrastructure issues (Docker OOM, network timeout, disk full). Classify environment issues separately -- they need different remediation.

5. No Feedback Loop

Building the pipeline once and never measuring accuracy. Track: auto-classification accuracy, false duplicate rate, ticket quality ratings from developers.

6. Raw Logs in Tickets

Pasting 500 lines of raw CI output into a bug ticket. Normalize, extract relevant lines, and present the 5-10 lines that matter.

7. Fingerprinting Without Normalization

Hashing raw log lines produces unstable fingerprints that change every run. Normalization (Step 1) is mandatory before fingerprinting.

8. No Component Ownership Mapping

Classification without routing is useless. Maintain a component-to-team mapping so that classified bugs reach the right people.

Done When

Each triaged bug has severity, component, and root cause labels assigned
Duplicates merged or linked with references to the canonical ticket
CI failure analysis report generated summarizing failure categories and counts
Actionable tickets created for all P0 and P1 issues with assigned owners
Triage session findings summarized and shared with the team

Related Skills

qa-metrics — Track triage accuracy, duplicate rates, mean time to classification, and defect escape rates.
ci-cd-integration — Pipeline configuration for running triage on test failures, parallel execution, and reporting.
test-reliability — Flaky test classification, quarantine management, and root cause analysis.
qa-project-context — Project context that improves classification accuracy: component map, known issues, ownership.
ai-test-generation — Generate regression tests from triaged bug reports.

References

references/classification-taxonomy.md — Bug categories, severity definitions, component mapping rules, and root cause categories.
references/ci-failure-analysis.md — CI log parsing patterns, failure category decision tree, fingerprinting algorithm detail.