ai-bug-triage

Installation
SKILL.md

Key reframe: The LLM is best at explaining and routing, not deduplication. Teach agents to DESIGN the pipeline, not BE the pipeline.

Before starting: Check for .agents/qa-project-context.md in the project root. It contains tech stack, component mapping, and known flaky areas that improve classification accuracy.


Discovery Questions

Before building or using a triage pipeline, clarify:

  1. What is the failure source?

    • CI pipeline logs (GitHub Actions, GitLab CI, Jenkins, CircleCI)
    • Test framework output (Playwright, Jest, pytest, Vitest)
    • Production error monitoring (Sentry, Datadog, Bugsnag)
    • Manual bug reports from QA or users
  2. What is the ticket destination?

    • Jira, Linear, GitHub Issues, Azure DevOps, Shortcut
    • What fields are required? (component, severity, priority, labels)
    • What workflows exist? (triage board, auto-assignment rules)
  3. What is the deduplication scope?

    • Same test run? Same sprint? Same release? All time?
    • Do you already have fingerprinting? What is the current duplicate rate?
  4. What approval workflow is needed?

    • Auto-create tickets with human review?
    • Suggest tickets for human approval before creation?
    • Auto-close duplicates? (dangerous -- require approval)
  5. What historical data exists?

    • Past bug reports with resolution data?
    • Flaky test history? Known environment issues?
    • Component ownership mapping?

Core Principles

  1. Deterministic first, LLM second. Use stable, reproducible fingerprinting for deduplication and clustering. Use LLM only for tasks requiring understanding: severity classification, root cause hypothesis, and human-readable ticket writing.

  2. Normalize before comparing. Raw CI logs are full of timestamps, port numbers, process IDs, and random suffixes that make identical failures look different. Strip all noise before fingerprinting.

  3. Fingerprints are anchored to stable elements. Exception type, top stack frames, test name, error message template, and URL pattern are stable. Timestamps, request IDs, and ephemeral ports are not.

  4. Human approval before destructive actions. Auto-closing a ticket as duplicate or auto-merging reports requires human confirmation. False deduplication wastes more time than manual triage.

  5. Classification drives routing. The value of triage is not the label itself but the routing decision it enables: which team, what priority, what SLA.

  6. Track triage accuracy. Measure how often auto-classification matches human judgment. Below 85% accuracy, the pipeline needs tuning.


The Pipeline

CI Log / Error Report
Step 1: NORMALIZE
  Strip timestamps, process IDs, ports, random suffixes, ANSI codes
Step 2: EXTRACT STABLE ANCHORS
  Exception type, top N stack frames, test name, error message template, URL pattern
Step 3: HASH CANONICAL FORM
  Deterministic fingerprint from ordered anchors
Step 4: CLUSTER NEAR-DUPLICATES
  Similarity scoring for non-identical but related failures
Step 5: LLM CLASSIFY
  Severity, component, suspected root cause, failure category
Step 6: LLM GENERATE TICKET
  Title, description, repro steps, evidence, suggested assignee
Step 7: HUMAN APPROVAL
  Review before create/close/merge

Step 1: Normalize

Strip noise that makes identical failures look different.

Normalization rules (apply in order):

1. Strip ANSI color codes:        \x1b\[[0-9;]*m → ""
2. Strip timestamps:              \d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[.\d]*Z? → "<TIMESTAMP>"
3. Strip UUIDs:                   [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} → "<UUID>"
4. Strip process IDs:             pid[=: ]\d+ → "pid=<PID>"
5. Strip port numbers:            :\d{4,5}(?=[\s/]) → ":<PORT>"
6. Strip temp file paths:         /tmp/[^\s]+ → "<TMPPATH>"
7. Strip memory addresses:        0x[0-9a-f]{8,16} → "<ADDR>"
8. Strip random suffixes:         [-_][a-z0-9]{6,8}(?=\.) → "<RAND>"
9. Strip request IDs:             (?:request[_-]?id|trace[_-]?id|correlation[_-]?id)[=: ]["']?[a-zA-Z0-9-]+ → "<REQ_ID>"
10. Collapse whitespace:          \s+ → " "

Example:

Before: 2025-03-22T14:32:01.456Z [pid=42891] Error: Connection refused at 127.0.0.1:54321
        request_id=abc-123-def-456
After:  <TIMESTAMP> [pid=<PID>] Error: Connection refused at 127.0.0.1:<PORT>
        <REQ_ID>

Step 2: Extract Stable Anchors

From the normalized log, extract elements that identify the failure regardless of environment or timing.

Anchor types (in priority order):

Anchor Example Stability
Exception type TypeError, AssertionError, HTTP 500 Very high
Error message template Cannot read property 'X' of undefined High
Top 3 stack frames at processOrder (order.ts:142) High
Test name checkout.spec.ts > completes payment Very high
URL pattern POST /api/orders High
HTTP status code 500, 429, 503 Very high
Exit code exit code 1, SIGKILL High
Assertion diff Expected: 200, Received: 500 Medium

Extraction rules:

  • Keep function names but strip line numbers (they change with edits)
  • Keep URL paths but strip query parameters and IDs in paths (/api/orders/<ID>)
  • Keep error message structure but replace dynamic values with placeholders
  • Keep test file and test name exactly as-is

Step 3: Hash Canonical Form

Create a deterministic fingerprint from the extracted anchors.

Algorithm:

1. Sort anchors alphabetically by type
2. Concatenate: exception_type + "|" + message_template + "|" + top_frames + "|" + test_name
3. SHA-256 hash the concatenated string
4. Take first 16 hex characters as fingerprint

Fingerprint properties:

  • Same failure always produces same fingerprint (deterministic)
  • Different failures produce different fingerprints (collision-resistant)
  • Minor log format changes do not change fingerprint (stable)
  • Fingerprint is short enough for Jira labels and GitHub tags

Example:

Anchors:
  exception_type: "TypeError"
  message_template: "Cannot read property 'vendorId' of undefined"
  top_frames: "processOrder|groupByVendor|checkout"
  test_name: "checkout.spec.ts > multi-vendor checkout"

Canonical: "TypeError|Cannot read property 'vendorId' of undefined|processOrder|groupByVendor|checkout|checkout.spec.ts > multi-vendor checkout"
Fingerprint: a3f8b2c1e9d04567

Step 4: Cluster Near-Duplicates

Exact fingerprint matching catches identical failures. Similarity scoring catches related failures that differ slightly (same root cause, different manifestation).

Similarity dimensions:

Dimension Weight Match Criteria
Exception type 0.30 Exact match
Error message 0.25 Levenshtein distance < 20% of message length
Stack frames 0.25 Jaccard similarity of top 5 frames > 0.6
Component/file 0.10 Same directory or module
Test name 0.10 Same describe block or test file

Clustering threshold: similarity score > 0.75 = likely duplicate, suggest merge.

Human review required for:

  • Scores between 0.60 and 0.75 (ambiguous)
  • First occurrence of a new fingerprint (no history to compare)
  • Failures in components with known intermittent issues

Step 5: LLM Classify

After deterministic fingerprinting and clustering, use LLM to classify the failure.

LLM classification prompt:

Given this normalized failure:

Exception: [TYPE]
Message: [MESSAGE]
Stack trace (top 5 frames): [FRAMES]
Test name: [TEST]
CI context: [branch, commit, runner OS]

Classify this failure:

1. **Failure category:** test bug | application bug | environment issue | flaky test | build failure
2. **Severity:** critical | major | minor | trivial (see severity matrix below)
3. **Component:** [infer from stack trace and file paths]
4. **Suspected root cause:** [1-2 sentence hypothesis]
5. **Confidence:** high | medium | low

If confidence is low, explain what additional information would help.

Failure categories (see references/ci-failure-analysis.md for detail):

Category Description Typical Action
Application bug The app is broken File bug ticket, assign to owning team
Test bug The test is wrong Fix the test, no app change needed
Environment issue CI infra / network / service down Retry, notify infra team
Flaky test Intermittent, non-deterministic Quarantine, investigate root cause
Build failure Compilation, dependency, config Fix build, usually blocking

Step 6: LLM Generate Ticket

Once classified, use LLM to generate a human-quality bug ticket.

Ticket generation prompt:

Generate a bug ticket from this classified failure:

Failure category: [CATEGORY]
Severity: [SEVERITY]
Component: [COMPONENT]
Fingerprint: [HASH]
Suspected root cause: [HYPOTHESIS]

Normalized error:
[NORMALIZED ERROR WITH CONTEXT]

Original log excerpt (last 30 lines before failure):
[LOG EXCERPT]

Related failures (same cluster):
[LIST OF SIMILAR FINGERPRINTS WITH DATES]

Generate:
1. **Title:** concise, searchable, includes component name (under 80 chars)
2. **Description:** what happened, in plain language
3. **Steps to reproduce:** derived from test name and log context
4. **Evidence:** relevant log lines, assertion diffs, screenshots if available
5. **Suggested labels:** [component, severity, failure-category]
6. **Suggested assignee:** based on component ownership (if known)

Step 7: Human Approval

No automated action without review. The pipeline suggests; humans decide.

Approval decisions:

  • Create ticket — New failure, clear root cause, assign to team
  • Merge into existing — Duplicate of known issue, add evidence to existing ticket
  • Quarantine test — Flaky test, not an app bug, quarantine and schedule investigation
  • Retry and monitor — Environment issue, retry CI, alert if persists
  • Dismiss — Known issue already fixed in pending deploy, or test bug with obvious fix

Severity/Priority Matrix

Severity measures impact. Priority measures urgency. They are independent dimensions.

Severity Definitions

Severity Definition Examples
Critical System unusable, data loss, security breach, no workaround Payment processing fails, user data exposed, app crashes on launch
Major Core feature broken, degraded experience, workaround exists Search returns wrong results, checkout requires page reload, form data lost on back-button
Minor Non-core feature affected, cosmetic with functional impact Sorting does not persist, tooltip clipped on mobile, secondary action fails
Trivial Cosmetic only, no functional impact Typo in label, 1px alignment, inconsistent capitalization

Priority Definitions

Priority Definition SLA (example)
P0 Fix immediately, blocks release or production Same day
P1 Fix this sprint, significant user impact This sprint
P2 Fix next sprint, moderate impact Next sprint
P3 Fix when convenient, low impact Backlog

Severity x Priority Decision Guide

Critical Major Minor Trivial
Affects all users P0 P0 P1 P2
Affects segment (>10%) P0 P1 P2 P3
Affects few users (<10%) P1 P1 P2 P3
Edge case only P1 P2 P3 P3

Bug Report Template

Use this template for any bug report, whether auto-generated or human-written.

## [Component] Brief description of the defect

**Severity:** Critical | Major | Minor | Trivial
**Priority:** P0 | P1 | P2 | P3
**Component:** [module/service/page]
**Environment:** [OS, browser, deploy environment]
**Fingerprint:** [if auto-generated: hash ID]
**Reporter:** [person or "auto-triage pipeline"]

### Description
[1-3 sentences: what is broken, who is affected, what is the business impact]

### Steps to Reproduce
1. [Precondition: user role, data state]
2. [Navigate to / call endpoint]
3. [Perform action]
4. [Observe failure]

### Expected Behavior
[What should happen]

### Actual Behavior
[What actually happens — include error messages verbatim]

### Evidence
- **Error log:** [relevant lines]
- **Screenshot:** [if applicable]
- **Assertion diff:** [expected vs actual values]
- **Trace/request ID:** [for distributed tracing]

### Frequency
- [Always | Intermittent (N/M runs) | Once observed]
- First seen: [date/commit]
- Last seen: [date/commit]

### Suggested Root Cause
[Hypothesis based on evidence — helps developer investigation]

### Related Issues
- [Links to similar/duplicate tickets]
- [Links to related PRs or deployments]

Deduplication Patterns

Pattern Detection Action
Exact duplicate Same fingerprint Merge into existing ticket, add evidence
Near-duplicate Same cluster (similarity > 0.75) Link tickets, suggest merge for human review
Same root cause, different symptom Same exception type + overlapping frames in different tests Create parent ticket linking symptom tickets
Regression of fixed bug Fingerprint matches closed ticket Reopen ticket, flag as regression, increase priority
Flaky recurrence Same fingerprint intermittently across CI runs Tag as flaky, quarantine if rate > 10%

CI Failure Analysis

See references/ci-failure-analysis.md for comprehensive patterns. Key decision: consistent failure = test bug or app bug; intermittent failure = flaky test or environment; multiple failures at once = environment or shared component; build failure = code or dependency issue.


Integration Patterns

GitHub Issues

# Create issue with labels from pipeline output
gh issue create \
  --title "[Checkout] Payment fails for multi-vendor carts" \
  --body "$(cat ticket-body.md)" \
  --label "bug,severity:critical,component:checkout" \
  --assignee "@me"

# Check for duplicate by fingerprint
gh issue list --label "fingerprint:a3f8b2c1" --state all

CI Pipeline Integration

# GitHub Actions: run triage on test failure
- name: Triage failures
  if: failure()
  run: |
    node scripts/extract-failures.js test-results/
    node scripts/triage-pipeline.js --input failures.json --output tickets/
    for ticket in tickets/*.json; do
      gh issue create --title "$(jq -r .title $ticket)" \
        --body "$(jq -r .body $ticket)" \
        --label "$(jq -r '.labels | join(",")' $ticket)"
    done

For Jira, Linear, and Azure DevOps integration, use their respective REST/GraphQL APIs with the same ticket data generated by Step 6. The pipeline output is tracker-agnostic -- it produces title, description, labels, severity, and component that map to any tracker's fields.


Anti-Patterns

1. Using LLM for Deduplication

LLMs are non-deterministic. The same two errors compared twice may get different similarity scores. Use deterministic fingerprinting for deduplication; use LLM only for explaining and classifying.

2. Auto-Closing Without Review

Automatically closing a ticket as "duplicate" based on fingerprint matching can merge distinct issues. Always require human confirmation for close/merge actions.

3. Over-Classifying Severity

If everything is "critical," nothing is. Follow the severity matrix strictly. A cosmetic typo is trivial even if it annoys someone.

4. Ignoring Environment Failures

Labeling all failures as "app bug" when many are CI infrastructure issues (Docker OOM, network timeout, disk full). Classify environment issues separately -- they need different remediation.

5. No Feedback Loop

Building the pipeline once and never measuring accuracy. Track: auto-classification accuracy, false duplicate rate, ticket quality ratings from developers.

6. Raw Logs in Tickets

Pasting 500 lines of raw CI output into a bug ticket. Normalize, extract relevant lines, and present the 5-10 lines that matter.

7. Fingerprinting Without Normalization

Hashing raw log lines produces unstable fingerprints that change every run. Normalization (Step 1) is mandatory before fingerprinting.

8. No Component Ownership Mapping

Classification without routing is useless. Maintain a component-to-team mapping so that classified bugs reach the right people.


Done When

  • Each triaged bug has severity, component, and root cause labels assigned
  • Duplicates merged or linked with references to the canonical ticket
  • CI failure analysis report generated summarizing failure categories and counts
  • Actionable tickets created for all P0 and P1 issues with assigned owners
  • Triage session findings summarized and shared with the team

Related Skills

  • qa-metrics — Track triage accuracy, duplicate rates, mean time to classification, and defect escape rates.
  • ci-cd-integration — Pipeline configuration for running triage on test failures, parallel execution, and reporting.
  • test-reliability — Flaky test classification, quarantine management, and root cause analysis.
  • qa-project-context — Project context that improves classification accuracy: component map, known issues, ownership.
  • ai-test-generation — Generate regression tests from triaged bug reports.

References

  • references/classification-taxonomy.md — Bug categories, severity definitions, component mapping rules, and root cause categories.
  • references/ci-failure-analysis.md — CI log parsing patterns, failure category decision tree, fingerprinting algorithm detail.
Weekly Installs
11
GitHub Stars
4
First Seen
Apr 1, 2026
Installed on
amp10
cline10
opencode10
cursor10
kimi-cli10
warp10