skills/arabelatso/skills-4-se/szz-bug-introducing-commit-identifier

szz-bug-introducing-commit-identifier

SKILL.md

SZZ Bug-Introducing Commit Identifier

Overview

This skill performs SZZ (Śliwerski-Zimmermann-Zeller) algorithm analysis to identify bug-introducing commits in git repositories. Given a bug-fixing commit, it traces modified lines back through version history using git blame to find candidate commits that originally introduced the buggy code.

Workflow

1. Identify the Bug-Fixing Commit

Start by identifying the commit that fixes the bug. This can be obtained from:

  • Commit hash provided by the user
  • Issue tracker references (e.g., "fixes #123")
  • Commit message analysis (e.g., "fix:", "bug:")
  • Manual identification by the user

2. Run the SZZ Analysis

Use the provided script to perform the analysis:

python scripts/szz_analyzer.py <fix-commit-hash>

Options:

  • --repo <path>: Specify repository path (default: current directory)
  • --json: Output results in JSON format for programmatic processing
  • --top <n>: Number of top candidates to show (default: 10)

Example:

python scripts/szz_analyzer.py abc123def --repo /path/to/repo --top 5

3. Interpret Results

The script outputs a ranked list of candidate bug-introducing commits with:

  • Commit hash: The candidate commit identifier
  • Author: Who made the commit
  • Date: When the commit was made
  • Message: The commit message
  • Confidence score: Likelihood this commit introduced the bug (0.0-1.0)
  • Reasons: Explanation for why this commit is a candidate

4. Manual Verification

Always manually review the top candidates:

  1. Examine the actual code changes in the candidate commit
  2. Check if the changes are functionally related to the bug
  3. Consider the context and purpose of the changes
  4. Verify against issue tracker history if available

Understanding Confidence Scores

High Confidence (0.8-1.0):

  • Multiple lines from the commit were fixed
  • Commit message doesn't suggest refactoring
  • Functional code changes (not just formatting)

Medium Confidence (0.5-0.8):

  • Single line modified, or
  • Some indicators of refactoring but functional changes present

Low Confidence (0.0-0.5):

  • Commit message suggests refactoring/formatting
  • Only structural changes (imports, comments, whitespace)
  • Likely a false positive

False Positive Filtering

The script automatically filters common false positives:

Automatically Filtered Lines:

  • Empty lines and whitespace-only changes
  • Comment additions/modifications
  • Import/include statements
  • Braces and structural elements

Reduced Confidence for:

  • Commits with refactoring keywords in messages
  • Single-line changes
  • Formatting-related commits

Common Use Cases

Use Case 1: Bug Root Cause Analysis

User: "Find which commit introduced the bug fixed in commit abc123"
→ Run: python scripts/szz_analyzer.py abc123
→ Review top candidates and examine their changes

Use Case 2: Developer Accountability

User: "Who introduced the authentication bug?"
→ First identify the fix commit
→ Run SZZ analysis
→ Check the author field of top candidates

Use Case 3: Bug Pattern Analysis

User: "Analyze all bug-introducing commits from the last release"
→ Identify all bug-fix commits
→ Run SZZ analysis on each
→ Aggregate results to find patterns

Use Case 4: Empirical Software Engineering Research

User: "Generate dataset of bug-introducing commits for analysis"
→ Run SZZ analysis with --json flag
→ Process JSON output for statistical analysis

Limitations and Considerations

  1. Tangled Changes: If a commit mixes bug-introducing code with unrelated changes, the entire commit is flagged

  2. Refactoring Breaks Chains: Heavy refactoring can make it difficult to trace back to the original introduction

  3. Indirect Bugs: Bugs caused by missing code or incorrect assumptions may not be detected

  4. Multi-Commit Bugs: Bugs introduced across multiple commits may only identify the most recent contributor

  5. False Fixes: If the "fix" commit doesn't actually fix the bug, the analysis will be incorrect

Advanced Usage

Programmatic Integration

Use JSON output for integration with other tools:

import subprocess
import json

result = subprocess.run(
    ['python', 'scripts/szz_analyzer.py', 'abc123', '--json'],
    capture_output=True,
    text=True
)
candidates = json.loads(result.stdout)

for candidate in candidates:
    print(f"{candidate['commit_hash']}: {candidate['confidence_score']}")

Batch Analysis

Analyze multiple bug fixes:

for commit in $(git log --grep="fix:" --format="%H"); do
    echo "Analyzing fix: $commit"
    python scripts/szz_analyzer.py $commit --top 3
done

Resources

scripts/szz_analyzer.py

The main analysis script that performs SZZ algorithm implementation. It:

  • Extracts modified lines from bug-fixing commits
  • Uses git blame to trace lines back through history
  • Applies filtering heuristics to reduce false positives
  • Ranks candidates by confidence score

references/szz_algorithm.md

Comprehensive documentation on the SZZ algorithm including:

  • Detailed algorithm steps and theory
  • False positive patterns and filtering strategies
  • Confidence scoring methodology
  • Limitations and best practices
  • Algorithm variants and extensions

Read this reference when you need deeper understanding of the algorithm, want to customize filtering heuristics, or need to explain the methodology to users.

Weekly Installs
1
GitHub Stars
47
First Seen
11 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1