szz-bug-introducing-commit-identifier
SZZ Bug-Introducing Commit Identifier
Overview
This skill performs SZZ (Śliwerski-Zimmermann-Zeller) algorithm analysis to identify bug-introducing commits in git repositories. Given a bug-fixing commit, it traces modified lines back through version history using git blame to find candidate commits that originally introduced the buggy code.
Workflow
1. Identify the Bug-Fixing Commit
Start by identifying the commit that fixes the bug. This can be obtained from:
- Commit hash provided by the user
- Issue tracker references (e.g., "fixes #123")
- Commit message analysis (e.g., "fix:", "bug:")
- Manual identification by the user
2. Run the SZZ Analysis
Use the provided script to perform the analysis:
python scripts/szz_analyzer.py <fix-commit-hash>
Options:
--repo <path>: Specify repository path (default: current directory)--json: Output results in JSON format for programmatic processing--top <n>: Number of top candidates to show (default: 10)
Example:
python scripts/szz_analyzer.py abc123def --repo /path/to/repo --top 5
3. Interpret Results
The script outputs a ranked list of candidate bug-introducing commits with:
- Commit hash: The candidate commit identifier
- Author: Who made the commit
- Date: When the commit was made
- Message: The commit message
- Confidence score: Likelihood this commit introduced the bug (0.0-1.0)
- Reasons: Explanation for why this commit is a candidate
4. Manual Verification
Always manually review the top candidates:
- Examine the actual code changes in the candidate commit
- Check if the changes are functionally related to the bug
- Consider the context and purpose of the changes
- Verify against issue tracker history if available
Understanding Confidence Scores
High Confidence (0.8-1.0):
- Multiple lines from the commit were fixed
- Commit message doesn't suggest refactoring
- Functional code changes (not just formatting)
Medium Confidence (0.5-0.8):
- Single line modified, or
- Some indicators of refactoring but functional changes present
Low Confidence (0.0-0.5):
- Commit message suggests refactoring/formatting
- Only structural changes (imports, comments, whitespace)
- Likely a false positive
False Positive Filtering
The script automatically filters common false positives:
Automatically Filtered Lines:
- Empty lines and whitespace-only changes
- Comment additions/modifications
- Import/include statements
- Braces and structural elements
Reduced Confidence for:
- Commits with refactoring keywords in messages
- Single-line changes
- Formatting-related commits
Common Use Cases
Use Case 1: Bug Root Cause Analysis
User: "Find which commit introduced the bug fixed in commit abc123"
→ Run: python scripts/szz_analyzer.py abc123
→ Review top candidates and examine their changes
Use Case 2: Developer Accountability
User: "Who introduced the authentication bug?"
→ First identify the fix commit
→ Run SZZ analysis
→ Check the author field of top candidates
Use Case 3: Bug Pattern Analysis
User: "Analyze all bug-introducing commits from the last release"
→ Identify all bug-fix commits
→ Run SZZ analysis on each
→ Aggregate results to find patterns
Use Case 4: Empirical Software Engineering Research
User: "Generate dataset of bug-introducing commits for analysis"
→ Run SZZ analysis with --json flag
→ Process JSON output for statistical analysis
Limitations and Considerations
-
Tangled Changes: If a commit mixes bug-introducing code with unrelated changes, the entire commit is flagged
-
Refactoring Breaks Chains: Heavy refactoring can make it difficult to trace back to the original introduction
-
Indirect Bugs: Bugs caused by missing code or incorrect assumptions may not be detected
-
Multi-Commit Bugs: Bugs introduced across multiple commits may only identify the most recent contributor
-
False Fixes: If the "fix" commit doesn't actually fix the bug, the analysis will be incorrect
Advanced Usage
Programmatic Integration
Use JSON output for integration with other tools:
import subprocess
import json
result = subprocess.run(
['python', 'scripts/szz_analyzer.py', 'abc123', '--json'],
capture_output=True,
text=True
)
candidates = json.loads(result.stdout)
for candidate in candidates:
print(f"{candidate['commit_hash']}: {candidate['confidence_score']}")
Batch Analysis
Analyze multiple bug fixes:
for commit in $(git log --grep="fix:" --format="%H"); do
echo "Analyzing fix: $commit"
python scripts/szz_analyzer.py $commit --top 3
done
Resources
scripts/szz_analyzer.py
The main analysis script that performs SZZ algorithm implementation. It:
- Extracts modified lines from bug-fixing commits
- Uses git blame to trace lines back through history
- Applies filtering heuristics to reduce false positives
- Ranks candidates by confidence score
references/szz_algorithm.md
Comprehensive documentation on the SZZ algorithm including:
- Detailed algorithm steps and theory
- False positive patterns and filtering strategies
- Confidence scoring methodology
- Limitations and best practices
- Algorithm variants and extensions
Read this reference when you need deeper understanding of the algorithm, want to customize filtering heuristics, or need to explain the methodology to users.