code-clone-assistant
Code Clone Assistant
Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
Tools
- PMD CPD v7.17.0+: Exact duplicate detection
- Semgrep v1.140.0+: Pattern-based detection
Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone
When to Use This Skill
Use this skill when:
- Finding duplicate code in a codebase
- Detecting DRY violations
- Refactoring similar code patterns
- Identifying copy-paste code
Why Two Tools?
PMD CPD and Semgrep detect different clone types:
| Aspect | PMD CPD | Semgrep |
|---|---|---|
| Detects | Exact copy-paste duplicates | Similar patterns with variations |
| Scope | Across files ✅ | Within/across files (Pro only) |
| Matching | Token-based (ignores formatting) | Pattern-based (AST matching) |
| Rules | ❌ No custom rules | ✅ Custom rules |
Result: Using both finds ~3x more DRY violations.
Clone Types
| Type | Description | PMD CPD | Semgrep |
|---|---|---|---|
| Type-1 | Exact copies | ✅ Default | ✅ |
| Type-2 | Renamed identifiers | ✅ --ignore-* |
✅ |
| Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌ | ❌ |
Quick Start Workflow
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity
# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests
Accepted Exceptions (Known Intentional Duplication)
Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.
When Duplication Is Acceptable
| Pattern | Why Acceptable | Example |
|---|---|---|
| Generation-per-directory experiments | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each gen{NNN}/ is independent |
| SQL templates with placeholder substitution | SQL has no import/include mechanism. Templates use sed placeholder replacement (__PLACEHOLDER__), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. |
ClickHouse sweep templates sharing signal detection + metrics CTEs |
| Protocol/schema boilerplate | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts |
| Test fixtures and golden files | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. | Test setup code, expected output snapshots |
How to Report Accepted Exceptions
When clone detection finds duplication that matches an accepted exception pattern:
- Report it — always show the user what was found (lines, tokens, files)
- Flag as accepted — explicitly state it matches a known exception pattern
- Explain why — cite the specific reason refactoring is not recommended
- Do NOT recommend refactoring — this is the key difference from actionable findings
Example output format:
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2
Project-Level Exception Configuration
Projects can declare accepted exception patterns in their CLAUDE.md:
## Code Clone Exceptions
- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
When this section exists in a project's CLAUDE.md, the code-clone-assistant should check it before classifying findings.
Reference Documentation
For detailed information, see:
- Detection Commands - PMD CPD and Semgrep command details
- Complete Workflow - Detection, analysis, and presentation phases
- Refactoring Strategies - Approaches for addressing violations
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| PMD CPD not found | Not installed or not in PATH | brew install pmd or download from PMD releases |
| Semgrep timeout | Large codebase scan | Use --exclude to limit scope |
| No duplicates detected | minimum-tokens too high | Lower --minimum-tokens value (try 15) |
| Too many false positives | minimum-tokens too low | Increase --minimum-tokens (try 30+) |
| Language not recognized | Wrong -l flag |
Check PMD CPD supported languages list |
| SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version |
| Memory error on large repo | Java heap too small | Set PMD_JAVA_OPTS=-Xmx4g |
| Missing clone rules file | Custom rules not created | Create clone-rules.yaml or use default config |