skills/terrylica/cc-skills/code-clone-assistant

code-clone-assistant

SKILL.md

Code Clone Assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

Tools

  • PMD CPD v7.17.0+: Exact duplicate detection
  • Semgrep v1.140.0+: Pattern-based detection

Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone


When to Use This Skill

Use this skill when:

  • Finding duplicate code in a codebase
  • Detecting DRY violations
  • Refactoring similar code patterns
  • Identifying copy-paste code

Why Two Tools?

PMD CPD and Semgrep detect different clone types:

Aspect PMD CPD Semgrep
Detects Exact copy-paste duplicates Similar patterns with variations
Scope Across files ✅ Within/across files (Pro only)
Matching Token-based (ignores formatting) Pattern-based (AST matching)
Rules ❌ No custom rules ✅ Custom rules

Result: Using both finds ~3x more DRY violations.

Clone Types

Type Description PMD CPD Semgrep
Type-1 Exact copies ✅ Default
Type-2 Renamed identifiers --ignore-*
Type-3 Near-miss with variations ⚠️ Partial ✅ Patterns
Type-4 Semantic clones (same behavior)

Quick Start Workflow

# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity

# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests


Accepted Exceptions (Known Intentional Duplication)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.

When Duplication Is Acceptable

Pattern Why Acceptable Example
Generation-per-directory experiments Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. SQL templates, sweep scripts where each gen{NNN}/ is independent
SQL templates with placeholder substitution SQL has no import/include mechanism. Templates use sed placeholder replacement (__PLACEHOLDER__), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. ClickHouse sweep templates sharing signal detection + metrics CTEs
Protocol/schema boilerplate Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. NDJSON telemetry line construction in wrapper scripts
Test fixtures and golden files Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. Test setup code, expected output snapshots

How to Report Accepted Exceptions

When clone detection finds duplication that matches an accepted exception pattern:

  1. Report it — always show the user what was found (lines, tokens, files)
  2. Flag as accepted — explicitly state it matches a known exception pattern
  3. Explain why — cite the specific reason refactoring is not recommended
  4. Do NOT recommend refactoring — this is the key difference from actionable findings

Example output format:

Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2

Project-Level Exception Configuration

Projects can declare accepted exception patterns in their CLAUDE.md:

## Code Clone Exceptions

- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation

When this section exists in a project's CLAUDE.md, the code-clone-assistant should check it before classifying findings.


Reference Documentation

For detailed information, see:


Troubleshooting

Issue Cause Solution
PMD CPD not found Not installed or not in PATH brew install pmd or download from PMD releases
Semgrep timeout Large codebase scan Use --exclude to limit scope
No duplicates detected minimum-tokens too high Lower --minimum-tokens value (try 15)
Too many false positives minimum-tokens too low Increase --minimum-tokens (try 30+)
Language not recognized Wrong -l flag Check PMD CPD supported languages list
SARIF parse error Semgrep output malformed Upgrade Semgrep to latest version
Memory error on large repo Java heap too small Set PMD_JAVA_OPTS=-Xmx4g
Missing clone rules file Custom rules not created Create clone-rules.yaml or use default config
Weekly Installs
53
GitHub Stars
19
First Seen
Jan 24, 2026
Installed on
opencode50
gemini-cli49
claude-code48
codex48
cursor47
github-copilot46