devpilot-scanning-repos
Repo Scan (Security / Edge Cases / Coverage → GitHub Issues)
Files in this skill
| File | When to load |
|---|---|
agents/security-scanner.md |
Step 3 — sub-agent prompt for the security scanner. |
agents/edge-case-hunter.md |
Step 3 — sub-agent prompt for edge-case hunting (no business logic). |
agents/coverage-auditor.md |
Step 3 — sub-agent prompt for test-coverage gap detection. |
agents/doc-consistency-auditor.md |
Step 3 — sub-agent prompt for doc/code drift detection (CLAUDE.md, AGENTS.md, README.md and linked docs). |
references/scoring.md |
Step 4 — full 0/25/50/75/100 rubric + false-positive classes. |
references/issue-template.md |
Step 7 — exact gh issue create body and label contract. |
references/labels.md |
Step 2 — one-shot gh label create commands. |
scripts/check-findings.py |
Step 3.5 — validates each scanner's JSON output against the schema. |
evals/evals.json |
Test scenarios for skill behavior (not loaded at runtime). |
Overview
A whole-repo sweep that dispatches four parallel specialist sub-agents (security, edge-case, coverage, doc-drift), scores every finding 0–100 for confidence, filters below threshold, then files each surviving finding as a labeled GitHub issue. Business logic is out of scope — scanners only catch mistakes a reasonable reader could flag without domain knowledge. The doc-drift scanner audits the entry-point docs (CLAUDE.md, AGENTS.md, README.md) and every doc file they link to, checking falsifiable claims against the current code.
Core principle: coverage during scan, filtering during scoring, noise-free issues at the end. The sub-agents are told to surface everything they notice; a separate scoring pass kills the noise so the human only sees load-bearing issues.
When NOT to Use
- Single PR / diff review →
devpilot-pr-review. - Pure style / lint / formatting → the relevant style skill (
devpilot-google-go-style, etc.) + the project's linter. - Business-logic correctness ("does this function compute the right tax rate?") → a human with domain context.
- Repo without a
.github-style issue tracker, or user doesn't want issues created → ask first; print findings to terminal instead.
Workflow
-
Resolve target. Accept
owner/repo, a clone URL, or "this repo" (usegh repo view --json nameWithOwner). Verify withgh repo view. -
Reconcile labels — rename non-canonical matches, reuse exact matches, create what's missing. Do NOT blindly paste a
gh label createblock. The procedure:- Snapshot existing labels:
gh label list --limit 200 --json name,description > /tmp/devpilot-existing-labels.json. If the count returned equals the limit, raise--limitand re-run until the result is shorter than the limit. - For each label the skill needs (the full taxonomy lives in
references/labels.md): classify the closest existing repo label into one of three buckets:- Exact match — the repo label name already equals the canonical
type:value(e.g. repo hasscan:security). Reuse as-is. - Semantic match, non-canonical name — same intent, different name (e.g. repo has
securityortype-securityforscan:security;nil-derefforedge:nil-deref). Rename to the canonical name withgh label edit "<old-name>" --name "<canonical-name>" --description "<canonical-desc>". Renaming preserves all existing issue associations, so this is safe and brings the repo onto our taxonomy. Confirm the rename plan with the user before executing if more than 3 labels are being renamed at once. - Too generic / wrong intent — e.g.
bugforedge:nil-deref, orenhancementfor anything scan-related. Do NOT rename and do NOT reuse. Create the canonical label fresh; leave the generic label alone.
- Exact match — the repo label name already equals the canonical
- Build a name-mapping table for this run:
canonical_name → label_to_apply. After renames, every entry should be the canonicaltype:valuename. The orchestrator uses this mapping when filing issues in step 7. - Create the labels that had no match. Use the canonical
type:valueform. Seereferences/labels.mdfor the exactgh label createinvocation per label (colors and descriptions). - Print the reconciliation summary to the user before continuing:
N reused, R renamed (old → new), M created— so they can spot a wrong rename or reuse before issues get filed.
area:*labels follow the same three-bucket rule but are reconciled lazily at filing time (step 7): for each finding, check the snapshot for an existing area-ish label covering the same top-level dir; rename it toarea:<dir>if it's a semantic match with a non-canonical name, otherwise create. 2.5. Build the file manifest AND the doc manifest.- Source manifest (
/tmp/devpilot-scan-manifest.txt): one walk, sampled path list of production source files for the security / edge-case / coverage scanners. They MUST NOT re-walk — they may only read paths in this manifest. See "Scaling for large repos" below. Default cap: 800 files (--fullraises to 2000,--scope <dir>constrains to a subtree). - Doc manifest (
/tmp/devpilot-doc-manifest.txt): built independently for the doc-drift scanner. Steps:- Find every entry-point file, case-insensitive, anywhere in the repo:
fd -HI -t f -i '^(claude|agents|readme)\.md$'(fallback:find . -type f -iregex '.*/\(claude\|agents\|readme\)\.md'). - Parse markdown links from each — both inline
[text](path)and reference[text]: pathforms — keep only relative targets that resolve to existing files with doc-ish extensions (.md,.mdx,.txt,.rst) or living under adocs/-style directory. Strip#anchorfor resolution but keep it for the scanner's reporting. - Recurse one hop at a time, dedupe by absolute path, cap at depth 3 and at 200 doc files total. Write the resulting list to
/tmp/devpilot-doc-manifest.txt. Print to the user: total entry points found, total docs in the manifest, and a tree showing which entry point pulled in which linked doc.
- Find every entry-point file, case-insensitive, anywhere in the repo:
- Snapshot existing labels:
-
Dispatch scanners in parallel. In ONE message, launch four sub-agents using the prompts in
agents/:agents/security-scanner.md— pass/tmp/devpilot-scan-manifest.txtagents/edge-case-hunter.md— pass/tmp/devpilot-scan-manifest.txtagents/coverage-auditor.md— pass/tmp/devpilot-scan-manifest.txtagents/doc-consistency-auditor.md— pass/tmp/devpilot-doc-manifest.txtEach returns a list ofFindingobjects (see format below). Scanners are told to emit everything they notice — including low-severity — because filtering happens in step 4, not in the scanner. 3.5. Validate scanner output. Pipe each scanner's JSON array throughpython3 scripts/check-findings.py. The--manifestflag is mandatory and chooses the manifest the scanner was dispatched against:- security / edge-case / coverage →
--manifest /tmp/devpilot-scan-manifest.txt - doc-drift →
--manifest /tmp/devpilot-doc-manifest.txtThe script rejects findings whosefileis not on the relevant manifest, missing required fields, invalidcategory/subcategory/severityenums, and emptyevidence. Fix (or ask the scanner to re-emit) before scoring.
-
Score every finding, in batches. Group findings by category and dispatch ONE scoring sub-agent per batch of up to 25 findings (not one per finding — the per-finding fan-out doesn't scale past ~50). The scoring agent returns a JSON array of
{index, score, reason}aligned with the input order. Seereferences/scoring.mdfor the rubric and the batched prompt. -
Filter. Drop every finding with score
< 75. If zero survive, stop — report "no high-confidence issues found" to the user and do not create issues. -
Deduplicate against existing issues. Before filing, query existing scan issues. Use a search that covers BOTH the new taxonomy and the legacy
repo-scanlabel (so re-runs against repos scanned under the old label set still dedupe correctly):gh issue list --search 'label:scan:security,scan:edge-case,scan:coverage,scan:doc-drift,repo-scan in:title' \ --state all --limit 1000 --json title,number,stateNormalize titles by lower-casing, stripping the
[scan:<category>]/[repo-scan:<category>]prefix, and collapsing whitespace before comparing. Skip findings whose normalized title matches an existing issue. If--limit 1000returns exactly 1000, paginate with--search "... created:<<date-of-oldest>"until empty. -
File issues. One
gh issue createper surviving finding, using the template inreferences/issue-template.md. Labels: always exactly five — apply the labels from the step-2 mapping table forscan:<category>, the matching subcategory,severity:<level>,confidence:<score>, plus anarea:<top-level-dir>resolved lazily here. For the area label: first check/tmp/devpilot-existing-labels.jsonfor a suitable existing label (e.g. repo already hasarea-cmdorcmdcovering the dir). If it's an exact match, reuse; if it's a semantic match with a non-canonical name, rename it toarea:<dir>viagh label edit; otherwise rungh label create area:<dir>. -
Summarize. Print a compact table to the user:
[category] [severity] title → #issue-number.
Finding format
Every scanner returns a JSON array of objects with exactly these fields:
{
"category": "security | edge-case | coverage | doc-drift",
"subcategory": "sec:injection | sec:authn-authz | sec:secrets | sec:crypto | sec:path-traversal | sec:ssrf-csrf | sec:deserialization | sec:tls-misconfig | edge:nil-deref | edge:bounds-overflow | edge:error-swallowed | edge:concurrency | edge:resource-leak | edge:input-validation | cov:no-test-file | cov:error-paths | cov:integration-seam | cov:stale-test | doc:broken-link | doc:missing-file | doc:command-mismatch | doc:stale-claim | doc:cross-doc-conflict",
"title": "<≤80 chars, imperative — e.g. 'Sanitize shell input in cmd/devpilot/run.go'>",
"severity": "high | medium | low",
"file": "<path relative to repo root>",
"line_range": "L42-L58",
"evidence": "<2–5 lines quoted from the file, with line numbers>",
"why_it_matters": "<1–3 sentences, no business-logic claims>",
"suggested_fix": "<1–3 sentences; null if scanner can't confidently propose one>"
}
subcategory must match category (sec:* for security, edge:* for edge-case, cov:* for coverage, doc:* for doc-drift). See references/labels.md for the fixed enum — scanners do NOT invent new subcategory values. If a finding doesn't fit any subcategory, the scanner picks the closest fit OR drops the finding.
For doc-drift, the file field is the doc containing the wrong claim (e.g. README.md, docs/cli-reference.md), not the source file the claim is about. The source file (or its absence) goes in evidence.
evidence is mandatory. A finding without quotable code is speculation — drop it at the scanner level.
Scoring rubric (summary — full rubric in references/scoring.md)
| Score | Meaning | Action |
|---|---|---|
| 0 | False positive / pre-existing / would be caught by CI | drop |
| 25 | Might be real; scanner couldn't verify | drop |
| 50 | Real but minor nit | drop |
| 75 | Real, meaningful, verified in code | file |
| 100 | Certain, load-bearing, reproducible | file |
Threshold is 75, matching the confidence bar of a senior reviewer who doesn't cry wolf.
What scanners MUST NOT flag
(These are pre-filtered at the scanner level — do not let findings like this through.)
- Anything a linter, formatter, type-checker, or compiler catches.
- Business-logic correctness — "this discount calculation is wrong" requires domain context the scanner doesn't have.
- Style / naming / readability nits — defer to style skills.
- "Missing tests" for files that are pure types, generated code, or obviously trivial (constants, one-line accessors).
- Dependency CVEs — that's Dependabot's job.
- Pre-existing issues on lines the scanner can see were untouched in recent history (scanner should
git logto check if desired).
See each agent prompt for category-specific false-positive classes.
Quick reference
| I want to… | Do this |
|---|---|
| Scan current repo | gh repo view --json nameWithOwner → pass to step 1 |
| Scan a specific path only | Tell each scanner: "scope = internal/auth/ only" |
| Preview without filing issues | Skip step 2 and 7; print the table |
| Re-scan and not duplicate issues | Step 6 handles this; don't delete existing repo-scan issues |
| Change threshold | Edit step 5 — keep the 0/25/50/75/100 scale, move only the cutoff |
Acceptance criteria (the "test" this skill is written against)
A correct run of this skill produces:
- Exactly four scanner sub-agent dispatches, in parallel (security, edge-case, coverage, doc-drift).
- Every filed issue has exactly five labels:
scan:<category>, a matchingsec:*|edge:*|cov:*|doc:*subcategory,severity:<level>,confidence:<score>(75 or 100), and an auto-derivedarea:<top-level-dir>(for doc-drift findings, derived from the doc file's path — e.g. a finding inREADME.md→area:root, indocs/cli-reference.md→area:docs). - No issue is filed whose scoring-agent score is below 75.
- No duplicate of an existing open scan issue (under either the new
scan:*labels or the legacyrepo-scanlabel). - No finding cites business-logic correctness as the sole reason.
- Every filed issue body quotes ≥2 lines of actual code with a
<file>#L<start>-L<end>link. - If zero findings survive scoring, no issues are filed and the user is told so explicitly.
If any of these is violated, the skill failed — stop and correct before continuing.
Scaling for large repos
A real-world stress test on kubernetes/kubernetes (16,942 source files, ~268k tokens just for the path list) surfaced three failure modes the workflow MUST defend against. This section is the contract — it is not optional.
Failure modes (measured, not theoretical)
- Path-list explosion. The full file list of a large monorepo can consume more tokens than a sub-agent's entire window, before a single file is read. Each scanner re-walking the tree triplicates the cost.
- Hardcoded directory allowlist misses. The naive fallback
cmd/, internal/, api/, handlers/, controllers/, auth/, crypto/, utils/matched 3% of code on kubernetes (which has none ofinternal/,api/,handlers/,controllers/,auth/,crypto/,utils/at the top level). Hardcoded allowlists cannot be the fallback. - Sink-grep volume swamps verify.
exec.Commandalone returns 252 hits,math/rand214,InsecureSkipVerify93 — totalling 647 hits across just 6 patterns. A scanner told to "read the surrounding function" for each will either skip verification (and emit raw greps) or stop after a tiny fraction.
The manifest algorithm (step 2.5)
The orchestrator builds the manifest BEFORE dispatching scanners:
- Inventory. Run
find . -type f \( -name '*.go' -o -name '*.py' -o -name '*.js' -o -name '*.ts' -o -name '*.rb' -o -name '*.java' -o -name '*.rs' \) -not -path './vendor/*' -not -path './node_modules/*' -not -path '*/gen/*' -not -path '*/.git/*' | wc -l. Call thisN. - If
N≤ 800, the manifest is the full list — no sampling needed. - If
N> 800, sample down to 800 with this priority order (concatenate, dedupe, truncate at 800):- Hot-path churn. Files changed in the last 90 days, ranked by commit count:
git log --since=90.days.ago --name-only --pretty=format: -- '*.go' '*.py' '*.js' '*.ts' '*.rb' '*.java' '*.rs' | sort | uniq -c | sort -rn | awk '$2 != "" {print $2}'. Take the top 400. - Top-level high-signal dirs by file count. Dynamically discover (do NOT use the hardcoded allowlist):
find . -mindepth 1 -maxdepth 1 -type d -not -name '.*' -not -name 'vendor' -not -name 'node_modules' -not -name 'test' -not -name 'tests' -not -name 'docs' -not -name 'examples', count source files in each, take the top 4 dirs by count, sample evenly from them up to 300 files. This is the part that fixes the kubernetes-allowlist mismatch — a repo withpkg/andstaging/at the top will have those picked, not ignored. - Security-sensitive name patterns as a final fill: any path containing
auth,crypto,secret,cred,token,password,cert,tls,oauth,session,jwt,signin,login. Take up to 100.
- Hot-path churn. Files changed in the last 90 days, ranked by commit count:
- Write
/tmp/devpilot-scan-manifest.txt, one path per line. Print to the user: totalN, manifest size, and the top 5 dirs represented (so they can sanity-check that the sampling caught the right slice). --fullmode raises the cap to 2000 but still goes through the same priority sampling. Above 2000, refuse and ask the user to scope.--scope <dir>mode skips churn ranking — manifest isfind <dir>capped at 800.
Sink-grep budget (per scanner)
Each scanner runs its dangerous-sink greps ONLY against files in the manifest, not the whole tree:
grep -nE 'exec\.Command|InsecureSkipVerify|...' $(cat /tmp/devpilot-scan-manifest.txt)
Cap per pattern: 40 hits. If a pattern returns more, take the top 40 by file (prefer files that also appear in the churn list — recently-modified hits beat ancient ones). The remainder is logged as "skipped: N additional <pattern> hits not verified" in the scanner's output, so the user sees coverage gaps explicitly rather than silently.
Scoring batching (step 4)
Group findings by category, send up to 25 findings per scoring sub-agent dispatch in one message. For 200 findings across three categories that's ~9 scoring dispatches, not 200. The scoring agent's batched prompt lives in references/scoring.md.
Dedupe pagination (step 6)
gh issue list --limit 200 silently truncates — confirmed against kubernetes/kubernetes. Always use --limit 1000, check whether exactly 1000 returned, and paginate with created:<<<date> until the page is short. See step 6.
Common mistakes
- Letting scanners filter their own output. They should over-report. The scoring pass does the filtering. Merging the two loses calibration.
- Using the scanner agent to also create the issue. Don't — the scanner has too much context. File issues from the main agent after scoring.
- Dropping the evidence block in the issue body. Without it the human has to re-derive the finding. File a crap issue once and nobody trusts the skill.
- Mass-creating the full taxonomy upfront without checking the repo. The skill MUST snapshot existing labels (step 2) and reconcile against them. Blindly creating
scan:securitywhen the repo already hassecuritydoubles the label space and fragments triage queries — renamesecuritytoscan:securityinstead. - Reusing a non-canonical name as-is instead of renaming it. If the repo has
securityand we file underscan:security, both labels now exist with overlapping intent. Rename the existing one to bring the repo onto the canonical taxonomy. - Renaming a too-generic label. Don't rename
bugtoedge:nil-deref— it's attached to unrelated existing issues. Generic labels stay; create the canonical one alongside. - Creating subcategory or severity labels inside the issue-creation loop. Reconcile in step 2, file in step 7. Only
area:*is resolved lazily, and even then it goes through the same suitable-existing-label check. - Asking scanners to rank severity and confidence. Confidence is the scoring pass's job; scanners assign severity only.
- Forgetting the dedupe step. Re-running the skill must be idempotent or the user will stop running it.
Evaluation
Test scenarios for this skill live in evals/evals.json. Each eval gives a prompt, expected output shape, and machine-checkable assertions (e.g. exactly_three_scanner_dispatches, no_business_logic_findings_filed, all_issues_have_three_labels). Run before shipping any change to scanner prompts or the scoring rubric.
More from siyuqian/devpilot
devpilot-resolve-issues
Use when the user wants to resolve, fix, work through, or burn down open GitHub issues in a repository — "fix all the issues", "resolve these tickets", "work through the repo-scan issues", "clear the backlog", "fix issue #42", "/resolve-issues", "解决所有issue", "修复这些issue", "处理issue backlog". Runs as a loop over matching issues until none remain; each REAL issue is fixed in its own `git worktree` so the main checkout stays untouched. Do NOT use for creating issues (use devpilot-scanning-repos) or for reviewing a PR the user already has open (use devpilot-pr-review).
9devpilot-pr-creator
>
8devpilot-content-creator
SEO-optimized blog and content writing skill. Use this skill whenever the user wants to write a blog post, create content for their website, improve SEO rankings, do keyword research, or plan content strategy. Triggers on any mention of blog writing, SEO content, keyword research, content marketing, "写博客", "写文章", "内容创作", "SEO优化", or when the user wants to create any form of long-form content for a website or product. Even if the user just says "write something about X for our site", use this skill.
3devpilot-google-go-style
>
2devpilot-prompt-review
Review and improve prompts for Claude (system prompts, CLAUDE.md files, SKILL.md files, API prompts, or any LLM instruction text). Use this skill whenever the user wants to review a prompt, improve prompt quality, check prompt best practices, audit instructions for Claude, optimize a system prompt, or asks "is this prompt good?". Also trigger when the user shares a prompt and asks for feedback, or when they mention prompt engineering, prompt optimization, or prompt hygiene. Even if they just paste a prompt and say "thoughts?" — use this skill.
2devpilot-issue-triage
Use when the user wants to triage, classify, sort, or sweep open GitHub issues to decide what's worth working on — "triage these issues", "整理 backlog", "哪些可以马上修", "/triage", "sort the open issues". Sits between devpilot-scanning-repos (which files issues) and devpilot-resolve-issues (which fixes them). Read-only against GitHub — drafts comments and labels into a local report but never posts. Do NOT use for filing new issues, executing fixes, or reviewing a single PR.
2