searching-codebases

Installation
SKILL.md

Searching Codebases

Find code in any codebase by pattern or concept. One entry point, two search strategies, automatic routing.

Prerequisites

uv tool install ripgrep

tree-sitting (for structural context expansion) installs automatically when the --expand flag is used.

Primary Command

SKILL_DIR=/mnt/skills/user/searching-codebases

python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]

SOURCE is any of:

  • Local directory path
  • GitHub URL (downloads tarball automatically)
  • uploads (uses /mnt/user-data/uploads/)
  • project (uses /mnt/project/)
  • Path to a .zip or .tar.gz archive

Search Modes

Regex mode (patterns, identifiers, literal text):

python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error"
python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex
python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"

Semantic mode (concepts, natural language):

python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic
python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow"
python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"

Auto-detection: short queries and code-like tokens → regex. Multi-word natural language → semantic. Override with --regex or --semantic.

Options

  • --regex / --semantic: Force search mode
  • --expand: Return full function bodies via tree-sitting AST context
  • --benchmark: Compare indexed regex vs brute-force ripgrep
  • --branch NAME: Git branch for GitHub URLs (default: main)
  • --skip DIRS: Comma-separated directories to skip
  • --json: Machine-readable output
  • -v: Show index stats and query routing decisions

How It Works

Regex search builds a sparse n-gram inverted index over all files. Queries are decomposed into literal fragments, looked up in the index to identify candidate files (typically 90-99% reduction), then verified with ripgrep. Frequency-weighted n-grams make rare character sequences more selective.

Semantic search builds a TF-IDF index over code chunks (functions, classes, structural entries). Queries are ranked by cosine similarity.

Context expansion (--expand) uses tree-sitting's AST cache to identify function/class boundaries, returning complete structural units rather than line fragments. On first use, tree-sitting scans the repo (~700ms for 250 files); subsequent expansions are sub-millisecond.

Small codebases (< 20 files) skip indexing entirely — direct ripgrep is faster when there's nothing to narrow.

Mixed Queries

Multiple queries can use different modes in a single invocation. Each query is auto-routed independently, and indexes are built once per mode:

python3 $SKILL_DIR/scripts/search.py ./repo \
  "class.*Error" \
  "error recovery strategy" \
  "def retry"

Dependencies

  • tree-sitting: Provides AST-based context expansion for --expand. Not required — search works without it, just with less structural context in results.
  • ripgrep: Required for regex verification. Install via uv tool install ripgrep.
  • scikit-learn: Required for semantic mode. Installs automatically.

When to Use

  • Known target: "where is the retry logic?", "find all error handlers"
  • Pattern matching: regex across large codebases with indexed speedup
  • Concept search: "authentication flow", "database connection pooling"
  • Cross-reference: find all callers/users of a specific function

When NOT to Use

  • First encounter: "what does this repo do?" → use exploring-codebases
  • Repos under ~10 files: just read them directly
  • Exact symbol lookup: find_symbol('ClassName') via tree-sitting is simpler
  • Structural overview: use tree-sitting's tree_overview() / dir_overview()

Files

  • scripts/search.py — Entry point, query routing, output formatting
  • scripts/resolve.py — Input source resolution (GitHub, uploads, archives)
  • scripts/context.py — tree-sitting-based AST context expansion
  • scripts/ngram_index.py — Sparse n-gram inverted index, regex decomposition
  • scripts/sparse_ngrams.py — Core n-gram algorithms, frequency weights
  • scripts/code_rag.py — TF-IDF semantic search over code chunks
Related skills

More from oaustegard/claude-skills

Installs
5
GitHub Stars
119
First Seen
Mar 29, 2026