searching-codebases
Searching Codebases
Find code in any codebase by pattern or concept. One entry point, two search strategies, automatic routing.
Prerequisites
uv tool install ripgrep
tree-sitting (for structural context expansion) installs automatically when
the --expand flag is used.
Primary Command
SKILL_DIR=/mnt/skills/user/searching-codebases
python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]
SOURCE is any of:
- Local directory path
- GitHub URL (downloads tarball automatically)
uploads(uses/mnt/user-data/uploads/)project(uses/mnt/project/)- Path to a
.zipor.tar.gzarchive
Search Modes
Regex mode (patterns, identifiers, literal text):
python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error"
python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex
python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"
Semantic mode (concepts, natural language):
python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic
python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow"
python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"
Auto-detection: short queries and code-like tokens → regex. Multi-word
natural language → semantic. Override with --regex or --semantic.
Options
--regex/--semantic: Force search mode--expand: Return full function bodies via tree-sitting AST context--benchmark: Compare indexed regex vs brute-force ripgrep--branch NAME: Git branch for GitHub URLs (default: main)--skip DIRS: Comma-separated directories to skip--json: Machine-readable output-v: Show index stats and query routing decisions
How It Works
Regex search builds a sparse n-gram inverted index over all files. Queries are decomposed into literal fragments, looked up in the index to identify candidate files (typically 90-99% reduction), then verified with ripgrep. Frequency-weighted n-grams make rare character sequences more selective.
Semantic search builds a TF-IDF index over code chunks (functions, classes, structural entries). Queries are ranked by cosine similarity.
Context expansion (--expand) uses tree-sitting's AST cache to
identify function/class boundaries, returning complete structural units
rather than line fragments. On first use, tree-sitting scans the repo
(~700ms for 250 files); subsequent expansions are sub-millisecond.
Small codebases (< 20 files) skip indexing entirely — direct ripgrep is faster when there's nothing to narrow.
Mixed Queries
Multiple queries can use different modes in a single invocation. Each query is auto-routed independently, and indexes are built once per mode:
python3 $SKILL_DIR/scripts/search.py ./repo \
"class.*Error" \
"error recovery strategy" \
"def retry"
Dependencies
- tree-sitting: Provides AST-based context expansion for
--expand. Not required — search works without it, just with less structural context in results. - ripgrep: Required for regex verification. Install via
uv tool install ripgrep. - scikit-learn: Required for semantic mode. Installs automatically.
When to Use
- Known target: "where is the retry logic?", "find all error handlers"
- Pattern matching: regex across large codebases with indexed speedup
- Concept search: "authentication flow", "database connection pooling"
- Cross-reference: find all callers/users of a specific function
When NOT to Use
- First encounter: "what does this repo do?" → use exploring-codebases
- Repos under ~10 files: just read them directly
- Exact symbol lookup:
find_symbol('ClassName')via tree-sitting is simpler - Structural overview: use tree-sitting's
tree_overview()/dir_overview()
Files
scripts/search.py— Entry point, query routing, output formattingscripts/resolve.py— Input source resolution (GitHub, uploads, archives)scripts/context.py— tree-sitting-based AST context expansionscripts/ngram_index.py— Sparse n-gram inverted index, regex decompositionscripts/sparse_ngrams.py— Core n-gram algorithms, frequency weightsscripts/code_rag.py— TF-IDF semantic search over code chunks
More from oaustegard/claude-skills
developing-preact
Specialized Preact development skill for standards-based web applications with native-first architecture and minimal dependency footprint. Use when building Preact projects, particularly those involving data visualization, interactive applications, single-page apps with HTM syntax, Web Components integration, CSV/JSON data parsing, WebGL shader visualizations, or zero-build solutions with vendored ESM imports.
106reviewing-ai-papers
Analyze AI/ML technical content (papers, articles, blog posts) and extract actionable insights filtered through enterprise AI engineering lens. Use when user provides URL/document for AI/ML content analysis, asks to "review this paper", or mentions technical content in domains like RAG, embeddings, fine-tuning, prompt engineering, LLM deployment.
80exploring-codebases
>-
64mapping-codebases
Generate navigable code maps for unfamiliar codebases. Extracts exports/imports via AST (tree-sitter) to create _MAP.md files per directory showing classes, functions, methods with signatures and line numbers. Use when exploring repositories, understanding project structure, analyzing unfamiliar code, or before modifications. Triggers on "map this codebase", "explore repo", "understand structure", "what does this project contain", or when starting work on an unfamiliar repository.
50accessing-github-repos
GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
44asking-questions
Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
42