building-github-index
Building GitHub Index
Create markdown indexes of GitHub repositories optimized for Claude project knowledge. Indexes enable retrieval via GitHub API with semantic descriptions for effective matching.
Quick Start
# Documentation repos (markdown/notebooks)
python scripts/github_index.py owner/repo -o index.md
# Code repos (extract symbols via tree-sitter)
python scripts/github_index.py owner/repo --code-symbols -o index.md
# Multiple repos combined
python scripts/github_index.py owner/repo1 owner/repo2 -o combined.md
Script Options
| Flag | Description |
|---|---|
-o, --output |
Output file (default: github_index.md) |
--token |
GitHub PAT; also reads GITHUB_TOKEN env |
--include-patterns |
Only index matching globs: "docs/**" "src/**" |
--exclude-patterns |
Skip matching globs: "test/**" |
--max-files |
Cap files per repo (default: 200) |
--skip-fetch |
Tree only, no content fetch (fast, filename-only descriptions) |
--code-symbols |
Include code files, extract function/class names via tree-sitter |
Description Extraction Priority
- YAML frontmatter -
title:anddescription:fields - Markdown headings - First h1/h2 as title, subsequent as topics
- Notebook cells - First markdown cell heading
- Code symbols - Public function/class names (with
--code-symbols) - Path-derived - Convert filename to words (fallback)
When Descriptions Fail
Some repos have stub files (links to external docs, empty readmes). In these cases:
Manual curation recommended. Use the tree output and domain knowledge:
# Get tree structure only (fast)
python scripts/github_index.py owner/repo --skip-fetch -o skeleton.md
# Then manually enhance descriptions based on domain knowledge
For code-heavy repos with embedded apps:
- Directory names encode purpose:
acc_wav_gen→ "ACC waveform generation" - Peripheral acronyms map to functions: AFEC=ADC, MCAN=CAN, TWIHS=I2C
- Operation modes: blocking, interrupt, dma, polled
Output Format
# {Repo} - Content Index
**Repository:** {url}
**Branch:** `{branch}`
## Retrieval Method
{API curl commands}
---
## {Category}
| Description | Path |
|-------------|------|
| {What this covers} | `{path/file.md}` |
Description column leads (relevance matching), path follows (retrieval key).
API Access
Enumerate files:
curl -sL "https://api.github.com/repos/OWNER/REPO/git/trees/BRANCH?recursive=1"
Fetch content:
curl -s "https://api.github.com/repos/OWNER/REPO/contents/PATH?ref=BRANCH" \
-H "Accept: application/vnd.github+json" | \
python3 -c "import sys,json,base64; print(base64.b64decode(json.load(sys.stdin)['content']).decode())"
Network
Both scripts download a repo tarball (single HTTP request, no per-file rate limits) then process files locally. Allowlist: api.github.com (tarball redirects via this endpoint)
Related Skills
accessing-github-repos- Private repos, PAT setup, tarball downloadmapping-codebases- Detailed code structure (methods, imports, line numbers)
Condensed Format (pk_index.py)
For token-constrained project knowledge, use the condensed script:
python scripts/pk_index.py owner/repo -o repo_pk.md
Produces ~80% smaller output:
- Single line per file:
path— description - Symbols only (no signatures)
- 15 files max per category
- No retrieval instructions section
Ideal when adding multiple repo indexes to project knowledge.
More from oaustegard/claude-skills
developing-preact
Specialized Preact development skill for standards-based web applications with native-first architecture and minimal dependency footprint. Use when building Preact projects, particularly those involving data visualization, interactive applications, single-page apps with HTM syntax, Web Components integration, CSV/JSON data parsing, WebGL shader visualizations, or zero-build solutions with vendored ESM imports.
104reviewing-ai-papers
Analyze AI/ML technical content (papers, articles, blog posts) and extract actionable insights filtered through enterprise AI engineering lens. Use when user provides URL/document for AI/ML content analysis, asks to "review this paper", or mentions technical content in domains like RAG, embeddings, fine-tuning, prompt engineering, LLM deployment.
79exploring-codebases
>-
63mapping-codebases
Generate navigable code maps for unfamiliar codebases. Extracts exports/imports via AST (tree-sitter) to create _MAP.md files per directory showing classes, functions, methods with signatures and line numbers. Use when exploring repositories, understanding project structure, analyzing unfamiliar code, or before modifications. Triggers on "map this codebase", "explore repo", "understand structure", "what does this project contain", or when starting work on an unfamiliar repository.
48accessing-github-repos
GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
43remembering
Advanced memory operations reference. Basic patterns (profile loading, simple recall/remember) are in project instructions. Consult this skill for background writes, memory versioning, complex queries, edge cases, session scoping, retention management, type-safe results, proactive memory hints, GitHub access detection, autonomous curation, episodic scoring, and decision traces.
41