mapping-documents
Mapping Documents
Generate _MAP.md files providing hierarchical document structure with semantic annotations. Maps show section summaries, typed claims (result/definition/method/caveat/open-question), symbol definitions, and cross-section dependencies — all anchored to page numbers.
The structural analog to mapping-codebases: tree-sitter parses code via grammar, docmap parses documents via font analysis + LLM extraction.
Installation
pip install pdfplumber anthropic --break-system-packages -q
Generate Maps
# Full run (structure + semantic extraction via Claude API)
python /mnt/skills/user/mapping-documents/scripts/docmap.py paper.pdf \
--out docs/ --genre paper --workers 4
# Structure only (no API calls, no cost)
python /mnt/skills/user/mapping-documents/scripts/docmap.py paper.pdf \
--out docs/ --structure-only
API key resolution: --api-key flag > ANTHROPIC_API_KEY env > API_KEY env.
Output Artifacts
Four files, forming a three-layer progressive-disclosure stack:
CLAUDE.md / project instructions ← curated invariants (you write this)
↕ (_USAGE.md bridges the gap)
_MAP.md + JSON indexes ← navigable document map (docmap generates)
↕
raw PDF ← the source document
| File | Purpose | When to read |
|---|---|---|
{stem}_USAGE.md |
Snippet for pasting into CLAUDE.md / AGENTS.md / project knowledge. Describes the reading order and JSON query patterns. | Once, at setup |
{stem}_MAP.md |
Section map: TOC with summaries, typed claims, defined symbols, dependencies. All page-anchored. | Any question about what the document says |
{stem}.symbols.json |
Flat symbol index: where defined, where used, what it means. | "Where is X defined?" |
{stem}.anchors.json |
Every claim: section ID, type, text, page number. | "What caveats exist?" / "What does §3 claim?" |
After Generating: Wire It Up
Generating the map is step 1. Step 2 is telling the agent the map exists.
For a code repo (CLAUDE.md / AGENTS.md):
# Paste the generated usage snippet into your agent instructions
cat docs/paper_USAGE.md >> CLAUDE.md
For Claude.ai project knowledge:
Upload _MAP.md as a project knowledge file, or paste the _USAGE.md content into project instructions.
The _USAGE.md snippet includes copy-pasteable query commands for the JSON indexes. Replace QUERY and SECTION_ID placeholders with actual values.
Navigate Via Maps
After generating and wiring up, use the map for navigation — read _MAP.md, not the raw PDF.
Workflow:
- Read
_USAGE.mdblock in CLAUDE.md for orientation - Read top-level TOC in
_MAP.mdfor structure and section summaries - Drill into relevant sections for typed claims and symbol definitions
- Query
.symbols.jsonfor "where is X defined?" lookups - Query
.anchors.jsonfor claim filtering by type or section - Read the raw PDF only when exact wording or figures are needed
Querying the JSON indexes:
# Symbol lookup
python3 -c "import json; [print(f'§{s[\"defined_in\"]} p.{s[\"defined_at_page\"]}') \
for s in json.load(open('docs/paper.symbols.json')) if 'edl' in s['symbol']]"
# All caveats in the document
python3 -c "import json; [print(f'p.{c[\"page\"]} {c[\"text\"]}') \
for c in json.load(open('docs/paper.anchors.json')) if c['type'] == 'caveat']"
# All claims in a section
python3 -c "import json; [print(f'[{c[\"type\"]}] {c[\"text\"]}') \
for c in json.load(open('docs/paper.anchors.json')) if c['section'] == '4.3']"
Genre Support
Genre controls the claim taxonomy used in semantic extraction.
| Genre | Claim types | Best for |
|---|---|---|
paper (default) |
definition, result, method, claim, caveat, open-question | Academic papers, arXiv preprints |
spec |
requirement, definition, constraint, example, note | RFCs, API specs, technical standards |
legal |
definition, obligation, right, exception, condition, reference | Contracts, policy documents, regulations |
Limitations (v0.1.x)
- PDF-only. No DOCX, HTML, or plain text input yet.
- Single-column layout assumed. Two-column papers may mis-order text within sections.
- No caching. Re-running re-extracts everything.
- No citation cross-referencing.
- Genre must be specified manually.
- Semantic extraction can hallucinate. Every claim is page-anchored, but the page number comes from the LLM. Verify critical claims against the source.
CLI Reference
python docmap.py paper.pdf [options]
Options:
--genre {paper,spec,legal} Claim taxonomy (default: paper)
--structure-only Skip LLM pass (free, fast)
--out DIR Output directory (default: .)
--api-key KEY Anthropic API key
--model MODEL Model (default: claude-sonnet-4-6)
--workers N Parallel workers (default: 4)
--no-usage-snippet Skip _USAGE.md generation
-v Verbose structural parsing
More from oaustegard/claude-skills
developing-preact
Specialized Preact development skill for standards-based web applications with native-first architecture and minimal dependency footprint. Use when building Preact projects, particularly those involving data visualization, interactive applications, single-page apps with HTM syntax, Web Components integration, CSV/JSON data parsing, WebGL shader visualizations, or zero-build solutions with vendored ESM imports.
106reviewing-ai-papers
Analyze AI/ML technical content (papers, articles, blog posts) and extract actionable insights filtered through enterprise AI engineering lens. Use when user provides URL/document for AI/ML content analysis, asks to "review this paper", or mentions technical content in domains like RAG, embeddings, fine-tuning, prompt engineering, LLM deployment.
80exploring-codebases
>-
64mapping-codebases
Generate navigable code maps for unfamiliar codebases. Extracts exports/imports via AST (tree-sitter) to create _MAP.md files per directory showing classes, functions, methods with signatures and line numbers. Use when exploring repositories, understanding project structure, analyzing unfamiliar code, or before modifications. Triggers on "map this codebase", "explore repo", "understand structure", "what does this project contain", or when starting work on an unfamiliar repository.
50accessing-github-repos
GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
44asking-questions
Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
42