Local PDF Index Builder
Local PDF Index Builder
Build a searchable Asta document index from a collection of local PDF files. Each PDF is converted to markdown, split into ~2000-character chunks, and written as documents to an asta-documents YAML index — enabling semantic search across the full text of the collection.
Installation
This skill requires the asta CLI:
# Install/reinstall at the correct version
PLUGIN_VERSION=0.16.0
if [ "$(asta --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')" != "$PLUGIN_VERSION" ]; then
uv tool install --force git+https://github.com/allenai/asta-plugins.git@v$PLUGIN_VERSION
fi
Prerequisites: Python 3.11+ and uv package manager
Assets
This skill includes standalone scripts in the assets/ directory:
| Script | Purpose |
|---|---|
assets/extract-pdfs.sh |
Convert PDFs to markdown via asta pdf-extraction remote |
assets/chunk-and-index.py |
Chunk markdown files and write the YAML index directly |
assets/warm-cache.sh |
Run an initial search to build the search cache |
Locate the assets directory relative to this skill file. The scripts are self-contained and can be copied to the working directory or run in place.
Procedure
Step 0: Interview the user for paths and collection name
Before starting, ask the user for four things:
- PDF directory — Where are the PDFs?
- Markdown output directory — Where should extracted markdown go?
- Collection name — A short label for the collection (e.g.,
my-papers,cs-reading-list). - Include images? — Whether to extract and save images embedded in the PDFs alongside the markdown. Images are useful for papers with figures/diagrams but increase storage. Default: no.
Suggest a directory layout where the PDFs and markdown live as siblings under a common parent, with the index file alongside them. For example, if the PDFs are at /data/papers/pdfs/:
/data/papers/ # DATASET_ROOT — parent of everything
├── pdfs/ # PDF_DIR (user already has this)
├── markdown/ # MARKDOWN_DIR (suggested: sibling of pdfs/)
│ ├── paper1.md # without --images: flat .md files
│ ├── paper2.md
│ ├── paper3/ # with --images: per-PDF subdirectories
│ │ ├── paper3.md
│ │ └── img-0.jpeg
│ └── ...
└── index.yaml # INDEX_PATH (auto-created here)
This layout matters because the index stores relative paths to the markdown files. Keeping everything under one root makes the index portable and git-friendly.
Key rules for the suggestion:
- The markdown directory should be adjacent to the PDF directory, not inside
.asta. - The
DATASET_ROOTis the parent directory that contains bothPDF_DIRandMARKDOWN_DIR. - The index lives at
DATASET_ROOT/index.yaml. - If the user's PDFs are at
/home/user/research/pdfs, suggest/home/user/research/markdownand/home/user/research/index.yaml.
Once the user confirms (or provides their own paths), set these variables:
PDF_DIR="/data/papers/pdfs" # user-provided
MARKDOWN_DIR="/data/papers/markdown" # user-confirmed
DATASET_ROOT="/data/papers" # parent of both
INDEX_PATH="$DATASET_ROOT/index.yaml" # derived
COLLECTION="my-papers" # user-chosen
IMAGES=false # true if user wants images
Step 1: Discover PDFs and show estimates
PDF_COUNT=$(find "$PDF_DIR" -name "*.pdf" -type f | wc -l)
TOTAL_SIZE=$(find "$PDF_DIR" -name "*.pdf" -type f -exec du -ch {} + | tail -1 | awk '{print $1}')
echo "Found $PDF_COUNT PDFs ($TOTAL_SIZE total)"
Present this estimate to the user before proceeding:
| Metric | Estimate |
|---|---|
| PDFs found | N files |
| Total size on disk | X MB |
| Extraction time | ~2-5 min per 10-page PDF (remote API); faster with olmocr for batches >20 |
| Chunking + indexing | ~1-2 seconds per PDF |
| Index storage | ~2-3x the extracted text size (markdown files + YAML with chunk text) |
| Cache warm-up | 5-30 seconds (one-time, after indexing) |
| Total estimated time | Dominated by extraction: roughly N_papers x 3 min |
Ask the user to confirm before starting, especially for large collections (>20 PDFs).
Step 2: Extract PDFs to markdown
# Without images (flat layout: markdown/paper.md)
bash /path/to/assets/extract-pdfs.sh "$PDF_DIR" "$MARKDOWN_DIR"
# With images (per-PDF subdirectories: markdown/paper/paper.md + images)
bash /path/to/assets/extract-pdfs.sh --images "$PDF_DIR" "$MARKDOWN_DIR"
Pass --images only if the user opted in during Step 0. When --images is used, each PDF gets its own subdirectory under MARKDOWN_DIR to avoid image filename collisions across PDFs. The chunking script in Step 3 handles both layouts automatically.
The script:
- Skips PDFs whose markdown already exists (resumable)
- Handles large PDFs (>50 pages) by extracting in 50-page increments
- Reports progress and counts
For large batches (>20 PDFs), asta pdf-extraction olmocr with --workers is significantly faster. See the pdf-extraction skill for details.
Step 3: Chunk and build index
uv run --with pyyaml python3 /path/to/assets/chunk-and-index.py "$COLLECTION" "$MARKDOWN_DIR" --index-path "$INDEX_PATH"
The --index-path argument is required. The script:
- Computes paths relative to the index file's directory, storing relative paths in the
urlfield — making the index portable across machines - Reads each markdown file, splits into ~2000-char chunks at paragraph/sentence boundaries
- Writes all documents to the index YAML in a single pass
- Preserves any existing documents in the index (appends, does not overwrite)
- Skips PDFs already indexed for this collection (safe to re-run)
- Each document gets:
- Shared PDF metadata:
source_pdf,collection(inextra) - Per-chunk metadata:
chunk_index,total_chunks,chunk_chars,chunk_offset,file_chars(inextra) - Tags:
<collection-name>,pdf-index
- Shared PDF metadata:
Options:
--chunk-size 2000— adjust chunk size (default 2000 chars)
Step 4: Warm the search cache
bash /path/to/assets/warm-cache.sh "$DATASET_ROOT"
The argument is required:
$DATASET_ROOT— the root directory containingindex.yaml
This step is required. The first search after indexing builds the internal BM25 + embedding indexes. Without warming, the user's first real search will be unexpectedly slow.
Step 5: Report results
asta documents --root "$DATASET_ROOT" list --tags="$COLLECTION"
asta documents --root "$DATASET_ROOT" show
Tell the user:
- Number of PDFs processed and chunks created
- Dataset root: the
DATASET_ROOTpath - Index location: the
INDEX_PATH - Collection tag for filtering: the chosen collection name
- How to search:
asta documents --root "$DATASET_ROOT" search --summary="query" --tags="COLLECTION"
Searching the Index
After building, search across all indexed PDFs:
# Semantic search within the collection
asta documents --root "$DATASET_ROOT" search --summary="neural network architecture" --tags="my-papers"
# With relevance scores
asta documents --root "$DATASET_ROOT" search --summary="attention mechanism" --tags="my-papers" --show-scores
# Filter by source PDF
asta documents --root "$DATASET_ROOT" search --extra=".source_pdf contains some-paper"
# List all documents in the collection
asta documents --root "$DATASET_ROOT" list --tags="my-papers"
Storage Estimates
| Collection size | Approx. index size | Approx. markdown size |
|---|---|---|
| 10 PDFs (~10 pp each) | 2-5 MB | 1-3 MB |
| 50 PDFs (~10 pp each) | 10-25 MB | 5-15 MB |
| 100 PDFs (~10 pp each) | 20-50 MB | 10-30 MB |
Total storage is roughly 2-3x the extracted text (markdown files + index YAML with chunk text in the summary field).
Time Estimates
| Stage | Per PDF | Notes |
|---|---|---|
| Extraction (remote) | 2-5 min / 10 pages | API-bound; 50-page limit per call |
| Extraction (olmocr) | 10-20 sec / page, parallel | Better for >20 PDFs |
| Chunking + indexing | 1-2 seconds | Single YAML write, fast |
| Cache warming | 5-30 seconds total | One-time after indexing |
Important Notes
- Warm the cache. The first
asta documents search --summary=...builds the search index. Always run the warm-cache script after indexing. - Chunk size tradeoff. 2000 chars balances search precision with context. Smaller chunks = more precise hits, less context. Larger chunks = more context, diluted relevance.
- Resumable. Both extraction and indexing skip already-processed files. Safe to re-run after interruption.
- Index is append-only. The chunking script preserves existing documents in the index. To rebuild from scratch, delete
index.yamlfirst. - PyYAML required. The chunking script needs
pyyaml. Install withpip install pyyamloruv pip install pyyamlif not available. - Relative paths. The index stores relative paths (e.g.,
markdown/paper.md) so the dataset is portable. This requires the markdown directory to be under the same directory as the index file.
More from allenai/asta-plugins
semantic scholar lookup
This skill should be used when the user asks to "get paper details", "look up a paper", "find citations", "who cited this paper", "papers by [author]", "search for papers on [topic]", or needs quick lookups of paper metadata, citations, or author information from Semantic Scholar. Use this for fast, targeted queries (not comprehensive reports).
46asta literature reports
Create or update literature reviews/reports. Use whenever you need to research, summarize, or synthesize the literature.
32asta library
Local document metadata index for files used by Asta skills and tools. Use this skill when the user asks to store a document "in Asta" or retrieve "from Asta". Use it when the
31preview
Render and deploy project documents, reports, and notebooks. Use when docs need to be shared or when previewing how documents render with citations and formatting.
29asta literature search
This skill should be used when the user asks to "find papers", "search for papers", "what does the literature say", "find research on", "academic papers about", "literature review", "cite papers", or needs to answer questions using academic literature.
19workspace
Set up a GitHub Codespaces or Dev Container environment with Asta skills installed in GitHub Copilot and Quarto pre-configured. Use when asked to set up a Codespace or devcontainer for an Asta project.
15