Library RAG: Semantic Search

Semantic search over your local library of markdown-converted papers using sentence-transformers embeddings and ChromaDB.

Prerequisites

uv installed (standard in this project)
Papers ingested via ingest.py (which converts to markdown, organizes files, and adds metadata to references.bib)
references.bib with md_path fields linking citation keys to markdown files

Important: Only files registered in references.bib are indexed. Loose markdown files in library/markdown/ without a bib entry will be flagged as "unlinked" during indexing. Run ingest.py to register them.

Quick Start

# Index your library (first time or after adding papers)
uv run plugins/sociology-skillset/scripts/rag.py index

# Search by meaning
uv run plugins/sociology-skillset/scripts/rag.py search "cultural capital and educational attainment"

Script Location

All commands use:

uv run plugins/sociology-skillset/scripts/rag.py <command>

Dependencies (sentence-transformers, chromadb) are auto-installed by uv on first run via PEP 723 inline metadata. No manual installation needed.

Commands

Index

Build or update the vector index from library/markdown/ files.

# Index all markdown files (incremental — skips unchanged files)
uv run plugins/sociology-skillset/scripts/rag.py index

# Index specific citation keys only
uv run plugins/sociology-skillset/scripts/rag.py index --keys Smith2020_Cultural Jones2019_Institutional

The index is stored at library/.rag-index/. First run downloads the all-MiniLM-L6-v2 embedding model (~80MB, cached by sentence-transformers).

Run this after adding new papers to keep the index current.

Search

Semantic search across all indexed documents. Returns JSON lines ranked by similarity.

uv run plugins/sociology-skillset/scripts/rag.py search "social movements and collective identity"
uv run plugins/sociology-skillset/scripts/rag.py search "interview methodology" --top-k 5
uv run plugins/sociology-skillset/scripts/rag.py search "Bourdieu field theory" --min-score 0.3

Each result includes: chunk_id, citation_key, section_title, score, text (truncated), plus title, author, year from references.bib.

Similar

Find passages similar to a given chunk (from search results).

uv run plugins/sociology-skillset/scripts/rag.py similar <chunk_id>
uv run plugins/sociology-skillset/scripts/rag.py similar abc123def456 --top-k 5

Use this to explore thematic connections: find a relevant passage via search, then use similar to discover related content across other papers.

Context

Show the full context around a chunk — the target chunk plus surrounding chunks from the same document.

uv run plugins/sociology-skillset/scripts/rag.py context <chunk_id>
uv run plugins/sociology-skillset/scripts/rag.py context abc123def456 --window 3

Returns the target chunk and neighboring chunks (default: 2 on each side), so you can read the passage in its original context.

Status

Show index statistics: number of documents, chunks, and last modified time.

uv run plugins/sociology-skillset/scripts/rag.py status

List

List all indexed documents with chunk counts.

uv run plugins/sociology-skillset/scripts/rag.py list

Remove

Remove a document from the index by citation key.

uv run plugins/sociology-skillset/scripts/rag.py remove Smith2020_Cultural

Typical Workflows

First-time setup

Ensure papers are in library/markdown/ (run ingest.py for each PDF/EPUB)
Run uv run rag.py index to build the index
Search with uv run rag.py search "your topic"

Adding new papers

Ingest the paper: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index

Adding a PDF for a paper already in references.bib

Ingest with update: uv run plugins/sociology-skillset/scripts/ingest.py --file paper.pdf --citekey ExistingKey2022 --update
Update the index: uv run plugins/sociology-skillset/scripts/rag.py index

Deep exploration

Search for a topic: search "concept or question"
Read context of a promising hit: context <chunk_id>
Find similar passages across other papers: similar <chunk_id>
Read the full paper if needed: open the source_file path from results

When to Use RAG vs. Grep

Need	Tool
Conceptual/semantic search (find passages about a concept even if they don't use the exact words)	`rag.py search`
Exact keyword/phrase search (find specific terms, author names, method names)	`grep library/markdown/`
Metadata search (by author, year, journal)	`grep references.bib`

Both approaches complement each other. Use semantic search for exploratory discovery and grep for precise retrieval.

Technical Details

Embedding model: all-MiniLM-L6-v2 (384 dimensions, same as old Zotero RAG)
Vector store: ChromaDB with file-based persistence at library/.rag-index/
Chunking: Split by ## headers (section-level); fallback to ~512-token fixed chunks for headerless documents
Incremental indexing: Content hashes stored in metadata; unchanged files are skipped on re-index
Output format: JSON lines for easy parsing by Claude or other tools

zotero-rag