Document RAG Pipeline Skill

Overview

This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:

Regular PDF text extraction
OCR for scanned/image-based PDFs
DRM-protected file detection
Text chunking with overlap
Vector embedding generation
SQLite storage with full-text search
Semantic similarity search

Quick Start

# Install dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

# Build knowledge base
python build_knowledge_base.py /path/to/documents --embed

# Search documents
python build_knowledge_base.py /path/to/documents --search "your query"

When to Use

Building searchable knowledge bases from document folders
Processing technical standards libraries (API, ISO, ASME, etc.)
Creating semantic search over engineering documents
OCR processing of scanned historical documents
Any collection of PDFs needing intelligent search

Architecture

Document Folder
      │
      ▼
┌─────────────────────┐
│ 1. Build Inventory  │  SQLite catalog of all files
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 2. Extract Text     │  PyMuPDF for regular PDFs
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 3. OCR Scanned PDFs │  Tesseract + pytesseract
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 4. Chunk Text       │  1000 chars, 200 overlap
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 5. Generate Embeds  │  sentence-transformers
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 6. Semantic Search  │  Cosine similarity
└─────────────────────┘

Prerequisites

System Dependencies

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils

# macOS
brew install tesseract poppler

# Verify Tesseract
tesseract --version  # Should show 5.x

Python Dependencies

pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Or with UV:

uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Implementation

Step 1: Database Schema

import sqlite3
from pathlib import Path
from datetime import datetime

def create_database(db_path):
    """Create SQLite database with full schema."""
    conn = sqlite3.connect(db_path, timeout=30)
    cursor = conn.cursor()

    # Documents table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS documents (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            filename TEXT NOT NULL,
            filepath TEXT UNIQUE NOT NULL,
            file_size INTEGER,
            file_type TEXT,
            page_count INTEGER,
            extraction_method TEXT,  -- 'text', 'ocr', 'failed', 'drm_protected'
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    # Text chunks table
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS text_chunks (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            document_id INTEGER NOT NULL,
            chunk_num INTEGER NOT NULL,
            chunk_text TEXT NOT NULL,
            char_count INTEGER,
            embedding BLOB,
            embedding_model TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (document_id) REFERENCES documents(id),
            UNIQUE(document_id, chunk_num)
        )
    ''')

    # Create indexes
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_chunks_doc_id ON text_chunks(document_id)')
    cursor.execute('CREATE INDEX IF NOT EXISTS idx_docs_filepath ON documents(filepath)')

    conn.commit()
    return conn

Step 2: PDF Text Extraction

import fitz  # PyMuPDF

def extract_pdf_text(pdf_path):
    """Extract text from PDF using PyMuPDF."""
    try:
        doc = fitz.open(pdf_path)
        text_parts = []

        for page_num in range(len(doc)):
            page = doc[page_num]
            text = page.get_text()
            if text.strip():
                text_parts.append(text)

        doc.close()
        full_text = "\n".join(text_parts)

        # Check if meaningful text extracted
        if len(full_text.strip()) < 100:
            return None, "no_text"

        return full_text, "text"

    except Exception as e:
        if "encrypted" in str(e).lower() or "drm" in str(e).lower():
            return None, "drm_protected"
        return None, f"error: {str(e)}"

Step 3: OCR for Scanned PDFs

import fitz
import pytesseract
from PIL import Image
import io

def ocr_pdf(pdf_path, dpi=200):
    """OCR scanned PDF using Tesseract."""
    try:
        doc = fitz.open(pdf_path)
        text_parts = []

        for page_num in range(len(doc)):
            page = doc[page_num]

            # Convert page to image
            mat = fitz.Matrix(dpi/72, dpi/72)
            pix = page.get_pixmap(matrix=mat)

            # Convert to PIL Image
            img_data = pix.tobytes("png")
            img = Image.open(io.BytesIO(img_data))

            # OCR with Tesseract
            text = pytesseract.image_to_string(img, lang='eng')
            if text.strip():
                text_parts.append(text)

        doc.close()
        full_text = "\n".join(text_parts)

        if len(full_text.strip()) < 100:
            return None, "ocr_failed"

        return full_text, "ocr"

    except Exception as e:
        return None, f"ocr_error: {str(e)}"

Step 4: Text Chunking

def chunk_text(text, chunk_size=1000, overlap=200):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    text_len = len(text)

    while start < text_len:
        end = start + chunk_size
        chunk = text[start:end]

        # Try to break at sentence boundary
        if end < text_len:
            last_period = chunk.rfind('.')
            last_newline = chunk.rfind('\n')
            break_point = max(last_period, last_newline)

            if break_point > chunk_size * 0.7:
                chunk = text[start:start + break_point + 1]
                end = start + break_point + 1

        chunks.append(chunk.strip())
        start = end - overlap

        if start >= text_len:
            break

    return chunks

Step 5: Embedding Generation

from sentence_transformers import SentenceTransformer
import numpy as np
import pickle
import os

# Force CPU mode (for CUDA compatibility issues)
os.environ["CUDA_VISIBLE_DEVICES"] = ""

def create_embeddings(db_path, model_name='all-MiniLM-L6-v2', batch_size=100):
    """Generate embeddings for all chunks without embeddings."""

    model = SentenceTransformer(model_name)
    conn = sqlite3.connect(db_path, timeout=30)
    cursor = conn.cursor()

    # Get chunks needing embeddings
    cursor.execute('''
        SELECT id, chunk_text FROM text_chunks
        WHERE embedding IS NULL
    ''')
    chunks = cursor.fetchall()

    print(f"Generating embeddings for {len(chunks)} chunks...")

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        ids = [c[0] for c in batch]
        texts = [c[1] for c in batch]

        # Generate embeddings
        embeddings = model.encode(texts, normalize_embeddings=True)

        # Store as pickled numpy arrays
        for chunk_id, emb in zip(ids, embeddings):
            emb_blob = pickle.dumps(emb.astype(np.float32))
            cursor.execute('''
                UPDATE text_chunks
                SET embedding = ?, embedding_model = ?
                WHERE id = ?
            ''', (emb_blob, model_name, chunk_id))

        conn.commit()
        print(f"  Embedded {min(i+batch_size, len(chunks))}/{len(chunks)}")

    conn.close()
    print("Embedding complete!")

Step 6: Semantic Search

def semantic_search(db_path, query, top_k=10, sample_size=50000):
    """Search for similar chunks using cosine similarity."""

    # Force CPU mode
    os.environ["CUDA_VISIBLE_DEVICES"] = ""

    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_emb = model.encode(query, normalize_embeddings=True)

    conn = sqlite3.connect(db_path, timeout=30)
    cursor = conn.cursor()

    # Get chunks with embeddings (sample if large)
    cursor.execute('SELECT COUNT(*) FROM text_chunks WHERE embedding IS NOT NULL')
    total = cursor.fetchone()[0]

    if total > sample_size:
        # Random sample for large databases
        cursor.execute(f'''
            SELECT tc.id, tc.chunk_text, tc.embedding, d.filename
            FROM text_chunks tc
            JOIN documents d ON tc.document_id = d.id
            WHERE tc.embedding IS NOT NULL
            ORDER BY RANDOM()
            LIMIT {sample_size}
        ''')
    else:
        cursor.execute('''
            SELECT tc.id, tc.chunk_text, tc.embedding, d.filename
            FROM text_chunks tc
            JOIN documents d ON tc.document_id = d.id
            WHERE tc.embedding IS NOT NULL
        ''')

    results = []
    for chunk_id, text, emb_blob, filename in cursor.fetchall():
        emb = pickle.loads(emb_blob)

        # Cosine similarity (embeddings are normalized)
        similarity = np.dot(query_emb, emb)

        results.append({
            'id': chunk_id,
            'text': text[:500],  # Truncate for display
            'filename': filename,
            'score': float(similarity)
        })

    conn.close()

    # Sort by similarity
    results.sort(key=lambda x: x['score'], reverse=True)
    return results[:top_k]

Complete Pipeline Script

#!/usr/bin/env python3
"""
Document RAG Pipeline - Build searchable knowledge base from PDF folder.

Usage:
    python build_knowledge_base.py /path/to/documents --db inventory.db
    python build_knowledge_base.py /path/to/documents --search "query text"
"""

import argparse
import os
from pathlib import Path
from tqdm import tqdm

def build_inventory(folder_path, db_path):
    """Build document inventory from folder."""
    conn = create_database(db_path)
    cursor = conn.cursor()

    pdf_files = list(Path(folder_path).rglob("*.pdf"))
    print(f"Found {len(pdf_files)} PDF files")

    for pdf_path in tqdm(pdf_files, desc="Building inventory"):
        # Check if already processed
        cursor.execute('SELECT id FROM documents WHERE filepath = ?',
                       (str(pdf_path),))
        if cursor.fetchone():
            continue

        file_size = pdf_path.stat().st_size

        cursor.execute('''
            INSERT INTO documents (filename, filepath, file_size, file_type)
            VALUES (?, ?, ?, 'pdf')
        ''', (pdf_path.name, str(pdf_path), file_size))

    conn.commit()
    conn.close()

def process_documents(db_path, use_ocr=True):
    """Extract text from all unprocessed documents."""
    conn = sqlite3.connect(db_path, timeout=30)
    cursor = conn.cursor()

    # Get unprocessed documents
    cursor.execute('''
        SELECT id, filepath FROM documents
        WHERE extraction_method IS NULL
    ''')
    docs = cursor.fetchall()

    stats = {'text': 0, 'ocr': 0, 'failed': 0, 'drm': 0}

    for doc_id, filepath in tqdm(docs, desc="Extracting text"):
        # Try regular extraction first
        text, method = extract_pdf_text(filepath)

        # Try OCR if no text and OCR enabled
        if text is None and use_ocr and method == "no_text":
            text, method = ocr_pdf(filepath)

        if text:
            # Chunk and store
            chunks = chunk_text(text)
            for i, chunk in enumerate(chunks):
                cursor.execute('''
                    INSERT OR IGNORE INTO text_chunks
                    (document_id, chunk_num, chunk_text, char_count)
                    VALUES (?, ?, ?, ?)
                ''', (doc_id, i, chunk, len(chunk)))

            stats['text' if method == 'text' else 'ocr'] += 1
        else:
            if 'drm' in method:
                stats['drm'] += 1
            else:
                stats['failed'] += 1

        # Update document status
        cursor.execute('''
            UPDATE documents SET extraction_method = ? WHERE id = ?
        ''', (method, doc_id))

        conn.commit()

    conn.close()
    return stats

def main():
    parser = argparse.ArgumentParser(description='Document RAG Pipeline')
    parser.add_argument('folder', help='Folder containing documents')
    parser.add_argument('--db', default='_inventory.db', help='Database path')
    parser.add_argument('--no-ocr', action='store_true', help='Skip OCR')
    parser.add_argument('--embed', action='store_true', help='Generate embeddings')
    parser.add_argument('--search', help='Search query')
    parser.add_argument('--top-k', type=int, default=10, help='Number of results')

    args = parser.parse_args()

    db_path = Path(args.folder) / args.db

    if args.search:
        # Search mode
        results = semantic_search(str(db_path), args.search, args.top_k)
        print(f"\nTop {len(results)} results for: '{args.search}'\n")
        for i, r in enumerate(results, 1):
            print(f"{i}. [{r['score']:.3f}] {r['filename']}")
            print(f"   {r['text'][:200]}...\n")
    else:
        # Build mode
        print("Step 1: Building inventory...")
        build_inventory(args.folder, str(db_path))

        print("\nStep 2: Extracting text...")
        stats = process_documents(str(db_path), use_ocr=not args.no_ocr)
        print(f"Results: {stats}")

        if args.embed:
            print("\nStep 3: Generating embeddings...")
            create_embeddings(str(db_path))

if __name__ == '__main__':
    main()

Usage Examples

Build Knowledge Base

# Full pipeline with OCR and embeddings
python build_knowledge_base.py /path/to/documents --embed

# Skip OCR (faster, text PDFs only)
python build_knowledge_base.py /path/to/documents --no-ocr --embed

# Just build inventory (no extraction)
python build_knowledge_base.py /path/to/documents

Search Documents

# Semantic search
python build_knowledge_base.py /path/to/documents --search "subsea wellhead design"

# More results
python build_knowledge_base.py /path/to/documents --search "fatigue analysis" --top-k 20

Quick Search Script

#!/bin/bash
# search_docs.sh - Quick semantic search

DB_PATH="${1:-/path/to/_inventory.db}"
QUERY="$2"

CUDA_VISIBLE_DEVICES="" python3 -c "
import sqlite3, pickle, numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
query_emb = model.encode('$QUERY', normalize_embeddings=True)

conn = sqlite3.connect('$DB_PATH')
cursor = conn.cursor()
cursor.execute('''
    SELECT tc.chunk_text, tc.embedding, d.filename
    FROM text_chunks tc
    JOIN documents d ON tc.document_id = d.id
    WHERE tc.embedding IS NOT NULL
    ORDER BY RANDOM() LIMIT 50000
''')

results = []
for text, emb_blob, filename in cursor.fetchall():
    emb = pickle.loads(emb_blob)
    sim = float(np.dot(query_emb, emb))
    results.append((sim, filename, text[:200]))

for score, fname, text in sorted(results, reverse=True)[:10]:
    print(f'[{score:.3f}] {fname}')
    print(f'  {text}...\n')
"

Execution Checklist

Install system dependencies (Tesseract, Poppler)
Install Python dependencies
Verify document folder exists
Run inventory to catalog documents
Extract text (with or without OCR)
Generate embeddings
Test semantic search
Monitor for DRM-protected files

Error Handling

Common Errors

Error: CUDA not available

Cause: CUDA driver issues or incompatible GPU
Solution: Force CPU mode with CUDA_VISIBLE_DEVICES=""

Error: Tesseract not found

Cause: Tesseract OCR not installed
Solution: Install with apt-get install tesseract-ocr or brew install tesseract

Error: DRM-protected files

Cause: FileOpen or other DRM encryption
Solution: Skip these files; list with extraction_method = 'drm_protected'

Error: SQLite database locked

Cause: Concurrent access without timeout
Solution: Use timeout=30 in sqlite3.connect()

Error: Out of memory

Cause: Large batch sizes or too many embeddings
Solution: Reduce batch_size, use sampling for search

Metrics

Metric	Typical Value
Text extraction	~50 pages/second
OCR processing	~2-5 pages/minute
Embedding generation	~100 chunks/second (CPU)
Search latency	<2 seconds (50K chunks)
Memory usage	~2GB for embeddings

Performance Metrics (Real-World)

From O&G Standards processing (957 documents):

Metric	Value
Total documents	957
Text extraction	811 PDFs
OCR processed	96 PDFs
DRM protected	50 PDFs
Total chunks	1,043,616
Embedding time	~4 hours (CPU)
Search latency	<2 seconds

Related Skills

pdf-text-extractor - Just text extraction
semantic-search-setup - Just embeddings/search
rag-system-builder - Add LLM Q&A layer
knowledge-base-builder - Simpler document catalog

Version History

1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search

document-rag-pipeline