skills/ajrlewis/ai-skills/addon-rag-ingestion-pipeline

addon-rag-ingestion-pipeline

SKILL.md

Add-on: Multi-Format RAG Ingestion Pipeline

Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats.

Compatibility

  • Works with architect-python-uv-batch.
  • Works with architect-python-uv-fastapi-sqlalchemy.
  • Can back a Next.js app via a Python worker service.

Inputs

Collect:

  • SOURCE_FORMATS: one or more of pdf, markdown, txt, html, csv.
  • EMBED_PROVIDER: openai or sentence-transformers.
  • VECTOR_STORE: pgvector, chroma, or existing vector layer.
  • CHUNK_SIZE: default 1000.
  • CHUNK_OVERLAP: default 150.
  • TOP_K: default 5.

Integration Workflow

  1. Add dependencies (Python worker path):
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters
  • If EMBED_PROVIDER=openai: uv add openai
  • If EMBED_PROVIDER=sentence-transformers: uv add sentence-transformers
  • If VECTOR_STORE=chroma: uv add chromadb
  1. Add modules:
src/{{MODULE_NAME}}/rag/
  loaders/pdf_loader.py
  loaders/markdown_loader.py
  loaders/text_loader.py
  loaders/html_loader.py
  loaders/csv_loader.py
  normalize.py
  chunking.py
  embeddings.py
  indexer.py
  retriever.py
  1. Use a normalized document contract:
  • document_id
  • source_path
  • source_type
  • content
  • metadata (filename/page/section/checksum/ingested_at/model_version)
  1. Implement ingestion entrypoint:
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt
  1. Implement retrieval entrypoint:
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5
  • Ensure both commands are wired in the project CLI/script entrypoint.
  • rag-query depends on an existing index from rag-ingest; do not run these validation commands in parallel.

Loader Notes

  • PDF: extract per page and keep page_number in metadata.
  • Markdown: keep heading hierarchy and section anchors in metadata.
  • Text: detect encoding fallback (utf-8, then latin-1).
  • HTML: strip script/style tags and preserve title/headings where possible.
  • CSV: convert rows into stable textual records with row identifiers.

Minimal Defaults

normalize.py

import re
import unicodedata


def normalize_text(raw: str) -> str:
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace("\r\n", "\n")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

chunking.py

from langchain_text_splitters import RecursiveCharacterTextSplitter


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

Guardrails

  • Documentation contract for generated code:

    • Python: write module docstrings and docstrings for public classes, methods, and functions.
    • Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
    • Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
    • Apply this contract even when using template snippets below; expand templates as needed.
  • Deduplicate ingestion by checksum to keep re-runs idempotent.

  • Store embedding model/version so re-indexing can be reasoned about.

  • Never interpolate user queries into raw SQL vector search.

  • Keep ingestion async/offline for large corpora; do not block request-response paths.

  • Preserve citation metadata for retrieval (source_path, section, page, row id).

Validation Checklist

  • Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q

Decision Justification Rule

  • Every non-trivial decision must include a concrete justification.
  • Capture the alternatives considered and why they were rejected.
  • State tradeoffs and residual risks for the chosen option.
  • If justification is missing, treat the task as incomplete and surface it as a blocker.
Weekly Installs
11
First Seen
Feb 27, 2026
Installed on
github-copilot11
codex11
kimi-cli11
gemini-cli11
cursor11
opencode11