addon-jsonl-chunking-citations
SKILL.md
Add-on: JSONL Chunking + Citation Metadata
Use this skill to create the canonical intermediate representation:
- JSONL records (archival artifact in object storage)
document_chunksrows (queryable in Postgres)
This is the foundation for extraction, validation, and audit-ready citations.
Inputs
Collect:
CHUNK_TARGET_CHARS: default1200.CHUNK_MIN_CHARS: default400.CHUNK_OVERLAP_CHARS: default120(only if using sliding windows).CHUNK_STRATEGY:paragraph(default) orwindow.
JSONL Record Shape (Required)
Each JSONL line must include:
document_idpage(1-based)chunk_id(stable within document; e.g.,p{page}_c{chunk_index})chunk_index(0-based within page)textchar_start,char_end(span markers relative to the chosen page artifact)heading(nullable)metadata(parser name/version, cleaned flag, strategy params)
Generation Workflow
- Input:
document_pages.clean_markdown(preferred) orraw_markdown(only if cleanup not run yet). - Chunk per page:
paragraphstrategy: split on blank lines; merge/split to hitCHUNK_TARGET_CHARSwindowstrategy: sliding window over the page string
- For each chunk, compute deterministic span markers:
char_start/char_endin the page string used for chunking
- Emit JSONL line and insert/update a
document_chunksrow. - Store the JSONL file in object storage at
documents/jsonl/{document_id}.jsonl.
Guardrails
- Use cleaned text for chunking whenever available; keep provenance in
metadata.cleaned=true. - Ensure chunk ids are stable across reruns given the same page text + parameters.
- Do not store only the JSONL blob in Postgres; ingest rows for query use.
Decision Justification Rule
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.
Weekly Installs
1
Repository
ajrlewis/ai-skillsFirst Seen
4 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1