spec-distill
Spec Distill — Protocol Specification to Claude Code Skill
Overview
Transforms a large protocol specification (markdown, 3,000–9,000 lines) into a compact Claude Code skill (25–40KB) through 5 phases: Survey, Extraction, Synthesis, Compression, and Package. The output is a ready-to-install skill under .claude/skills/<spec-name>/.
Input: Path to a markdown specification file (e.g., scripts/markdown/cesr-specification.md)
Output: .claude/skills/<spec-name>/SKILL.md + references/*.md
Workflow
Phase 1 — Survey
Goal: Understand the spec's structure, classify content, plan reference file organization.
-
Parse the input argument to get the spec markdown path. Derive
spec-namefrom the filename (e.g.,cesr-specification.md→cesr;keri-specification.md→spec). -
Create staging directories:
scripts/staging/distill-<spec-name>/ extractions/ synthesis/ -
Read the first 200 lines of the spec for title, version, front matter, and document overview.
-
Build a heading index by grepping for all headings with line numbers:
Grep pattern: ^#{2,4}This gives the structural skeleton of the entire document.
-
Using the heading index, compute section boundaries (start line → next heading start line) and classify each section into one of these content types:
Content Type What It Contains Examples data-structures Message formats, field definitions, serialization layouts Event fields, code tables, CESR primitives rules Normative requirements (MUST/SHOULD/MAY) Validation rules, processing requirements algorithms Step-by-step computation procedures Digest computation, key derivation state-machines State transitions, protocol flows KEL processing, escrow states terminology Defined terms, abbreviations, glossary Protocol-specific vocabulary skip Boilerplate to exclude Copyright, bibliography, terms of use, acknowledgments -
Group sections into chunks of 500–800 lines, aligned to section boundaries:
- Never split a section across chunks
- If a section exceeds 800 lines, split at H4/H5 sub-boundaries within it
- Sections classified as
skipare excluded from chunks entirely - Record each chunk as:
{chunk_id, start_line, end_line, sections[], content_types[]}
-
Propose reference file organization. Group related content types into implementation concerns — these will become the final reference files. Example groupings:
encoding.md— CESR encoding tables, code points, primitive formatsvalidation.md— All MUST/SHOULD/MAY rules, verification proceduresevent-processing.md— State machines, event flows, escrow logicdata-formats.md— Message structures, field tables, serializationterminology.md— Protocol terms and abbreviations (if substantial)
-
Write
staging/distill-<spec-name>/survey.mdcontaining:- Document title and overview (from step 3)
- Full heading index with line numbers and content type classifications
- Chunk plan table (chunk_id, lines, sections, types)
- Proposed reference file groupings
-
INTERACTIVE CHECKPOINT: Present to the user:
- Concept map: table of top-level sections with content type, line count, chunk assignment
- Proposed reference files with what content maps to each
- Skip list (sections being excluded)
- Ask: "Does this grouping look right? Should any sections be reclassified or reference files renamed/split/merged?"
-
Incorporate user feedback and write the approved plan to
staging/distill-<spec-name>/plan.md.
Phase 2 — Extraction
Goal: Extract structured data from each chunk using parallel Task agents.
-
For each chunk in
plan.md, launch a Task agent (subagent_type: "general-purpose") with this prompt:You are extracting structured data from a protocol specification. Read the file: <spec-path> Line range: <start_line> to <end_line> (use Read tool with offset/limit) Content type hints: <content_types from survey> Extract into the 6 categories defined in the extraction template (read: .claude/skills/spec-distill/references/extraction-template.md). Write output to: <staging>/extractions/<NN>-<slug>.md Rules: - Preserve normative keywords (MUST, SHOULD, MAY) exactly - Use tables, not prose - Flag ambiguities with [AMBIGUOUS: ...] - Note cross-references with [XREF: section/concept] - If a category has no content in this chunk, omit itLaunch chunks in parallel where possible (batch of 4-6 concurrent agents).
-
After all agents complete, verify every extraction file exists and is non-empty:
Glob: staging/distill-<spec-name>/extractions/*.md -
If any extraction is missing or empty, re-run that specific chunk.
-
No interactive checkpoint — extraction is mechanical.
Phase 3 — Synthesis
Goal: Merge extractions by implementation concern, deduplicate, resolve cross-references.
-
Read all extraction files from
staging/distill-<spec-name>/extractions/. -
For each implementation concern from
plan.md(the approved reference file groupings):- Pull relevant items from all extraction files that match this concern
- Merge items of the same category (e.g., all Rules tables into one)
- Deduplicate: remove exact duplicates, merge near-duplicates keeping the more specific version
- Resolve
[XREF:]markers: replace with inline content or specific section references - Flag unresolved ambiguities: keep
[AMBIGUOUS:]markers for user review
-
Write one synthesis file per concern:
staging/distill-<spec-name>/synthesis/<concern>.md -
INTERACTIVE CHECKPOINT: Present to the user:
- Summary table: concern name, file size, count of rules/structures/algorithms/terms
- Cross-reference resolution status (how many resolved vs. remaining)
- Any
[AMBIGUOUS:]markers found — list them for user decision - Ask: "Review the synthesis. Should anything be moved between files, split, or merged? How should the ambiguities be resolved?"
-
Incorporate user feedback and update synthesis files.
Phase 4 — Compression
Goal: Transform verbose synthesis into compact, actionable reference files.
-
For each synthesis file, read the compression formats reference:
.claude/skills/spec-distill/references/compression-formats.md -
Apply the appropriate compression format based on content:
Content Compression Format Validation logic with conditionals Decision trees (indented if/then/else) Field/parameter definitions Compact tables (name, type, constraints per row) Multi-step procedures Ordered checklists Data structures/messages Annotated code templates Scattered warnings/gotchas Anti-pattern DO/DON'T pairs State transitions Compact [state] --event--> [state]notation -
Drop: boilerplate, motivation/rationale paragraphs, history, bibliography, examples that don't add information beyond what the rule already states.
-
Keep: every MUST/SHOULD/MAY statement, every field name and constraint, every algorithm step, every state transition, every error condition.
-
Target: 5–15KB per reference file. If a file exceeds 15KB, split by sub-concern.
-
Write compressed output directly to the final skill location:
.claude/skills/<spec-name>/references/<concern>.md -
No interactive checkpoint — compression is mechanical, but the final package review (Phase 5) covers quality.
Phase 5 — Package
Goal: Generate the SKILL.md and validate the complete skill package.
-
Generate
.claude/skills/<spec-name>/SKILL.mdwith this structure:YAML Frontmatter:
--- name: <spec-name> description: <Specific description for auto-activation. State the protocol name, what kind of code it applies to, and key concepts. Model after style's description — specific enough to activate only when relevant.> ---Body sections:
- Overview (2–3 sentences): What this spec defines, its scope, key concepts
- Quick Reference: The most critical tables/decision trees inlined directly (the content a developer needs most often — field definitions, primary code tables, core validation rules)
- Reference File Index: List each reference file with 1–2 line summary of contents
- Validation Checklist: Checkbox list of things to verify when implementing this spec
- Anti-Patterns: Common mistakes when implementing this spec (DON'T/DO format)
-
Validate the package:
- Total size across all files: must be 25–40KB
- No single file exceeds 15KB
- SKILL.md has valid YAML frontmatter
- Every reference file listed in the index actually exists
- Glob for all
.mdfiles in the skill directory and report sizes
-
INTERACTIVE CHECKPOINT: Present to the user:
- File tree with sizes
- Total size
- SKILL.md frontmatter (name + description)
- Any size violations
- Ask: "Does this look complete? Ready to clean up staging files?"
-
On user approval, optionally clean up staging:
rm -rf scripts/staging/distill-<spec-name>/
Chunking Strategy
Specs in this repo range from 3,524 lines (CESR) to 8,891 lines (ACDC). The Read tool defaults to 2,000 lines per call. Strategy:
- Index pass: Use Grep for
^#{2,4}headings to get the full skeleton without reading the entire file - Target chunks: 500–800 lines, aligned to section boundaries from the heading index
- Large sections (>800 lines): Split at H4/H5 sub-heading boundaries within the section
- Skip list: Always exclude these sections from extraction:
- Copyright / License / Terms of Use
- Bibliography / References / Normative References / Informative References
- Acknowledgments / Contributors
- Table of Contents (auto-generated)
- Change History / Revision Log
Size Budget
| Component | Target Size |
|---|---|
| SKILL.md | 4–8KB |
| Each reference file | 5–15KB |
| Total skill package | 25–40KB |
| Number of reference files | 3–6 |
If the total exceeds 40KB, go back to Phase 4 and compress more aggressively. If under 25KB, check that no normative content was dropped.
Error Recovery
- Extraction agent fails: Re-read the chunk boundaries from
plan.mdand re-launch just that agent - Synthesis too large: Split the concern into sub-concerns (e.g.,
validation-events.md+validation-credentials.md) - Compression drops normative content: Grep the synthesis for MUST/SHOULD/MAY counts, then grep the compressed output — counts should match
- Size budget violated: Adjust by splitting large files or compressing further; never drop normative statements to meet size targets