Schema Normalizer (NO PROSE)

Purpose: close a common failure mode in skills-first pipelines: schema drift across JSONL artifacts.

When fields are inconsistent (missing ids/titles, mixed citation-key formats), downstream skills start doing best-effort joins and fragile parsing. This skill makes the interface explicit and deterministic.

Inputs

outline/outline.yml (source of truth for section/subsection ids + titles)
Optional (for citation-key sanity): citations/ref.bib
Default JSONL artifacts to normalize (arxiv-survey(-latex) C4 bridge):
- outline/subsection_briefs.jsonl
- outline/chapter_briefs.jsonl
- outline/evidence_bindings.jsonl
- outline/evidence_drafts.jsonl
- outline/anchor_sheet.jsonl
Optional (run after writer packs are generated):
- outline/writer_context_packs.jsonl

Outputs

output/SCHEMA_NORMALIZATION_REPORT.md (always written; PASS/FAIL + what changed)
The processed JSONL files are normalized in place (a .bak.* is created if changes are applied).

What gets normalized

1) IDs + titles (join keys)

For any record with sub_id: "<H2>.<H3>":

Ensure section_id exists (derived from the prefix before the dot)
Ensure title, section_title exist (filled from outline/outline.yml)

For any record with section_id: "<H2>":

Ensure section_title exists (filled from outline/outline.yml)

2) Citation key format (reduce parsing drift)

Within these C2-C4 JSONL artifacts, normalize citation keys so they are raw BibTeX keys (no @ prefix):

"citations": ["smith2023", "jones2024"]

Notes:

Final prose still uses Markdown citations: [@smith2023].
This skill does not add/remove citations; it only normalizes formatting.

When to run

Recommended placement in arxiv-survey(-latex):

Run after evidence-draft + anchor-sheet and before writer-context-pack + evidence-selfloop.
This ensures outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl are schema-stable before drafting packs are built.

Failure modes

If outline/outline.yml is missing or cannot be parsed, the skill FAILs.
If any target JSONL contains invalid JSON lines, the skill reports them and FAILs (do not proceed on corrupted artifacts).

Script (optional)

Quick Start

python .codex/skills/schema-normalizer/scripts/run.py --help
Normalize the C4 bridge artifacts:
- python .codex/skills/schema-normalizer/scripts/run.py --workspace workspaces/<ws>

All Options

--workspace <dir>
--unit-id <U###>
--inputs <semicolon-separated>
--outputs <semicolon-separated>
--checkpoint <C#>

Examples

Normalize the default C4 artifacts (ids/titles + citations format):
- python .codex/skills/schema-normalizer/scripts/run.py --workspace workspaces/<ws> --inputs outline/outline.yml;citations/ref.bib;outline/subsection_briefs.jsonl;outline/chapter_briefs.jsonl;outline/evidence_bindings.jsonl;outline/evidence_drafts.jsonl;outline/anchor_sheet.jsonl --outputs output/SCHEMA_NORMALIZATION_REPORT.md
Normalize writer packs too (if you are running this after writer-context-pack):
- python .codex/skills/schema-normalizer/scripts/run.py --workspace workspaces/<ws> --inputs outline/outline.yml;citations/ref.bib;outline/writer_context_packs.jsonl --outputs output/SCHEMA_NORMALIZATION_REPORT.md

schema-normalizer

Schema Normalizer (NO PROSE)

Inputs

Outputs

What gets normalized

1) IDs + titles (join keys)

2) Citation key format (reduce parsing drift)

When to run

Failure modes

Script (optional)

Quick Start

All Options

Examples

More from willoscar/research-units-pipeline-skills

pdf-text-extractor

latex-compile-qa

draft-polisher

citation-verifier

paper-notes

section-logic-polisher