document-audit-extraction
Document Audit & Feature Extraction
Systematically inventory, evaluate, and extract structured data from document collections.
Audit Framework
Four-Phase Audit
Phase 1: Inventory → What exists?
Phase 2: Classify → What type is each document?
Phase 3: Evaluate → What's the quality?
Phase 4: Extract → What structured data can we pull?
Phase 1: Inventory
Automated Inventory
from pathlib import Path
from dataclasses import dataclass
@dataclass
class DocumentEntry:
path: str
name: str
extension: str
size_bytes: int
modified: str
word_count: int
has_frontmatter: bool
def inventory_documents(root: str, patterns: list[str] = ["*.md", "*.txt", "*.yaml"]) -> list[DocumentEntry]:
entries = []
for pattern in patterns:
for path in Path(root).rglob(pattern):
content = path.read_text(errors="ignore")
entries.append(DocumentEntry(
path=str(path.relative_to(root)),
name=path.stem,
extension=path.suffix,
size_bytes=path.stat().st_size,
modified=path.stat().st_mtime,
word_count=len(content.split()),
has_frontmatter=content.startswith("---"),
))
return entries
Inventory Report
## Document Inventory
| Path | Type | Words | Frontmatter | Modified |
|------|------|-------|-------------|----------|
| skills/dev/testing/SKILL.md | skill | 1,245 | Yes | 2026-03-20 |
| docs/CHANGELOG.md | changelog | 890 | No | 2026-03-19 |
| README.md | readme | 450 | No | 2026-03-18 |
**Total:** 142 documents | **With frontmatter:** 105 | **Total words:** 185,000
Phase 2: Classification
Document Type Taxonomy
| Type | Signal | Example |
|---|---|---|
| Skill | YAML frontmatter with name:, in skills/ |
SKILL.md |
| Configuration | YAML/JSON schema | seed.yaml, registry.json |
| Guide | Tutorial structure, step-by-step | getting-started.md |
| Reference | API docs, schema docs | api-spec.md |
| Decision | ADR format, options + decision | adr-001.md |
| Changelog | Date-ordered entries | CHANGELOG.md |
| Policy | Rules, constraints | CONTRIBUTING.md |
Automated Classification
def classify_document(path: str, content: str) -> str:
if "skills/" in path and content.startswith("---"):
return "skill"
if path.endswith("seed.yaml") or path.endswith("registry.json"):
return "configuration"
if "## Step" in content or "### Step" in content:
return "guide"
if "## Decision" in content or "## Alternatives" in content:
return "decision"
if "## [" in content and any(d in content for d in ["Added", "Fixed", "Changed"]):
return "changelog"
return "general"
Phase 3: Quality Evaluation
Quality Scorecard
| Criterion | Weight | Score (0-3) | Notes |
|---|---|---|---|
| Completeness | 30% | All required sections present? | |
| Accuracy | 25% | Information correct and current? | |
| Clarity | 20% | Understandable without prior context? | |
| Structure | 15% | Logical organization, headings, formatting? | |
| Maintenance | 10% | Updated date, versioned, no stale links? |
Automated Quality Checks
def quality_check(path: str, content: str) -> dict:
checks = {
"has_title": content.startswith("#") or content.startswith("---"),
"has_sections": content.count("\n##") >= 2,
"reasonable_length": 100 < len(content.split()) < 10000,
"no_todo_left": "TODO" not in content and "FIXME" not in content,
"no_broken_links": not re.search(r'\[.*?\]\(\s*\)', content),
"has_code_examples": "```" in content,
"frontmatter_complete": check_frontmatter_fields(content),
}
score = sum(checks.values()) / len(checks)
return {"checks": checks, "score": round(score, 2)}
Phase 4: Feature Extraction
Metadata Extraction
import yaml
import re
def extract_frontmatter(content: str) -> dict | None:
match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
if match:
return yaml.safe_load(match.group(1))
return None
def extract_features(content: str) -> dict:
return {
"headings": re.findall(r'^#+\s+(.+)$', content, re.MULTILINE),
"code_blocks": len(re.findall(r'```', content)) // 2,
"links": re.findall(r'\[([^\]]+)\]\(([^)]+)\)', content),
"images": re.findall(r'!\[([^\]]*)\]\(([^)]+)\)', content),
"tables": content.count("\n|"),
"todos": re.findall(r'- \[ \]\s+(.+)', content),
}
Cross-Reference Mapping
def build_reference_graph(documents: list[dict]) -> dict:
graph = {}
for doc in documents:
links = extract_features(doc["content"])["links"]
graph[doc["path"]] = {
"outgoing": [link[1] for link in links if not link[1].startswith("http")],
"incoming": [],
}
# Build incoming links
for source, data in graph.items():
for target in data["outgoing"]:
if target in graph:
graph[target]["incoming"].append(source)
return graph
Gap Analysis
def gap_analysis(inventory: list[dict], expected: dict) -> dict:
existing = {doc["path"] for doc in inventory}
gaps = {
"missing_required": [p for p in expected.get("required", []) if p not in existing],
"missing_recommended": [p for p in expected.get("recommended", []) if p not in existing],
"orphaned": [p for p in existing if p not in expected.get("all_known", existing)],
"empty_files": [doc["path"] for doc in inventory if doc["word_count"] < 10],
}
return gaps
Anti-Patterns
- Manual-only audits — Automate what you can; reserve human judgment for quality assessment
- Audit without action — Every finding should map to a remediation action
- One-time audit — Build continuous monitoring, not point-in-time snapshots
- Counting without evaluating — Document count is vanity; quality score is actionable
- No baseline — Establish quality benchmarks before auditing
- Ignoring cross-references — Orphaned documents and broken links indicate structural problems
More from 4444j99/a-i--skills
interfaith-sacred-geometry
Generate sacred geometry patterns with interfaith symbolism for spiritual visualizations and art. Use when creating visual representations that honor multiple religious traditions, designing meditation aids, building soul journey visualizations, or producing art that bridges sacred traditions through geometric harmony. Triggers on sacred geometry requests, interfaith symbol design, spiritual visualization projects, or multi-tradition sacred art.
8three-js-interactive-builder
Scaffold and build interactive 3D visualizations using Three.js with emphasis on algorithmic art, sacred geometry, temporal animations, and modular architecture. Use when creating WebGL visualizations, generative art pieces, interactive 3D experiences, particle systems, flow fields, or projects like gravitational spirals, temporal perspective pieces, or illuminated visual narratives. Triggers on requests for Three.js projects, 3D web graphics, algorithmic visualizations, or sacred geometry renders.
6github-repo-curator
Organize GitHub repositories for professional presentation and maintainability. README templates, documentation standards, repo organization patterns, and profile optimization. Triggers on GitHub cleanup, repo organization, README writing, or open source presentation requests.
5ontological-renamer
Renames projects and content with dense, meaningful ontological titles that describe essence and function. Combines 3-4 words using separator conventions (- for compound/close words, -- for distant concepts). Provides translations to Latin and Greek. Use when naming projects, repositories, systems, or concepts.
5mobile-platform-architect
Architects cross-platform and native mobile applications, providing guidance on state management, navigation, and platform-specific best practices for React Native, Flutter, iOS, and Android.
5security-implementation-guide
Comprehensive security patterns for authentication, authorization, input validation, and common vulnerability prevention
5