data-dictionary
Data Dictionary Generator
v1.0 — Auto-generate comprehensive codebooks from .dta files
Read Stata .dta files and produce a structured markdown data dictionary with variable names, types, labels, value labels, summary statistics, and missingness. Outputs a ready-to-use codebook document.
Argument: $ARGUMENTS
- Path to a .dta file or directory containing .dta files
Modes (append to argument):
summary(default) — One-page overview: variable list with types, labels, missingnessfull— Comprehensive codebook: summary + value labels + summary stats + distributions for key variablesanalysis— Analysis-ready: full + notes on which variables are outcomes vs controls, indices vs components
Flags:
vars:consumption,assets— Only document variables matching these patternsoutput:path/to/output.md— Custom output path (default: same directory as input, namedcodebook_[filename].md)format:md(default) |format:csv— Output format
Example: /data-dictionary data/for_analysis/endline_analysis.dta full
Example: /data-dictionary . vars:consumption,exp_
Example: /data-dictionary data/for_analysis/ summary
Instructions
Step 0: Locate and Load Data
- Resolve
$ARGUMENTSto find .dta file(s):- If a file path: read that file directly
- If a directory: glob for
*.dtafiles in it - If a bare name: search the current working directory and subdirectories (
data/,data/for_analysis/,output/)
- For directories with many files (>10), generate an index page + individual codebooks
- Parse mode and flags from
$ARGUMENTS. Default tosummary.
Step 1: Extract Metadata
Write a temporary Python script (use uv run python or python3). Never use python -c.
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_dta("path/to/file.dta",
apply_value_formats=False)
# Available metadata:
# meta.column_names — variable names (list)
# meta.column_labels — variable labels (dict: name -> label)
# meta.variable_value_labels — value label mappings (dict: name -> {code: label})
# meta.original_variable_types — Stata storage types (dict: name -> type string)
# meta.number_columns — count of numeric columns
# meta.number_rows — row count
# meta.file_label — dataset label
# meta.notes — dataset notes
For each variable, extract:
- Name: column name
- Label: from
meta.column_labels(may be empty) - Type: from
meta.original_variable_types(e.g.,float,double,byte,int,long,str#) - Value labels: from
meta.variable_value_labels(categorical mappings) - N non-missing: count of non-null values
- N missing: count of null values
- Missing %: percentage missing
- Unique values: nunique
Step 2: Summary Statistics (full and analysis modes)
For numeric variables:
- Mean, SD, Min, P25, Median, P75, Max
- Flag: negative values in typically-positive variables
For string variables:
- Number of unique values
- Top 5 most frequent values with counts
- Max string length
For binary (0/1) variables:
- Proportion = 1 (with count)
For categorical variables (those with value labels):
- Frequency table: code, label, count, percentage
Step 3: Variable Classification (analysis mode only)
Attempt to classify variables by role based on naming patterns:
| Pattern | Classification |
|---|---|
hhid, id, *_id |
Identifier |
treat*, arm*, t_* |
Treatment |
strat*, block*, pair* |
Stratification |
i_*, idx_*, index_* |
Index/Composite |
z_* |
Z-score |
exp_*, cons_*, income_* |
Outcome (economic) |
is_*, has_*, any_* |
Binary indicator |
mi_*, miss_* |
Missing flag |
_wins, _w, _tr |
Winsorized/trimmed |
bl_*, base_* |
Baseline |
wt_*, weight* |
Sampling weight |
Present classifications as suggestions, not assertions. Include an "Unclassified" category.
Step 4: Cross-Variable Relationships (analysis mode only)
- Index decomposition: For index variables (i_, idx_), identify likely component variables and report correlations
- Treatment balance: If treatment variable detected, report mean of key variables by treatment arm (first 10 numeric variables)
- Missing patterns: Identify clusters of variables with correlated missingness
Step 5: Generate Output
Output location:
- Default: same directory as input file, named
codebook_[filename]_[YYYY-MM-DD].md - Override with
output:flag - For directory inputs:
codebook_index_[YYYY-MM-DD].md+ individual codebooks
Tell the user the full absolute path to the output file.
Output Format — Summary Mode
# Data Dictionary: [filename]
**Generated:** [YYYY-MM-DD]
**Source:** [full path]
**Dataset label:** [meta.file_label if available]
**Observations:** [N rows]
**Variables:** [N columns]
---
## Variable List
| # | Variable | Label | Type | Non-missing | Missing % | Unique |
|---|----------|-------|------|-------------|-----------|--------|
| 1 | hhid | Household ID | long | 5,000 | 0.0% | 5,000 |
| 2 | treat | Treatment arm | byte | 5,000 | 0.0% | 3 |
| ... | ... | ... | ... | ... | ... | ... |
---
## Variables Without Labels
[List any variables that have no label — these may need documentation]
---
## Notes
- [N] variables are entirely missing (0 non-missing values)
- [N] variables have >50% missing values
- [N] categorical variables have value labels defined
Output Format — Full Mode
Adds to summary:
---
## Summary Statistics — Numeric Variables
| Variable | N | Mean | SD | Min | P25 | Median | P75 | Max |
|----------|---|------|-----|-----|-----|--------|-----|-----|
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
---
## Summary Statistics — Binary Variables
| Variable | Label | N | Prop = 1 | Count = 1 |
|----------|-------|---|----------|-----------|
| ... | ... | ... | ... | ... |
---
## Value Labels — Categorical Variables
### [variable_name]: [label]
| Code | Label | Count | Percent |
|------|-------|-------|---------|
| 0 | Control | 2,500 | 50.0% |
| 1 | Treatment | 2,500 | 50.0% |
[Repeat for each categorical variable with value labels]
---
## String Variables
| Variable | Label | N | Unique | Max Length | Top Values |
|----------|-------|---|--------|------------|------------|
| ... | ... | ... | ... | ... | [val1 (N), val2 (N), ...] |
---
## High Missingness Variables (>20%)
| Variable | Label | Missing % | Non-missing N |
|----------|-------|-----------|---------------|
| ... | ... | ... | ... |
Output Format — Analysis Mode
Adds to full:
---
## Variable Classification (suggested)
### Identifiers
| Variable | Label |
|----------|-------|
| hhid | Household ID |
### Treatment & Stratification
| Variable | Label | Values |
|----------|-------|--------|
| treat | Treatment arm | 0: Control, 1: Treatment |
### Outcome Indices
| Variable | Label | Mean | SD | Likely Components |
|----------|-------|------|-----|-------------------|
| i_consumption | Consumption index | 0.00 | 1.00 | exp_food, exp_nonfood, ... |
### Outcome Variables
[table]
### Control Variables / Baseline
[table]
### Missing Flags
[table]
### Unclassified
[table]
---
## Treatment Balance (first 10 numeric variables)
| Variable | Control Mean | Treatment Mean | Diff | p-value |
|----------|-------------|----------------|------|---------|
| ... | ... | ... | ... | ... |
Principles
- Comprehensive but scannable. The dictionary should work as both a reference document (ctrl+F for a variable) and a quick overview (scan the summary table).
- Metadata-first. Always use Stata's own metadata (labels, value labels, types) rather than inferring. Only infer when metadata is missing.
- Flag gaps. Unlabeled variables, undocumented value labels, and high missingness are all worth flagging — they're the most likely sources of confusion.
- Analysis mode is suggestive, not prescriptive. Variable classification is based on naming patterns and may be wrong. Present as suggestions.
- Reproducible. The output includes the source path and generation date so it's clear what version of the data was documented.
More from thinkingwithagents/skills
academic-beamer-deck
>
9review-paper
Adversarial paper review simulating a skeptical referee — checks identification, statistical claims, robustness, and presentation against real referee patterns
7lit-review
Structured multi-session literature review workflow — scaffolds reviews, tracks papers, generates slide decks or documents, and runs referee passes for canonical paper coverage
7econ-audit
Adversarial econometrics review — catches specification errors, clustering mistakes, bad controls, and silent analytical failures in Stata, R, or Python code
7research-brainstorm
Brainstorm and stress-test research ideas as a senior scholar colleague would. Runs a multi-turn dialogue that clarifies the seed question (or generates one from cold start), probes it conversationally, launches a parallel deep literature scan across published and working-paper sources, critiques the question adversarially, generates 2–3 alternative framings, assesses feasibility (delegating to the find-data skill when available), and writes a research brief to the working directory. Trigger when the user asks to brainstorm a research idea, stress-test a research question, workshop a project, develop a new paper idea, assess novelty of a question, evaluate whether an idea is worth pursuing, refine a research direction, or check whether an idea clears a top-journal bar. Also trigger on phrases like "brainstorm with me", "is this novel", "has anyone done this", "what should I work on", "help me think through this idea", "workshop this question", "stress-test this", "poke holes in this", "new research idea", "research project idea", or when the user describes a half-formed question and asks for feedback. Use regardless of field but tuned for empirical economics and adjacent social sciences.
7code-review
Structured code review for research code (Stata, R, Python) against DIME, Gentzkow-Shapiro, AEA, and IPA standards — catches silent failures, reproducibility risks, and style issues
6