code-review
Code Review
v1.0 — Structured code review for research code, drawing on DIME, Gentzkow-Shapiro, AEA, and IPA standards
Review research code (Stata, R, or Python) against economics-specific quality standards. Catches silent failures, reproducibility risks, and style issues that generic linters miss.
Argument: $ARGUMENTS
- Path to a file (
.do,.R,.py) or a directory - Or a project name (will look in
~/Dropbox/Github/[project]/)
Modes (append to argument):
quick(default) — Single-file review: correctness, reproducibility risks, stylefull— Deep single-file review with project context (reads master do-file, config, related files)pipeline— Multi-file review: trace the full analysis pipeline, check dependencies and flowreplication— AEA replication package audit (README, data citations, reproducibility, completeness)
Flags:
fix— Also output a corrected version of the file (otherwise review-only)severity:high— Only report high-severity issues (skip style nitpicks)
Example: /code-review ~/Dropbox/Github/graduation-coaching/code/dofiles/01_clean.do
Example: /code-review graduation-coaching pipeline
Example: /code-review my_analysis.do full fix
Instructions
Step 0: Locate and Read the Code
- If
$ARGUMENTScontains a file path, read that file directly - If
$ARGUMENTSis a project name, check these locations in order:~/Dropbox/Github/[project]/code/~/Dropbox/Github/[project]/analysis/~/Dropbox/Github/[project]/dofiles/- Glob for
*.do,*.R,*.pyin the project repo
- If a directory is given:
- For
quickorfull: review each code file individually - For
pipeline: trace the execution order from the master file - For
replication: review the full package structure
- For
- Detect language from file extension:
.do= Stata,.R/.r= R,.py= Python
If multiple files found and no mode specified, list them and ask which to review.
For full and pipeline modes, also read:
- Master do-file / main script (look for
master*.do,main*.do,run*.do,00_*.do) - Config file (look for
config*.do,profile.do,globals.do,paths.do) - Project's
CLAUDE.mdorREADME.mdif available - Project's HUB.md in eb-lab if available
Parse the mode from $ARGUMENTS. Default to quick if not specified.
Step 1: Correctness Checks
Review the code for errors that could produce wrong results silently. These are the most important findings.
1.1 Stata-Specific Correctness
- Merge diagnostics: Every
mergemust be followed byassert _merge == 3or explicit handling of_mergevalues (tabulate, keep/drop). Flag any merge without_mergeinspection. - Sort stability:
sortin Stata is not stable. Flag anysortfollowed by operations that depend on row order (e.g.,gen id = _n,by ... : gen x = x[_n-1]). Recommendisidchecks orsort ..., stable. - Dropped observations: Flag any
drop iforkeep ifwithout a preceding or following count/assertion. The reviewer should verify the number of dropped obs is expected. - Missing values in comparisons: In Stata, missing values are greater than any number. Flag
if x > thresholdwithout& !missing(x). Flagdrop if x > thresholdespecially. - String/numeric mismatch: Flag comparisons between string and numeric variables (e.g., merge keys where one side is string, the other numeric).
- Preserve/restore: Every
preservemust have a matchingrestore. Flag unmatched pairs. - Temporary files:
tempfileandtempvarusage — flag any that are created but never used, or used afterclear. - Collapse without saving: Flag
collapsewithout a precedingpreserveorsave/tempfile— the original data is destroyed. - Destring/tostring issues: Flag
destring, forcewithout checking what was forced to missing. - Factor variable traps: Flag regressions using
i.on variables with many levels without checking for singletons or collinearity.
1.2 R-Specific Correctness
- Unhandled NAs: Operations on vectors with NAs without
na.rm = TRUEor explicitfilter(!is.na(...)). - Left joins dropping data: Flag
left_joinwithout checking for unexpected row count changes. - Factor level issues: Implicit factor ordering in regressions.
- Package conflicts: Multiple packages loaded that mask each other's functions (e.g.,
dplyr::filtervsstats::filter).
1.3 Python-Specific Correctness
- Pandas merge issues:
pd.mergewithoutvalidate=parameter. Missinghow=specification. - Silent type coercion: Operations that silently convert types (e.g., int to float due to NaN).
- Index alignment: Operations on DataFrames with misaligned indices.
1.4 Cross-Language Correctness
- Hardcoded values: Magic numbers without explanation (e.g.,
drop if age > 65— is 65 the right cutoff?). - Commented-out code that affects results: Large blocks of commented-out analysis that suggest the code was modified and may not reflect the intended specification.
- Off-by-one errors: Loop bounds, date ranges, age cutoffs.
- Inconsistent sample restrictions: Different
ifconditions across regressions that should use the same sample.
Step 2: Reproducibility Checks
Review for issues that would cause the code to fail or produce different results on another machine or in the future.
2.1 Path and Environment
- Hardcoded paths: Any absolute path that includes a username, machine name, or drive letter. Should use config files, globals, or relative paths.
- Missing version specification: For Stata: no
versioncommand. For R: nosessionInfo()orrenv.lock. For Python: norequirements.txtor environment file. - Platform-specific code: Forward vs back slashes, OS-specific commands.
- Working directory assumptions: Code that assumes a specific working directory without setting it.
2.2 Randomness and Determinism
- Unseeded randomness: Any use of random numbers (
runiform(),sample(),np.random) without a precedingset seed/set.seed()/np.random.seed(). - Sort-dependent operations: Results that depend on sort order where the sort is not unique (see 1.1).
- Floating point issues: Comparisons using
==on floating point numbers.
2.3 Dependencies
- Undocumented packages: Community-contributed commands (Stata:
ssc install; R:install.packages; Python:pip install) that are used but not listed in a requirements file or package installer script. - Version-sensitive commands: Commands whose behavior changed between versions (e.g., Stata's
reghdfeupdates, R package breaking changes).
2.4 File Dependencies
- Input files not documented: Data files read by the code but not listed in README or data documentation.
- Output files not tracked: Files created by the code that aren't mentioned in documentation.
- Circular dependencies: File A reads output of File B which reads output of File A.
Step 3: Style and Readability
Review for issues that make code harder to understand, maintain, or review. Lower priority than correctness and reproducibility.
Skip this section if severity:high flag is set.
3.1 Stata Style (based on DIME Analytics + Gentzkow-Shapiro)
#delimit ;: Flag use of#delimit— prefer///for line continuation.- Abbreviations: Flag abbreviated command names (
genis OK, but flagg,dfordrop,renforrename,tafortab,suforsum). Only flag genuinely ambiguous abbreviations —gen,reg,tab,sumare universally understood and acceptable. - Variable abbreviation: Flag reliance on partial variable name matching (Stata's dangerous default). Recommend
set varabbrev off. - Indentation: Inconsistent indentation, especially inside loops and if-blocks.
- Line length: Lines over 100 characters.
- Magic numbers: Unnamed numeric constants. Should be stored in locals/globals with descriptive names.
- Commenting: Major sections without header comments. Complex logic without inline explanation.
3.2 R Style (based on tidyverse style guide)
- Long pipes (
%>%or|>) without intermediate assignments. - Functions over 50 lines without decomposition.
- Inconsistent naming (mixing
snake_caseandcamelCase).
3.3 Python Style
- Defer to Ruff/PEP 8. Only flag issues a linter wouldn't catch (e.g., misleading variable names in an econometric context).
3.4 Cross-Language Style
- Dead code: Large commented-out blocks, unused variables, unreachable branches.
- Copy-paste code: Repeated blocks that should be a function/loop.
- Naming: Variable names that don't convey meaning (e.g.,
x1,temp2,var_new).
Step 4: Documentation Checks
4.1 File-Level
- Does the file have a header comment explaining: purpose, inputs, outputs, author, date?
- Is it clear where this file fits in the pipeline (what runs before/after it)?
4.2 Data Transformations
- Are merge ratios documented (e.g., "expect 1:1 merge, N = 5,000")?
- Are sample restrictions explained (why drop these observations)?
- Are variable constructions documented (how is this index built)?
4.3 Analysis
- Are regression specifications motivated (why these controls? why this functional form)?
- Are robustness checks documented (what is being tested and why)?
- Is it clear which tables/figures each code block produces?
Step 5: Pipeline-Specific Checks (pipeline mode only)
Skip unless mode = pipeline or replication.
5.1 Execution Order
- Is there a master file that runs everything in order?
- Can the full pipeline run from a single command ("push-button replication")?
- Are there files that must be run manually or out of order?
5.2 Data Flow
- Trace the data from raw inputs to final outputs. Map:
raw data → cleaning → construction → analysis → tables/figures. - Flag any breaks in the chain (a file reads data that no previous file creates).
- Flag any data files that are created but never used downstream.
5.3 Runtime
- Estimate total runtime if possible (flag long-running operations).
- Are there expensive operations that could be cached or skipped on re-runs?
Step 6: Replication Package Checks (replication mode only)
Skip unless mode = replication.
Run the AEA Data Editor checklist. For each item, assess: Met / Partial / Missing / Can't Assess.
6.1 README
- Follows AEA template structure (or equivalent)?
- Data availability statements for each data source?
- Computational requirements (software, hardware, runtime, storage)?
- Instructions for replicators (clear step-by-step)?
6.2 Data
- Data citations in standard format (author, title, distributor, date, DOI)?
- License/terms of use for each dataset?
- Access instructions for restricted data?
- PII check — any risk of identifiable information?
6.3 Code
- All code included and runnable?
- Package/dependency management (Stata: package installer do-file; R:
renv.lock; Python:requirements.txt)? - Version pinning (Stata version, R version, Python version)?
- Output mapping: which script produces which table/figure?
6.4 Outputs
- All tables and figures in the paper reproducible from provided code + data?
- In-text statistics traceable to code?
- Appendix materials included?
6.5 Legal and Ethical
- LICENSE file present?
- IRB approval documented?
- RCT registration cited?
- Data use agreements acknowledged?
Step 7: Generate Output
Classify each finding by severity:
- CRITICAL — Will produce wrong results or prevent replication. Fix immediately.
- HIGH — Significant reproducibility risk or code quality issue. Fix before sharing.
- MEDIUM — Style or documentation issue that makes code harder to review. Fix when convenient.
- LOW — Nitpick. Optional improvement.
Save the report to the same directory as the reviewed file:
review_[filename]_[YYYY-MM-DD].md
For pipeline/replication reviews, save to the project root:
code_review_[project]_[YYYY-MM-DD].md
If the fix flag is set, also save a corrected version:
[filename]_reviewed.[ext]
Tell the user the full path to the output file(s).
Output Format
# Code Review: [filename or project name]
**Date:** [YYYY-MM-DD]
**Mode:** [quick / full / pipeline / replication]
**Language:** [Stata / R / Python]
**File(s) reviewed:** [path(s)]
**Reviewer:** /code-review skill v1.0
**Standards:** DIME Analytics, Gentzkow-Shapiro, AEA Data Editor
---
## Summary
**Overall assessment:** [Clean / Minor Issues / Needs Revision / Significant Problems]
**Findings:** [N] critical, [N] high, [N] medium, [N] low
[2-3 sentence summary of the most important findings.]
---
## Critical & High Findings
### F1: [Title]
- **Severity:** [CRITICAL / HIGH]
- **Category:** [Correctness / Reproducibility / Documentation / Pipeline / Replication]
- **Location:** [file:line_number or file:section]
- **Issue:** [What's wrong]
- **Risk:** [What could go wrong if unfixed]
- **Fix:** [Specific recommendation]
[Repeat for each critical/high finding]
---
## Medium Findings
### F[N]: [Title]
- **Severity:** MEDIUM
- **Category:** [category]
- **Location:** [location]
- **Issue:** [description]
- **Fix:** [recommendation]
[Repeat]
---
## Low Findings
[Brief list format — one line per finding]
- **F[N]:** [location] — [issue] → [fix]
---
## File Summary Table
| Check Category | Status | Issues Found |
|---------------|--------|-------------|
| Correctness | [pass/warn/fail] | [count] |
| Reproducibility | [pass/warn/fail] | [count] |
| Style | [pass/warn/fail] | [count] |
| Documentation | [pass/warn/fail] | [count] |
| Pipeline (if applicable) | [pass/warn/fail] | [count] |
| Replication (if applicable) | [pass/warn/fail] | [count] |
---
## Checklist (replication mode only)
| AEA Requirement | Status | Notes |
|----------------|--------|-------|
| [requirement] | [Met/Partial/Missing/Can't Assess] | [details] |
---
## Next Steps
1. [Highest-priority action]
2. [Second priority]
3. [Third priority]
Principles
- Correctness over style. A well-formatted file with a wrong merge is worse than ugly code that produces correct results. Always prioritize findings that affect results.
- Economics-aware. Understand that
reghdfeis a regression, that_mergematters, thatcollapsedestroys data, that missing values sort high in Stata. Generic code review advice is insufficient. - Actionable findings. Every issue comes with a specific fix. "This could be improved" is not a finding — "Line 47:
mergewithout_mergecheck; addassert _merge == 3ortab _mergeafter merge" is. - Calibrate severity honestly. Not everything is critical. A missing comment is LOW. A merge without diagnostics is HIGH. A hardcoded path is HIGH. An abbreviated command is LOW.
- Do not fabricate. If you cannot determine whether a pattern is an error or intentional, flag it as a question, not a finding. Say "Verify: is this intentional?" rather than "Bug: this is wrong."
- Respect the author's intent. Unusual patterns may be intentional. Flag them, explain the risk, but don't assume they're mistakes.
- No scope creep. Review the code as requested. Do not rewrite the analysis, suggest different specifications, or critique the research design (that's what
/review-paperis for).
More from thinkingwithagents/skills
academic-beamer-deck
>
9review-paper
Adversarial paper review simulating a skeptical referee — checks identification, statistical claims, robustness, and presentation against real referee patterns
7lit-review
Structured multi-session literature review workflow — scaffolds reviews, tracks papers, generates slide decks or documents, and runs referee passes for canonical paper coverage
7econ-audit
Adversarial econometrics review — catches specification errors, clustering mistakes, bad controls, and silent analytical failures in Stata, R, or Python code
7research-brainstorm
Brainstorm and stress-test research ideas as a senior scholar colleague would. Runs a multi-turn dialogue that clarifies the seed question (or generates one from cold start), probes it conversationally, launches a parallel deep literature scan across published and working-paper sources, critiques the question adversarially, generates 2–3 alternative framings, assesses feasibility (delegating to the find-data skill when available), and writes a research brief to the working directory. Trigger when the user asks to brainstorm a research idea, stress-test a research question, workshop a project, develop a new paper idea, assess novelty of a question, evaluate whether an idea is worth pursuing, refine a research direction, or check whether an idea clears a top-journal bar. Also trigger on phrases like "brainstorm with me", "is this novel", "has anyone done this", "what should I work on", "help me think through this idea", "workshop this question", "stress-test this", "poke holes in this", "new research idea", "research project idea", or when the user describes a half-formed question and asks for feedback. Use regardless of field but tuned for empirical economics and adjacent social sciences.
7find-data
Help researchers identify and catalog relevant datasets for empirical research. Use this skill whenever a user asks to find data, locate datasets, identify data sources, or asks "what data exists for...", "where can I get data on...", "is there a dataset that...", or similar. Also trigger when the user describes a research question and needs to know what data could support it, asks about data availability for a particular topic or geography, wants to compare data sources, or mentions needing panel data, cross-sectional data, or time-series data for a project. Trigger on mentions of "find data", "data search", "dataset", "data source", "microdata", "administrative data", "survey data", "public-use data", "restricted data", or when the user specifies parameters like time windows, geographic levels, or data frequency in the context of research. Also trigger when the user asks for help downloading, scraping, or structuring a dataset that has been identified.
6