code-review

Installation

SKILL.md

Code Review

v1.0 — Structured code review for research code, drawing on DIME, Gentzkow-Shapiro, AEA, and IPA standards

Review research code (Stata, R, or Python) against economics-specific quality standards. Catches silent failures, reproducibility risks, and style issues that generic linters miss.

Argument: $ARGUMENTS

Path to a file (.do, .R, .py) or a directory
Or a project name (will look in ~/Dropbox/Github/[project]/)

Modes (append to argument):

quick (default) — Single-file review: correctness, reproducibility risks, style
full — Deep single-file review with project context (reads master do-file, config, related files)
pipeline — Multi-file review: trace the full analysis pipeline, check dependencies and flow
replication — AEA replication package audit (README, data citations, reproducibility, completeness)

Flags:

fix — Also output a corrected version of the file (otherwise review-only)
severity:high — Only report high-severity issues (skip style nitpicks)

Example: /code-review ~/Dropbox/Github/graduation-coaching/code/dofiles/01_clean.do Example: /code-review graduation-coaching pipeline Example: /code-review my_analysis.do full fix

Instructions

Step 0: Locate and Read the Code

If $ARGUMENTS contains a file path, read that file directly
If $ARGUMENTS is a project name, check these locations in order:
- ~/Dropbox/Github/[project]/code/
- ~/Dropbox/Github/[project]/analysis/
- ~/Dropbox/Github/[project]/dofiles/
- Glob for *.do, *.R, *.py in the project repo
If a directory is given:
- For quick or full: review each code file individually
- For pipeline: trace the execution order from the master file
- For replication: review the full package structure
Detect language from file extension: .do = Stata, .R/.r = R, .py = Python

If multiple files found and no mode specified, list them and ask which to review.

For full and pipeline modes, also read:

Master do-file / main script (look for master*.do, main*.do, run*.do, 00_*.do)
Config file (look for config*.do, profile.do, globals.do, paths.do)
Project's CLAUDE.md or README.md if available
Project's HUB.md in eb-lab if available

Parse the mode from $ARGUMENTS. Default to quick if not specified.

Step 1: Correctness Checks

Review the code for errors that could produce wrong results silently. These are the most important findings.

1.1 Stata-Specific Correctness

Merge diagnostics: Every merge must be followed by assert _merge == 3 or explicit handling of _merge values (tabulate, keep/drop). Flag any merge without _merge inspection.
Sort stability: sort in Stata is not stable. Flag any sort followed by operations that depend on row order (e.g., gen id = _n, by ... : gen x = x[_n-1]). Recommend isid checks or sort ..., stable.
Dropped observations: Flag any drop if or keep if without a preceding or following count/assertion. The reviewer should verify the number of dropped obs is expected.
Missing values in comparisons: In Stata, missing values are greater than any number. Flag if x > threshold without & !missing(x). Flag drop if x > threshold especially.
String/numeric mismatch: Flag comparisons between string and numeric variables (e.g., merge keys where one side is string, the other numeric).
Preserve/restore: Every preserve must have a matching restore. Flag unmatched pairs.
Temporary files: tempfile and tempvar usage — flag any that are created but never used, or used after clear.
Collapse without saving: Flag collapse without a preceding preserve or save/tempfile — the original data is destroyed.
Destring/tostring issues: Flag destring, force without checking what was forced to missing.
Factor variable traps: Flag regressions using i. on variables with many levels without checking for singletons or collinearity.

1.2 R-Specific Correctness

Unhandled NAs: Operations on vectors with NAs without na.rm = TRUE or explicit filter(!is.na(...)).
Left joins dropping data: Flag left_join without checking for unexpected row count changes.
Factor level issues: Implicit factor ordering in regressions.
Package conflicts: Multiple packages loaded that mask each other's functions (e.g., dplyr::filter vs stats::filter).

1.3 Python-Specific Correctness

Pandas merge issues: pd.merge without validate= parameter. Missing how= specification.
Silent type coercion: Operations that silently convert types (e.g., int to float due to NaN).
Index alignment: Operations on DataFrames with misaligned indices.

1.4 Cross-Language Correctness

Hardcoded values: Magic numbers without explanation (e.g., drop if age > 65 — is 65 the right cutoff?).
Commented-out code that affects results: Large blocks of commented-out analysis that suggest the code was modified and may not reflect the intended specification.
Off-by-one errors: Loop bounds, date ranges, age cutoffs.
Inconsistent sample restrictions: Different if conditions across regressions that should use the same sample.

Step 2: Reproducibility Checks

Review for issues that would cause the code to fail or produce different results on another machine or in the future.

2.1 Path and Environment

Hardcoded paths: Any absolute path that includes a username, machine name, or drive letter. Should use config files, globals, or relative paths.
Missing version specification: For Stata: no version command. For R: no sessionInfo() or renv.lock. For Python: no requirements.txt or environment file.
Platform-specific code: Forward vs back slashes, OS-specific commands.
Working directory assumptions: Code that assumes a specific working directory without setting it.

2.2 Randomness and Determinism

Unseeded randomness: Any use of random numbers (runiform(), sample(), np.random) without a preceding set seed / set.seed() / np.random.seed().
Sort-dependent operations: Results that depend on sort order where the sort is not unique (see 1.1).
Floating point issues: Comparisons using == on floating point numbers.

2.3 Dependencies

Undocumented packages: Community-contributed commands (Stata: ssc install; R: install.packages; Python: pip install) that are used but not listed in a requirements file or package installer script.
Version-sensitive commands: Commands whose behavior changed between versions (e.g., Stata's reghdfe updates, R package breaking changes).

2.4 File Dependencies

Input files not documented: Data files read by the code but not listed in README or data documentation.
Output files not tracked: Files created by the code that aren't mentioned in documentation.
Circular dependencies: File A reads output of File B which reads output of File A.

Step 3: Style and Readability

Review for issues that make code harder to understand, maintain, or review. Lower priority than correctness and reproducibility.

Skip this section if severity:high flag is set.

3.1 Stata Style (based on DIME Analytics + Gentzkow-Shapiro)

#delimit ;: Flag use of #delimit — prefer /// for line continuation.
Abbreviations: Flag abbreviated command names (gen is OK, but flag g, d for drop, ren for rename, ta for tab, su for sum). Only flag genuinely ambiguous abbreviations — gen, reg, tab, sum are universally understood and acceptable.
Variable abbreviation: Flag reliance on partial variable name matching (Stata's dangerous default). Recommend set varabbrev off.
Indentation: Inconsistent indentation, especially inside loops and if-blocks.
Line length: Lines over 100 characters.
Magic numbers: Unnamed numeric constants. Should be stored in locals/globals with descriptive names.
Commenting: Major sections without header comments. Complex logic without inline explanation.

3.2 R Style (based on tidyverse style guide)

Long pipes (%>% or |>) without intermediate assignments.
Functions over 50 lines without decomposition.
Inconsistent naming (mixing snake_case and camelCase).

3.3 Python Style

Defer to Ruff/PEP 8. Only flag issues a linter wouldn't catch (e.g., misleading variable names in an econometric context).

3.4 Cross-Language Style

Dead code: Large commented-out blocks, unused variables, unreachable branches.
Copy-paste code: Repeated blocks that should be a function/loop.
Naming: Variable names that don't convey meaning (e.g., x1, temp2, var_new).

Step 4: Documentation Checks

4.1 File-Level

Does the file have a header comment explaining: purpose, inputs, outputs, author, date?
Is it clear where this file fits in the pipeline (what runs before/after it)?

4.2 Data Transformations

Are merge ratios documented (e.g., "expect 1:1 merge, N = 5,000")?
Are sample restrictions explained (why drop these observations)?
Are variable constructions documented (how is this index built)?

4.3 Analysis

Are regression specifications motivated (why these controls? why this functional form)?
Are robustness checks documented (what is being tested and why)?
Is it clear which tables/figures each code block produces?

Step 5: Pipeline-Specific Checks (pipeline mode only)

Skip unless mode = pipeline or replication.

5.1 Execution Order

Is there a master file that runs everything in order?
Can the full pipeline run from a single command ("push-button replication")?
Are there files that must be run manually or out of order?

5.2 Data Flow

Trace the data from raw inputs to final outputs. Map: raw data → cleaning → construction → analysis → tables/figures.
Flag any breaks in the chain (a file reads data that no previous file creates).
Flag any data files that are created but never used downstream.

5.3 Runtime

Estimate total runtime if possible (flag long-running operations).
Are there expensive operations that could be cached or skipped on re-runs?

Step 6: Replication Package Checks (replication mode only)

Skip unless mode = replication.

Run the AEA Data Editor checklist. For each item, assess: Met / Partial / Missing / Can't Assess.

6.1 README

Follows AEA template structure (or equivalent)?
Data availability statements for each data source?
Computational requirements (software, hardware, runtime, storage)?
Instructions for replicators (clear step-by-step)?

6.2 Data

Data citations in standard format (author, title, distributor, date, DOI)?
License/terms of use for each dataset?
Access instructions for restricted data?
PII check — any risk of identifiable information?

6.3 Code

All code included and runnable?
Package/dependency management (Stata: package installer do-file; R: renv.lock; Python: requirements.txt)?
Version pinning (Stata version, R version, Python version)?
Output mapping: which script produces which table/figure?

6.4 Outputs

All tables and figures in the paper reproducible from provided code + data?
In-text statistics traceable to code?
Appendix materials included?

6.5 Legal and Ethical

LICENSE file present?
IRB approval documented?
RCT registration cited?
Data use agreements acknowledged?

Step 7: Generate Output

Classify each finding by severity:

CRITICAL — Will produce wrong results or prevent replication. Fix immediately.
HIGH — Significant reproducibility risk or code quality issue. Fix before sharing.
MEDIUM — Style or documentation issue that makes code harder to review. Fix when convenient.
LOW — Nitpick. Optional improvement.

Save the report to the same directory as the reviewed file: review_[filename]_[YYYY-MM-DD].md

For pipeline/replication reviews, save to the project root: code_review_[project]_[YYYY-MM-DD].md

If the fix flag is set, also save a corrected version: [filename]_reviewed.[ext]

Tell the user the full path to the output file(s).

Output Format

# Code Review: [filename or project name]

**Date:** [YYYY-MM-DD]
**Mode:** [quick / full / pipeline / replication]
**Language:** [Stata / R / Python]
**File(s) reviewed:** [path(s)]
**Reviewer:** /code-review skill v1.0
**Standards:** DIME Analytics, Gentzkow-Shapiro, AEA Data Editor

---

## Summary

**Overall assessment:** [Clean / Minor Issues / Needs Revision / Significant Problems]
**Findings:** [N] critical, [N] high, [N] medium, [N] low

[2-3 sentence summary of the most important findings.]

---

## Critical & High Findings

### F1: [Title]
- **Severity:** [CRITICAL / HIGH]
- **Category:** [Correctness / Reproducibility / Documentation / Pipeline / Replication]
- **Location:** [file:line_number or file:section]
- **Issue:** [What's wrong]
- **Risk:** [What could go wrong if unfixed]
- **Fix:** [Specific recommendation]

[Repeat for each critical/high finding]

---

## Medium Findings

### F[N]: [Title]
- **Severity:** MEDIUM
- **Category:** [category]
- **Location:** [location]
- **Issue:** [description]
- **Fix:** [recommendation]

[Repeat]

---

## Low Findings

[Brief list format — one line per finding]

- **F[N]:** [location] — [issue] → [fix]

---

## File Summary Table

| Check Category | Status | Issues Found |
|---------------|--------|-------------|
| Correctness | [pass/warn/fail] | [count] |
| Reproducibility | [pass/warn/fail] | [count] |
| Style | [pass/warn/fail] | [count] |
| Documentation | [pass/warn/fail] | [count] |
| Pipeline (if applicable) | [pass/warn/fail] | [count] |
| Replication (if applicable) | [pass/warn/fail] | [count] |

---

## Checklist (replication mode only)

| AEA Requirement | Status | Notes |
|----------------|--------|-------|
| [requirement] | [Met/Partial/Missing/Can't Assess] | [details] |

---

## Next Steps

1. [Highest-priority action]
2. [Second priority]
3. [Third priority]

Principles

Correctness over style. A well-formatted file with a wrong merge is worse than ugly code that produces correct results. Always prioritize findings that affect results.
Economics-aware. Understand that reghdfe is a regression, that _merge matters, that collapse destroys data, that missing values sort high in Stata. Generic code review advice is insufficient.
Actionable findings. Every issue comes with a specific fix. "This could be improved" is not a finding — "Line 47: merge without _merge check; add assert _merge == 3 or tab _merge after merge" is.
Calibrate severity honestly. Not everything is critical. A missing comment is LOW. A merge without diagnostics is HIGH. A hardcoded path is HIGH. An abbreviated command is LOW.
Do not fabricate. If you cannot determine whether a pattern is an error or intentional, flag it as a question, not a finding. Say "Verify: is this intentional?" rather than "Bug: this is wrong."
Respect the author's intent. Unusual patterns may be intentional. Flag them, explain the risk, but don't assume they're mistakes.
No scope creep. Review the code as requested. Do not rewrite the analysis, suggest different specifications, or critique the research design (that's what /review-paper is for).

Related skills

More from thinkingwithagents/skills

Installs

Repository

thinkingwithage…s/skills

GitHub Stars

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass