stata
SKILL.md
Stata Data Cleaning and Analysis Skill
Contents
- Core Principles
- Project Configuration
- Coding Standards Quick Reference
- Data Cleaning Workflow
- Missing Values
- Common Operations
- Quality Checks
- Troubleshooting
- References
Core Principles
| Principle | Description |
|---|---|
| Reproducible | Code produces identical outputs when run multiple times |
| Defensive | Assert statements verify data meets expected conditions |
| Documented | Comments explain why decisions were made, not just what |
| No PII | Never process personally identifiable information with AI tools |
Four-Stage Data Flow
- Import - Combine data into Stata format, apply corrections, remove duplicates
- Deidentify - Remove PII as early as possible
- Clean - Standardize formats, verify consistency
- Construct - Build analysis variables through merging/appending
Project Configuration
Do-File Header
* ==============================================================================
* Project: [Project Name]
* Purpose: [Brief description]
* Author: [Name]
* Created: [Date]
* ==============================================================================
clear all
set more off
version 17.0
set maxvar 5000 // Increase only if genuinely needed
Path Setup
* Define paths in master do-file (use forward slashes)
global data "$root/data"
global output "$root/output"
* Usage - always use globals, never cd
use "$data/raw/survey.dta", clear
save "$data/clean/survey_clean.dta", replace
Coding Standards Quick Reference
Variable Naming
| Prefix | Meaning | Example |
|---|---|---|
hh_ |
Household | hh_income |
ind_ |
Individual | ind_age |
bl_/el_ |
Baseline/Endline | bl_score |
d_ |
Dummy/indicator | d_employed |
n_ |
Count | n_children |
Command Abbreviations
| Safe to abbreviate | Never abbreviate |
|---|---|
gen, reg, lab, sum, tab |
local, global, save, merge |
bys, qui, noi, cap, forv |
append, sort, drop, keep |
Conditionals
* Good - explicit and clear
replace status = 1 if (employed == 1) & !missing(income)
drop if missing(respondent_id)
* Bad - implicit or unclear
replace status = 1 if employed & income
drop if respondent_id >= .
Line Breaking
regress income ///
age i.education i.region ///
if (sample == 1), ///
vce(cluster village_id)
Data Cleaning Workflow
1. Import and Inspect
import delimited "$data/raw/survey.csv", clear varnames(1)
describe
codebook, compact
2. Verify Identifiers
duplicates report respondent_id
duplicates tag respondent_id, gen(dup_flag)
* Investigate and resolve duplicates
isid respondent_id // Assert uniqueness
3. Clean Variables
* Rename to convention
rename (q1 q2 q3) (resp_age resp_gender resp_education)
* Validate ranges
assert inrange(age, 0, 120) if !missing(age)
* Clean strings
replace name = strtrim(strproper(name))
4. Document
label var resp_age "Respondent age in years"
label define gender_lbl 1 "Male" 2 "Female"
label values resp_gender gender_lbl
notes _dta: "Cleaned on `c(current_date)'"
5. Save and Verify
compress
save "$data/clean/survey_clean.dta", replace
Missing Values
IPA Extended Missing Conventions
| Raw Code | Stata | Meaning |
|---|---|---|
| -99 | .d |
Don't know |
| -98 | .r |
Refused |
| -97 | .n |
Not applicable |
| -96 | .s |
Skipped |
| -95 | .o |
Other missing |
Recoding
* Using mvdecode (efficient)
mvdecode _all, mv(-99=.d \ -98=.r \ -97=.n \ -96=.s)
* Check missing patterns
misstable summarize
Common Operations
Merging
use "$data/clean/household.dta", clear
count
local pre_merge = r(N)
merge 1:1 hhid using "$data/admin/treatment.dta"
tab _merge
assert _merge != 2 // No unmatched using expected
keep if _merge == 3
drop _merge
Appending
use "$data/clean/baseline.dta", clear
gen wave = 1
append using "$data/clean/endline.dta"
replace wave = 2 if missing(wave)
Reshaping
* Wide to long
reshape long income_, i(hhid) j(year)
rename income_ income
* Long to wide
reshape wide income, i(hhid) j(year)
Quality Checks
* Summary statistics
summarize, detail
tabstat income expenditure, stats(n mean sd min max)
* Outlier detection
egen income_std = std(income)
list hhid income if abs(income_std) > 3
* Cross-tabulation consistency
tab gender pregnant, missing
assert pregnant == . | pregnant == 0 if gender == 1
Troubleshooting
Assert Failures
- Examine failing observations:
list if !(condition) - Check for unexpected missing values
- Verify data source and transformations
- Document exceptions if valid
Merge Issues
- Check
_mergedistribution withtab _merge - Investigate unmatched:
list if _merge == 1or_merge == 2 - Verify key variable types match (string vs numeric)
- Check for leading/trailing spaces in string keys
Performance
- Load only needed variables:
use var1 var2 using "data.dta" - Reshape to long format before loops
- Use
quietlyto suppress output in loops - Increase
maxvaronly when necessary
References
Project References
- Coding Standards - Complete DIME Analytics Stata standards
- Data Cleaning Checklist - Step-by-step checklist
- Missing Values Guide - IPA missing value conventions
External Resources
Linting
just lint-stata # Lint all do-files
just lint-stata-file scripts/01.do # Lint specific file
Common Packages
ssc install ietoolkit // DIME tools
ssc install estout // Tables
ssc install fre // Frequencies
Weekly Installs
1
Repository
smithery/aiFirst Seen
3 days ago
Installed on
amp1
opencode1
cursor1
kimi-cli1
codex1
github-copilot1