Stata Data Cleaning and Analysis Skill
Contents
Core Principles
| Principle |
Description |
| Reproducible |
Code produces identical outputs when run multiple times |
| Defensive |
Assert statements verify data meets expected conditions |
| Documented |
Comments explain why decisions were made, not just what |
| No PII |
Never process personally identifiable information with AI tools |
Four-Stage Data Flow
- Import - Combine data into Stata format, apply corrections, remove duplicates
- Deidentify - Remove PII as early as possible
- Clean - Standardize formats, verify consistency
- Construct - Build analysis variables through merging/appending
Project Configuration
Do-File Header
clear all
set more off
version 17.0
set maxvar 5000
Path Setup
global data "$root/data"
global output "$root/output"
use "$data/raw/survey.dta", clear
save "$data/clean/survey_clean.dta", replace
Coding Standards Quick Reference
Variable Naming
| Prefix |
Meaning |
Example |
hh_ |
Household |
hh_income |
ind_ |
Individual |
ind_age |
bl_/el_ |
Baseline/Endline |
bl_score |
d_ |
Dummy/indicator |
d_employed |
n_ |
Count |
n_children |
Command Abbreviations
| Safe to abbreviate |
Never abbreviate |
gen, reg, lab, sum, tab |
local, global, save, merge |
bys, qui, noi, cap, forv |
append, sort, drop, keep |
Conditionals
replace status = 1 if (employed == 1) & !missing(income)
drop if missing(respondent_id)
replace status = 1 if employed & income
drop if respondent_id >= .
Line Breaking
regress income
age i.education i.region
if (sample == 1),
vce(cluster village_id)
Data Cleaning Workflow
1. Import and Inspect
import delimited "$data/raw/survey.csv", clear varnames(1)
describe
codebook, compact
2. Verify Identifiers
duplicates report respondent_id
duplicates tag respondent_id, gen(dup_flag)
isid respondent_id
3. Clean Variables
rename (q1 q2 q3) (resp_age resp_gender resp_education)
assert inrange(age, 0, 120) if !missing(age)
replace name = strtrim(strproper(name))
4. Document
label var resp_age "Respondent age in years"
label define gender_lbl 1 "Male" 2 "Female"
label values resp_gender gender_lbl
notes _dta: "Cleaned on `c(current_date)'"
5. Save and Verify
compress
save "$data/clean/survey_clean.dta", replace
Missing Values
IPA Extended Missing Conventions
| Raw Code |
Stata |
Meaning |
| -99 |
.d |
Don't know |
| -98 |
.r |
Refused |
| -97 |
.n |
Not applicable |
| -96 |
.s |
Skipped |
| -95 |
.o |
Other missing |
Recoding
mvdecode _all, mv(-99=.d \ -98=.r \ -97=.n \ -96=.s)
misstable summarize
Common Operations
Merging
use "$data/clean/household.dta", clear
count
local pre_merge = r(N)
merge 1:1 hhid using "$data/admin/treatment.dta"
tab _merge
assert _merge != 2
keep if _merge == 3
drop _merge
Appending
use "$data/clean/baseline.dta", clear
gen wave = 1
append using "$data/clean/endline.dta"
replace wave = 2 if missing(wave)
Reshaping
reshape long income_, i(hhid) j(year)
rename income_ income
reshape wide income, i(hhid) j(year)
Quality Checks
summarize, detail
tabstat income expenditure, stats(n mean sd min max)
egen income_std = std(income)
list hhid income if abs(income_std) > 3
tab gender pregnant, missing
assert pregnant == . | pregnant == 0 if gender == 1
Troubleshooting
Assert Failures
- Examine failing observations:
list if !(condition)
- Check for unexpected missing values
- Verify data source and transformations
- Document exceptions if valid
Merge Issues
- Check
_merge distribution with tab _merge
- Investigate unmatched:
list if _merge == 1 or _merge == 2
- Verify key variable types match (string vs numeric)
- Check for leading/trailing spaces in string keys
Performance
- Load only needed variables:
use var1 var2 using "data.dta"
- Reshape to long format before loops
- Use
quietly to suppress output in loops
- Increase
maxvar only when necessary
References
Project References
External Resources
Linting
just lint-stata
just lint-stata-file scripts/01.do
Common Packages
ssc install ietoolkit
ssc install estout
ssc install fre