Exploratory Data Analysis (EDA)

Framework

IRON LAW: Perform EDA Only AFTER Train/Test Split — Or You Leak the Future

Agents know "do EDA first." But they almost always do EDA on the FULL
dataset before splitting. This is information leakage: you've seen the
test set's distributions, outliers, and correlations, and your subsequent
modeling choices (feature scaling, outlier treatment, imputation strategy)
are now informed by data the model shouldn't see. Split first, then EDA
only on the training set. Apply the same transformations to the test set
without re-examining it.

Exception: data quality checks (nulls, dtypes, duplicates) CAN run on
the full dataset since they don't inform model hyperparameters.

EDA Workflow

Standard five-phase flow (structure → quality → univariate → bivariate → findings summary). Assume the agent already knows these steps. Focus on the non-obvious traps below instead.

Critical additions most EDA guides miss:

Split BEFORE explore (see IRON LAW above)
Missing data pattern matters more than count: MCAR is safe to impute; MNAR (e.g. high-income respondents skip income question) requires domain modeling, not mean-fill
Simpson's paradox check: If a trend holds in the aggregate but reverses within subgroups, the aggregate trend is misleading. Always stratify by the most obvious confound before reporting a bivariate finding
Data leakage in features: A feature that perfectly correlates with the target is usually derived FROM the target (e.g. "refund_amount" predicting churn — it's an effect, not a cause). Flag any feature with r > 0.95 for causal review

For the visualization selection guide, see references/missing-data.md.

Output Format

# EDA Report: {Dataset Name}

## Dataset Overview
- Rows: {N}, Columns: {N}
- Date range: {if applicable}
- Key columns: {description}

## Data Quality
| Issue | Columns Affected | Count/% | Action |
|-------|-----------------|---------|--------|
| Missing values | {cols} | {N / %} | {drop / impute / investigate} |
| Outliers | {cols} | {N} | {cap / remove / keep} |
| Duplicates | — | {N} | {remove} |

## Key Statistics
| Variable | Mean | Median | Std | Min | Max | Distribution |
|----------|------|--------|-----|-----|-----|-------------|
| {var} | ... | ... | ... | ... | ... | {normal/skewed/bimodal} |

## Key Findings
1. {insight with supporting data}
2. {insight}
3. {insight}

## Recommendations
- {next analysis step or data issue to resolve}

Gotchas

Correlation ≠ causation: EDA finds associations. Establishing causation requires controlled experiments or causal inference methods.
Outliers can be data errors OR real signal: Don't auto-remove. Investigate. A transaction amount of $1M might be a typo or your biggest customer.
Missing data has meaning: Data missing from one column may be related to values in another. "Missing income" may mean "unemployed", not random. Check patterns.
Visualization lies: Truncated Y-axes, cherry-picked time ranges, and misleading scales can distort insights. Always use appropriate scales and note limitations.

References