data-science-eda
Exploratory Data Analysis (EDA)
Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.
When to use this skill
- New dataset — need orientation on structure, types, distributions
- Before feature engineering — understand variable relationships
- Data quality investigation — find anomalies, missing patterns, outliers
- Model preparation — validate assumptions about data
Core EDA workflow
- Profile structure
- Schema, types, cardinality
- Missing value patterns
- Analyze distributions
- Numerical: histograms, boxplots, skewness
- Categorical: frequencies, rare categories
- Explore relationships
- Correlation matrix (numerical)
- Cross-tabulations (categorical)
- Target-variable relationships
- Identify issues
- Outliers, duplicates, inconsistencies
- Class imbalance (classification)
- Temporal patterns (time series)
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Automated profiling | ydata-profiling / pandas-profiling | Fast comprehensive reports |
| Interactive exploration | ipywidgets + plotly | Drill-down capability |
| Statistical tests | scipy.stats | Normality, correlations |
| Large datasets | Polars + lazy | Memory-efficient |
Core implementation rules
1) Start with automated profiling
import polars as pl
from ydata_profiling import ProfileReport
df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")
2) Focus on actionable insights
- Document outliers worth investigating (not all outliers are problems)
- Flag features with high cardinality or rare categories
- Note strong correlations that may cause multicollinearity
3) Visualize for communication
- Distribution plots for key variables
- Correlation heatmap
- Missing value patterns
- Target relationship plots
4) Validate assumptions
- Check for expected ranges/business rules
- Verify temporal consistency
- Confirm key relationships match domain knowledge
Common anti-patterns
- ❌ Skipping EDA and jumping to modeling
- ❌ Treating all outliers as errors
- ❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
- ❌ Over-plotting large datasets without sampling
- ❌ Not documenting findings for team
Progressive disclosure
../references/automated-profiling.md— ydata-profiling, Sweetviz, D-Tale../references/visualization-patterns.md— Matplotlib, Seaborn, Plotly patterns../references/statistical-tests.md— Scipy statistical tests guide../references/large-dataset-eda.md— Sampling, Polars, Dask approaches
Related skills
@data-science-feature-engineering— Next step after EDA@data-science-model-evaluation— Validate modeling assumptions@data-engineering-quality— Data validation frameworks
References
More from legout/data-platform-agent-skills
data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8data-engineering-storage-formats
Modern data serialization formats: Parquet, Apache Arrow (Feather/IPC), Lance (ML-native), Zarr (chunked arrays), Avro, and ORC. Covers compression, partitioning, and format selection.
8