data-science-eda

SKILL.md

Exploratory Data Analysis (EDA)

Use this skill for understanding datasets before modeling: profiling distributions, detecting anomalies, identifying relationships, and assessing data quality.

When to use this skill

  • New dataset — need orientation on structure, types, distributions
  • Before feature engineering — understand variable relationships
  • Data quality investigation — find anomalies, missing patterns, outliers
  • Model preparation — validate assumptions about data

Core EDA workflow

  1. Profile structure
    • Schema, types, cardinality
    • Missing value patterns
  2. Analyze distributions
    • Numerical: histograms, boxplots, skewness
    • Categorical: frequencies, rare categories
  3. Explore relationships
    • Correlation matrix (numerical)
    • Cross-tabulations (categorical)
    • Target-variable relationships
  4. Identify issues
    • Outliers, duplicates, inconsistencies
    • Class imbalance (classification)
    • Temporal patterns (time series)

Quick tool selection

Task Default choice Notes
Automated profiling ydata-profiling / pandas-profiling Fast comprehensive reports
Interactive exploration ipywidgets + plotly Drill-down capability
Statistical tests scipy.stats Normality, correlations
Large datasets Polars + lazy Memory-efficient

Core implementation rules

1) Start with automated profiling

import polars as pl
from ydata_profiling import ProfileReport

df = pl.read_parquet("data.parquet")
profile = ProfileReport(df.to_pandas(), title="Data Profile")
profile.to_file("profile_report.html")

2) Focus on actionable insights

  • Document outliers worth investigating (not all outliers are problems)
  • Flag features with high cardinality or rare categories
  • Note strong correlations that may cause multicollinearity

3) Visualize for communication

  • Distribution plots for key variables
  • Correlation heatmap
  • Missing value patterns
  • Target relationship plots

4) Validate assumptions

  • Check for expected ranges/business rules
  • Verify temporal consistency
  • Confirm key relationships match domain knowledge

Common anti-patterns

  • ❌ Skipping EDA and jumping to modeling
  • ❌ Treating all outliers as errors
  • ❌ Ignoring missing value mechanisms (MCAR/MAR/MNAR)
  • ❌ Over-plotting large datasets without sampling
  • ❌ Not documenting findings for team

Progressive disclosure

  • ../references/automated-profiling.md — ydata-profiling, Sweetviz, D-Tale
  • ../references/visualization-patterns.md — Matplotlib, Seaborn, Plotly patterns
  • ../references/statistical-tests.md — Scipy statistical tests guide
  • ../references/large-dataset-eda.md — Sampling, Polars, Dask approaches

Related skills

  • @data-science-feature-engineering — Next step after EDA
  • @data-science-model-evaluation — Validate modeling assumptions
  • @data-engineering-quality — Data validation frameworks

References

Weekly Installs
13
First Seen
Feb 11, 2026
Installed on
opencode11
gemini-cli11
github-copilot11
amp11
codex11
kimi-cli11