exploratory-data-analysis

Installation
SKILL.md

Exploratory Data Analysis

This skill enables an AI agent to perform structured exploratory data analysis (EDA) on any tabular dataset. The agent systematically profiles the data's shape and types, examines distributions, computes correlations, detects outliers, and produces a summary of findings. EDA is the critical first step before any modeling or reporting — it reveals what the data actually contains versus what it is assumed to contain.

Workflow

  1. Load and inspect basic structure. Read the dataset and immediately report its shape (rows, columns), column names, data types, and memory footprint. Display the first 5 and last 5 rows to catch header issues, trailing garbage rows, or encoding artifacts. This takes under a second but prevents hours of downstream confusion.

  2. Assess data quality. Count nulls per column as both absolute and percentage. Identify columns with zero variance (constant values), high cardinality categoricals (e.g., a "notes" field with unique values per row), and mixed-type columns. Build a concise quality scorecard: columns with >5% missing, columns with suspicious types, and duplicate row counts.

  3. Analyze distributions of individual variables. For numeric columns, compute mean, median, standard deviation, skewness, and kurtosis. Plot histograms or KDE plots. For categorical columns, show value counts and proportions for the top 10 categories. Flag highly imbalanced distributions (e.g., a binary target where one class is under 5%).

  4. Explore relationships between variables. Compute the full correlation matrix for numeric columns and visualize it as a heatmap. For categorical-vs-numeric relationships, use grouped box plots or violin plots. For categorical-vs-categorical, use contingency tables or mosaic plots. Highlight pairs with correlation above 0.7 or below -0.7.

  5. Detect outliers and anomalies. Apply the IQR method to every numeric column and report the count and percentage of outlier values. Visualize outliers with box plots. Cross-reference outliers across columns — a row that is an outlier in multiple columns simultaneously often represents a data entry error or a genuinely unusual observation.

  6. Synthesize findings into an EDA report. Write a structured summary covering: dataset overview, quality issues found, key distribution characteristics, notable correlations, outlier summary, and recommended next steps (e.g., columns to drop, transformations to apply, features likely to be predictive).

Supported Technologies

Related skills
Installs
8
GitHub Stars
78
First Seen
Mar 19, 2026