data-analysis
Data Analysis
This skill enables an AI agent to perform rigorous statistical analysis on structured datasets. The agent loads data, computes descriptive and inferential statistics, identifies trends and correlations, tests hypotheses, and produces actionable insights. It supports CSV, Excel, Parquet, and JSON inputs and leverages pandas, scipy, and statsmodels for analysis.
Workflow
-
Load and profile the data. Read the dataset into a pandas DataFrame and inspect its shape, column types, and memory usage. Display the first and last rows to confirm the data loaded correctly. Check for obvious structural issues such as shifted columns or encoding problems.
-
Compute descriptive statistics. Generate summary statistics for all numeric columns including mean, median, standard deviation, skewness, and kurtosis. For categorical columns, compute value counts and mode. This step establishes a baseline understanding of each variable's distribution.
-
Identify trends and patterns. Apply rolling averages, percentage changes, and seasonal decomposition to time-indexed data. For non-temporal data, use group-by aggregations and pivot tables to surface patterns across categories. Flag any monotonic trends or cyclical behavior.
-
Perform correlation and hypothesis testing. Calculate Pearson and Spearman correlation matrices to quantify relationships between variables. Conduct hypothesis tests (t-tests, chi-square, ANOVA) where appropriate to determine statistical significance. Report p-values and confidence intervals alongside effect sizes.
-
Detect anomalies and outliers. Use the IQR method and z-scores to identify data points that deviate significantly from the norm. Cross-reference outliers with domain context to determine whether they represent errors, rare events, or meaningful signals.
-
Synthesize findings into a report. Summarize the key insights in plain language, supported by specific numbers. Rank findings by business impact or statistical significance. Include limitations and caveats such as sample size constraints or confounding variables.
Supported Technologies
- pandas — data loading, manipulation, and aggregation
- scipy.stats — hypothesis testing, statistical distributions
- statsmodels — time-series decomposition, regression analysis
- numpy — numerical computations
Usage
Provide the agent with a file path to the dataset and a description of the analysis goals. Optionally specify which columns to focus on, the significance level for hypothesis tests (default alpha=0.05), and whether time-series methods should be applied.
Examples
Example 1: Sales CSV analysis with pandas
import pandas as pd
from scipy import stats
# Load the dataset
df = pd.read_csv("sales_2024.csv", parse_dates=["order_date"])
# Descriptive statistics
print(df[["revenue", "units_sold", "discount"]].describe())
# revenue units_sold discount
# count 8450.00 8450.00 8450.00
# mean 312.45 4.12 0.08
# std 189.73 2.87 0.05
# min 12.00 1.00 0.00
# max 2450.00 47.00 0.35
# Correlation analysis
corr = df[["revenue", "units_sold", "discount"]].corr(method="pearson")
print(corr)
# revenue units_sold discount
# revenue 1.000 0.847 -0.213
# units_sold 0.847 1.000 -0.089
# discount -0.213 -0.089 1.000
# Hypothesis test: do discounted orders produce higher revenue?
discounted = df[df["discount"] > 0]["revenue"]
full_price = df[df["discount"] == 0]["revenue"]
t_stat, p_value = stats.ttest_ind(discounted, full_price)
print(f"t={t_stat:.3f}, p={p_value:.4f}")
# t=-3.217, p=0.0013 — discounted orders have significantly lower revenue per order
Example 2: Time-series analysis with seasonal decomposition
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
# Load monthly revenue data
df = pd.read_csv("monthly_revenue.csv", parse_dates=["month"], index_col="month")
# Decompose into trend, seasonal, and residual components
result = seasonal_decompose(df["revenue"], model="additive", period=12)
print("Trend (last 6 months):")
print(result.trend.dropna().tail(6))
# 2024-07 48230.12
# 2024-08 49012.45
# 2024-09 49780.33
# 2024-10 50234.10
# 2024-11 51002.88
# 2024-12 51890.67
print("\nSeasonal peaks:")
seasonal = result.seasonal.groupby(result.seasonal.index.month).mean()
print(seasonal.nlargest(3))
# month
# 11 8923.40 (November — holiday pre-orders)
# 12 7654.20 (December — holiday sales)
# 3 3210.15 (March — spring promotions)
# The upward trend of ~$600/month suggests 14.5% annualized growth.
# Strong Q4 seasonality accounts for roughly 18% of total annual revenue.
Best Practices
- Always inspect raw data before computing statistics — silent parsing errors (wrong delimiters, encoding issues) can invalidate every downstream result.
- Report effect sizes alongside p-values; statistical significance alone does not imply practical importance.
- Use non-parametric tests (Mann-Whitney, Kruskal-Wallis) when data distributions are heavily skewed or sample sizes are small.
- Segment analysis by meaningful categories (region, product line, customer tier) to avoid Simpson's paradox.
- Document assumptions explicitly — stationarity for time-series, normality for parametric tests, independence of observations.
- Validate surprising findings with a holdout sample or alternative methodology before presenting them as conclusions.
Edge Cases
- Missing values in key columns. If more than 30% of a target column is null, warn the user that imputation may introduce significant bias. Offer to analyze the complete-case subset instead.
- Extremely skewed distributions. Log-transform or use median-based statistics when skewness exceeds |2.0| to avoid misleading mean values.
- Multicollinearity. When two predictors correlate above 0.9, flag this and recommend dropping one or using regularized models to avoid inflated coefficients.
- Small sample sizes (n < 30). Switch to exact tests or bootstrap methods and widen confidence intervals accordingly.
- Mixed data types in a single column. Coerce carefully and report how many values could not be converted, rather than silently dropping them.
More from seb1n/awesome-ai-agent-skills
summarization
Summarize text using extractive, abstractive, hierarchical, and multi-document techniques, producing concise outputs at configurable detail levels.
23note-taking
Capture, organize, and retrieve notes efficiently using structured formats, tagging, and file management for meetings, ideas, research, and daily logs.
18proofreading
Proofread and correct text for grammar, spelling, punctuation, style, clarity, and consistency, with support for multiple style guides and readability analysis.
17knowledge-graph-creation
Build structured knowledge graphs from unstructured text by extracting entities, mapping relationships, generating graph triples, and visualizing the result.
16technical-writing
Write clear, concise, and accurate technical documentation including API references, user guides, tutorials, changelogs, and architecture docs, tailored to the target audience.
14data-visualization
Create clear, effective charts and dashboards from structured data using matplotlib, seaborn, and plotly.
13