data-analysis
Data Analysis Skill
When to Use This Skill
Use this skill when:
- Analyzing datasets (CSV, Excel, JSON)
- Performing statistical analysis
- Creating data visualizations
- Identifying trends and patterns
- Data cleaning and preprocessing
- Users mention "analyze data", "statistics", "charts", "trends", or "insights"
Analysis Process
1. Data Loading & Initial Exploration
Load the data:
import pandas as pd
import numpy as np
# CSV files
df = pd.read_csv('data.csv')
# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON files
df = pd.read_json('data.json')
# From database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')
df = pd.read_sql('SELECT * FROM table', engine)
Initial exploration:
# Basic information
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# First look at data
print("\nFirst 5 rows:")
print(df.head())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Basic statistics
print("\nDescriptive statistics:")
print(df.describe())
2. Data Cleaning
Handle missing values:
# Check missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Missing data percentage:")
print(missing_pct[missing_pct > 0])
# Drop columns with too many missing values
df = df.drop(columns=missing_pct[missing_pct > 50].index)
# Fill missing values
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)
# Or drop rows with missing values
df = df.dropna()
Handle duplicates:
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Remove duplicates
df = df.drop_duplicates()
# Keep specific duplicates
df = df.drop_duplicates(subset=['id'], keep='first')
Data type conversions:
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert to category (saves memory)
df['category'] = df['category'].astype('category')
3. Statistical Analysis
See references/STATISTICS.md for detailed formulas.
Descriptive statistics:
# Central tendency
mean = df['column'].mean()
median = df['column'].median()
mode = df['column'].mode()[0]
# Dispersion
std = df['column'].std()
variance = df['column'].var()
range_val = df['column'].max() - df['column'].min()
iqr = df['column'].quantile(0.75) - df['column'].quantile(0.25)
# Distribution
skewness = df['column'].skew()
kurtosis = df['column'].kurtosis()
print(f"""
Statistics for {column}:
Mean: {mean:.2f}
Median: {median:.2f}
Std Dev: {std:.2f}
Range: {range_val:.2f}
IQR: {iqr:.2f}
Skewness: {skewness:.2f}
""")
Correlation analysis:
# Correlation matrix
correlation = df[numeric_columns].corr()
print(correlation)
# Find strong correlations
strong_corr = correlation[(correlation > 0.7) | (correlation < -0.7)]
strong_corr = strong_corr[strong_corr != 1.0].stack()
print("\nStrong correlations:")
print(strong_corr)
Group analysis:
# Group by categorical variable
grouped = df.groupby('category').agg({
'sales': ['sum', 'mean', 'count'],
'profit': ['sum', 'mean'],
'quantity': 'sum'
})
print(grouped)
# Multiple grouping
df.groupby(['region', 'category'])['sales'].sum().unstack()
4. Data Visualization
Distribution plots:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Histogram
plt.figure()
df['column'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.savefig('histogram.png', dpi=300, bbox_inches='tight')
plt.close()
# Box plot
plt.figure()
df.boxplot(column='value', by='category')
plt.title('Value by Category')
plt.suptitle('') # Remove default title
plt.savefig('boxplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Violin plot
plt.figure()
sns.violinplot(data=df, x='category', y='value')
plt.title('Value Distribution by Category')
plt.savefig('violin.png', dpi=300, bbox_inches='tight')
plt.close()
Relationship plots:
# Scatter plot
plt.figure()
plt.scatter(df['x'], df['y'], alpha=0.5)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('X vs Y')
plt.savefig('scatter.png', dpi=300, bbox_inches='tight')
plt.close()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.close()
# Pair plot for multiple variables
sns.pairplot(df[['var1', 'var2', 'var3', 'category']], hue='category')
plt.savefig('pairplot.png', dpi=300, bbox_inches='tight')
plt.close()
Time series plots:
# Line plot
plt.figure()
df.set_index('date')['value'].plot()
plt.title('Value Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.savefig('timeseries.png', dpi=300, bbox_inches='tight')
plt.close()
# Multiple time series
df.pivot(index='date', columns='category', values='value').plot()
plt.title('Values by Category Over Time')
plt.legend(title='Category')
plt.savefig('timeseries_multi.png', dpi=300, bbox_inches='tight')
plt.close()
Categorical plots:
# Bar plot
category_counts = df['category'].value_counts()
plt.figure()
category_counts.plot(kind='bar')
plt.title('Count by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.savefig('barplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Stacked bar plot
df.groupby(['region', 'category'])['sales'].sum().unstack().plot(kind='bar', stacked=True)
plt.title('Sales by Region and Category')
plt.savefig('stacked_bar.png', dpi=300, bbox_inches='tight')
plt.close()
5. Advanced Analysis
Trend detection:
from scipy import stats
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['x'], df['y'])
print(f"Trend: slope={slope:.4f}, R²={r_value**2:.4f}, p={p_value:.4f}")
# Moving average
df['ma_7'] = df['value'].rolling(window=7).mean()
df['ma_30'] = df['value'].rolling(window=30).mean()
Outlier detection:
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3]
print(f"Outliers detected: {len(outliers)}")
# IQR method
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < Q1 - 1.5*IQR) | (df['column'] > Q3 + 1.5*IQR)]
print(f"Outliers by IQR: {len(outliers)}")
Statistical tests:
from scipy import stats
# T-test (compare two groups)
group1 = df[df['category'] == 'A']['value']
group2 = df[df['category'] == 'B']['value']
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-test: t={t_stat:.4f}, p={p_value:.4f}")
# ANOVA (compare multiple groups)
groups = [df[df['category'] == cat]['value'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F={f_stat:.4f}, p={p_value:.4f}")
# Chi-square test (categorical variables)
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square: χ²={chi2:.4f}, p={p_value:.4f}")
6. Generate Report
Use the analysis script:
python scripts/analyze.py --file data.csv --output report.html
Create summary:
summary = f"""
# Data Analysis Report
## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Date range: {df['date'].min()} to {df['date'].max()}
## Key Findings
### 1. [Finding Title]
{description_of_finding}
### 2. [Finding Title]
{description_of_finding}
## Statistical Summary
{df.describe().to_markdown()}
## Recommendations
1. [Recommendation based on analysis]
2. [Recommendation based on analysis]
"""
with open('analysis_report.md', 'w') as f:
f.write(summary)
Best Practices
Memory Optimization
# Read large files in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
processed_chunk = chunk.process() # Your processing
chunks.append(processed_chunk)
df = pd.concat(chunks)
# Optimize data types
df['int_col'] = df['int_col'].astype('int32') # Instead of int64
df['float_col'] = df['float_col'].astype('float32') # Instead of float64
Performance Tips
# Use vectorized operations instead of loops
# Bad
result = []
for value in df['column']:
result.append(value * 2)
# Good
result = df['column'] * 2
# Use .query() for filtering
df_filtered = df.query('age > 30 and city == "NYC"')
# Use .loc for setting values
df.loc[df['age'] > 30, 'category'] = 'senior'
Reproducibility
# Set random seed
np.random.seed(42)
# Save processed data
df.to_csv('processed_data.csv', index=False)
df.to_parquet('processed_data.parquet') # Better for large datasets
# Export analysis
import pickle
with open('analysis_results.pkl', 'wb') as f:
pickle.dump({'stats': stats, 'model': model}, f)
Common Analysis Types
Sales Analysis
# Total sales by period
sales_by_month = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
# Top products
top_products = df.groupby('product')['sales'].sum().sort_values(ascending=False).head(10)
# Growth rate
df['growth_rate'] = df['sales'].pct_change() * 100
Customer Analysis
# Customer segmentation
df['segment'] = pd.cut(df['total_purchases'],
bins=[0, 100, 500, float('inf')],
labels=['Low', 'Medium', 'High'])
# Retention analysis
cohort = df.groupby(['cohort_month', 'purchase_month']).size()
Performance Analysis
# Year-over-year comparison
df['year'] = df['date'].dt.year
yoy = df.groupby('year')['metric'].sum()
yoy_growth = yoy.pct_change() * 100
Error Handling
try:
df = pd.read_csv('data.csv')
except FileNotFoundError:
print("Error: File not found")
sys.exit(1)
except pd.errors.EmptyDataError:
print("Error: File is empty")
sys.exit(1)
except Exception as e:
print(f"Error loading data: {e}")
sys.exit(1)
# Validate data
assert not df.empty, "DataFrame is empty"
assert 'required_column' in df.columns, "Missing required column"
assert df['date'].dtype == 'datetime64[ns]', "Date column not in datetime format"
Output Guidelines
Always provide:
- Summary: High-level findings in plain language
- Statistics: Key numbers and metrics
- Visualizations: Charts that support findings
- Insights: Actionable conclusions
- Recommendations: Next steps based on analysis
More from ihkreddy/agent-skills
work-on-ticket
Pulls ticket details from Jira, creates feature branches with proper naming conventions, and handles planning steps. Use when starting work on a Jira ticket, creating branches for tickets, or when users mention "work on ticket", "start ticket", "create branch for", or Jira ticket IDs.
7code-review
Performs comprehensive code reviews following industry best practices. Use when reviewing pull requests, code changes, or when asked to analyze code quality, security, performance, or maintainability. Checks for common bugs, security vulnerabilities, code smells, and adherence to coding standards.
6api-integration
Design and implement REST API integrations with proper error handling, authentication, rate limiting, and testing. Use when building API clients, integrating third-party services, or when users mention API, REST, webhooks, HTTP requests, or service integration.
4