root-cause-investigation
Root Cause Investigation
Quick Start
Systematically investigate why a metric changed, using structured drill-downs, statistical analysis, and hypothesis testing to identify root causes and provide actionable explanations.
Context Requirements
Before investigating, I need:
- The Metric: What changed and by how much
- Time Context: When the change occurred
- Historical Data: Metric values over time for comparison
- Drill-Down Dimensions: Ways to slice the metric (geography, product, channel, etc.)
- Known Events (optional): Product changes, campaigns, external factors
Context Gathering
For The Metric:
"Tell me about the metric change you're investigating:
What metric changed?
- Example: 'Daily Active Users', 'Revenue', 'Conversion Rate', 'Page Load Time'
What happened?
- Dropped from X to Y
- Increased unexpectedly
- Suddenly became volatile
- Plateaued after growing
When did you notice it?
- Specific date or date range
- Gradual change or sudden jump?
How big is the change?
- Absolute: +/- 1,000 users
- Percentage: -15%
- Is this normal variance or unusual?"
For Historical Data:
"To understand if this change is unusual, I need historical context:
Option 1 - Provide Data:
date | metric_value
2024-12-01 | 10,000
2024-12-02 | 10,200
...
2024-12-15 | 8,500 โ drop happened
Option 2 - Database Query:
SELECT date, SUM(metric) as metric_value
FROM metrics_table
WHERE date >= '2024-11-01'
GROUP BY date
ORDER BY date
Option 3 - Dashboard Export: Export your dashboard data to CSV
How much history should I analyze? Recommend at least 2-3x the investigation period."
For Drill-Down Dimensions:
"To find what's driving the change, which dimensions can we analyze?
Common Dimensions:
- Geographic: Country, region, city
- Product: SKU, category, brand
- Customer: Segment, cohort, plan tier
- Channel: Organic, paid, email, direct
- Platform: Mobile, desktop, iOS, Android
- Time: Day of week, hour, season
Which dimensions are available in your data and most likely to reveal insights?"
For Known Events:
"Were there any changes around the time of the metric change?
Product Changes:
- New feature launches
- UI/UX changes
- Bug fixes or incidents
- Algorithm updates
Marketing:
- Campaign starts/stops
- Budget changes
- New channels
External:
- Seasonality
- Competitor actions
- Market events
- Press coverage
Any of these apply?"
For Baseline Expectations:
"What's 'normal' for this metric?
- Historical average
- Expected growth rate
- Acceptable variance range
- Seasonal patterns
This helps determine if the change is truly anomalous."
Workflow
Step 1: Validate the Change
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from datetime import datetime, timedelta
# Load historical data
df = pd.read_csv('metric_data.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date')
print(f"๐ Data Loaded:")
print(f" Date Range: {df['date'].min()} to {df['date'].max()}")
print(f" Total Days: {len(df)}")
# Define investigation period
investigation_date = '2024-12-15'
baseline_days = 30 # Compare to previous 30 days
investigation_value = df[df['date'] == investigation_date]['metric_value'].values[0]
baseline = df[df['date'] < investigation_date].tail(baseline_days)
baseline_mean = baseline['metric_value'].mean()
baseline_std = baseline['metric_value'].std()
# Calculate change
absolute_change = investigation_value - baseline_mean
percent_change = (absolute_change / baseline_mean) * 100
z_score = (investigation_value - baseline_mean) / baseline_std
print(f"\n๐ Change Detection:")
print(f" Investigation Date: {investigation_date}")
print(f" Current Value: {investigation_value:,.0f}")
print(f" Baseline (30d avg): {baseline_mean:,.0f}")
print(f" Change: {absolute_change:+,.0f} ({percent_change:+.1f}%)")
print(f" Z-Score: {z_score:.2f}")
if abs(z_score) > 2:
print(f" โ ๏ธ SIGNIFICANT: Change is {abs(z_score):.1f} std deviations from normal")
else:
print(f" โน๏ธ MINOR: Change is within normal variance")
Checkpoint: "Confirmed the change is real and significant. Ready to investigate root cause?"
Step 2: Visualize the Trend
def plot_metric_trend(df, investigation_date, metric_col='metric_value'):
"""Visualize metric over time with investigation date marked"""
fig, ax = plt.subplots(figsize=(14, 6))
# Plot full trend
ax.plot(df['date'], df[metric_col], marker='o', linewidth=2, markersize=4)
# Mark investigation date
inv_date = pd.to_datetime(investigation_date)
ax.axvline(inv_date, color='red', linestyle='--', linewidth=2, label='Investigation Date')
# Add baseline mean line
baseline = df[df['date'] < inv_date].tail(30)
ax.axhline(baseline[metric_col].mean(), color='blue', linestyle=':',
alpha=0.5, label='30-Day Baseline')
# Highlight unusual period
unusual = df[df['date'] >= inv_date]
ax.scatter(unusual['date'], unusual[metric_col], color='red', s=100, zorder=5)
ax.set_xlabel('Date')
ax.set_ylabel('Metric Value')
ax.set_title(f'Metric Trend Analysis - Investigation Date: {investigation_date}')
ax.legend()
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('metric_trend.png', dpi=300, bbox_inches='tight')
plt.show()
plot_metric_trend(df, investigation_date)
Step 3: Systematic Drill-Down
def drill_down_analysis(detailed_df, metric_col, dimension, investigation_date):
"""
Compare metric by dimension before vs after investigation date
"""
# Split data into before and after
before = detailed_df[detailed_df['date'] < investigation_date]
after = detailed_df[detailed_df['date'] >= investigation_date]
# Aggregate by dimension
before_agg = before.groupby(dimension)[metric_col].mean()
after_agg = after.groupby(dimension)[metric_col].mean()
# Calculate change
comparison = pd.DataFrame({
'before': before_agg,
'after': after_agg
})
comparison['change'] = comparison['after'] - comparison['before']
comparison['pct_change'] = (comparison['change'] / comparison['before']) * 100
comparison['contribution'] = comparison['change'].abs()
# Sort by contribution to find biggest drivers
comparison = comparison.sort_values('contribution', ascending=False)
return comparison
# Load detailed data (with dimensions)
detailed_df = pd.read_csv('detailed_metric_data.csv')
detailed_df['date'] = pd.to_datetime(detailed_df['date'])
# Drill down by each dimension
dimensions = ['country', 'product_category', 'channel', 'device_type']
print("\n๐ฌ Systematic Drill-Down Analysis:")
for dim in dimensions:
if dim not in detailed_df.columns:
continue
print(f"\n{'='*60}")
print(f"Dimension: {dim.upper()}")
print('='*60)
analysis = drill_down_analysis(detailed_df, 'metric_value', dim, investigation_date)
# Show top contributors to change
top_5 = analysis.head(5)
print(f"\nTop Contributors to Change:")
for idx, row in top_5.iterrows():
print(f"\n {idx}:")
print(f" Before: {row['before']:,.0f}")
print(f" After: {row['after']:,.0f}")
print(f" Change: {row['change']:+,.0f} ({row['pct_change']:+.1f}%)")
Step 4: Hypothesis Testing
def test_hypotheses(detailed_df, investigation_date):
"""
Test common hypotheses for metric changes
"""
before = detailed_df[detailed_df['date'] < investigation_date]
after = detailed_df[detailed_df['date'] >= investigation_date]
hypotheses = []
# Hypothesis 1: Change concentrated in specific segment
for dim in ['country', 'channel', 'device_type']:
if dim not in detailed_df.columns:
continue
before_dist = before.groupby(dim)['metric_value'].sum() / before['metric_value'].sum()
after_dist = after.groupby(dim)['metric_value'].sum() / after['metric_value'].sum()
# Check if distribution shifted significantly
max_shift = (after_dist - before_dist).abs().max()
if max_shift > 0.05: # >5% shift in any segment
shifted_segment = (after_dist - before_dist).abs().idxmax()
hypotheses.append({
'hypothesis': f'Mix shift in {dim}',
'evidence': f'{shifted_segment} changed by {max_shift*100:.1f}%',
'strength': 'Strong' if max_shift > 0.10 else 'Moderate'
})
# Hypothesis 2: Volume drop (fewer total events)
before_count = len(before)
after_count = len(after)
count_change = (after_count - before_count) / before_count
if abs(count_change) > 0.10:
hypotheses.append({
'hypothesis': 'Volume change',
'evidence': f'Event count {count_change:+.1%}',
'strength': 'Strong' if abs(count_change) > 0.20 else 'Moderate'
})
# Hypothesis 3: Quality drop (per-unit metric changed)
if 'user_count' in before.columns:
before_per_user = before['metric_value'].sum() / before['user_count'].sum()
after_per_user = after['metric_value'].sum() / after['user_count'].sum()
per_user_change = (after_per_user - before_per_user) / before_per_user
if abs(per_user_change) > 0.05:
hypotheses.append({
'hypothesis': 'Per-user metric changed',
'evidence': f'Metric per user {per_user_change:+.1%}',
'strength': 'Strong' if abs(per_user_change) > 0.15 else 'Moderate'
})
return hypotheses
hypotheses = test_hypotheses(detailed_df, investigation_date)
print(f"\n๐งช Hypothesis Testing Results:")
for i, hyp in enumerate(hypotheses, 1):
print(f"\n {i}. {hyp['hypothesis']} [{hyp['strength']}]")
print(f" Evidence: {hyp['evidence']}")
Step 5: Correlation Analysis
def find_correlations(df, metric_col, investigation_date):
"""
Find what else changed around the same time
"""
# Look for other metrics that changed
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlations = []
for col in numeric_cols:
if col == metric_col:
continue
# Calculate change in this metric
before = df[df['date'] < investigation_date][col].mean()
after = df[df['date'] >= investigation_date][col].mean()
change = (after - before) / before if before != 0 else 0
if abs(change) > 0.05: # Changed by more than 5%
# Calculate correlation with target metric
corr = df[metric_col].corr(df[col])
correlations.append({
'metric': col,
'change': change,
'correlation': corr
})
# Sort by absolute correlation
correlations = sorted(correlations, key=lambda x: abs(x['correlation']), reverse=True)
return correlations[:5] # Top 5
correlated_metrics = find_correlations(detailed_df, 'metric_value', investigation_date)
print(f"\n๐ Correlated Metrics That Also Changed:")
for metric in correlated_metrics:
print(f"\n {metric['metric']}:")
print(f" Changed: {metric['change']:+.1%}")
print(f" Correlation: {metric['correlation']:.2f}")
Step 6: Identify Root Cause
def synthesize_root_cause(drill_down_results, hypotheses, correlations):
"""
Synthesize findings into root cause explanation
"""
print(f"\n{'='*60}")
print("๐ฏ ROOT CAUSE ANALYSIS")
print('='*60)
# Find the dimension with biggest contribution
biggest_contributor = None
biggest_impact = 0
for dim, results in drill_down_results.items():
top_impact = results['contribution'].iloc[0]
if top_impact > biggest_impact:
biggest_impact = top_impact
biggest_contributor = (dim, results.index[0], results.iloc[0])
if biggest_contributor:
dim, segment, data = biggest_contributor
print(f"\n PRIMARY DRIVER:")
print(f" Dimension: {dim}")
print(f" Segment: {segment}")
print(f" Impact: {data['change']:+,.0f} ({data['pct_change']:+.1f}%)")
print(f" This explains {biggest_impact / abs(absolute_change) * 100:.0f}% of total change")
# Summarize hypotheses
strong_hypotheses = [h for h in hypotheses if h['strength'] == 'Strong']
if strong_hypotheses:
print(f"\n SUPPORTING EVIDENCE:")
for hyp in strong_hypotheses:
print(f" โ {hyp['hypothesis']}: {hyp['evidence']}")
# Highlight correlations
if correlations:
print(f"\n RELATED CHANGES:")
for metric in correlations[:3]:
print(f" โข {metric['metric']}: {metric['change']:+.1%}")
return {
'primary_driver': biggest_contributor,
'supporting_hypotheses': strong_hypotheses,
'correlations': correlations
}
# Collect drill-down results
drill_down_results = {}
for dim in dimensions:
if dim in detailed_df.columns:
drill_down_results[dim] = drill_down_analysis(
detailed_df, 'metric_value', dim, investigation_date
)
root_cause = synthesize_root_cause(drill_down_results, hypotheses, correlated_metrics)
Step 7: Generate Investigation Report
def generate_investigation_report(
metric_name,
investigation_date,
baseline_mean,
investigation_value,
absolute_change,
percent_change,
root_cause,
known_events=None
):
"""Create comprehensive investigation report"""
report = []
report.append("=" * 70)
report.append(f"ROOT CAUSE INVESTIGATION: {metric_name}")
report.append("=" * 70)
report.append(f"\nDate: {investigation_date}")
report.append(f"Analyst: [Your Name]")
report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
report.append(f"\n{'='*70}")
report.append("WHAT HAPPENED")
report.append("=" * 70)
report.append(f"\n{metric_name} changed significantly on {investigation_date}:")
report.append(f" โข Previous 30-day average: {baseline_mean:,.0f}")
report.append(f" โข Value on {investigation_date}: {investigation_value:,.0f}")
report.append(f" โข Change: {absolute_change:+,.0f} ({percent_change:+.1f}%)")
if root_cause['primary_driver']:
dim, segment, data = root_cause['primary_driver']
report.append(f"\n{'='*70}")
report.append("ROOT CAUSE")
report.append("=" * 70)
report.append(f"\nPrimary Driver: {segment} ({dim})")
report.append(f" โข Before: {data['before']:,.0f}")
report.append(f" โข After: {data['after']:,.0f}")
report.append(f" โข Impact: {data['change']:+,.0f} ({data['pct_change']:+.1f}%)")
if root_cause['supporting_hypotheses']:
report.append(f"\nSupporting Evidence:")
for hyp in root_cause['supporting_hypotheses']:
report.append(f" โข {hyp['hypothesis']}: {hyp['evidence']}")
if known_events:
report.append(f"\n{'='*70}")
report.append("KNOWN EVENTS")
report.append("=" * 70)
for event in known_events:
report.append(f" โข {event['date']}: {event['description']}")
report.append(f"\n{'='*70}")
report.append("RECOMMENDATIONS")
report.append("=" * 70)
report.append("\n1. [Specific action based on root cause]")
report.append("2. [Monitoring recommendation]")
report.append("3. [Prevention strategy]")
report.append(f"\n{'='*70}")
report.append("APPENDIX")
report.append("=" * 70)
report.append("\nFiles Generated:")
report.append(" โข metric_trend.png - Visual timeline")
report.append(" โข drill_down_analysis.csv - Detailed breakdowns")
report.append(" โข investigation_data.csv - Raw data used")
return "\n".join(report)
report_text = generate_investigation_report(
metric_name="Daily Active Users",
investigation_date=investigation_date,
baseline_mean=baseline_mean,
investigation_value=investigation_value,
absolute_change=absolute_change,
percent_change=percent_change,
root_cause=root_cause,
known_events=[
{'date': '2024-12-14', 'description': 'New app version released'},
{'date': '2024-12-15', 'description': 'Bug reported in Android app'}
]
)
print(report_text)
# Save report
with open('root_cause_investigation_report.txt', 'w') as f:
f.write(report_text)
Context Validation
Before proceeding, verify:
- Have sufficient historical data (at least 2x investigation period)
- Metric definition is consistent across time periods
- Drill-down dimensions are available and reliable
- Understand business context and recent changes
- Can access granular data if needed for deep dives
Output Template
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ROOT CAUSE INVESTIGATION
Metric: Daily Active Users (DAU)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐
Investigation Date: December 15, 2024
๐ Change Detected: -15% drop (25,000 โ 21,250)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ ROOT CAUSE IDENTIFIED
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PRIMARY DRIVER: Android Mobile App (Platform)
Before: 12,000 users
After: 8,000 users
Impact: -4,000 users (-33%)
This explains 107% of total drop (other segments up slightly)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ DRILL-DOWN ANALYSIS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
By Platform:
โ
Desktop: +500 (+5%)
โ
iOS: +200 (+3%)
๐ด Android: -4,000 (-33%) โ Problem
โ
Web Mobile: +100 (+2%)
By Geography:
โ
US: +100
โ
UK: +50
โ
All regions stable (no geographic concentration)
By User Cohort:
โ
New users: Stable
๐ด Returning users: -4,200 (driven by Android)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐งช HYPOTHESIS TESTING
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
CONFIRMED: Platform-specific issue
Evidence: Only Android affected
โ
CONFIRMED: Affects existing users
Evidence: New user acquisition stable
โ REJECTED: Marketing/Channel issue
Evidence: All channels show Android drop
โ REJECTED: Geographic issue
Evidence: Uniform across all regions
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ CORRELATED CHANGES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Android crash rate: +450% (correlation: -0.85)
โข App store rating: -0.5 stars (correlation: -0.62)
โข Support tickets: +120% (correlation: -0.71)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ TIMELINE CORRELATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Dec 14, 10:00 PM: Android app v3.2.0 released
Dec 15, 12:00 AM: First crash reports
Dec 15, 8:00 AM: DAU drop visible
Dec 15, 10:00 AM: Bug ticket filed (#4523)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ก ROOT CAUSE CONCLUSION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The 15% DAU drop was caused by Android app version 3.2.0
crash on launch, affecting existing users. Bug prevents
app from loading, causing immediate 33% drop in Android
DAU. Other platforms unaffected.
Confidence: 95%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ RECOMMENDATIONS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
IMMEDIATE (Today):
1. Roll back Android app to v3.1.9
2. Halt v3.2.0 rollout
3. Notify users via email about temporary fix
SHORT-TERM (This Week):
4. Root cause bug in v3.2.0 code
5. Add crash detection to pre-release testing
6. Implement gradual rollout (10% โ 50% โ 100%)
LONG-TERM (This Month):
7. Set up automated alerting for platform-specific drops
8. Improve QA process for mobile releases
9. Create rollback playbook
EXPECTED RECOVERY:
- Rollback deployed: +3,000 users within 24h
- Full recovery: 3-5 days as users update
- Estimated user loss: 5% churn from bad experience
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ SUPPORTING FILES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ metric_trend.png (timeline visualization)
โ drill_down_analysis.csv (full breakdown)
โ hypothesis_testing.csv (test results)
โ crash_logs.txt (Android error logs)
Common Scenarios
Scenario 1: "Conversion rate suddenly dropped"
โ Validate drop is real vs noise โ Drill down by funnel step to find which step broke โ Analyze by traffic source, device, geography โ Check for recent product/site changes โ Identify technical issue or user experience problem
Scenario 2: "Revenue is down but we don't know why"
โ Decompose revenue = volume ร price ร mix โ Determine if it's fewer customers, lower spending, or different product mix โ Drill down by customer segment and product category โ Compare to seasonal patterns โ Identify specific segment driving the decline
Scenario 3: "Metric improved but team doesn't trust it"
โ Validate improvement is real (not data bug) โ Determine what actually changed โ Connect to known initiatives or changes โ Rule out technical explanations (logging bug, etc.) โ Provide evidence-based explanation
Scenario 4: "Weekly metric is volatile, hard to understand"
โ Separate signal from noise with statistics โ Identify if volatility is new or normal โ Find patterns (day of week effects, etc.) โ Recommend smoothing or better baseline โ Set up proper thresholds for alerting
Scenario 5: "Metric plateaued after growing steadily"
โ Identify when growth rate changed โ Check if it's saturation (hitting ceiling) โ Analyze if new user growth slowed or retention dropped โ Compare to market benchmarks โ Provide strategic recommendations
Handling Missing Context
User reports "metric dropped" without specifics: "Let me help investigate. I need:
- What's the specific metric?
- What time period are you comparing?
- Do you have historical data I can analyze?
- What dimensions can we drill down by?"
User doesn't know what dimensions to analyze: "Let me examine the data to see what's available. Common patterns:
- Geographic (country, region)
- Platform (mobile/desktop, iOS/Android)
- Source (organic, paid, channels)
- Customer (new vs returning, segments)
Which of these are in your data?"
User suspects a cause but isn't sure: "Great hypothesis! Let's test it systematically:
- I'll compare the suspected segment to others
- Check if timing aligns
- Look for counter-evidence
- Quantify the impact
What's your hypothesis?"
Incomplete historical data: "I can work with what you have. If data is limited, I'll:
- Use shorter comparison periods
- Focus on largest changes
- Note limitations in conclusions
- Recommend data to collect going forward"
Advanced Options
After basic investigation, offer:
Statistical Significance Testing: "Want me to calculate if this change is statistically significant vs random variance?"
Decomposition Analysis: "I can break down the metric mathematically (e.g., revenue = users ร avg order value ร frequency)"
Synthetic Control: "I can create a 'what if nothing changed' baseline using historical patterns"
Anomaly Detection: "I can build a model to automatically detect future anomalies in this metric"
Impact Quantification: "Want me to estimate the business impact in $ or user terms?"
Prevention Playbook: "I can create monitoring alerts and investigation runbooks to catch this faster next time"