data-scientist

SKILL.md

Data Scientist

Expert in statistical analysis, experimentation, and business insights.

⚠️ Chunking Rule

Large analyses (EDA + modeling + visualization) = 800+ lines. Generate ONE phase per response: EDA → Feature Engineering → Modeling → Evaluation → Recommendations

Core Capabilities

Statistical Modeling

  • Hypothesis testing (t-test, chi-square, ANOVA)
  • Regression analysis (linear, logistic, GLMs)
  • Bayesian inference
  • Causal inference (propensity score matching, DiD)

Experimentation

  • A/B test design and analysis
  • Sample size calculation
  • Statistical power analysis
  • Multi-armed bandits

Customer Analytics

  • Customer Lifetime Value (CLV) prediction
  • Churn prediction and prevention
  • Cohort analysis
  • RFM segmentation

Anomaly Detection

  • Isolation Forest for outliers
  • DBSCAN clustering
  • Statistical process control
  • Time series anomaly detection

Experiment Tracking

  • MLflow integration for experiment logging
  • Weights & Biases (W&B) support
  • Experiment comparison and visualization
  • Model versioning and registry

Data Visualization

  • Exploratory data analysis (EDA)
  • Distribution plots and correlations
  • Time series visualization
  • Interactive dashboards (Plotly, Streamlit)

Best Practices

# A/B Test Analysis
from scipy import stats

def analyze_ab_test(control, treatment, metric='conversion'):
    # Check sample size
    n_control, n_treatment = len(control), len(treatment)

    # Statistical test
    t_stat, p_value = stats.ttest_ind(control[metric], treatment[metric])

    # Effect size (Cohen's d)
    pooled_std = np.sqrt((control[metric].var() + treatment[metric].var()) / 2)
    effect_size = (treatment[metric].mean() - control[metric].mean()) / pooled_std

    return {
        'p_value': p_value,
        'significant': p_value < 0.05,
        'effect_size': effect_size,
        'lift': (treatment[metric].mean() / control[metric].mean() - 1) * 100
    }
# Experiment Tracking with MLflow
import mlflow

with mlflow.start_run(run_name="experiment-001"):
    mlflow.log_param("model_type", "xgboost")
    mlflow.log_params(model.get_params())

    # Train and evaluate
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, predictions))
    mlflow.log_metric("f1", f1_score(y_test, predictions))

    # Log model
    mlflow.sklearn.log_model(model, "model")

When to Use

  • Business analytics and insights
  • A/B test design and analysis
  • Customer segmentation and CLV
  • Anomaly and fraud detection
  • Experiment tracking and comparison
  • Data visualization and EDA
Weekly Installs
11
GitHub Stars
82
First Seen
Jan 25, 2026
Installed on
claude-code9
opencode8
antigravity8
codex8
gemini-cli8
cursor7