skills/borghei/claude-skills/senior-data-scientist

senior-data-scientist

SKILL.md

Senior Data Scientist

Expert data science for statistical modeling, experimentation, ML deployment, and data-driven decision making.

Keywords

data-science, machine-learning, statistics, a-b-testing, causal-inference, feature-engineering, mlops, experiment-design, model-deployment, python, scikit-learn, pytorch, tensorflow, spark, airflow


Quick Start

# Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/

# Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze

# Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy

# Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json

Tools

Script Purpose
scripts/experiment_designer.py A/B test design, power analysis, sample size calculation
scripts/feature_engineering_pipeline.py Automated feature generation, correlation analysis, feature selection
scripts/statistical_analyzer.py Hypothesis testing, causal inference, regression analysis
scripts/model_evaluation_suite.py Model comparison, cross-validation, deployment readiness checks

Tech Stack

Category Tools
Languages Python, SQL, R, Scala
ML Frameworks PyTorch, TensorFlow, Scikit-learn, XGBoost
Data Processing Spark, Airflow, dbt, Kafka, Databricks
Deployment Docker, Kubernetes, AWS SageMaker, GCP Vertex AI
Experiment Tracking MLflow, Weights & Biases
Databases PostgreSQL, BigQuery, Snowflake, Pinecone

Workflow 1: Design and Analyze an A/B Test

  1. Define hypothesis -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
  2. Calculate sample size -- python scripts/experiment_designer.py --input data/ --output results/
    • Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
    • Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
  3. Randomize assignment -- Use hash-based assignment on user ID for deterministic, reproducible splits.
  4. Run experiment -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.
  5. Analyze results:
    from scipy import stats
    
    # Two-proportion z-test for conversion rates
    control_conv = control_successes / control_total
    treatment_conv = treatment_successes / treatment_total
    z_stat, p_value = stats.proportions_ztest(
        [treatment_successes, control_successes],
        [treatment_total, control_total],
        alternative='two-sided'
    )
    # Reject H0 if p_value < 0.05
    
  6. Validate -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.

Workflow 2: Build a Feature Engineering Pipeline

  1. Profile raw data -- python scripts/feature_engineering_pipeline.py --target project/ --analyze
    • Identify null rates, cardinality, distributions, and data types.
  2. Generate candidate features:
    • Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
    • Aggregation: rolling means/sums over 7d/30d/90d windows
    • Interaction: ratio features, polynomial combinations
    • Text: TF-IDF, embedding vectors
  3. Select features -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
  4. Validate -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
  5. Register -- Store features in feature store with versioning and lineage metadata.

Workflow 3: Train and Evaluate a Model

  1. Split data -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
  2. Train baseline -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
  3. Tune hyperparameters -- Use Optuna or cross-validated grid search. Log all runs to MLflow.
  4. Evaluate on held-out test set:
    from sklearn.metrics import classification_report, roc_auc_score
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(classification_report(y_test, y_pred))
    print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
    
  5. Validate -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).

Workflow 4: Deploy a Model to Production

  1. Containerize -- Package model with inference dependencies in Docker:
    docker build -t model-service:v1 .
    
  2. Set up serving -- Deploy behind a REST API with health check, input validation, and structured error responses.
  3. Configure monitoring:
    • Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
    • Output drift: monitor prediction distribution shifts
    • Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
  4. Enable canary deployment -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
  5. Validate -- python scripts/model_evaluation_suite.py --config config.yaml --deploy confirms serving latency, error rate <0.1%, and model outputs match offline evaluation.

Workflow 5: Perform Causal Inference

  1. Assess assignment mechanism -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
  2. Select method based on data structure:
    • Propensity Score Matching: when treatment is binary, many covariates available
    • Difference-in-Differences: when pre/post data available for treatment and control groups
    • Regression Discontinuity: when treatment assigned by threshold on running variable
    • Instrumental Variables: when unobserved confounding present but valid instrument exists
  3. Check assumptions -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
  4. Estimate treatment effect and compute confidence intervals.
  5. Validate -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.

Performance Targets

Metric Target
P50 latency < 50ms
P95 latency < 100ms
P99 latency < 200ms
Throughput > 1,000 req/s
Availability 99.9%
Error rate < 0.1%

Common Commands

# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py

Reference Documentation

Document Path
Statistical Methods references/statistical_methods_advanced.md
Experiment Design Frameworks references/experiment_design_frameworks.md
Feature Engineering Patterns references/feature_engineering_patterns.md
Automation Scripts scripts/ directory
Weekly Installs
43
GitHub Stars
38
First Seen
Feb 23, 2026
Installed on
claude-code36
opencode29
github-copilot29
codex29
gemini-cli29
cursor29