senior-data-scientist
SKILL.md
Senior Data Scientist
Expert data science for statistical modeling, experimentation, ML deployment, and data-driven decision making.
Keywords
data-science, machine-learning, statistics, a-b-testing, causal-inference, feature-engineering, mlops, experiment-design, model-deployment, python, scikit-learn, pytorch, tensorflow, spark, airflow
Quick Start
# Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/
# Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze
# Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy
# Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json
Tools
| Script | Purpose |
|---|---|
scripts/experiment_designer.py |
A/B test design, power analysis, sample size calculation |
scripts/feature_engineering_pipeline.py |
Automated feature generation, correlation analysis, feature selection |
scripts/statistical_analyzer.py |
Hypothesis testing, causal inference, regression analysis |
scripts/model_evaluation_suite.py |
Model comparison, cross-validation, deployment readiness checks |
Tech Stack
| Category | Tools |
|---|---|
| Languages | Python, SQL, R, Scala |
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| Data Processing | Spark, Airflow, dbt, Kafka, Databricks |
| Deployment | Docker, Kubernetes, AWS SageMaker, GCP Vertex AI |
| Experiment Tracking | MLflow, Weights & Biases |
| Databases | PostgreSQL, BigQuery, Snowflake, Pinecone |
Workflow 1: Design and Analyze an A/B Test
- Define hypothesis -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
- Calculate sample size --
python scripts/experiment_designer.py --input data/ --output results/- Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
- Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
- Randomize assignment -- Use hash-based assignment on user ID for deterministic, reproducible splits.
- Run experiment -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.
- Analyze results:
from scipy import stats # Two-proportion z-test for conversion rates control_conv = control_successes / control_total treatment_conv = treatment_successes / treatment_total z_stat, p_value = stats.proportions_ztest( [treatment_successes, control_successes], [treatment_total, control_total], alternative='two-sided' ) # Reject H0 if p_value < 0.05 - Validate -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.
Workflow 2: Build a Feature Engineering Pipeline
- Profile raw data --
python scripts/feature_engineering_pipeline.py --target project/ --analyze- Identify null rates, cardinality, distributions, and data types.
- Generate candidate features:
- Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
- Aggregation: rolling means/sums over 7d/30d/90d windows
- Interaction: ratio features, polynomial combinations
- Text: TF-IDF, embedding vectors
- Select features -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
- Validate -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
- Register -- Store features in feature store with versioning and lineage metadata.
Workflow 3: Train and Evaluate a Model
- Split data -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
- Train baseline -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
- Tune hyperparameters -- Use Optuna or cross-validated grid search. Log all runs to MLflow.
- Evaluate on held-out test set:
from sklearn.metrics import classification_report, roc_auc_score y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}") - Validate -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).
Workflow 4: Deploy a Model to Production
- Containerize -- Package model with inference dependencies in Docker:
docker build -t model-service:v1 . - Set up serving -- Deploy behind a REST API with health check, input validation, and structured error responses.
- Configure monitoring:
- Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
- Output drift: monitor prediction distribution shifts
- Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
- Enable canary deployment -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
- Validate --
python scripts/model_evaluation_suite.py --config config.yaml --deployconfirms serving latency, error rate <0.1%, and model outputs match offline evaluation.
Workflow 5: Perform Causal Inference
- Assess assignment mechanism -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
- Select method based on data structure:
- Propensity Score Matching: when treatment is binary, many covariates available
- Difference-in-Differences: when pre/post data available for treatment and control groups
- Regression Discontinuity: when treatment assigned by threshold on running variable
- Instrumental Variables: when unobserved confounding present but valid instrument exists
- Check assumptions -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
- Estimate treatment effect and compute confidence intervals.
- Validate -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.
Performance Targets
| Metric | Target |
|---|---|
| P50 latency | < 50ms |
| P95 latency | < 100ms |
| P99 latency | < 200ms |
| Throughput | > 1,000 req/s |
| Availability | 99.9% |
| Error rate | < 0.1% |
Common Commands
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/
# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth
# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/
# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py
Reference Documentation
| Document | Path |
|---|---|
| Statistical Methods | references/statistical_methods_advanced.md |
| Experiment Design Frameworks | references/experiment_design_frameworks.md |
| Feature Engineering Patterns | references/feature_engineering_patterns.md |
| Automation Scripts | scripts/ directory |
Weekly Installs
43
Repository
borghei/claude-skillsGitHub Stars
38
First Seen
Feb 23, 2026
Security Audits
Installed on
claude-code36
opencode29
github-copilot29
codex29
gemini-cli29
cursor29