Scikit-learn Machine Learning
Industry-standard Python library for classical machine learning.
When to Use
- Classification or regression tasks
- Clustering or dimensionality reduction
- Preprocessing and feature engineering
- Model evaluation and cross-validation
- Hyperparameter tuning
- Building ML pipelines
Algorithm Selection
Classification
| Algorithm |
Best For |
Strengths |
| Logistic Regression |
Baseline, interpretable |
Fast, probabilistic |
| Random Forest |
General purpose |
Handles non-linear, feature importance |
| Gradient Boosting |
Best accuracy |
State-of-art for tabular |
| SVM |
High-dimensional data |
Works well with few samples |
| KNN |
Simple problems |
No training, instance-based |
Regression
| Algorithm |
Best For |
Notes |
| Linear Regression |
Baseline |
Interpretable coefficients |
| Ridge/Lasso |
Regularization needed |
L2 vs L1 penalty |
| Random Forest |
Non-linear relationships |
Robust to outliers |
| Gradient Boosting |
Best accuracy |
XGBoost, LightGBM wrappers |
Clustering
| Algorithm |
Best For |
Key Parameter |
| KMeans |
Spherical clusters |
n_clusters (must specify) |
| DBSCAN |
Arbitrary shapes |
eps (density) |
| Agglomerative |
Hierarchical |
n_clusters or distance threshold |
| Gaussian Mixture |
Soft clustering |
n_components |
Dimensionality Reduction
| Method |
Preserves |
Use Case |
| PCA |
Global variance |
Feature reduction |
| t-SNE |
Local structure |
2D/3D visualization |
| UMAP |
Both local/global |
Visualization + downstream |
Pipeline Concepts
Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.
| Component |
Purpose |
| Pipeline |
Sequential steps (transform → model) |
| ColumnTransformer |
Apply different transforms to different columns |
| FeatureUnion |
Combine multiple feature extraction methods |
Common preprocessing flow:
- Impute missing values (SimpleImputer)
- Scale numeric features (StandardScaler, MinMaxScaler)
- Encode categoricals (OneHotEncoder, OrdinalEncoder)
- Optional: feature selection or polynomial features
Model Evaluation
Cross-Validation Strategies
| Strategy |
Use Case |
| KFold |
General purpose |
| StratifiedKFold |
Imbalanced classification |
| TimeSeriesSplit |
Temporal data |
| LeaveOneOut |
Very small datasets |
Metrics
| Task |
Metric |
When to Use |
| Classification |
Accuracy |
Balanced classes |
|
F1-score |
Imbalanced classes |
|
ROC-AUC |
Ranking, threshold tuning |
|
Precision/Recall |
Domain-specific costs |
| Regression |
RMSE |
Penalize large errors |
|
MAE |
Robust to outliers |
|
R² |
Explained variance |
Hyperparameter Tuning
| Method |
Pros |
Cons |
| GridSearchCV |
Exhaustive |
Slow for many params |
| RandomizedSearchCV |
Faster |
May miss optimal |
| HalvingGridSearchCV |
Efficient |
Requires sklearn 0.24+ |
Key concept: Always tune on validation set, evaluate final model on held-out test set.
Best Practices
| Practice |
Why |
| Split data first |
Prevent leakage |
| Use pipelines |
Reproducible, no leakage |
| Scale for distance-based |
KNN, SVM, PCA need scaled features |
| Stratify imbalanced |
Preserve class distribution |
| Cross-validate |
Reliable performance estimates |
| Check learning curves |
Diagnose over/underfitting |
Common Pitfalls
| Pitfall |
Solution |
| Fitting scaler on all data |
Use pipeline or fit only on train |
| Using accuracy for imbalanced |
Use F1, ROC-AUC, or balanced accuracy |
| Too many hyperparameters |
Start simple, add complexity |
| Ignoring feature importance |
Use feature_importances_ or permutation importance |
Resources