Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

When to Use

Algorithm	Best For	Strengths
Logistic Regression	Baseline, interpretable	Fast, probabilistic
Random Forest	General purpose	Handles non-linear, feature importance
Gradient Boosting	Best accuracy	State-of-art for tabular
SVM	High-dimensional data	Works well with few samples
KNN	Simple problems	No training, instance-based

Algorithm	Best For	Notes
Linear Regression	Baseline	Interpretable coefficients
Ridge/Lasso	Regularization needed	L2 vs L1 penalty
Random Forest	Non-linear relationships	Robust to outliers
Gradient Boosting	Best accuracy	XGBoost, LightGBM wrappers

Algorithm	Best For	Key Parameter
KMeans	Spherical clusters	n_clusters (must specify)
DBSCAN	Arbitrary shapes	eps (density)
Agglomerative	Hierarchical	n_clusters or distance threshold
Gaussian Mixture	Soft clustering	n_components

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

Component	Purpose
Pipeline	Sequential steps (transform → model)
ColumnTransformer	Apply different transforms to different columns
FeatureUnion	Combine multiple feature extraction methods

Common preprocessing flow:

Key concept: Always tune on validation set, evaluate final model on held-out test set.

Practice	Why
Split data first	Prevent leakage
Use pipelines	Reproducible, no leakage
Scale for distance-based	KNN, SVM, PCA need scaled features
Stratify imbalanced	Preserve class distribution
Cross-validate	Reliable performance estimates
Check learning curves	Diagnose over/underfitting

Pitfall	Solution
Fitting scaler on all data	Use pipeline or fit only on train
Using accuracy for imbalanced	Use F1, ROC-AUC, or balanced accuracy
Too many hyperparameters	Start simple, add complexity
Ignoring feature importance	Use `feature_importances_` or permutation importance