unsupervised
Unsupervised Learning — Clustering, Anomalies, and Dimensionality Reduction
When the data has no target column, you're in unsupervised territory. Three things you can do with it:
- Cluster to find segments / groups
- Detect anomalies to find outliers / fraud / failures
- Reduce dimensions to visualize, denoise, or compress
Each comes with its own gotchas. The biggest one: there's no single "accuracy" metric, so you have to be deliberate about how you evaluate.
When to use this skill
- You have tabular data without a target column
- You want to find groups (customer segments, anomalies, behavioral modes)
- You want to visualize high-dimensional data
- You want to denoise or compress features before downstream modeling
When NOT to use this skill
- You have labels — use a supervised skill (binary, multiclass, etc.)
- The data is text / images / audio / time series — different rules
- You need a single "best" cluster assignment with no validation loop — unsupervised always needs human judgment in the loop
Always run tabular-eda first
Profile the data before clustering. The unsupervised skill assumes:
- Numeric features (categoricals must be encoded first)
- No leakage / no target column
- Scaled features (clustering algorithms are distance-based)
- Missing values handled
If you skip EDA you'll find clusters that just reflect data quality
issues (e.g. one cluster = "rows with feature_3 missing").
The clustering pipeline
from sklearn.cluster import DBSCAN, KMeans
from sklearn.compose import ColumnTransformer
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
preprocess = ColumnTransformer([("num", StandardScaler(), numeric_cols)])
X = preprocess.fit_transform(df[numeric_cols])
# 1. Choose K via silhouette score (NOT just visual elbow)
k_values = list(range(2, 11))
silhouettes = []
for k in k_values:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
silhouettes.append(silhouette_score(X, km.fit_predict(X)))
best_k = k_values[int(np.argmax(silhouettes))]
# 2. Stability check at best K
stability_runs = []
for seed in range(8):
km = KMeans(n_clusters=best_k, n_init=10, random_state=seed)
stability_runs.append(km.fit_predict(X))
ari_mean = np.mean([
adjusted_rand_score(stability_runs[i], stability_runs[j])
for i in range(8) for j in range(i + 1, 8)
])
# If ari_mean < 0.7, the clustering is unstable — try a different K or algorithm
# 3. Final clustering
labels = KMeans(n_clusters=best_k, n_init=10, random_state=42).fit_predict(X)
The five things that separate this from a tutorial
1. Choose K with silhouette score, not just visual elbow
The classic "elbow method" plots inertia (within-cluster sum of squares) vs K and looks for the kink. It's subjective. Two people can disagree on where the elbow is. The silhouette score is quantifiable: it measures how well each point fits its assigned cluster vs the next nearest cluster, averaged across all points.
from sklearn.metrics import silhouette_score
silhouettes = []
for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X)
silhouettes.append(silhouette_score(X, labels))
best_k = 2 + int(np.argmax(silhouettes)) # +2 because k starts at 2
A silhouette of 1.0 = perfect separation; 0 = overlapping clusters; negative = points are closer to other clusters than their own. Below ~0.3 means your clustering is questionable regardless of which K you pick.
Plot both elbow and silhouette curves so you have both signals, but pick K from the silhouette.
2. Stability check across multiple random_states
KMeans is sensitive to initialization. Run it with several seeds and compute pairwise Adjusted Rand Index (ARI) between runs. If the mean off-diagonal ARI is below ~0.7, your clustering is unstable — the algorithm is picking different "clusters" each time, which means either K is wrong or the algorithm is wrong for your data.
from sklearn.metrics import adjusted_rand_score
import numpy as np
n_runs = 8
labels_per_run = []
for seed in range(n_runs):
km = KMeans(n_clusters=best_k, n_init=10, random_state=seed)
labels_per_run.append(km.fit_predict(X))
ari_matrix = np.array([
[adjusted_rand_score(labels_per_run[i], labels_per_run[j])
for j in range(n_runs)] for i in range(n_runs)
])
mean_ari = ari_matrix[~np.eye(n_runs, dtype=bool)].mean()
# Stable: mean_ari ≈ 1.0
# Unstable: mean_ari < 0.7 → reconsider K or algorithm
n_init=10 already runs KMeans 10 times internally and picks the best
by inertia. The stability check is on top of that — does the best
of those 10 differ between random seeds? If yes, no amount of n_init
will save you.
3. KMeans assumes spherical convex clusters
KMeans implicitly assumes:
- Clusters are convex (no concave shapes)
- Clusters are roughly the same size
- Clusters are roughly the same density
- Clusters are roughly spherical (isotropic)
When these don't hold, KMeans fails:
| Data shape | KMeans | DBSCAN | GMM |
|---|---|---|---|
| Spherical convex blobs | ✓ best | ✓ | ✓ |
| Elliptical clusters | ✗ (drags boundaries) | ✓ | ✓ best |
| Concentric circles / moons | ✗ | ✓ best | ✗ |
| Density-based clusters | ✗ | ✓ best | ✗ |
| Wildly varying cluster sizes | ✗ | ✓ | partial |
| Many noise points | ✗ | ✓ best (labels noise as -1) | ✗ |
Default to KMeans for blob-like data; switch to DBSCAN for non-convex or noisy data. Use GMM when clusters are elliptical (e.g. correlated features within each cluster).
4. IsolationForest for anomaly detection
For anomalies in tabular data, IsolationForest is the default. It
randomly partitions the feature space and measures how few partitions
are needed to isolate each point — outliers get isolated quickly.
Works on high-dim data, no distribution assumptions, fast.
from sklearn.ensemble import IsolationForest
iso = IsolationForest(
contamination=0.02, # expected fraction of outliers
random_state=42,
n_jobs=-1,
)
predictions = iso.fit_predict(X) # -1 = outlier, +1 = inlier
is_anomaly = (predictions == -1)
# Or get continuous scores:
scores = iso.score_samples(X) # higher = more normal
The contamination parameter is your prior on the outlier fraction.
Set it to your best guess; if you genuinely don't know, start at 0.05
and tune. Don't set it too high or you'll flag normal points as
outliers.
Alternatives:
OneClassSVM— older, slower, doesn't scale beyond ~10k rowsLocalOutlierFactor— density-based, good for local anomaliesEllipticEnvelope— assumes Gaussian, fast, breaks on non-Gaussian data
5. PCA before clustering for high-dim data
In high dimensions, all pairwise distances become similar — the "curse of dimensionality." Distance-based algorithms (KMeans, DBSCAN, LOF) lose discriminative power. The fix is PCA before clustering:
from sklearn.decomposition import PCA
# Reduce to enough components to explain ~90% of variance
pca = PCA(n_components=0.90, random_state=42)
X_reduced = pca.fit_transform(X)
km = KMeans(n_clusters=best_k, n_init=10, random_state=42)
labels = km.fit_predict(X_reduced)
Rule of thumb: if you have more than ~20 features, PCA first. Below that, decide based on whether your features are informative or noisy.
For visualization (always 2D or 3D), use PCA(n_components=2). For
non-linear structure, use UMAP or t-SNE — but those are visualization
tools, not preprocessing for clustering (UMAP changes inter-point
distances).
Evaluation metrics
Without ground-truth labels (the usual case):
| Metric | What it measures | Range | Higher is better |
|---|---|---|---|
| Silhouette score | How well points fit their cluster vs the next | [-1, 1] | yes |
| Calinski-Harabasz | Between/within cluster variance ratio | [0, ∞) | yes |
| Davies-Bouldin | Average cluster similarity | [0, ∞) | no (lower) |
With ground-truth labels (for validation on synthetic data):
| Metric | What it measures | Range | Higher is better |
|---|---|---|---|
| Adjusted Rand Index (ARI) | Pairwise agreement vs truth, corrected for chance | [-1, 1] | yes |
| Normalized Mutual Information (NMI) | MI between cluster labels and truth, normalized | [0, 1] | yes |
| Homogeneity / Completeness | Cluster purity / class recovery | [0, 1] | yes |
ARI is the most commonly reported "did clustering recover the true groups" metric. Use it on synthetic data to validate the pipeline, even though you won't have it in production.
Common pitfalls
- Skipping EDA. You'll cluster on data quality artifacts.
- Not standardizing features. Distance metrics will be dominated
by whichever feature has the largest scale. Always
StandardScalerfirst. - Trusting the elbow method alone. Use silhouette.
- Ignoring stability. A clustering that changes wildly with the random seed isn't a clustering — it's noise.
- Using KMeans on non-convex data. Use DBSCAN.
- Setting
contaminationtoo high in IsolationForest. You'll flag normal points. - PCA on already-clean low-dim data. Adds noise. Only PCA when you have many features.
- Treating cluster labels as meaningful by default. Cluster 0 doesn't mean anything until you look at its members and decide what it represents (e.g. "this cluster looks like 'high-value weekday users'"). The algorithm gives you groups; the interpretation is on you.
Worked example
See demo.py (marimo notebook). It generates two synthetic
datasets to make the algorithm-vs-shape lesson concrete:
make_blobs(4 well-separated convex clusters) — KMeans wins, DBSCAN also worksmake_moons(two interleaved half-moons) — KMeans fails (it can't represent non-convex clusters), DBSCAN nails it
Then it walks through K selection via silhouette, the stability check, and IsolationForest anomaly detection on injected outliers. The end shows the same data in PCA space colored by each method's labels so the difference is visible.
After unsupervised: what to do next
Cluster labels are typically used for one of:
- Customer segmentation: each cluster gets a persona, marketing uses it
- Feature engineering for supervised models: cluster ID becomes a categorical feature in a classifier
- Anomaly triage: flag outliers for human review
- Data exploration: confirm that your data has the structure you expected (or doesn't, which is more interesting)
Whatever you do with the labels, always look at example members of each cluster before believing the clustering. The metrics tell you "this clustering is internally consistent" — they don't tell you "these clusters are meaningful for your problem."
More from brojonat/llmsrules
ibis-data
Use Ibis for database-agnostic data access in Python. Use when writing data queries, connecting to databases (DuckDB, PostgreSQL, SQLite), or building portable data pipelines that should work across backends.
13go-service
Build Go microservices with stdlib HTTP handlers, sqlc, urfave/cli, and slog. Use when creating or modifying a Go HTTP server, adding routes, middleware, database queries, or CLI commands.
13temporal-go
Build Temporal workflow applications in Go. Use when creating or modifying Temporal workflows, activities, workers, clients, signals, queries, updates, retry policies, saga patterns, or writing Temporal tests.
13parquet-analysis
Analyze parquet files using Python and Ibis. Use when the user wants to explore, transform, or analyze parquet data files, perform aggregations, joins, or export results. Works with local parquet files and provides database-agnostic data operations.
12ducklake
Work with DuckLake, an open lakehouse format built on DuckDB. Use when creating or querying DuckLake tables, managing snapshots, time travel, schema evolution, partitioning, or lakehouse maintenance operations.
12temporal-python
Build Temporal applications in Python using the temporalio SDK. Use when creating workflows, activities, workers, clients, signals, queries, updates, child workflows, timers, retry policies, saga/compensation patterns, testing, or any durable execution pattern in Python.
12