Unsupervised Learning and Clustering
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.
Remembering
- Unsupervised learning — Learning patterns from data without labels or target outputs.
- Clustering — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
- K-means — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
- Hierarchical clustering — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
- Gaussian Mixture Model (GMM) — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
- Expectation-Maximization (EM) — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
- PCA (Principal Component Analysis) — Linear dimensionality reduction that finds the directions of maximum variance.
- t-SNE — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
- UMAP — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
- Autoencoder — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
- Silhouette score — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
- Elbow method — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
- Inertia — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.
Understanding
Unsupervised learning discovers **inherent structure** in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.
- K-means** is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.
- DBSCAN** doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.
- Dimensionality reduction** is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.
- The curse of dimensionality**: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.
Applying
Customer segmentation pipeline: <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap
- Customer behavioral data
df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',
'pages_per_session', 'email_open_rate', 'support_tickets']
X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
- Dimensionality reduction for visualization
reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled)
- Find optimal K using silhouette score
scores = {} for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42) labels = km.fit_predict(X_scaled) scores[k] = silhouette_score(X_scaled, labels)
optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")
- Final clustering
kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled)
- Characterize each segment
for cluster_id in range(optimal_k):
segment = df[df['cluster'] == cluster_id][features]
print(f"\nCluster {cluster_id} ({len(segment)} customers):")
print(segment.mean().round(2))
</syntaxhighlight>
- Algorithm selection guide
- Known K, spherical clusters → K-means (fast, interpretable)
- Unknown K, arbitrary shapes → DBSCAN (handles noise, arbitrary clusters)
- Probabilistic assignments needed → Gaussian Mixture Model (soft clustering)
- Hierarchical structure → Agglomerative clustering (Ward linkage)
- High-dimensional data → PCA/UMAP first, then K-means
- Visualization → UMAP (speed + quality) or t-SNE (quality at small scale)
Analyzing
| Algorithm | K Required | Cluster Shape | Handles Noise | Scale |
|---|---|---|---|---|
| K-means | Yes | Spherical only | No (noise → nearest cluster) | Very large |
| DBSCAN | No | Arbitrary | Yes | Large |
| GMM | Yes | Elliptical | No | Medium |
| Hierarchical | No (choose at cut) | Any | No | Small-medium |
| HDBSCAN | No | Arbitrary | Yes | Large |
Failure modes: K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.
Evaluating
Unsupervised evaluation without ground truth: (1) **Silhouette score**: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) **Davies-Bouldin index**: lower is better; measures average similarity between each cluster and its most similar cluster. (3) **Calinski-Harabasz index**: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: **ARI (Adjusted Rand Index)** measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.
Creating
Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization.