Unsupervised Learning and Clustering

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.

Remembering

Unsupervised learning — Learning patterns from data without labels or target outputs.
Clustering — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
K-means — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
Hierarchical clustering — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
Gaussian Mixture Model (GMM) — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
Expectation-Maximization (EM) — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
PCA (Principal Component Analysis) — Linear dimensionality reduction that finds the directions of maximum variance.
t-SNE — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
UMAP — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
Autoencoder — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
Silhouette score — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
Elbow method — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
Inertia — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.

Understanding

Unsupervised learning discovers **inherent structure** in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.

- K-means** is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.

- DBSCAN** doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.

- Dimensionality reduction** is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.

- The curse of dimensionality**: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.

Applying

Customer segmentation pipeline: <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap

Customer behavioral data

df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',

           'pages_per_session', 'email_open_rate', 'support_tickets']

X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Dimensionality reduction for visualization

reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled)

Find optimal K using silhouette score

scores = {} for k in range(2, 11):

   km = KMeans(n_clusters=k, n_init=10, random_state=42)
   labels = km.fit_predict(X_scaled)
   scores[k] = silhouette_score(X_scaled, labels)

optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")

Final clustering

kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled)

Characterize each segment

for cluster_id in range(optimal_k):

   segment = df[df['cluster'] == cluster_id][features]
   print(f"\nCluster {cluster_id} ({len(segment)} customers):")
   print(segment.mean().round(2))

</syntaxhighlight>

Algorithm selection guide: Known K, spherical clusters → K-means (fast, interpretable); Unknown K, arbitrary shapes → DBSCAN (handles noise, arbitrary clusters); Probabilistic assignments needed → Gaussian Mixture Model (soft clustering); Hierarchical structure → Agglomerative clustering (Ward linkage); High-dimensional data → PCA/UMAP first, then K-means; Visualization → UMAP (speed + quality) or t-SNE (quality at small scale)

Analyzing

Clustering Algorithm Comparison
Algorithm	K Required	Cluster Shape	Handles Noise	Scale
K-means	Yes	Spherical only	No (noise → nearest cluster)	Very large
DBSCAN	No	Arbitrary	Yes	Large
GMM	Yes	Elliptical	No	Medium
Hierarchical	No (choose at cut)	Any	No	Small-medium
HDBSCAN	No	Arbitrary	Yes	Large

Failure modes: K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.

Evaluating

Unsupervised evaluation without ground truth: (1) **Silhouette score**: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) **Davies-Bouldin index**: lower is better; measures average similarity between each cluster and its most similar cluster. (3) **Calinski-Harabasz index**: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: **ARI (Adjusted Rand Index)** measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.

Creating

Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization.

Unsupervised Learning and Clustering

Contents

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Unsupervised Learning and Clustering

Remembering

Understanding

Applying

Analyzing

Evaluating

Creating

Navigation menu

Search