Unsupervised Clustering

From BloomWiki
Revision as of 14:20, 23 April 2026 by Wordpad (talk | contribs) (BloomWiki: Unsupervised Clustering)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.

Remembering

  • Unsupervised learning — Learning patterns from data without labels or target outputs.
  • Clustering — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
  • K-means — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
  • Hierarchical clustering — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
  • Gaussian Mixture Model (GMM) — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
  • Expectation-Maximization (EM) — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
  • PCA (Principal Component Analysis) — Linear dimensionality reduction that finds the directions of maximum variance.
  • t-SNE — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
  • UMAP — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
  • Autoencoder — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
  • Silhouette score — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
  • Elbow method — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
  • Inertia — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.

Understanding

Unsupervised learning discovers inherent structure in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.

K-means is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.

DBSCAN doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.

Dimensionality reduction is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.

The curse of dimensionality: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.

Applying

Customer segmentation pipeline: <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap

  1. Customer behavioral data

df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',

           'pages_per_session', 'email_open_rate', 'support_tickets']

X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

  1. Dimensionality reduction for visualization

reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled)

  1. Find optimal K using silhouette score

scores = {} for k in range(2, 11):

   km = KMeans(n_clusters=k, n_init=10, random_state=42)
   labels = km.fit_predict(X_scaled)
   scores[k] = silhouette_score(X_scaled, labels)

optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")

  1. Final clustering

kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled)

  1. Characterize each segment

for cluster_id in range(optimal_k):

   segment = df[df['cluster'] == cluster_id][features]
   print(f"\nCluster {cluster_id} ({len(segment)} customers):")
   print(segment.mean().round(2))

</syntaxhighlight>

Algorithm selection guide
Known K, spherical clusters → K-means (fast, interpretable)
Unknown K, arbitrary shapes → DBSCAN (handles noise, arbitrary clusters)
Probabilistic assignments needed → Gaussian Mixture Model (soft clustering)
Hierarchical structure → Agglomerative clustering (Ward linkage)
High-dimensional data → PCA/UMAP first, then K-means
Visualization → UMAP (speed + quality) or t-SNE (quality at small scale)

Analyzing

Clustering Algorithm Comparison
Algorithm K Required Cluster Shape Handles Noise Scale
K-means Yes Spherical only No (noise → nearest cluster) Very large
DBSCAN No Arbitrary Yes Large
GMM Yes Elliptical No Medium
Hierarchical No (choose at cut) Any No Small-medium
HDBSCAN No Arbitrary Yes Large

Failure modes: K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.

Evaluating

Unsupervised evaluation without ground truth: (1) Silhouette score: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) Davies-Bouldin index: lower is better; measures average similarity between each cluster and its most similar cluster. (3) Calinski-Harabasz index: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: ARI (Adjusted Rand Index) measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.

Creating

Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization.