Unsupervised Learning and Clustering: Difference between revisions
New BloomWiki article: Unsupervised Learning and Clustering |
BloomWiki: Unsupervised Learning and Clustering |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks. | Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Unsupervised learning''' — Learning patterns from data without labels or target outputs. | * '''Unsupervised learning''' — Learning patterns from data without labels or target outputs. | ||
* '''Clustering''' — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups. | * '''Clustering''' — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups. | ||
| Line 17: | Line 22: | ||
* '''Elbow method''' — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows. | * '''Elbow method''' — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows. | ||
* '''Inertia''' — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this. | * '''Inertia''' — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
Unsupervised learning discovers **inherent structure** in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning. | Unsupervised learning discovers **inherent structure** in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning. | ||
| Line 28: | Line 35: | ||
**The curse of dimensionality**: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering. | **The curse of dimensionality**: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''Customer segmentation pipeline:''' | '''Customer segmentation pipeline:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 80: | Line 89: | ||
: '''High-dimensional data''' → PCA/UMAP first, then K-means | : '''High-dimensional data''' → PCA/UMAP first, then K-means | ||
: '''Visualization''' → UMAP (speed + quality) or t-SNE (quality at small scale) | : '''Visualization''' → UMAP (speed + quality) or t-SNE (quality at small scale) | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Clustering Algorithm Comparison | |+ Clustering Algorithm Comparison | ||
| Line 98: | Line 109: | ||
'''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction. | '''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Unsupervised evaluation without ground truth: (1) **Silhouette score**: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) **Davies-Bouldin index**: lower is better; measures average similarity between each cluster and its most similar cluster. (3) **Calinski-Harabasz index**: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: **ARI (Adjusted Rand Index)** measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability. | Unsupervised evaluation without ground truth: (1) **Silhouette score**: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) **Davies-Bouldin index**: lower is better; measures average similarity between each cluster and its most similar cluster. (3) **Calinski-Harabasz index**: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: **ARI (Adjusted Rand Index)** measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization. | Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization. | ||
| Line 108: | Line 123: | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Unsupervised Learning]] | [[Category:Unsupervised Learning]] | ||
</div> | |||
Latest revision as of 02:00, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.
Remembering[edit]
- Unsupervised learning — Learning patterns from data without labels or target outputs.
- Clustering — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
- K-means — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
- Hierarchical clustering — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
- Gaussian Mixture Model (GMM) — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
- Expectation-Maximization (EM) — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
- PCA (Principal Component Analysis) — Linear dimensionality reduction that finds the directions of maximum variance.
- t-SNE — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
- UMAP — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
- Autoencoder — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
- Silhouette score — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
- Elbow method — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
- Inertia — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.
Understanding[edit]
Unsupervised learning discovers **inherent structure** in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.
- K-means** is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.
- DBSCAN** doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.
- Dimensionality reduction** is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.
- The curse of dimensionality**: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.
Applying[edit]
Customer segmentation pipeline: <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap
- Customer behavioral data
df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',
'pages_per_session', 'email_open_rate', 'support_tickets']
X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
- Dimensionality reduction for visualization
reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled)
- Find optimal K using silhouette score
scores = {} for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42) labels = km.fit_predict(X_scaled) scores[k] = silhouette_score(X_scaled, labels)
optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")
- Final clustering
kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled)
- Characterize each segment
for cluster_id in range(optimal_k):
segment = df[df['cluster'] == cluster_id][features]
print(f"\nCluster {cluster_id} ({len(segment)} customers):")
print(segment.mean().round(2))
</syntaxhighlight>
- Algorithm selection guide
- Known K, spherical clusters → K-means (fast, interpretable)
- Unknown K, arbitrary shapes → DBSCAN (handles noise, arbitrary clusters)
- Probabilistic assignments needed → Gaussian Mixture Model (soft clustering)
- Hierarchical structure → Agglomerative clustering (Ward linkage)
- High-dimensional data → PCA/UMAP first, then K-means
- Visualization → UMAP (speed + quality) or t-SNE (quality at small scale)
Analyzing[edit]
| Algorithm | K Required | Cluster Shape | Handles Noise | Scale |
|---|---|---|---|---|
| K-means | Yes | Spherical only | No (noise → nearest cluster) | Very large |
| DBSCAN | No | Arbitrary | Yes | Large |
| GMM | Yes | Elliptical | No | Medium |
| Hierarchical | No (choose at cut) | Any | No | Small-medium |
| HDBSCAN | No | Arbitrary | Yes | Large |
Failure modes: K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.
Evaluating[edit]
Unsupervised evaluation without ground truth: (1) **Silhouette score**: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) **Davies-Bouldin index**: lower is better; measures average similarity between each cluster and its most similar cluster. (3) **Calinski-Harabasz index**: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: **ARI (Adjusted Rand Index)** measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.
Creating[edit]
Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization.