Unsupervised Clustering: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Unsupervised Clustering
 
BloomWiki: Unsupervised Clustering
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
{{BloomIntro}}
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.
</div>


== Remembering ==
__TOC__
 
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Unsupervised learning''' — Learning patterns from data without labels or target outputs.
* '''Unsupervised learning''' — Learning patterns from data without labels or target outputs.
* '''Clustering''' — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
* '''Clustering''' — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
Line 17: Line 22:
* '''Elbow method''' — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
* '''Elbow method''' — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
* '''Inertia''' — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.
* '''Inertia''' — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.
</div>


== Understanding ==
<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Unsupervised learning discovers '''inherent structure''' in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.
Unsupervised learning discovers '''inherent structure''' in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.


Line 28: Line 35:


'''The curse of dimensionality''': In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.
'''The curse of dimensionality''': In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.
</div>


== Applying ==
<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Customer segmentation pipeline:'''
'''Customer segmentation pipeline:'''
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 80: Line 89:
: '''High-dimensional data''' → PCA/UMAP first, then K-means
: '''High-dimensional data''' → PCA/UMAP first, then K-means
: '''Visualization''' → UMAP (speed + quality) or t-SNE (quality at small scale)
: '''Visualization''' → UMAP (speed + quality) or t-SNE (quality at small scale)
</div>


== Analyzing ==
<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
{| class="wikitable"
|+ Clustering Algorithm Comparison
|+ Clustering Algorithm Comparison
Line 98: Line 109:


'''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.
'''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.
</div>


== Evaluating ==
<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Unsupervised evaluation without ground truth: (1) '''Silhouette score''': ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. (2) '''Davies-Bouldin index''': lower is better; measures average similarity between each cluster and its most similar cluster. (3) '''Calinski-Harabasz index''': higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: '''ARI (Adjusted Rand Index)''' measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.
== <span style="color: #FFFFFF;">Evaluating</span> ==
Unsupervised evaluation without ground truth:
# '''Silhouette score''': ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering.
# '''Davies-Bouldin index''': lower is better; measures average similarity between each cluster and its most similar cluster.
# '''Calinski-Harabasz index''': higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: '''ARI (Adjusted Rand Index)''' measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.
</div>


== Creating ==
<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Designing a production segmentation system: (1) Feature engineering: create behavioral, transactional, and demographic features. (2) Preprocessing: standardize, handle missing values, remove collinear features. (3) Dimensionality reduction: PCA to 10-20 components retaining 90% variance. (4) Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge. (5) Stability check: run clustering 10× with different random seeds; stable segments persist. (6) Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). (7) Deployment: assign new users to nearest centroid in real time for personalization.
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a production segmentation system:
# Feature engineering: create behavioral, transactional, and demographic features.
# Preprocessing: standardize, handle missing values, remove collinear features.
# Dimensionality reduction: PCA to 10-20 components retaining 90% variance.
# Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge.
# Stability check: run clustering 10× with different random seeds; stable segments persist.
# Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners").
# Deployment: assign new users to nearest centroid in real time for personalization.


[[Category:Artificial Intelligence]]
[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]
[[Category:Unsupervised Learning]]
[[Category:Unsupervised Learning]]
</div>

Latest revision as of 02:00, 25 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.

Remembering[edit]

  • Unsupervised learning — Learning patterns from data without labels or target outputs.
  • Clustering — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
  • K-means — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
  • Hierarchical clustering — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
  • Gaussian Mixture Model (GMM) — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
  • Expectation-Maximization (EM) — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
  • PCA (Principal Component Analysis) — Linear dimensionality reduction that finds the directions of maximum variance.
  • t-SNE — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
  • UMAP — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
  • Autoencoder — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
  • Silhouette score — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
  • Elbow method — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
  • Inertia — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.

Understanding[edit]

Unsupervised learning discovers inherent structure in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.

K-means is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.

DBSCAN doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.

Dimensionality reduction is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.

The curse of dimensionality: In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.

Applying[edit]

Customer segmentation pipeline: <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap

  1. Customer behavioral data

df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',

           'pages_per_session', 'email_open_rate', 'support_tickets']

X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

  1. Dimensionality reduction for visualization

reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled)

  1. Find optimal K using silhouette score

scores = {} for k in range(2, 11):

   km = KMeans(n_clusters=k, n_init=10, random_state=42)
   labels = km.fit_predict(X_scaled)
   scores[k] = silhouette_score(X_scaled, labels)

optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")

  1. Final clustering

kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled)

  1. Characterize each segment

for cluster_id in range(optimal_k):

   segment = df[df['cluster'] == cluster_id][features]
   print(f"\nCluster {cluster_id} ({len(segment)} customers):")
   print(segment.mean().round(2))

</syntaxhighlight>

Algorithm selection guide
Known K, spherical clusters → K-means (fast, interpretable)
Unknown K, arbitrary shapes → DBSCAN (handles noise, arbitrary clusters)
Probabilistic assignments needed → Gaussian Mixture Model (soft clustering)
Hierarchical structure → Agglomerative clustering (Ward linkage)
High-dimensional data → PCA/UMAP first, then K-means
Visualization → UMAP (speed + quality) or t-SNE (quality at small scale)

Analyzing[edit]

Clustering Algorithm Comparison
Algorithm K Required Cluster Shape Handles Noise Scale
K-means Yes Spherical only No (noise → nearest cluster) Very large
DBSCAN No Arbitrary Yes Large
GMM Yes Elliptical No Medium
Hierarchical No (choose at cut) Any No Small-medium
HDBSCAN No Arbitrary Yes Large

Failure modes: K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.

Evaluating[edit]

Unsupervised evaluation without ground truth:

  1. Silhouette score: ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering.
  2. Davies-Bouldin index: lower is better; measures average similarity between each cluster and its most similar cluster.
  3. Calinski-Harabasz index: higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: ARI (Adjusted Rand Index) measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.

Creating[edit]

Designing a production segmentation system:

  1. Feature engineering: create behavioral, transactional, and demographic features.
  2. Preprocessing: standardize, handle missing values, remove collinear features.
  3. Dimensionality reduction: PCA to 10-20 components retaining 90% variance.
  4. Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge.
  5. Stability check: run clustering 10× with different random seeds; stable segments persist.
  6. Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners").
  7. Deployment: assign new users to nearest centroid in real time for personalization.