Editing
Unsupervised Clustering
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Unsupervised learning''' β Learning patterns from data without labels or target outputs. * '''Clustering''' β Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups. * '''K-means''' β A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers. * '''Hierarchical clustering''' β Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down). * '''DBSCAN (Density-Based Spatial Clustering of Applications with Noise)''' β Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters. * '''Gaussian Mixture Model (GMM)''' β A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm. * '''Expectation-Maximization (EM)''' β An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters). * '''PCA (Principal Component Analysis)''' β Linear dimensionality reduction that finds the directions of maximum variance. * '''t-SNE''' β A non-linear dimensionality reduction for visualization; preserves local neighborhood structure. * '''UMAP''' β Faster, more scalable alternative to t-SNE for visualization; better preserves global structure. * '''Autoencoder''' β A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation. * '''Silhouette score''' β A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1]. * '''Elbow method''' β Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows. * '''Inertia''' β Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == Unsupervised learning discovers '''inherent structure''' in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning. '''K-means''' is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization. '''DBSCAN''' doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has β₯ MinPts neighbors within radius Ξ΅; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown. '''Dimensionality reduction''' is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning. '''The curse of dimensionality''': In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down β all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Customer segmentation pipeline:''' <syntaxhighlight lang="python"> import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import umap # Customer behavioral data df = pd.read_csv("customer_features.csv") features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration', 'pages_per_session', 'email_open_rate', 'support_tickets'] X = df[features].fillna(df[features].median()) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Dimensionality reduction for visualization reducer = umap.UMAP(n_components=2, random_state=42) X_2d = reducer.fit_transform(X_scaled) # Find optimal K using silhouette score scores = {} for k in range(2, 11): km = KMeans(n_clusters=k, n_init=10, random_state=42) labels = km.fit_predict(X_scaled) scores[k] = silhouette_score(X_scaled, labels) optimal_k = max(scores, key=scores.get) print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})") # Final clustering kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42) df['cluster'] = kmeans.fit_predict(X_scaled) # Characterize each segment for cluster_id in range(optimal_k): segment = df[df['cluster'] == cluster_id][features] print(f"\nCluster {cluster_id} ({len(segment)} customers):") print(segment.mean().round(2)) </syntaxhighlight> ; Algorithm selection guide : '''Known K, spherical clusters''' β K-means (fast, interpretable) : '''Unknown K, arbitrary shapes''' β DBSCAN (handles noise, arbitrary clusters) : '''Probabilistic assignments needed''' β Gaussian Mixture Model (soft clustering) : '''Hierarchical structure''' β Agglomerative clustering (Ward linkage) : '''High-dimensional data''' β PCA/UMAP first, then K-means : '''Visualization''' β UMAP (speed + quality) or t-SNE (quality at small scale) </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Clustering Algorithm Comparison ! Algorithm !! K Required !! Cluster Shape !! Handles Noise !! Scale |- | K-means || Yes || Spherical only || No (noise β nearest cluster) || Very large |- | DBSCAN || No || Arbitrary || Yes || Large |- | GMM || Yes || Elliptical || No || Medium |- | Hierarchical || No (choose at cut) || Any || No || Small-medium |- | HDBSCAN || No || Arbitrary || Yes || Large |} '''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to Ξ΅ and MinPts parameters β hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Unsupervised evaluation without ground truth: # '''Silhouette score''': ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering. # '''Davies-Bouldin index''': lower is better; measures average similarity between each cluster and its most similar cluster. # '''Calinski-Harabasz index''': higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: '''ARI (Adjusted Rand Index)''' measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing a production segmentation system: # Feature engineering: create behavioral, transactional, and demographic features. # Preprocessing: standardize, handle missing values, remove collinear features. # Dimensionality reduction: PCA to 10-20 components retaining 90% variance. # Cluster search: evaluate K-means for K=2β15 using silhouette score and domain knowledge. # Stability check: run clustering 10Γ with different random seeds; stable segments persist. # Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners"). # Deployment: assign new users to nearest centroid in real time for personalization. [[Category:Artificial Intelligence]] [[Category:Machine Learning]] [[Category:Unsupervised Learning]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information