Editing Unsupervised Clustering

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Unsupervised learning and clustering are machine learning approaches that discover structure in data without labeled examples. Rather than learning to predict a predefined output, unsupervised methods find natural groupings, patterns, and representations in the data itself. Clustering algorithms segment data into groups of similar items; dimensionality reduction algorithms find compact representations; density estimation models the underlying probability distribution. These methods are essential for exploratory data analysis, customer segmentation, anomaly detection, and learning representations for downstream tasks.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Unsupervised learning''' — Learning patterns from data without labels or target outputs.
* '''Clustering''' — Partitioning data into groups (clusters) where items within a group are more similar to each other than to items in other groups.
* '''K-means''' — A centroid-based clustering algorithm that iteratively assigns points to the nearest of K cluster centers and updates centers.
* '''Hierarchical clustering''' — Builds a tree (dendrogram) of nested clusters; agglomerative (bottom-up) or divisive (top-down).
* '''DBSCAN (Density-Based Spatial Clustering of Applications with Noise)''' — Finds clusters as dense regions separated by low-density areas; can discover arbitrarily shaped clusters.
* '''Gaussian Mixture Model (GMM)''' — A probabilistic model representing the data as a mixture of K Gaussian distributions; fit by EM algorithm.
* '''Expectation-Maximization (EM)''' — An iterative algorithm for fitting latent variable models; alternates between expectation (compute responsibility) and maximization (update parameters).
* '''PCA (Principal Component Analysis)''' — Linear dimensionality reduction that finds the directions of maximum variance.
* '''t-SNE''' — A non-linear dimensionality reduction for visualization; preserves local neighborhood structure.
* '''UMAP''' — Faster, more scalable alternative to t-SNE for visualization; better preserves global structure.
* '''Autoencoder''' — A neural network trained to reconstruct its input through a bottleneck; the bottleneck gives a compressed representation.
* '''Silhouette score''' — A clustering quality metric measuring how similar a point is to its own cluster vs. other clusters; range [-1, 1].
* '''Elbow method''' — Heuristic for choosing K in K-means: plot inertia vs. K, choose the "elbow" where improvement slows.
* '''Inertia''' — Within-cluster sum of squared distances from each point to its cluster center; K-means minimizes this.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Unsupervised learning discovers '''inherent structure''' in data. The key challenge: without labels, how do we know if the discovered structure is meaningful? This is the fundamental evaluation problem of unsupervised learning.

'''K-means''' is the simplest and most widely used clustering algorithm. Initialize K cluster centers (randomly or with K-means++). Assign each point to the nearest center. Recompute centers as the mean of assigned points. Repeat until convergence. Pros: simple, fast. Cons: assumes spherical clusters of equal size, requires K to be specified, sensitive to initialization.

'''DBSCAN''' doesn't require K, can find clusters of arbitrary shape, and identifies outliers as noise points not belonging to any cluster. Works by: a "core" point has ≥ MinPts neighbors within radius ε; reachable points form a cluster; unreachable points are noise. Critical for applications where cluster number and shape are unknown.

'''Dimensionality reduction''' is essential before clustering in high-dimensional spaces (the curse of dimensionality makes all points equidistant). PCA gives a linear compression preserving maximum variance. t-SNE/UMAP give non-linear compressions suited for visualization. Autoencoders give deep non-linear compressions suited for representation learning.

'''The curse of dimensionality''': In high-dimensional spaces, data becomes increasingly sparse and the notion of "nearest neighbor" breaks down — all points become equidistant. K-means fails above ~20 dimensions without dimensionality reduction. This motivates learning compact representations before clustering.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Customer segmentation pipeline:'''
<syntaxhighlight lang="python">
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import umap

# Customer behavioral data
df = pd.read_csv("customer_features.csv")
features = ['recency_days', 'frequency', 'monetary_value', 'avg_session_duration',
            'pages_per_session', 'email_open_rate', 'support_tickets']

X = df[features].fillna(df[features].median())
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Dimensionality reduction for visualization
reducer = umap.UMAP(n_components=2, random_state=42)
X_2d = reducer.fit_transform(X_scaled)

# Find optimal K using silhouette score
scores = {}
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    scores[k] = silhouette_score(X_scaled, labels)
optimal_k = max(scores, key=scores.get)
print(f"Optimal K: {optimal_k} (silhouette: {scores[optimal_k]:.3f})")

# Final clustering
kmeans = KMeans(n_clusters=optimal_k, n_init=10, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Characterize each segment
for cluster_id in range(optimal_k):
    segment = df[df['cluster'] == cluster_id][features]
    print(f"\nCluster {cluster_id} ({len(segment)} customers):")
    print(segment.mean().round(2))
</syntaxhighlight>

; Algorithm selection guide
: '''Known K, spherical clusters''' → K-means (fast, interpretable)
: '''Unknown K, arbitrary shapes''' → DBSCAN (handles noise, arbitrary clusters)
: '''Probabilistic assignments needed''' → Gaussian Mixture Model (soft clustering)
: '''Hierarchical structure''' → Agglomerative clustering (Ward linkage)
: '''High-dimensional data''' → PCA/UMAP first, then K-means
: '''Visualization''' → UMAP (speed + quality) or t-SNE (quality at small scale)
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Clustering Algorithm Comparison
! Algorithm !! K Required !! Cluster Shape !! Handles Noise !! Scale
|-
| K-means || Yes || Spherical only || No (noise → nearest cluster) || Very large
|-
| DBSCAN || No || Arbitrary || Yes || Large
|-
| GMM || Yes || Elliptical || No || Medium
|-
| Hierarchical || No (choose at cut) || Any || No || Small-medium
|-
| HDBSCAN || No || Arbitrary || Yes || Large
|}

'''Failure modes''': K-means is sensitive to initialization (K-means++ mitigates this), outliers (pull centroids away from dense regions), and unequal cluster sizes. DBSCAN is sensitive to ε and MinPts parameters — hard to tune in varying density data. GMM collapses if K is too large. All methods fail in very high dimensions without dimensionality reduction.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Unsupervised evaluation without ground truth:
# '''Silhouette score''': ranges from -1 (wrong cluster) to +1 (well-separated); target >0.5 for good clustering.
# '''Davies-Bouldin index''': lower is better; measures average similarity between each cluster and its most similar cluster.
# '''Calinski-Harabasz index''': higher is better; ratio of between-cluster to within-cluster dispersion. With ground truth: '''ARI (Adjusted Rand Index)''' measures agreement with true labels; perfect = 1.0. Expert practitioners combine quantitative metrics with domain expert evaluation of cluster interpretability.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a production segmentation system:
# Feature engineering: create behavioral, transactional, and demographic features.
# Preprocessing: standardize, handle missing values, remove collinear features.
# Dimensionality reduction: PCA to 10-20 components retaining 90% variance.
# Cluster search: evaluate K-means for K=2–15 using silhouette score and domain knowledge.
# Stability check: run clustering 10× with different random seeds; stable segments persist.
# Interpretation: characterize each cluster by mean feature values and example members; name segments (e.g., "High-value champions," "At-risk churners").
# Deployment: assign new users to nearest centroid in real time for personalization.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Unsupervised Learning]]
</div>