Anomaly Detection

From BloomWiki
Revision as of 01:47, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Anomaly Detection)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Anomaly detection is the task of identifying data points, patterns, or behaviors that deviate significantly from what is considered "normal." Also called outlier detection or novelty detection, it is one of the most broadly applicable AI problems: detecting fraudulent transactions, machine failures before they happen, cyber intrusions, manufacturing defects, rare medical conditions, and astronomical events all reduce to finding the unusual in a sea of normal. Anomaly detection is especially challenging because anomalies are rare, diverse, and often not known in advance — making purely supervised approaches impractical.

Remembering

  • Anomaly — A data point or pattern that deviates significantly from the expected distribution; also called outlier or novelty.
  • Outlier — A data point that lies far from the majority of the data distribution.
  • Novelty detection — Detecting new types of data not seen during training; the training data is assumed clean (normal).
  • Outlier detection — Identifying anomalies when the training data itself may contain outliers (contaminated training).
  • Point anomaly — A single data instance that is anomalous with respect to the rest of the data.
  • Contextual anomaly — A data point that is anomalous only in a specific context (e.g., temperature of 30°C is normal in summer but anomalous in winter).
  • Collective anomaly — A collection of data points that is collectively anomalous even though individual points may not be.
  • Isolation Forest — An anomaly detection algorithm that isolates anomalies by randomly partitioning feature space; anomalies are isolated quickly.
  • One-Class SVM — A support vector machine trained on normal data only, learning a boundary around it.
  • Autoencoder for anomaly detection — A neural network trained to reconstruct normal data; anomalies have high reconstruction error.
  • Local Outlier Factor (LOF) — Measures the local density of each point relative to its neighbors; anomalies have lower local density.
  • DBSCAN — A clustering algorithm that identifies noise points (potential anomalies) as points not belonging to any cluster.
  • Reconstruction error — In autoencoder-based detection, the error between input and reconstruction; high error indicates anomaly.
  • Threshold — The score above which a data point is flagged as anomalous; setting this is a key tuning challenge.
  • False positive rate (FPR) — Fraction of normal points incorrectly flagged as anomalous; must be kept low for operator usability.

Understanding

Anomaly detection is fundamentally different from classification: in classification, you have labeled examples of each class. In anomaly detection, you typically only have normal data (or very few labeled anomalies), and anomalies can take any form not seen before.

The assumption underlying most anomaly detection: normal data occupies a compact, well-defined region of the feature space. Anomalies lie outside this region. The challenge is defining "outside" in a meaningful, threshold-able way.

Unsupervised approaches: - Isolation Forest: Trees that randomly split feature space. Anomalies are isolated by fewer splits than normal points (they're easier to isolate). Anomaly score = average path length across all trees. - Autoencoders: Train on normal data only. A model that can reconstruct normal patterns will fail on anomalies — high reconstruction error = anomaly. - DBSCAN: Points not in any dense cluster are noise/potential anomalies.

Statistical methods: Fit a statistical model to normal data (Gaussian, GMM, KDE). Flag points with low probability under the model. Works well in low dimensions but fails in high-dimensional spaces (the curse of dimensionality makes all points equally distant).

Supervised approaches (when labels exist): Treat as extreme class imbalance classification. Use focal loss, class weighting, or oversampling (SMOTE). Better precision/recall but requires labeled anomalies and fails on unseen anomaly types.

Temporal anomaly detection adds complexity: what's anomalous is often contextual (day of week, trend, seasonality). LSTM autoencoders learn expected sequences; anomalies produce high sequence reconstruction error.

Applying

Isolation Forest for tabular anomaly detection: <syntaxhighlight lang="python"> import numpy as np from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler import pandas as pd

  1. Load sensor/transaction/log data (assumed mostly normal)

df = pd.read_csv("sensor_data.csv") X = df'temperature', 'pressure', 'vibration', 'current'.values

scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

  1. Isolation Forest: contamination = expected fraction of anomalies

iso = IsolationForest(

   n_estimators=200,
   contamination=0.02,   # Expect ~2% anomalies
   random_state=42,
   n_jobs=-1

) iso.fit(X_scaled)

  1. Anomaly scores: lower (more negative) = more anomalous

scores = iso.score_samples(X_scaled) # Higher = more normal predictions = iso.predict(X_scaled) # 1=normal, -1=anomaly

print(f"Detected {(predictions==-1).sum()} anomalies out of {len(X)} samples") df['anomaly_score'] = scores df['is_anomaly'] = (predictions == -1) print(df[df['is_anomaly']].head()) </syntaxhighlight>

Autoencoder for high-dimensional anomaly detection: <syntaxhighlight lang="python"> import torch import torch.nn as nn

class AnomalyAutoencoder(nn.Module):

   def __init__(self, input_dim, latent_dim=32):
       super().__init__()
       self.encoder = nn.Sequential(
           nn.Linear(input_dim, 128), nn.ReLU(),
           nn.Linear(128, 64), nn.ReLU(),
           nn.Linear(64, latent_dim)
       )
       self.decoder = nn.Sequential(
           nn.Linear(latent_dim, 64), nn.ReLU(),
           nn.Linear(64, 128), nn.ReLU(),
           nn.Linear(128, input_dim)
       )
   def forward(self, x):
       return self.decoder(self.encoder(x))
  1. Train on normal data only with MSE loss
  2. At inference: reconstruction_error = MSE(x, model(x))
  3. Threshold: flag samples with error > percentile_95(train_errors)

</syntaxhighlight>

Method selection guide
Tabular, low-dimensional → Isolation Forest (fast, robust)
High-dimensional features → Autoencoder, One-Class SVM with RBF kernel
Time series → LSTM autoencoder, Prophet residuals, seasonal decomposition
Images → CNN autoencoder, PatchCore (nearest neighbors in feature space)
Labeled anomalies available → XGBoost/LightGBM with class_weight='balanced'
Streaming, real-time → Half-Space Trees, adaptive Isolation Forest

Analyzing

Anomaly Detection Method Comparison
Method Handles High Dimensions Speed Interpretability Requires Labels
Isolation Forest Good Very fast Low No
One-Class SVM Poor Slow (large data) Low No
Autoencoder Excellent Moderate Low (error only) No
LOF Poor (high-dim) Slow Medium No
Supervised (XGBoost) Good Fast High (SHAP) Yes
DBSCAN Poor Moderate Medium No

Failure modes: Masking — if training data contains many anomalies, the model learns them as "normal." Concept drift — as normal behavior evolves, fixed thresholds produce more false positives. Feature selection — anomalies may be visible only in specific feature subsets. Threshold sensitivity — too low = alert fatigue; too high = missed detections. Distribution shift between training and production environments.

Evaluating

When ground truth labels exist: AUC-ROC (measures ranking quality across all thresholds), AUC-PR (precision-recall; better for imbalanced data), and F1 at operating threshold. When labels don't exist (most unsupervised settings): evaluate by analyst validation rate (fraction of flagged alerts confirmed as true anomalies by human review), time-to-detect (how quickly after onset is an anomaly flagged), and false positive rate at desired threshold. Expert practitioners set thresholds on a labeled validation set, not by eyeballing scores.

Creating

Designing a production anomaly detection pipeline:

  1. Data collection: identify all relevant signals for the domain (sensor readings, transaction features, log events).
  2. Baseline modeling: fit Isolation Forest on 30 days of normal data.
  3. Threshold setting: use 99th percentile of normal data scores as initial threshold.
  4. Monitoring: track alert rate daily; alert if rate changes >3× (suggests concept drift or system issue).
  5. Continuous learning: retrain model monthly on sliding window of data.
  6. Human-in-the-loop: all alerts reviewed by analyst; verdicts feed back into labeled dataset for supervised upgrade.
  7. Alert deduplication: suppress repeated alerts for the same entity within a time window to reduce fatigue.