Editing Anomaly Detection

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Anomaly detection is the task of identifying data points, patterns, or behaviors that deviate significantly from what is considered "normal." Also called outlier detection or novelty detection, it is one of the most broadly applicable AI problems: detecting fraudulent transactions, machine failures before they happen, cyber intrusions, manufacturing defects, rare medical conditions, and astronomical events all reduce to finding the unusual in a sea of normal. Anomaly detection is especially challenging because anomalies are rare, diverse, and often not known in advance — making purely supervised approaches impractical.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Anomaly''' — A data point or pattern that deviates significantly from the expected distribution; also called outlier or novelty.
* '''Outlier''' — A data point that lies far from the majority of the data distribution.
* '''Novelty detection''' — Detecting new types of data not seen during training; the training data is assumed clean (normal).
* '''Outlier detection''' — Identifying anomalies when the training data itself may contain outliers (contaminated training).
* '''Point anomaly''' — A single data instance that is anomalous with respect to the rest of the data.
* '''Contextual anomaly''' — A data point that is anomalous only in a specific context (e.g., temperature of 30°C is normal in summer but anomalous in winter).
* '''Collective anomaly''' — A collection of data points that is collectively anomalous even though individual points may not be.
* '''Isolation Forest''' — An anomaly detection algorithm that isolates anomalies by randomly partitioning feature space; anomalies are isolated quickly.
* '''One-Class SVM''' — A support vector machine trained on normal data only, learning a boundary around it.
* '''Autoencoder for anomaly detection''' — A neural network trained to reconstruct normal data; anomalies have high reconstruction error.
* '''Local Outlier Factor (LOF)''' — Measures the local density of each point relative to its neighbors; anomalies have lower local density.
* '''DBSCAN''' — A clustering algorithm that identifies noise points (potential anomalies) as points not belonging to any cluster.
* '''Reconstruction error''' — In autoencoder-based detection, the error between input and reconstruction; high error indicates anomaly.
* '''Threshold''' — The score above which a data point is flagged as anomalous; setting this is a key tuning challenge.
* '''False positive rate (FPR)''' — Fraction of normal points incorrectly flagged as anomalous; must be kept low for operator usability.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Anomaly detection is fundamentally different from classification: in classification, you have labeled examples of each class. In anomaly detection, you typically only have normal data (or very few labeled anomalies), and anomalies can take any form not seen before.

'''The assumption underlying most anomaly detection''': normal data occupies a compact, well-defined region of the feature space. Anomalies lie outside this region. The challenge is defining "outside" in a meaningful, threshold-able way.

'''Unsupervised approaches''':
- '''Isolation Forest''': Trees that randomly split feature space. Anomalies are isolated by fewer splits than normal points (they're easier to isolate). Anomaly score = average path length across all trees.
- '''Autoencoders''': Train on normal data only. A model that can reconstruct normal patterns will fail on anomalies — high reconstruction error = anomaly.
- '''DBSCAN''': Points not in any dense cluster are noise/potential anomalies.

'''Statistical methods''': Fit a statistical model to normal data (Gaussian, GMM, KDE). Flag points with low probability under the model. Works well in low dimensions but fails in high-dimensional spaces (the curse of dimensionality makes all points equally distant).

'''Supervised approaches''' (when labels exist): Treat as extreme class imbalance classification. Use focal loss, class weighting, or oversampling (SMOTE). Better precision/recall but requires labeled anomalies and fails on unseen anomaly types.

'''Temporal anomaly detection''' adds complexity: what's anomalous is often contextual (day of week, trend, seasonality). LSTM autoencoders learn expected sequences; anomalies produce high sequence reconstruction error.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Isolation Forest for tabular anomaly detection:'''
<syntaxhighlight lang="python">
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load sensor/transaction/log data (assumed mostly normal)
df = pd.read_csv("sensor_data.csv")
X = df[['temperature', 'pressure', 'vibration', 'current']].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Isolation Forest: contamination = expected fraction of anomalies
iso = IsolationForest(
    n_estimators=200,
    contamination=0.02,   # Expect ~2% anomalies
    random_state=42,
    n_jobs=-1
)
iso.fit(X_scaled)

# Anomaly scores: lower (more negative) = more anomalous
scores = iso.score_samples(X_scaled)  # Higher = more normal
predictions = iso.predict(X_scaled)   # 1=normal, -1=anomaly

print(f"Detected {(predictions==-1).sum()} anomalies out of {len(X)} samples")
df['anomaly_score'] = scores
df['is_anomaly'] = (predictions == -1)
print(df[df['is_anomaly']].head())
</syntaxhighlight>

'''Autoencoder for high-dimensional anomaly detection:'''
<syntaxhighlight lang="python">
import torch
import torch.nn as nn

class AnomalyAutoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128), nn.ReLU(),
            nn.Linear(128, 64), nn.ReLU(),
            nn.Linear(64, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64), nn.ReLU(),
            nn.Linear(64, 128), nn.ReLU(),
            nn.Linear(128, input_dim)
        )
    def forward(self, x):
        return self.decoder(self.encoder(x))

# Train on normal data only with MSE loss
# At inference: reconstruction_error = MSE(x, model(x))
# Threshold: flag samples with error > percentile_95(train_errors)
</syntaxhighlight>

; Method selection guide
: '''Tabular, low-dimensional''' → Isolation Forest (fast, robust)
: '''High-dimensional features''' → Autoencoder, One-Class SVM with RBF kernel
: '''Time series''' → LSTM autoencoder, Prophet residuals, seasonal decomposition
: '''Images''' → CNN autoencoder, PatchCore (nearest neighbors in feature space)
: '''Labeled anomalies available''' → XGBoost/LightGBM with class_weight='balanced'
: '''Streaming, real-time''' → Half-Space Trees, adaptive Isolation Forest
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Anomaly Detection Method Comparison
! Method !! Handles High Dimensions !! Speed !! Interpretability !! Requires Labels
|-
| Isolation Forest || Good || Very fast || Low || No
|-
| One-Class SVM || Poor || Slow (large data) || Low || No
|-
| Autoencoder || Excellent || Moderate || Low (error only) || No
|-
| LOF || Poor (high-dim) || Slow || Medium || No
|-
| Supervised (XGBoost) || Good || Fast || High (SHAP) || Yes
|-
| DBSCAN || Poor || Moderate || Medium || No
|}

'''Failure modes''': Masking — if training data contains many anomalies, the model learns them as "normal." Concept drift — as normal behavior evolves, fixed thresholds produce more false positives. Feature selection — anomalies may be visible only in specific feature subsets. Threshold sensitivity — too low = alert fatigue; too high = missed detections. Distribution shift between training and production environments.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
When ground truth labels exist: AUC-ROC (measures ranking quality across all thresholds), AUC-PR (precision-recall; better for imbalanced data), and F1 at operating threshold. When labels don't exist (most unsupervised settings): evaluate by analyst validation rate (fraction of flagged alerts confirmed as true anomalies by human review), time-to-detect (how quickly after onset is an anomaly flagged), and false positive rate at desired threshold. Expert practitioners set thresholds on a labeled validation set, not by eyeballing scores.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a production anomaly detection pipeline:
# Data collection: identify all relevant signals for the domain (sensor readings, transaction features, log events).
# Baseline modeling: fit Isolation Forest on 30 days of normal data.
# Threshold setting: use 99th percentile of normal data scores as initial threshold.
# Monitoring: track alert rate daily; alert if rate changes >3× (suggests concept drift or system issue).
# Continuous learning: retrain model monthly on sliding window of data.
# Human-in-the-loop: all alerts reviewed by analyst; verdicts feed back into labeled dataset for supervised upgrade.
# Alert deduplication: suppress repeated alerts for the same entity within a time window to reduce fatigue.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Anomaly Detection]]
</div>