Editing
Anomaly Detection
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Anomaly detection is the task of identifying data points, patterns, or behaviors that deviate significantly from what is considered "normal." Also called outlier detection or novelty detection, it is one of the most broadly applicable AI problems: detecting fraudulent transactions, machine failures before they happen, cyber intrusions, manufacturing defects, rare medical conditions, and astronomical events all reduce to finding the unusual in a sea of normal. Anomaly detection is especially challenging because anomalies are rare, diverse, and often not known in advance β making purely supervised approaches impractical. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Anomaly''' β A data point or pattern that deviates significantly from the expected distribution; also called outlier or novelty. * '''Outlier''' β A data point that lies far from the majority of the data distribution. * '''Novelty detection''' β Detecting new types of data not seen during training; the training data is assumed clean (normal). * '''Outlier detection''' β Identifying anomalies when the training data itself may contain outliers (contaminated training). * '''Point anomaly''' β A single data instance that is anomalous with respect to the rest of the data. * '''Contextual anomaly''' β A data point that is anomalous only in a specific context (e.g., temperature of 30Β°C is normal in summer but anomalous in winter). * '''Collective anomaly''' β A collection of data points that is collectively anomalous even though individual points may not be. * '''Isolation Forest''' β An anomaly detection algorithm that isolates anomalies by randomly partitioning feature space; anomalies are isolated quickly. * '''One-Class SVM''' β A support vector machine trained on normal data only, learning a boundary around it. * '''Autoencoder for anomaly detection''' β A neural network trained to reconstruct normal data; anomalies have high reconstruction error. * '''Local Outlier Factor (LOF)''' β Measures the local density of each point relative to its neighbors; anomalies have lower local density. * '''DBSCAN''' β A clustering algorithm that identifies noise points (potential anomalies) as points not belonging to any cluster. * '''Reconstruction error''' β In autoencoder-based detection, the error between input and reconstruction; high error indicates anomaly. * '''Threshold''' β The score above which a data point is flagged as anomalous; setting this is a key tuning challenge. * '''False positive rate (FPR)''' β Fraction of normal points incorrectly flagged as anomalous; must be kept low for operator usability. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == Anomaly detection is fundamentally different from classification: in classification, you have labeled examples of each class. In anomaly detection, you typically only have normal data (or very few labeled anomalies), and anomalies can take any form not seen before. '''The assumption underlying most anomaly detection''': normal data occupies a compact, well-defined region of the feature space. Anomalies lie outside this region. The challenge is defining "outside" in a meaningful, threshold-able way. '''Unsupervised approaches''': - '''Isolation Forest''': Trees that randomly split feature space. Anomalies are isolated by fewer splits than normal points (they're easier to isolate). Anomaly score = average path length across all trees. - '''Autoencoders''': Train on normal data only. A model that can reconstruct normal patterns will fail on anomalies β high reconstruction error = anomaly. - '''DBSCAN''': Points not in any dense cluster are noise/potential anomalies. '''Statistical methods''': Fit a statistical model to normal data (Gaussian, GMM, KDE). Flag points with low probability under the model. Works well in low dimensions but fails in high-dimensional spaces (the curse of dimensionality makes all points equally distant). '''Supervised approaches''' (when labels exist): Treat as extreme class imbalance classification. Use focal loss, class weighting, or oversampling (SMOTE). Better precision/recall but requires labeled anomalies and fails on unseen anomaly types. '''Temporal anomaly detection''' adds complexity: what's anomalous is often contextual (day of week, trend, seasonality). LSTM autoencoders learn expected sequences; anomalies produce high sequence reconstruction error. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Isolation Forest for tabular anomaly detection:''' <syntaxhighlight lang="python"> import numpy as np from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler import pandas as pd # Load sensor/transaction/log data (assumed mostly normal) df = pd.read_csv("sensor_data.csv") X = df[['temperature', 'pressure', 'vibration', 'current']].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Isolation Forest: contamination = expected fraction of anomalies iso = IsolationForest( n_estimators=200, contamination=0.02, # Expect ~2% anomalies random_state=42, n_jobs=-1 ) iso.fit(X_scaled) # Anomaly scores: lower (more negative) = more anomalous scores = iso.score_samples(X_scaled) # Higher = more normal predictions = iso.predict(X_scaled) # 1=normal, -1=anomaly print(f"Detected {(predictions==-1).sum()} anomalies out of {len(X)} samples") df['anomaly_score'] = scores df['is_anomaly'] = (predictions == -1) print(df[df['is_anomaly']].head()) </syntaxhighlight> '''Autoencoder for high-dimensional anomaly detection:''' <syntaxhighlight lang="python"> import torch import torch.nn as nn class AnomalyAutoencoder(nn.Module): def __init__(self, input_dim, latent_dim=32): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, latent_dim) ) self.decoder = nn.Sequential( nn.Linear(latent_dim, 64), nn.ReLU(), nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, input_dim) ) def forward(self, x): return self.decoder(self.encoder(x)) # Train on normal data only with MSE loss # At inference: reconstruction_error = MSE(x, model(x)) # Threshold: flag samples with error > percentile_95(train_errors) </syntaxhighlight> ; Method selection guide : '''Tabular, low-dimensional''' β Isolation Forest (fast, robust) : '''High-dimensional features''' β Autoencoder, One-Class SVM with RBF kernel : '''Time series''' β LSTM autoencoder, Prophet residuals, seasonal decomposition : '''Images''' β CNN autoencoder, PatchCore (nearest neighbors in feature space) : '''Labeled anomalies available''' β XGBoost/LightGBM with class_weight='balanced' : '''Streaming, real-time''' β Half-Space Trees, adaptive Isolation Forest </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Anomaly Detection Method Comparison ! Method !! Handles High Dimensions !! Speed !! Interpretability !! Requires Labels |- | Isolation Forest || Good || Very fast || Low || No |- | One-Class SVM || Poor || Slow (large data) || Low || No |- | Autoencoder || Excellent || Moderate || Low (error only) || No |- | LOF || Poor (high-dim) || Slow || Medium || No |- | Supervised (XGBoost) || Good || Fast || High (SHAP) || Yes |- | DBSCAN || Poor || Moderate || Medium || No |} '''Failure modes''': Masking β if training data contains many anomalies, the model learns them as "normal." Concept drift β as normal behavior evolves, fixed thresholds produce more false positives. Feature selection β anomalies may be visible only in specific feature subsets. Threshold sensitivity β too low = alert fatigue; too high = missed detections. Distribution shift between training and production environments. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == When ground truth labels exist: AUC-ROC (measures ranking quality across all thresholds), AUC-PR (precision-recall; better for imbalanced data), and F1 at operating threshold. When labels don't exist (most unsupervised settings): evaluate by analyst validation rate (fraction of flagged alerts confirmed as true anomalies by human review), time-to-detect (how quickly after onset is an anomaly flagged), and false positive rate at desired threshold. Expert practitioners set thresholds on a labeled validation set, not by eyeballing scores. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing a production anomaly detection pipeline: # Data collection: identify all relevant signals for the domain (sensor readings, transaction features, log events). # Baseline modeling: fit Isolation Forest on 30 days of normal data. # Threshold setting: use 99th percentile of normal data scores as initial threshold. # Monitoring: track alert rate daily; alert if rate changes >3Γ (suggests concept drift or system issue). # Continuous learning: retrain model monthly on sliding window of data. # Human-in-the-loop: all alerts reviewed by analyst; verdicts feed back into labeled dataset for supervised upgrade. # Alert deduplication: suppress repeated alerts for the same entity within a time window to reduce fatigue. [[Category:Artificial Intelligence]] [[Category:Machine Learning]] [[Category:Anomaly Detection]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information