Comprehensive Overview of Common Anomaly Detection Methods with Python Code Examples
This article compiles and explains various common anomaly detection techniques—including distribution‑based, distance‑based, density‑based, clustering, tree‑based, dimensionality‑reduction, classification, and prediction methods—providing theoretical descriptions, algorithmic steps, advantages, limitations, and Python code examples for each approach.
The article presents a comprehensive collection of common anomaly (outlier) detection methods, organized into several categories: distribution‑based (3‑sigma, Z‑score, boxplot, Grubbs test), distance‑based (K‑Nearest Neighbors), density‑based (Local Outlier Factor, Connectivity‑Based Outlier Factor), clustering‑based (DBSCAN), tree‑based (Isolation Forest), dimensionality‑reduction (PCA, AutoEncoder), classification‑based (One‑Class SVM), and prediction‑based approaches.
Distribution‑based methods
def three_sigma(s):
mu, std = np.mean(s), np.std(s)
lower, upper = mu - 3*std, mu + 3*std
return lower, upper def z_score(s):
return (s - np.mean(s)) / np.std(s) def boxplot(s):
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upperGrubbs’ test is described with its hypothesis and iterative removal steps.
Distance‑based method (KNN)
from pyod.models.knn import KNN
clf = KNN(method='mean', n_neighbors=3)
clf.fit(X_train)
labels = clf.labels_ # 0: normal, 1: outlierDensity‑based methods
from sklearn.neighbors import LocalOutlierFactor as LOF
clf = LOF(n_neighbors=2)
labels = clf.fit_predict(X)Connectivity‑Based Outlier Factor (COF) is introduced with its set‑based nearest path concept.
Clustering‑based method (DBSCAN)
DBSCAN treats points that cannot belong to any dense cluster as noise (outliers).
Tree‑based method (Isolation Forest)
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=100, contamination=0.05, random_state=1)
iforest.fit(X)
labels = iforest.predict(X) # -1: outlier, 1: normalThe algorithm isolates anomalies with fewer splits, yielding higher anomaly scores.
Dimensionality‑reduction methods
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(data)
transformed = pca.transform(data)PCA can detect outliers by large deviations in principal component space or by high reconstruction error. AutoEncoder, a non‑linear counterpart, is trained on normal data and flags samples with large reconstruction loss.
Classification‑based method (One‑Class SVM)
from sklearn import svm
clf = svm.OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)
clf.fit(X)
pred = clf.predict(X) # -1: outlier, 1: normalOne‑Class SVM (or SVDD) learns a boundary that encloses the majority of data, treating points outside as anomalies.
Prediction‑based method
For time‑series, a forecasting model predicts future values; residuals are then analyzed (e.g., using K‑sigma) to identify abnormal points.
The article also discusses the strengths and weaknesses of each technique, providing guidance on when to choose a particular method based on data dimensionality, distribution assumptions, and computational cost.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.