Artificial Intelligence 8 min read

Mastering Outlier Detection: Techniques, Algorithms, and PyOD Implementation

Outlier detection identifies data points far from the norm, using methods such as the 3‑sigma rule, boxplots, K‑Nearest Neighbors, and numerous probabilistic and proximity‑based algorithms, with practical PyOD code examples for training, evaluating, and visualizing models across various techniques.

Model Perspective
Model Perspective
Model Perspective
Mastering Outlier Detection: Techniques, Algorithms, and PyOD Implementation

Outlier detection (also called anomaly detection) refers to identifying observations that are far from the majority of data points. It can significantly affect models such as linear and logistic regression, as well as ensemble methods like AdaBoost.

Common Outlier Detection Methods

One simple approach assumes data follow a known distribution (e.g., Gaussian) and visualizes the data with scatter plots or boxplots for small datasets (up to ~10k observations and 100 features). For high‑dimensional data, visualization is less effective.

3‑Sigma Rule

Based on the normal distribution, the 3‑sigma rule treats points beyond three standard deviations as outliers.

<code>def three_sigma(s):
    mu, std = np.mean(s), np.std(s)
    lower, upper = mu-3*std, mu+3*std
    return lower, upper
</code>

Boxplot (IQR Method)

The boxplot uses the interquartile range (IQR) to define lower and upper bounds; values outside are considered outliers.

<code>def boxplot(s):
    q1, q3 = s.quantile(.25), s.quantile(.75)
    iqr = q3 - q1
    lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
    return lower, upper
</code>

K‑Nearest Neighbors (KNN)

KNN computes the average distance from each sample to its K nearest neighbors and flags points whose distance exceeds a threshold. It does not assume any data distribution but only detects global outliers.

The PyOD library provides implementations of many outlier detection algorithms, including probabilistic methods (ECOD, ABOD, FastABOD, etc.), linear models (PCA, MCD, OCSVM, etc.), proximity‑based methods (LOF, HBOS, kNN, etc.), ensemble methods (Isolation Forest, Feature Bagging, etc.), and neural‑network approaches (AutoEncoder, VAE, GAN‑based models).

PyOD Quick Start (KNN Example)

<code>from pyod.models.knn import KNN   # kNN detector

# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

# get prediction labels and outlier scores for training data
y_train_pred = clf.labels_          # 0: inlier, 1: outlier
y_train_scores = clf.decision_scores_

# predict on test data
y_test_pred = clf.predict(X_test)          # 0 or 1
y_test_scores = clf.decision_function(X_test)

# optionally obtain confidence
y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)
</code>
<code>from pyod.utils.data import evaluate_print

print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
</code>
<code>visualize(clf_name, X_train, y_train, X_test, y_test,
          y_train_pred, y_test_pred,
          show_figure=True, save_figure=False)
</code>

References: original article and the PyOD GitHub repository.

machine learningAnomaly Detectionoutlier detectionpyod
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.