Artificial Intelligence 10 min read

Anomaly Detection with Partially Observed Anomalies: A Two‑Stage Semi‑Supervised Approach

This article summarizes a two‑stage method for anomaly detection when only a few labeled anomalies and many unlabeled instances are available, detailing problem formulation, isolation‑forest‑based scoring, clustering of anomalies, weighted multiclass modeling, experimental validation, and real‑world URL attack applications.

AntTech
AntTech
AntTech
Anomaly Detection with Partially Observed Anomalies: A Two‑Stage Semi‑Supervised Approach

The Web Conference 2018 (WWW) featured a paper titled Anomaly Detection with Partially Observed Anomalies authored by Zhang Yalin, Li Longfei, Zhou Jun, Li Xiaolong, and Zhou Zhihua, which the Ant Financial AI team presented.

Introduction – Anomaly detection is widely needed in e‑commerce scenarios such as fraud transaction mining and abnormal user discovery. Traditional settings assume either fully unsupervised data or fully supervised data, while the paper focuses on the realistic case where only a tiny set of known anomalies and a large pool of unlabeled data are available.

Problem Statement – Given a dataset of m samples, the first I samples are known anomalies (y=1) and the remaining m‑I samples are unlabeled (y=‑1 for normal). Typically I ≪ m . The goal is to learn a model that can reliably detect anomalies on future data.

Problem Analysis – Treating the task as pure unsupervised learning discards useful labeled information, while treating all unlabeled data as normal introduces heavy noise. The setting resembles Positive‑and‑Unlabeled (PU) learning, but anomalies are far less homogeneous than typical PU positives, limiting direct PU methods.

Proposed Method – A two‑stage framework:

Stage 1 : Cluster the few known anomalies (using k‑means) to capture their diversity. For each unlabeled instance, compute an Isolation Score IS(x) via Isolation Forest and a Similarity Score SS(x) to the nearest anomaly cluster center. Combine them into an overall anomaly score, adjusting the balance with a parameter α. Select top‑scoring instances as Potential Anomalies and low‑scoring instances as Reliable Normals .

Stage 2 : Train a weighted multiclass classifier on three groups – observed anomalies, potential anomalies, and reliable normals. Weights are set to 1 for observed anomalies; for potential anomalies the weight grows with their confidence score TS(x); for reliable normals the weight grows as TS(x) decreases.

The final model predicts the class of a new sample; any assignment to an anomaly class marks the sample as anomalous.

Experimental Results – Extensive experiments on multiple datasets compare the proposed method against pure unsupervised, supervised, and PU‑learning baselines. The new approach consistently achieves superior detection performance, especially when labeled anomalies are scarce.

Business Application – The algorithm was deployed for URL attack detection, handling diverse threats such as XXE, XSS, and SQL injection. Real‑world traffic logs with many unlabeled URLs and few known attacks were processed, and the method outperformed existing solutions in manual verification of the top‑1000 scored URLs.

Conclusion – The paper presents a practical solution for anomaly detection under partially observed conditions, leveraging isolation characteristics, similarity to known anomalies, and a weighted multiclass model. Experiments and a production URL‑security use case demonstrate its effectiveness and applicability.

machine learninganomaly detectionSemi-supervised LearningPu-Learningpartial observationsweighted multiclass
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.