Information Security 15 min read

Ad Traffic Anti‑Fraud: Algorithms, System Architecture, and Case Studies

The article explains how ad traffic fraud—ranging from simulated impressions to click farms—can be combated using a four‑layer risk‑control system that leverages unsupervised (DBSCAN, Isolation Forest) and supervised (Logistic Regression, Random Forest) algorithms, detailing data pipelines, model training, monitoring, and real‑world case studies.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Ad Traffic Anti‑Fraud: Algorithms, System Architecture, and Case Studies

Author: vivo Internet Security Team – Duan Yunxin

Commercial ad traffic monetization suffers from severe cheating on both the media side and the advertiser side, harming all parties. Strategy‑ and algorithm‑based risk control can effectively protect interests. This article first introduces ad anti‑fraud, then presents common algorithms used in risk‑control systems, and finally provides concrete application cases.

1. Ad Anti‑Fraud Overview

1.1 Definition

Ad traffic fraud occurs when media use various cheating methods to obtain advertiser benefits.

Typical sources of fraudulent traffic include:

Simulator‑generated or tampered‑device traffic;

Real devices controlled by bot farms;

Real devices induced to generate invalid traffic.

1.2 Common Cheating Behaviors

Machine behavior: repeated impressions from the same IP, IP switching, traffic hijacking, IMEI switching, etc.

Human behavior: deceptive creative elements that induce clicks, media‑rendered copy that lures clicks, accidental pop‑up clicks, etc.

1.3 Typical Fraud Types

Impression fraud: multiple ads displayed simultaneously in the same slot, charging the advertiser multiple times.

Click fraud: scripts or programs simulate real users, or incentivized users generate massive useless clicks, draining CPC budgets.

Installation fraud: simulated downloads via test devices or emulators, or device/SDK manipulation to send virtual install signals.

2. Algorithmic System for Ad Anti‑Fraud

2.1 Background of Algorithm Models in Business Risk Control

Intelligent risk control leverages massive behavioral data to build models that detect and monitor risks, offering higher accuracy, coverage, and stability than rule‑based strategies.

Common unsupervised algorithms:

Density‑based clustering (DBSCAN)

Isolation Forest

K‑means

Common supervised algorithms:

Logistic Regression

Random Forest

2.2 Four‑Layer Architecture

Platform layer: based on Spark‑ML / TensorFlow / PyTorch frameworks, integrating open‑source and custom algorithms for risk modeling.

Data layer: constructs multi‑granularity profiles (IP, media, ad slot, request, impression, click, download, activation) to feed models.

Business model layer: builds click‑fraud audit models, request‑click risk estimation, media‑behavior similarity groups, and media‑level anomaly perception models.

Access layer: applies model outputs to offline audit results, synchronizes downstream penalties, and feeds anomaly lists to inspection platforms.

3. Algorithm Model Application Cases

3.1 Material Interaction Induced Fraud Perception

Background: Some ad creatives embed a fake “X” close button. Users clicking the visual “X” generate invalid clicks, harming user experience. The left image shows the original creative; the right heat‑map shows click coordinates concentrated on the fake button.

Model perception:

1. DBSCAN

Key concepts:

Neighborhood: for a sample x and distance ε, the ε‑neighborhood includes all samples within ε.

Core point: a sample whose ε‑neighborhood contains at least minPts points.

Direct density reachability: a point b is directly reachable from a core point a if b lies in a’s ε‑neighborhood.

Density reachability: a chain of directly reachable points from a to b.

Density connectivity: two points are density‑connected if there exists a third point that is density‑reachable from both.

Cluster: the maximal set of density‑connected points.

After defining the concepts, DBSCAN is applied to click coordinate data to isolate dense clusters representing fraudulent clicks.

2. Application to induced accidental clicks

Steps:

Group click data by resolution and ad slot, filter out low‑volume groups.

Apply DBSCAN with ε=5 and minimum points=10 to each group.

Filter out clusters with small area; the training code (omitted) implements this filtering.

Monitor and act on identified clusters, linking them to downstream metrics and taking remediation actions.

3.2 Click Fraud Model

3.2.1 Background

Build a model to detect fraudulent clicks, enhancing audit coverage and uncovering hidden high‑dimensional cheating behaviors.

3.2.2 Construction Process

Feature Engineering

Token‑level features: device, IP, media, ad slot attributes before each event.

Frequency features: statistics (mean, variance, dispersion) over multiple time windows (1 min, 5 min, 30 min, 1 h, 1 day, 7 days) for impressions, clicks, installs.

Basic attributes: media type, ad type, device legitimacy, IP type, network type, device value tier, etc.

Sample Balancing

Down‑sample non‑fraud samples to achieve a 1:1 ratio.

Use K‑means to cluster non‑fraud samples and then down‑sample each cluster proportionally.

Feature Pre‑processing

Remove features with >50 % missing rate.

Filter out features with contribution < 0.001 to the target.

Drop features with Population Stability Index (PSI) > 0.2 across time periods.

Model Training

Random Forest is employed for classification because it handles high‑dimensional data without extensive feature selection, provides unbiased generalization error estimates, trains quickly in parallel, and resists over‑fitting. Hyper‑parameter tuning (max_depth, numTrees, etc.) is performed via a parameter grid.

Model Monitoring

Stability monitoring: compare PSI of features between training and inference periods, visualizing daily alerts.

Interpretability: compute Shapley values for each prediction to understand feature impact, visualized on a monitoring dashboard.

3.3 Click Sequence Anomaly Detection

3.3.1 Background

Analyze hourly click sequences per device to uncover malicious behaviors, such as users who only click during 0‑6 am or exhibit unusually balanced hourly activity.

3.3.2 Construction Process

Feature Construction : for each device, aggregate hourly click counts over the past 1, 7, and 30 days, forming 1×24, 7×24, and 30×24 vectors.

Model Selection : Isolation Forest, based on the assumptions that anomalies are rare and have feature values far from normal points.

Training (example parameters):

Whole‑platform traffic: contamination=0.05

Per‑media‑type traffic: contamination=0.1

Per‑ad‑slot‑type traffic: contamination=0.1

Scoring and Filtering

Anomaly score ≈ 1 indicates a definite anomaly; < 0.5 indicates normal.

Score > 0.7 → high‑risk users; 0.5–0.7 → medium‑risk, sent for manual review.

Case Studies

Case ① (2022‑XX‑XX): Device A shows consistently high hourly clicks across 7 days, far exceeding normal patterns.

Case ② (2022‑XX‑XX): Device B clicks only during midnight hours, with no activity during the day.

4. Summary

In the ad traffic anti‑fraud field, evolving adversarial techniques demand robust algorithmic models. Both supervised and unsupervised methods have been explored to detect fraudulent impressions, clicks, and installations, significantly improving detection capability and uncovering complex abnormal behavior patterns. Future work will continue to expand model applications for machine‑generated traffic identification.

advertisinganti-fraudAnomaly Detectionmachine learningad fraudrisk detection
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.