Ad Traffic Anti‑Fraud: Algorithms, System Architecture, and Case Studies
The article explains how ad traffic fraud—ranging from simulated impressions to click farms—can be combated using a four‑layer risk‑control system that leverages unsupervised (DBSCAN, Isolation Forest) and supervised (Logistic Regression, Random Forest) algorithms, detailing data pipelines, model training, monitoring, and real‑world case studies.
Author: vivo Internet Security Team – Duan Yunxin
Commercial ad traffic monetization suffers from severe cheating on both the media side and the advertiser side, harming all parties. Strategy‑ and algorithm‑based risk control can effectively protect interests. This article first introduces ad anti‑fraud, then presents common algorithms used in risk‑control systems, and finally provides concrete application cases.
1. Ad Anti‑Fraud Overview
1.1 Definition
Ad traffic fraud occurs when media use various cheating methods to obtain advertiser benefits.
Typical sources of fraudulent traffic include:
Simulator‑generated or tampered‑device traffic;
Real devices controlled by bot farms;
Real devices induced to generate invalid traffic.
1.2 Common Cheating Behaviors
Machine behavior: repeated impressions from the same IP, IP switching, traffic hijacking, IMEI switching, etc.
Human behavior: deceptive creative elements that induce clicks, media‑rendered copy that lures clicks, accidental pop‑up clicks, etc.
1.3 Typical Fraud Types
Impression fraud: multiple ads displayed simultaneously in the same slot, charging the advertiser multiple times.
Click fraud: scripts or programs simulate real users, or incentivized users generate massive useless clicks, draining CPC budgets.
Installation fraud: simulated downloads via test devices or emulators, or device/SDK manipulation to send virtual install signals.
2. Algorithmic System for Ad Anti‑Fraud
2.1 Background of Algorithm Models in Business Risk Control
Intelligent risk control leverages massive behavioral data to build models that detect and monitor risks, offering higher accuracy, coverage, and stability than rule‑based strategies.
Common unsupervised algorithms:
Density‑based clustering (DBSCAN)
Isolation Forest
K‑means
Common supervised algorithms:
Logistic Regression
Random Forest
2.2 Four‑Layer Architecture
Platform layer: based on Spark‑ML / TensorFlow / PyTorch frameworks, integrating open‑source and custom algorithms for risk modeling.
Data layer: constructs multi‑granularity profiles (IP, media, ad slot, request, impression, click, download, activation) to feed models.
Business model layer: builds click‑fraud audit models, request‑click risk estimation, media‑behavior similarity groups, and media‑level anomaly perception models.
Access layer: applies model outputs to offline audit results, synchronizes downstream penalties, and feeds anomaly lists to inspection platforms.
3. Algorithm Model Application Cases
3.1 Material Interaction Induced Fraud Perception
Background: Some ad creatives embed a fake “X” close button. Users clicking the visual “X” generate invalid clicks, harming user experience. The left image shows the original creative; the right heat‑map shows click coordinates concentrated on the fake button.
Model perception:
1. DBSCAN
Key concepts:
Neighborhood: for a sample x and distance ε, the ε‑neighborhood includes all samples within ε.
Core point: a sample whose ε‑neighborhood contains at least minPts points.
Direct density reachability: a point b is directly reachable from a core point a if b lies in a’s ε‑neighborhood.
Density reachability: a chain of directly reachable points from a to b.
Density connectivity: two points are density‑connected if there exists a third point that is density‑reachable from both.
Cluster: the maximal set of density‑connected points.
After defining the concepts, DBSCAN is applied to click coordinate data to isolate dense clusters representing fraudulent clicks.
2. Application to induced accidental clicks
Steps:
Group click data by resolution and ad slot, filter out low‑volume groups.
Apply DBSCAN with ε=5 and minimum points=10 to each group.
Filter out clusters with small area; the training code (omitted) implements this filtering.
Monitor and act on identified clusters, linking them to downstream metrics and taking remediation actions.
3.2 Click Fraud Model
3.2.1 Background
Build a model to detect fraudulent clicks, enhancing audit coverage and uncovering hidden high‑dimensional cheating behaviors.
3.2.2 Construction Process
Feature Engineering
Token‑level features: device, IP, media, ad slot attributes before each event.
Frequency features: statistics (mean, variance, dispersion) over multiple time windows (1 min, 5 min, 30 min, 1 h, 1 day, 7 days) for impressions, clicks, installs.
Basic attributes: media type, ad type, device legitimacy, IP type, network type, device value tier, etc.
Sample Balancing
Down‑sample non‑fraud samples to achieve a 1:1 ratio.
Use K‑means to cluster non‑fraud samples and then down‑sample each cluster proportionally.
Feature Pre‑processing
Remove features with >50 % missing rate.
Filter out features with contribution < 0.001 to the target.
Drop features with Population Stability Index (PSI) > 0.2 across time periods.
Model Training
Random Forest is employed for classification because it handles high‑dimensional data without extensive feature selection, provides unbiased generalization error estimates, trains quickly in parallel, and resists over‑fitting. Hyper‑parameter tuning (max_depth, numTrees, etc.) is performed via a parameter grid.
Model Monitoring
Stability monitoring: compare PSI of features between training and inference periods, visualizing daily alerts.
Interpretability: compute Shapley values for each prediction to understand feature impact, visualized on a monitoring dashboard.
3.3 Click Sequence Anomaly Detection
3.3.1 Background
Analyze hourly click sequences per device to uncover malicious behaviors, such as users who only click during 0‑6 am or exhibit unusually balanced hourly activity.
3.3.2 Construction Process
Feature Construction : for each device, aggregate hourly click counts over the past 1, 7, and 30 days, forming 1×24, 7×24, and 30×24 vectors.
Model Selection : Isolation Forest, based on the assumptions that anomalies are rare and have feature values far from normal points.
Training (example parameters):
Whole‑platform traffic: contamination=0.05
Per‑media‑type traffic: contamination=0.1
Per‑ad‑slot‑type traffic: contamination=0.1
Scoring and Filtering
Anomaly score ≈ 1 indicates a definite anomaly; < 0.5 indicates normal.
Score > 0.7 → high‑risk users; 0.5–0.7 → medium‑risk, sent for manual review.
Case Studies
Case ① (2022‑XX‑XX): Device A shows consistently high hourly clicks across 7 days, far exceeding normal patterns.
Case ② (2022‑XX‑XX): Device B clicks only during midnight hours, with no activity during the day.
4. Summary
In the ad traffic anti‑fraud field, evolving adversarial techniques demand robust algorithmic models. Both supervised and unsupervised methods have been explored to detect fraudulent impressions, clicks, and installations, significantly improving detection capability and uncovering complex abnormal behavior patterns. Future work will continue to expand model applications for machine‑generated traffic identification.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.