Artificial Intelligence 13 min read

Graph-Based Anti-Fraud: Gang Mining and Node Representation Using Graph Neural Networks

To curb large‑scale, organized fraud on Baidu’s platform, the Account Security team built a scalable heterogeneous graph framework that links accounts, features, and devices, trains GraphSAGE‑based node embeddings via link‑prediction, and leverages these representations to uncover fraud gangs, boosting detection accuracy above 90% across billions of nodes.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Graph-Based Anti-Fraud: Gang Mining and Node Representation Using Graph Neural Networks

The rapid expansion of the internet black market in China has led to large‑scale, industrialized cheating, especially targeting account systems. To enhance Baidu account security and user experience, the Baidu Account Security Strategy team built a scalable, extensible graph‑based anti‑fraud framework that leverages graph neural networks (GNN) for risk control.

According to the "China Internet Development Report," there were 1.011 billion net users by June 2021. This massive user base fuels a black‑gray industry worth over a trillion yuan, employing tactics such as fake orders, coupon harvesting, traffic redirection, fraud, and money laundering. These activities cause financial loss, degrade user experience, and threaten business sustainability.

To combat organized fraud, the team constructed a graph‑based architecture focusing on account‑level data. Traditional methods rely on statistical feature filtering, which can only identify isolated suspicious accounts and fail to reveal the full cheating gang. The new approach builds a heterogeneous graph that connects accounts, feature factors, and devices, enabling comprehensive gang mining.

Figure‑1 (left) shows a case where an account is linked to many feature factors and a device; the right side visualizes the corresponding graph structure (accounts in blue, feature factors in red, devices in green). While this reveals a set of suspicious accounts, extracting the entire gang still requires substantial effort. Figure‑2 demonstrates that the examined account is merely the tip of an iceberg within a much larger fraudulent network.

The team designed a multi‑scenario graph framework (Figure‑3) that supports daily, weekly, and monthly granularity, and combines homogeneous and heterogeneous sub‑graphs. It handles billions of nodes and edges, thanks to a redesigned algorithmic pipeline that offers high scalability and easy configuration for new business scenarios.

Despite its power, the graph approach faces challenges: hard relationships (e.g., shared devices) can produce false positives, dirty data and long‑term spans can create massive graphs containing both malicious and benign accounts, and resource‑crossing among gangs complicates analysis.

To address these issues, the team introduced node representation learning. Each account node is encoded into a fixed‑dimensional vector that captures both its intrinsic features and its structural context (neighbors and edge types). Various models were evaluated, including DeepWalk, LINE, node2vec, GCN, GAT, GraphSAGE, and PinSAGE.

Given the sparsity of account features and the massive scale without explicit labels, a link‑prediction task was used to train a customized GraphSAGE model. The workflow samples two‑hop neighbors via random walks, aggregates them through a two‑layer GraphSAGE, and combines the target node embeddings via a cross product to predict link existence. Training employs mini‑batch stochastic gradient descent with binary cross‑entropy loss. The architecture is illustrated in Figure‑4.

The final link‑prediction score is computed as score = σ(e_i • e_j) , where σ denotes the sigmoid function.

For benchmarking, the team also trained MLP and vanilla GCN models under identical hyper‑parameters. Visualizations using T‑SNE and UMAP (Figures 5‑7) show that GraphSAGE‑sum embeddings achieve markedly better cluster separation for the top‑25 gangs, indicating superior discriminative power.

With the learned node embeddings, downstream tasks such as link prediction, node classification, clustering, and gang representation become feasible. In a practical gang‑qualitative scenario, augmenting an XGBoost classifier with node‑embedding features raised classification accuracy above 90%.

Future work includes designing downstream tasks for massive gangs, improving GPU‑efficient embedding generation, enhancing model generalization, and advancing graph sampling, visualization, and real‑time processing techniques.

References: [1] Perozzi et al., DeepWalk (KDD 2014). [2] Tang et al., LINE (WWW 2015). [3] Grover & Leskovec, node2vec (KDD 2016). [4] Kipf & Welling, GCN (arXiv 2016). [5] Veličković et al., GAT (arXiv 2017). [6] Hamilton et al., GraphSAGE (NeurIPS 2017). [7] Ying et al., PinSAGE (KDD 2018). [8] Chen et al., XGBoost (R package 2015).

Machine Learninganti-fraudGraph Neural Networksrisk controlgraph miningnode embedding
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.