Artificial Intelligence 11 min read

Unsupervised Algorithms for Fraud Detection in Huya's Risk Control System

This article presents Huya's exploration of unsupervised learning techniques for risk control, detailing business risk scenarios, black‑market attack vectors, limitations of traditional defenses, and the design, implementation, and evaluation of graph‑based and density‑based clustering methods to automatically discover and mitigate fraudulent user groups.

DataFunTalk
DataFunTalk
DataFunTalk
Unsupervised Algorithms for Fraud Detection in Huya's Risk Control System

Huya's risk control faces multiple business risks such as marketing‑activity cheating, content violations, fake traffic, and recharge fraud, which are exploited by black‑market attackers using large‑scale account registration, device farms, IP spoofing, and automated tools.

Traditional defenses—expert rules, black/white lists, and supervised models—are reactive, struggle with group attacks, and generate high false‑positive rates.

To achieve proactive detection, Huya applies unsupervised algorithms that do not rely on labeled data, instead mining global data associations to uncover new attack patterns.

The system framework collects structured and unstructured data within a time window, performs preprocessing and feature engineering, and feeds the data into an unsupervised learning engine. The engine outputs identified fraud gangs, which are merged with existing gang records, scored for risk, and used for three purposes: automated rule generation, human review for false‑positive reduction, and as features for downstream supervised scoring models.

Similarity between users is computed using a weighted Jaccard distance, where feature weights are automatically derived from feature frequency and distribution—rare features like IP receive higher weights than common ones such as city of phone number.

Two families of clustering algorithms are evaluated:

Graph‑based methods: Louvain (modularity optimization) and Infomap (minimum‑entropy clustering).

Density‑based methods: DBSCAN, which clusters based on a distance ε and minimum points n, and OPTICS, which overcomes DBSCAN's sensitivity to ε by producing an ordered reachability plot.

Gang scores combine aggregation degree (tightness of member features) and risk level (presence of high‑risk attributes such as IP or device fingerprints).

Automatic interception leverages gang risk level and size to block high‑risk, large‑scale registrations, while lower‑risk groups are monitored; business behavior (e.g., mass coupon harvesting) further guides actions.

Explainability is achieved through frequent‑itemset mining, revealing the most common attribute combinations that characterize each gang.

Empirical results show two real‑world fraud gangs discovered by the unsupervised pipeline: one with geographically dispersed IPs but common country code phone numbers, and another with uniform new device models, old OS versions, and constantly charging batteries, indicating device‑tampering attacks.

The presentation concludes with thanks to the audience.

Clusteringfraud detectionAIunsupervised learningrisk controlsimilarityHuya
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.