Artificial Intelligence 8 min read

Risk Control System for Detecting Game Account Fraud Using Feature Engineering and Graph Database

The article describes a risk‑control pipeline for detecting high‑volume fraudulent game accounts, detailing data collection from game logs, extensive feature engineering and statistical tests, enrichment via a Neo4j knowledge graph, and a hybrid RandomForest‑GBDT model combined with methods to filter personal accounts.

37 Interactive Technology Team
37 Interactive Technology Team
37 Interactive Technology Team
Risk Control System for Detecting Game Account Fraud Using Feature Engineering and Graph Database

In many of our games, a large amount of "small‑account" (刷小号) activity exists. Players create auxiliary accounts, acquire in‑game currency (元宝) through rebates or sign‑in bonuses, and then sell the currency to main accounts at a markup. This behavior severely damages the game ecosystem, making the construction of a risk‑control system for players a pressing need.

The primary offenders are organized studios, with a few individual accounts. (Sensitive business details are omitted; the focus is on methodology.)

Key Definitions

• 刷量用户 (high‑volume user) : In the next 45 days, 90% of the user’s currency outflow occurs via marked‑up purchases.

• Input : Behavioral features of a user over the past week.

• Output : Probability that the user is a high‑volume (fraud) user.

1. Data Collection

To build an effective risk‑control system, data must first be gathered.

• Game basic data : Low‑sensitivity features such as online duration, activity level, sign‑in reward collection, etc.

• Game SDK data : Activation, login, server selection, character creation, etc., stored in Hive. Example extraction includes weekday login counts and holiday login counts via Hive SQL.

2. Feature Engineering

2.1 Feature Derivation

Example: "Number of different recharge amounts". Since most fraud users obtain currency via monthly card rebates, they typically have few distinct recharge amounts. The following code extracts this feature and visualizes its distribution:

-- Example Hive SQL (illustrative)
SELECT user_id, COUNT(DISTINCT recharge_amount) AS recharge_cnt
FROM recharge_table
GROUP BY user_id;

2.2 Feature Crossing

Example: Ratio of weekday to holiday online duration. Normal players tend to play longer on holidays, whereas studios concentrate activity on weekdays.

2.3 Chi‑square test for discrete features

Using the "signed‑in last week" feature, the joint distribution table is constructed and a chi‑square test applied (no correction needed due to large sample size). The chi‑square statistic follows a χ² distribution with k degrees of freedom, yielding a p‑value for significance.

2.4 Correlation analysis for continuous features

Highly correlated features are removed to avoid rank deficiency, and features with low correlation to the label are discarded.

2.5 Enriching data when features are insufficient

When the feature set is limited and business partners cannot provide more data, a graph database (Neo4j) is introduced to build a user knowledge graph. Nodes such as "User" and "Device" are connected by "Used" edges, enabling rapid queries over billions of records.

Example Cypher query to find fraud users linked to a specific device:

match (u:User)-[r:Used]-(d:DEVICE) where d.name="设备2" and u.is_fraud=1 return u,r,d

3. Model Construction

A hybrid model combining RandomForest and GBDT is employed. Tree‑based models are chosen for their interpretability and ability to separate samples efficiently.

4. Project Workflow

The overall pipeline includes data collection, feature engineering, model training, and deployment, as illustrated in the project flowchart (image omitted).

5. Summary and Outlook

The biggest challenges lie in feature engineering and data acquisition. To ensure model interpretability and high precision, knowledge‑graph‑derived features become essential. Future work may incorporate identity documents, payment card numbers, and additional in‑game relational features to further raise the cost of fraudulent account operations.

6. Methods to Filter Personal Small Accounts

Negative samples provided by the business contain many personal accounts whose features are not clearly distinguishable from normal players. Before model training, personal accounts should be removed, retaining only studio accounts as negative samples. Three simple approaches are introduced (details omitted).

7. Additional Analyses

• Distribution of total accumulated currency to differentiate personal from studio accounts (coarse).

• Clustering methods to group features and discard clusters with low negative‑sample ratios.

• Community‑detection algorithms to identify user groups; the size of a user's community helps infer whether the account is personal or studio‑controlled.

machine learningdata miningFeature Engineeringgraph databaseNeo4jrisk controlgame fraud detection
37 Interactive Technology Team
Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.