How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions
This article shares a comprehensive overview of game operation security at Tencent, covering personal background, real‑world incident cases, the inherent challenges of large‑scale game services, past monitoring efforts, and a new data‑driven alerting framework that dramatically reduces false alarms while protecting game economies.
1. Personal Introduction
I joined Tencent in 2008, initially working on DNF operations. As player concurrency grew, I built a Ruby‑based configuration management tool that generated server configuration and start/stop scripts, embodying early CMDB, automation, and auto‑generation concepts.
Over the years I have also been responsible for operations and management of multiple PC and mobile games such as DNF, Yulong, Feiche, and Huoying.
Currently I lead operation security in the Operations Department, covering application operation security, game economy security, and technical support for the audit team across all Tencent games.
2. Topic
Developing a successful game involves countless evaluations, data analyses, and optimizations. Once a game launches, the real challenge lies in the dynamic operation phase, where issues such as planning bugs, client‑side exploits, and internal mis‑operations can disrupt the game economy, cause public‑relations crises, or even force a re‑launch.
A healthy game economy requires stable producers, consumers, and well‑defined rules.
3. Operation Security Cases
3.1 Case 1
A shop allowed the purchase quantity to be altered to a tiny negative number, causing an integer underflow that let players mass‑extract any item, equipment, or gem.
Attackers can capture client/server packets, replay them, and repeatedly modify game data.
3.2 Case 2
A recharge‑rebate event mistakenly configured the number of gift packs to thousands instead of the intended 100, allowing players to buy a hundred‑fold of the original value for the same amount of money.
3.3 Case 3
Two backend Redis clusters failed failover; one recovered, the other did not, leading to partial data loss and allowing players to repeatedly claim once‑only recharge rewards.
3.4 Case 4
A well‑known New Year bug let players complete a dungeon without consuming tickets, repeatedly earning high‑value items and experience, resulting in millions of RMB worth of imbalance.
3.5 A Thoughtful Reflection
Community members have discussed testing, mis‑operation prevention, early warning, and rapid response mechanisms.
3.6 Small Goal
Undisclosed bugs remain hidden until exploited; attackers can monetize these points, posing the greatest threat to game economy security.
4. Challenges
4.1 Game Operation Security Challenges
Two main challenges: complexity and scale. When problems are few they are manageable, but at large scale many unknowns appear.
4.2 Lengthy Operation Process
Every release passes many stages; issues can arise from code bugs, testing gaps, or operational mistakes such as failed failover causing unintended diamond generation.
4.3 Dynamic Operation
Games must continuously adjust to player behavior and monetization demands, leading to frequent version updates (about 400 per year) and new bugs.
4.4 Human “Carelessness”
Both accidental and intentional mis‑operations by internal staff can create exploitable vulnerabilities.
4.5 Massive Business Scale
Our CMDB contains over 1,000 entries, supporting games with record‑breaking PCU such as League of Legends and Honor of Kings.
Access layer
Logic layer
Storage layer
Log platform
Big data platform
5. Past Efforts
Since 2010 we built basic guarantees and monitoring alerts: standardized logs, added trace IDs, and established a rapid‑response incident handling process to avoid PR crises.
However, fixed‑threshold alerts generated massive noise (over a thousand alerts per week per service) and lost trust from operators.
6. New Solution and Effects
6.1 New Idea
We treat the game as a society where most players follow normal statistical patterns. By continuously learning the dominant patterns (item flow per channel), we can detect outliers that indicate abuse.
6.2 Stage 1 – Monitoring Capability
We rebuilt the architecture, adding many monitoring models on top of a big‑data stack (Kafka, Spark, Elasticsearch, etc.) and a custom algorithm layer.
Typical workloads: 500 billion log entries per hour across 300 CPUs, each >80 % utilization.
Key models:
Frequency anomaly: detects high‑frequency repeated requests that may indicate client‑side hacks.
Trend anomaly: monitors per‑character maximum production values that shift with events and versions.
6.3 Stage 2 – Alert Analysis Capability
We built tools to triage alerts, filter false positives, and provide detailed user‑level context for operators.
6.4 Stage 3 – Fine‑Grained Operations
By continuously refining alert rules and adding white‑list mechanisms, weekly alert counts dropped from thousands to single digits, earning trust from product and operation teams.
Detection rate now exceeds 88 % and continues to improve as new models are added.
The solution requires only minutes to integrate a new game, regardless of revenue or genre, and has been deployed across dozens of titles.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.