How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons
This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.
Preface
The author, a senior engineer in Tencent's SNG Value‑Added Products Department, introduces the QQ Membership privilege and AMS system work, referencing a previous article on scaling from millions to hundreds of millions of daily requests.
1. Business System Chain Analysis and Traffic Estimation
1.1 What is an operation activity
AMS is an activity‑operation platform. An activity consists of four pages shown in the diagram. While the pages themselves are simple, handling them at a scale of 150,000 requests per second becomes a challenging problem.
1.2 AMS System Overview
The AMS platform serves QQ Membership activities with daily request volume of 4‑12 billion, involving over 120 services. Peak traffic can reach 150,000 QPS. The system supports major Tencent business and has been on‑call for two years during the QQ Spring Festival Red Packet activity.
1.3 Challenges of the QQ Spring Festival Red Packet Activity
First : Estimating traffic scale – is it 100k, 200k, etc.?
Second : Complex service‑to‑service call chains make capacity planning difficult.
Third : Ensuring the activity runs smoothly on the day despite unknown issues.
1.4 Architecture Mapping
A simplified diagram shows user interactions with sub‑systems, focusing on business‑level relationships rather than server‑level details.
For example, a user clicks the red‑packet claim button, the request flows through STGW (reverse proxy), WEB server, Cmem, IDIP, and many downstream services. This mapping helps identify which components need scaling together.
1.5 Business Function Chain Decomposition
Simply buying more machines is insufficient; each stateful service must be fully understood and scaled.
1.6 Traffic Scale Estimation
Traffic estimation combines promotion exposure, conversion rates, and per‑user request multipliers. For instance, if 20% of 100 exposed users click, each click may generate 3‑4 static and 3‑4 dynamic requests, leading to a multiplier of about 3. The final estimated QPS for the activity was 120 QPS per second per user, multiplied by the number of concurrent users.
1.7 AMS Scaling Assessment
Estimated required capacity was around 96k QPS (≈100k). However, some third‑party or cross‑BG services could not be scaled, requiring alternative solutions.
2. AMS High‑Availability Architecture Practices
2.1 Architecture Refactoring Based on the Red Packet Activity
Three main approaches were applied:
Asynchrony : For unscalable interfaces, produce requests quickly and consume them asynchronously.
Cache‑First : Load large user lists (e.g., game users) into memory to avoid costly external queries.
Service Degradation : Non‑critical modules (e.g., real‑time reporting) can be skipped when overloaded.
2.2 Key Elements of Scaling Assessment
Assess QPS of each link, storage size, memory, and bandwidth. Static resources also need bandwidth calculation; a single page may trigger multiple static requests, multiplying the load.
2.3 Critical Points of High‑Availability Construction
Overload Protection : Ensure the system never crashes; partial degradation is acceptable.
Flexible Availability : Business logic can fallback (e.g., treat all users as new) when certain services fail.
Disaster Recovery : Separate critical and non‑critical paths, and isolate them across clusters.
Monitoring & Data Reconciliation : Multi‑dimensional monitoring of response codes, traffic spikes, queue backlogs, and automated reconciliation scripts.
2.4 Overload Protection Details
Promotion Layer : Reduce or stop ad traffic when the system is stressed.
Frontend Layer : Introduce deliberate request delays to smooth spikes.
CGI Layer : Filter and throttle backend requests.
Backend Service Layer : Limit calls to third‑party APIs that cannot handle high QPS.
2.5 Light‑Heavy Separation
Critical requests (e.g., order fulfillment) are isolated from lightweight ones (e.g., user info queries) to prevent interference.
2.6 Flexible Availability Techniques
Set ultra‑short timeout thresholds (e.g., 20 ms) and skip slow paths.
Use UDP for non‑critical logging where loss is acceptable.
2.7 Disaster Recovery: Critical Path
Routine DR : Persist messages to disk, Kafka, and an internal data center for redundancy.
Catastrophic DR : Cross‑region deployment and network redundancy to survive data‑center failures.
2.8 Monitoring and Reconciliation
Monitor front‑end success rates, traffic fluctuations, gift‑delivery volumes, queue depths, and other dimensions. Use automated scripts to reconcile and re‑issue missed deliveries.
2.9 Operations Platform – Zhiyun System
The SNG operations team provides a standardized, intelligent platform that enables rapid cross‑region deployment, scaling, and on‑the‑fly adjustments during the event.
3. Practical Experience and Summary of the QQ Spring Festival Red Packet Activity
3.1 Real‑World Situation
Despite extensive preparation, several issues occurred:
Traffic exceeded estimates (peak 200k QPS vs. estimated 96k QPS).
Real‑time reporting server crashed; storage bandwidth saturated; message queues piled up to 12 million requests.
Multiple pre‑planned mitigations were activated, preventing total failure.
3.2 Activation of Emergency Plans
Sample non‑critical traffic when QPS > 200k.
Accept loss of real‑time reporting as it does not affect core user flow.
Scale Cmem bandwidth by provisioning higher‑speed NICs.
Use long‑term storage for massive message backlogs.
Compensate missed deliveries via reconciliation scripts.
3.3 Reflections
First , traffic was underestimated. Second , business cluster isolation was insufficient. Third , scaling was conservative (9.6 k QPS estimate expanded only to ~146 k QPS). Finally , the overload led to high‑load scenarios.
3.4 Summary of High‑Availability Practices for Large‑Scale Activities
Key takeaways:
Draw detailed architecture diagrams and decompose functional chains.
Estimate traffic from promotion exposure, app login peaks, and conversion rates.
Perform scaling assessments and apply asynchrony, caching, and degradation where direct scaling is impossible.
Implement overload protection, flexible availability, disaster recovery, and comprehensive monitoring.
Prepare and rehearse multiple emergency playbooks with one‑click activation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
