Backend Development 25 min read

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.

Efficient Ops

May 9, 2017

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

Preface

The author, a senior engineer in Tencent's SNG Value‑Added Products Department, introduces the QQ Membership privilege and AMS system work, referencing a previous article on scaling from millions to hundreds of millions of daily requests.

1. Business System Chain Analysis and Traffic Estimation

1.1 What is an operation activity

AMS is an activity‑operation platform. An activity consists of four pages shown in the diagram. While the pages themselves are simple, handling them at a scale of 150,000 requests per second becomes a challenging problem.

1.2 AMS System Overview

The AMS platform serves QQ Membership activities with daily request volume of 4‑12 billion, involving over 120 services. Peak traffic can reach 150,000 QPS. The system supports major Tencent business and has been on‑call for two years during the QQ Spring Festival Red Packet activity.

1.3 Challenges of the QQ Spring Festival Red Packet Activity

First : Estimating traffic scale – is it 100k, 200k, etc.?

Second : Complex service‑to‑service call chains make capacity planning difficult.

Third : Ensuring the activity runs smoothly on the day despite unknown issues.

1.4 Architecture Mapping

A simplified diagram shows user interactions with sub‑systems, focusing on business‑level relationships rather than server‑level details.

For example, a user clicks the red‑packet claim button, the request flows through STGW (reverse proxy), WEB server, Cmem, IDIP, and many downstream services. This mapping helps identify which components need scaling together.

1.5 Business Function Chain Decomposition

Simply buying more machines is insufficient; each stateful service must be fully understood and scaled.

1.6 Traffic Scale Estimation

Traffic estimation combines promotion exposure, conversion rates, and per‑user request multipliers. For instance, if 20% of 100 exposed users click, each click may generate 3‑4 static and 3‑4 dynamic requests, leading to a multiplier of about 3. The final estimated QPS for the activity was 120 QPS per second per user, multiplied by the number of concurrent users.

1.7 AMS Scaling Assessment

Estimated required capacity was around 96k QPS (≈100k). However, some third‑party or cross‑BG services could not be scaled, requiring alternative solutions.

2. AMS High‑Availability Architecture Practices

2.1 Architecture Refactoring Based on the Red Packet Activity

Three main approaches were applied:

Asynchrony : For unscalable interfaces, produce requests quickly and consume them asynchronously.

Cache‑First : Load large user lists (e.g., game users) into memory to avoid costly external queries.

Service Degradation : Non‑critical modules (e.g., real‑time reporting) can be skipped when overloaded.

2.2 Key Elements of Scaling Assessment

Assess QPS of each link, storage size, memory, and bandwidth. Static resources also need bandwidth calculation; a single page may trigger multiple static requests, multiplying the load.

2.3 Critical Points of High‑Availability Construction

Overload Protection : Ensure the system never crashes; partial degradation is acceptable.

Flexible Availability : Business logic can fallback (e.g., treat all users as new) when certain services fail.

Disaster Recovery : Separate critical and non‑critical paths, and isolate them across clusters.

Monitoring & Data Reconciliation : Multi‑dimensional monitoring of response codes, traffic spikes, queue backlogs, and automated reconciliation scripts.

2.4 Overload Protection Details

Promotion Layer : Reduce or stop ad traffic when the system is stressed.

Frontend Layer : Introduce deliberate request delays to smooth spikes.

CGI Layer : Filter and throttle backend requests.

Backend Service Layer : Limit calls to third‑party APIs that cannot handle high QPS.

2.5 Light‑Heavy Separation

Critical requests (e.g., order fulfillment) are isolated from lightweight ones (e.g., user info queries) to prevent interference.

2.6 Flexible Availability Techniques

Set ultra‑short timeout thresholds (e.g., 20 ms) and skip slow paths.

Use UDP for non‑critical logging where loss is acceptable.

2.7 Disaster Recovery: Critical Path

Routine DR : Persist messages to disk, Kafka, and an internal data center for redundancy.

Catastrophic DR : Cross‑region deployment and network redundancy to survive data‑center failures.

2.8 Monitoring and Reconciliation

Monitor front‑end success rates, traffic fluctuations, gift‑delivery volumes, queue depths, and other dimensions. Use automated scripts to reconcile and re‑issue missed deliveries.

2.9 Operations Platform – Zhiyun System

The SNG operations team provides a standardized, intelligent platform that enables rapid cross‑region deployment, scaling, and on‑the‑fly adjustments during the event.

3. Practical Experience and Summary of the QQ Spring Festival Red Packet Activity

3.1 Real‑World Situation

Despite extensive preparation, several issues occurred:

Traffic exceeded estimates (peak 200k QPS vs. estimated 96k QPS).

Real‑time reporting server crashed; storage bandwidth saturated; message queues piled up to 12 million requests.

Multiple pre‑planned mitigations were activated, preventing total failure.

3.2 Activation of Emergency Plans

Sample non‑critical traffic when QPS > 200k.

Accept loss of real‑time reporting as it does not affect core user flow.

Scale Cmem bandwidth by provisioning higher‑speed NICs.

Use long‑term storage for massive message backlogs.

Compensate missed deliveries via reconciliation scripts.

3.3 Reflections

First , traffic was underestimated. Second , business cluster isolation was insufficient. Third , scaling was conservative (9.6 k QPS estimate expanded only to ~146 k QPS). Finally , the overload led to high‑load scenarios.

3.4 Summary of High‑Availability Practices for Large‑Scale Activities

Key takeaways:

Draw detailed architecture diagrams and decompose functional chains.

Estimate traffic from promotion exposure, app login peaks, and conversion rates.

Perform scaling assessments and apply asynchrony, caching, and degradation where direct scaling is impossible.

Implement overload protection, flexible availability, disaster recovery, and comprehensive monitoring.

Prepare and rehearse multiple emergency playbooks with one‑click activation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend monitoring Scaling high-availability traffic-estimation disaster-recovery

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.