WeChat Operational Practices: Elastic Scaling, Cloud Management, Capacity Management, and Automated Scheduling
This article describes WeChat's operational standards, cloud‑native management, capacity planning, and automated scheduling techniques, covering configuration file conventions, name‑service design, cloud migration decisions, hardware‑metric based capacity evaluation, stress‑testing methods, and dynamic resource allocation to ensure efficient, reliable service scaling.
1. Operational Standards
1.1 Configuration File Standards
We define a directory‑structure standard for service deployment, manage cross‑service shared configuration items through automated gray‑release mechanisms, handle per‑instance differences to keep configuration MD5 identical across environments, and separate development, testing, and production configurations to enable seamless image‑based deployments.
All instances of the same service version have identical configuration file MD5 across environments.
1.2 Name‑Service Standards
The name service is organized into three layers: an access layer implemented with LVS, a logical layer using etcd, and a storage layer with automated routing configuration. Service scaling is treated as an operations task independent of development releases.
1.3 Data Storage Standards
Access layer carries no data, logical layer carries short‑lived cache and static data but no persistent data, and storage layer provides long‑lived cache and Paxos‑based durable storage. Scaling the access and logical layers does not require data migration.
Service scaling is an operations engineering problem, isolated from development change releases.
1.4 Operational Standards Summary
Goal: services are fully operable without manual intervention during scaling.
Measures: change‑system interception and network‑wide scans for non‑compliant services.
2. Cloud Management
We migrated logical services to a private cloud using Cgroup‑based resource isolation while keeping access and storage layers on dedicated physical machines due to their high traffic and stability requirements.
2.1 Why Move to Cloud
With nearly 5,000 micro‑services, resource contention on shared physical machines became a bottleneck, prompting a shift to cloud resources for the logical layer.
2.2 Parts Moved to Cloud
Access layer: exclusive physical machines, ample capacity, low change frequency – not cloud‑migrated.
Logical layer: mixed deployment, unpredictable capacity, frequent changes – migrated to cloud.
Storage layer: exclusive physical machines, controllable capacity – not migrated.
2.3 Cgroup‑Based Cloud
We use kernel Cgroup for isolation, defining custom VM types (e.g., VC11 = 1 CPU + 1 GB RAM, VC24 = 2 CPU + 4 GB RAM) and physical machine slicing.
2.4 Docker Not Used in Production
Our in‑house svrkit framework covers 100 % of production services and relies heavily on IPC mechanisms that would be disrupted by Docker; therefore we avoid Docker for now.
2.5 Private Cloud Scheduling System
We built a private scheduler inspired by Borg, YARN, Kubernetes, and Mesos, covering about 80 % of micro‑services and integrating tightly with svrkit.
2.6 Cloud Management Summary
Goal: isolate resources between services and provide page‑based scaling operations.
Measures: deployment system blocks non‑cloud services and actively refactors core services.
3. Capacity Management
3.1 Supporting Business Growth
We aim to match capacity expansion precisely with business growth, detecting insufficient capacity within minutes and performing rapid scaling.
3.2 Evaluating Capacity with Hardware Metrics
CPU usage, disk space, network bandwidth, and memory are primary indicators, though they have limitations.
3.3 CPU‑Based Capacity Calculation
Service capacity = current peak / empirical CPU ceiling
3.4 Limitations of Hardware Metrics
Different services are constrained by CPU or memory, and performance near critical thresholds can be unpredictable.
3.5 Stress‑Testing Approach
We conduct both simulated‑traffic and real‑traffic tests in test and production environments, resulting in four testing scenarios.
3.6 Real‑World Stress Test
Standardized weight‑adjustment loops allow us to safely increase load while monitoring for failures.
3.7 Potential Issues During Stress Test
Risk of causing failures and timely detection.
Possibility of hidden low‑level problems.
Rapid recovery procedures.
3.8 Service Self‑Protection
Services implement fast‑reject mechanisms to drop excess requests when overloaded.
3.9 Upstream Retry Protection
Upstream services automatically retry on alternative instances when a particular instance rejects requests.
3.10 Multi‑Dimensional Monitoring
We monitor hardware metrics, fast‑reject events, request latency, and failure rates across the entire call chain.
3.11 Second‑Level Monitoring
Metrics are collected every six seconds instead of once per minute, enabling sub‑10‑second anomaly detection.
3.12 Dynamic Rate Control
During stress testing we adjust load based on queue backlog and latency, achieving rapid convergence to the service’s capacity limit.
3.13 Capacity Management Summary
Accurately quantify each service’s resource needs.
Identify the optimal machine type for each service based on automated testing.
4. Automated Scheduling
4.1 Automatic Scaling for Business Growth
Using performance models derived from stress tests, we keep services operating at 50‑60 % utilization, reserving 66 % for disaster recovery, and automatically provision resources as demand rises.
4.2 Automatic Scaling for Anomalies
Sudden traffic spikes trigger CPU‑threshold‑based scaling.
Program performance regressions are detected via daily stress‑test curves, prompting proactive scaling.
4.3 Performance Degradation Evaluation
Daily stress tests reveal performance drops after major releases, allowing timely fixes.
4.4 Performance Management Loop
New instances are stress‑tested immediately after gray‑release; if performance degrades, the rollout is halted.
4.5 Business Patterns
We handle peak‑heavy, work‑hour, and event‑driven traffic patterns, adjusting resources accordingly.
4.6 Peak‑Shaving and Valley‑Filling
Resources released after peak periods are reallocated to lower‑load services, reducing waste.
4.7 Online Service Peak‑Shaving
Regular services release resources after peak.
Off‑peak services acquire the freed resources.
4.8 Offline Compute Valley‑Filling
Offline tasks run between 01:00‑08:00 unrestricted, with controlled queuing from 08:00‑20:00, using cgroup limits for CPU, memory, and blkio, and lowest priority to avoid impacting online services.
4.9 Automated Scheduling Summary
Full control of all online services to maximize resource utilization.
Offline tasks share online resources without separate provisioning.
END
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.