Operations 25 min read

WeChat Operational Practices: Elastic Scaling, Cloud Management, Capacity Management, and Automated Scheduling

This article describes WeChat's operational standards, cloud‑native management, capacity planning, and automated scheduling techniques, covering configuration file conventions, name‑service design, cloud migration decisions, hardware‑metric based capacity evaluation, stress‑testing methods, and dynamic resource allocation to ensure efficient, reliable service scaling.

Qunar Tech Salon

Oct 13, 2017

WeChat Operational Practices: Elastic Scaling, Cloud Management, Capacity Management, and Automated Scheduling

1. Operational Standards

1.1 Configuration File Standards

We define a directory‑structure standard for service deployment, manage cross‑service shared configuration items through automated gray‑release mechanisms, handle per‑instance differences to keep configuration MD5 identical across environments, and separate development, testing, and production configurations to enable seamless image‑based deployments.

All instances of the same service version have identical configuration file MD5 across environments.

1.2 Name‑Service Standards

The name service is organized into three layers: an access layer implemented with LVS, a logical layer using etcd, and a storage layer with automated routing configuration. Service scaling is treated as an operations task independent of development releases.

1.3 Data Storage Standards

Access layer carries no data, logical layer carries short‑lived cache and static data but no persistent data, and storage layer provides long‑lived cache and Paxos‑based durable storage. Scaling the access and logical layers does not require data migration.

Service scaling is an operations engineering problem, isolated from development change releases.

1.4 Operational Standards Summary

Goal: services are fully operable without manual intervention during scaling.

Measures: change‑system interception and network‑wide scans for non‑compliant services.

2. Cloud Management

We migrated logical services to a private cloud using Cgroup‑based resource isolation while keeping access and storage layers on dedicated physical machines due to their high traffic and stability requirements.

2.1 Why Move to Cloud

With nearly 5,000 micro‑services, resource contention on shared physical machines became a bottleneck, prompting a shift to cloud resources for the logical layer.

2.2 Parts Moved to Cloud

Access layer: exclusive physical machines, ample capacity, low change frequency – not cloud‑migrated.

Logical layer: mixed deployment, unpredictable capacity, frequent changes – migrated to cloud.

Storage layer: exclusive physical machines, controllable capacity – not migrated.

2.3 Cgroup‑Based Cloud

We use kernel Cgroup for isolation, defining custom VM types (e.g., VC11 = 1 CPU + 1 GB RAM, VC24 = 2 CPU + 4 GB RAM) and physical machine slicing.

2.4 Docker Not Used in Production

Our in‑house svrkit framework covers 100 % of production services and relies heavily on IPC mechanisms that would be disrupted by Docker; therefore we avoid Docker for now.

2.5 Private Cloud Scheduling System

We built a private scheduler inspired by Borg, YARN, Kubernetes, and Mesos, covering about 80 % of micro‑services and integrating tightly with svrkit.

2.6 Cloud Management Summary

Goal: isolate resources between services and provide page‑based scaling operations.

Measures: deployment system blocks non‑cloud services and actively refactors core services.

3. Capacity Management

3.1 Supporting Business Growth

We aim to match capacity expansion precisely with business growth, detecting insufficient capacity within minutes and performing rapid scaling.

3.2 Evaluating Capacity with Hardware Metrics

CPU usage, disk space, network bandwidth, and memory are primary indicators, though they have limitations.

3.3 CPU‑Based Capacity Calculation

Service capacity = current peak / empirical CPU ceiling

3.4 Limitations of Hardware Metrics

Different services are constrained by CPU or memory, and performance near critical thresholds can be unpredictable.

3.5 Stress‑Testing Approach

We conduct both simulated‑traffic and real‑traffic tests in test and production environments, resulting in four testing scenarios.

3.6 Real‑World Stress Test

Standardized weight‑adjustment loops allow us to safely increase load while monitoring for failures.

3.7 Potential Issues During Stress Test

Risk of causing failures and timely detection.

Possibility of hidden low‑level problems.

Rapid recovery procedures.

3.8 Service Self‑Protection

Services implement fast‑reject mechanisms to drop excess requests when overloaded.

3.9 Upstream Retry Protection

Upstream services automatically retry on alternative instances when a particular instance rejects requests.

3.10 Multi‑Dimensional Monitoring

We monitor hardware metrics, fast‑reject events, request latency, and failure rates across the entire call chain.

3.11 Second‑Level Monitoring

Metrics are collected every six seconds instead of once per minute, enabling sub‑10‑second anomaly detection.

3.12 Dynamic Rate Control

During stress testing we adjust load based on queue backlog and latency, achieving rapid convergence to the service’s capacity limit.

3.13 Capacity Management Summary

Accurately quantify each service’s resource needs.

Identify the optimal machine type for each service based on automated testing.

4. Automated Scheduling

4.1 Automatic Scaling for Business Growth

Using performance models derived from stress tests, we keep services operating at 50‑60 % utilization, reserving 66 % for disaster recovery, and automatically provision resources as demand rises.

4.2 Automatic Scaling for Anomalies

Sudden traffic spikes trigger CPU‑threshold‑based scaling.

Program performance regressions are detected via daily stress‑test curves, prompting proactive scaling.

4.3 Performance Degradation Evaluation

Daily stress tests reveal performance drops after major releases, allowing timely fixes.

4.4 Performance Management Loop

New instances are stress‑tested immediately after gray‑release; if performance degrades, the rollout is halted.

4.5 Business Patterns

We handle peak‑heavy, work‑hour, and event‑driven traffic patterns, adjusting resources accordingly.

4.6 Peak‑Shaving and Valley‑Filling

Resources released after peak periods are reallocated to lower‑load services, reducing waste.

4.7 Online Service Peak‑Shaving

Regular services release resources after peak.

Off‑peak services acquire the freed resources.

4.8 Offline Compute Valley‑Filling

Offline tasks run between 01:00‑08:00 unrestricted, with controlled queuing from 08:00‑20:00, using cgroup limits for CPU, memory, and blkio, and lowest priority to avoid impacting online services.

4.9 Automated Scheduling Summary

Full control of all online services to maximize resource utilization.

Offline tasks share online resources without separate provisioning.

END

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

elastic scaling capacity management cloud automation service operations

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.