Operations 26 min read

How QQ Music Achieves High Availability: Architecture, Toolchain, and Observability

This article explains how QQ Music builds a high‑availability system by combining redundant architecture, a comprehensive toolchain—including chaos engineering and full‑link pressure testing—and deep observability to gracefully handle failures in a large‑scale microservice environment.

Efficient Ops

Feb 27, 2023

How QQ Music Achieves High Availability: Architecture, Toolchain, and Observability

1. QQ Music High‑Availability Architecture Overview

Failures are inevitable in distributed systems, so the focus is on embracing them by building a high‑availability architecture composed of three subsystems: architecture, toolchain, and observability.

Architecture

Redundant architecture eliminates single points of failure through cluster, multi‑datacenter, and multi‑region deployments, supporting horizontal scaling, load balancing, and automatic failover. Stability strategies such as distributed rate limiting, circuit breaking, and dynamic timeouts further improve availability.

Toolchain

The toolchain integrates experiments and tests to enhance architecture reliability, including chaos engineering and full‑link pressure testing. Chaos engineering injects faults to discover weak points, while full‑link pressure testing applies realistic traffic to identify performance bottlenecks.

Observability

Observability improves fault detection and resolution by collecting logs, metrics, tracing, profiling, and dumps, enabling end‑to‑end visibility of service health.

2. Disaster‑Recovery Architecture

Common DR solutions include remote cold standby, same‑city active‑active, two‑region three‑center, and remote active‑active/multi‑active. QQ Music adopts a dual‑center active‑active model with a write‑to‑one, read‑from‑both approach to balance cost and risk.

1) Dual‑Center Deployment

Two centers (Shenzhen and Shanghai) host identical STGW and API gateways. Global Server Load Balancing (GSLB) directs traffic based on proximity, ensuring isolation between centers.

Logical layer separates read/write: Shenzhen handles read/write, Shanghai provides read‑only services, and write requests are routed from Shanghai to Shenzhen.

Storage is duplicated in both centers; synchronization components keep data consistent across regions, using native cross‑region sync where available.

2) Automatic Failover

Initial client‑side dynamic IP scoring proved unstable, so the solution shifted to API‑gateway‑side failover, reducing client involvement.

Two failover mechanisms:

API‑gateway failover: When a local API fails (including circuit break or rate limit), the gateway routes the request to the remote center.

Client failover: If the gateway times out, the client retries remotely; if the gateway returns a 5xx response, the client also retries remotely; otherwise no retry.

The gateway‑side retry is more controllable and, combined with adaptive rate‑limit and circuit‑break strategies, prevents traffic amplification.

3) Adaptive Retry Algorithm

// Let f(i) be the i‑th probe window, g(i) the actual probe amount, s(i) the success rate, t the total local requests.
if s(i) >= 98% {
    if g(i) >= f(i) {
        f(i+1) = f(i) + max(min(1% * t, f(i)), 1)
    } else {
        f(i+1) = f(i)
    }
} else {
    f(i+1) = max(1, f(i)/2)
}
// Initial window size f(0) = 1.

The algorithm adjusts the retry window based on probe success, with detection and back‑off phases.

3. Stability Strategies

Distributed Rate Limiting

QQ Music uses a sliding‑window counter for distributed rate limiting, discarding excess requests at the service level without introducing global dependencies.

Adaptive Rate Limiting

Server‑side adaptive limiting balances inflight requests using Little’s Law (inflight = latency × QPS) and triggers limits when CPU > 800 and inflight exceeds the optimal threshold.

Circuit Breaking

Adopts an SRE‑style circuit breaker with only Closed and Half‑Open states, discarding requests based on a dynamic success‑rate threshold (requests > K × accepts).

Dynamic Timeout

Uses an EMA‑based algorithm to adjust timeout thresholds dynamically, expanding timeout when average latency is low and shrinking it when latency spikes.

Service Grading

Services are classified into four grades (1‑critical, 2‑important, 3‑minor, 4‑trivial) to prioritize traffic and SLA commitments.

API‑Gateway Graded Rate Limiting

The gateway applies graded rate limiting, ensuring that during high load only grade‑1 services remain available.

4. Toolchain

Chaos Engineering

TMEChaos, built on ChaosMesh, provides a cloud‑native chaos platform with experiment orchestration, dashboards, and integration with TME microservice architecture.

Full‑Link Pressure Testing

Generates realistic traffic by sampling production API calls, applies traffic coloring to isolate test traffic, and uses a pressure engine to drive requests while smart monitoring detects and aborts unhealthy experiments.

5. Observability

Metrics

Prometheus federation collects millions of metrics with 3‑second scrape intervals, providing real‑time monitoring of QPS, latency, error rate, and saturation.

Logging

ELK stack (Filebeat → Kafka → Logstash → Elasticsearch → Kibana) centralizes log collection and enables fast query and analysis.

Tracing

Jaeger captures distributed traces, linking spans across services to reconstruct call chains for fault isolation.

Profiles

Conprof continuously collects CPU/heap profiles in production, storing them for later analysis via a unified UI.

Dumps

Panic dumps are captured via RPC interceptors and reported to Sentry for post‑mortem analysis.

6. Summary

The article presents QQ Music’s high‑availability practice across architecture, toolchain, and observability. Redundant dual‑center design, adaptive failover, and comprehensive stability strategies form the backbone, while chaos engineering, full‑link testing, and deep observability continuously improve resilience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices fault tolerance

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.