Cloud Native 17 min read

Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations

Baidu created an internally‑built, Istio‑based service mesh that decouples governance from language‑specific RPCs, offering low‑intrusion integration, ultra‑low latency via a brpc coroutine data plane, advanced fault‑tolerance and fine‑grained traffic scheduling, and now powers over 80 % of its core microservices handling more than a trillion daily requests.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations

Baidu’s existing RPC‑based service governance suffered from inconsistent framework capabilities, low efficiency, and insufficient global observability. To address these problems, Baidu introduced an internally built service mesh that decouples governance capabilities from language‑specific RPC frameworks and pushes them down to sidecars.

The mesh aims to provide two core capabilities: (1) fundamental stability functions such as fault tolerance, detection, and unified intervention interfaces; (2) traffic governance, including global traffic observability and fine‑grained scheduling.

Technical challenges included achieving low‑intrusion integration for thousands of services, maintaining ultra‑low latency for latency‑sensitive products (search, recommendation), supporting heterogeneous language ecosystems, and ensuring mesh reliability at massive scale.

The overall architecture is based on open‑source Istio and Envoy, extended with Baidu‑specific components: a Mesh Control Center (access, configuration, and operations), a control plane (istio‑pilot), a data plane (Envoy), and surrounding governance ecosystems (service discovery, RPC adapters, monitoring, PaaS support).

Access methods consist of a transparent traffic hijacking solution using a local lookback IP and Envoy, plus a proxy‑less approach that adapts various RPC frameworks to the Istio xDS API. Both methods allow services to join the mesh without code changes.

Performance optimization replaced the default Envoy event‑loop model with a high‑performance brpc‑based coroutine model, reducing CPU usage by over 60 % and average latency by more than 70 % compared with community versions. Additional research on eBPF and DPDK promises further latency and resource gains.

Stability governance introduced advanced fault‑tolerance strategies (dynamic retries, circuit breaking), rapid fault detection (minute‑level detection via Prometheus integration), and unified intervention/deg​radation mechanisms, dramatically improving availability (e.g., from 99 % to 99.99 %).

Traffic governance provides global service‑graph observability, standardized golden‑metric collection, and fine‑grained traffic scheduling at the instance level, enabling precise traffic shaping, canary releases, and load testing.

The mesh ecosystem also includes automatic parameter tuning, fault‑auto‑mitigation systems, and a unified protocol (xDS) to coordinate surrounding services. Stability is further ensured through multi‑level fallback mechanisms, controlled rollout, continuous health checks, and chaos‑engineering injections.

Since its launch in late 2019, the mesh has been deployed across dozens of product lines, covering over 80 % of core modules and handling traffic exceeding a trillion requests per day, delivering low‑cost, low‑intrusion, and standardized service governance for Baidu’s large‑scale microservice environment.

Performance OptimizationMicroservicesfault toleranceistioservice meshtraffic managementEnvoy
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.