Cloud Native 17 min read

Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations

Baidu created an internally‑built, Istio‑based service mesh that decouples governance from language‑specific RPCs, offering low‑intrusion integration, ultra‑low latency via a brpc coroutine data plane, advanced fault‑tolerance and fine‑grained traffic scheduling, and now powers over 80 % of its core microservices handling more than a trillion daily requests.

Baidu Geek Talk

Jun 9, 2021

Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Performance Optimizations

Baidu’s existing RPC‑based service governance suffered from inconsistent framework capabilities, low efficiency, and insufficient global observability. To address these problems, Baidu introduced an internally built service mesh that decouples governance capabilities from language‑specific RPC frameworks and pushes them down to sidecars.

The mesh aims to provide two core capabilities: (1) fundamental stability functions such as fault tolerance, detection, and unified intervention interfaces; (2) traffic governance, including global traffic observability and fine‑grained scheduling.

Technical challenges included achieving low‑intrusion integration for thousands of services, maintaining ultra‑low latency for latency‑sensitive products (search, recommendation), supporting heterogeneous language ecosystems, and ensuring mesh reliability at massive scale.

The overall architecture is based on open‑source Istio and Envoy, extended with Baidu‑specific components: a Mesh Control Center (access, configuration, and operations), a control plane (istio‑pilot), a data plane (Envoy), and surrounding governance ecosystems (service discovery, RPC adapters, monitoring, PaaS support).

Access methods consist of a transparent traffic hijacking solution using a local lookback IP and Envoy, plus a proxy‑less approach that adapts various RPC frameworks to the Istio xDS API. Both methods allow services to join the mesh without code changes.

Performance optimization replaced the default Envoy event‑loop model with a high‑performance brpc‑based coroutine model, reducing CPU usage by over 60 % and average latency by more than 70 % compared with community versions. Additional research on eBPF and DPDK promises further latency and resource gains.

Stability governance introduced advanced fault‑tolerance strategies (dynamic retries, circuit breaking), rapid fault detection (minute‑level detection via Prometheus integration), and unified intervention/degradation mechanisms, dramatically improving availability (e.g., from 99 % to 99.99 %).

Traffic governance provides global service‑graph observability, standardized golden‑metric collection, and fine‑grained traffic scheduling at the instance level, enabling precise traffic shaping, canary releases, and load testing.

The mesh ecosystem also includes automatic parameter tuning, fault‑auto‑mitigation systems, and a unified protocol (xDS) to coordinate surrounding services. Stability is further ensured through multi‑level fallback mechanisms, controlled rollout, continuous health checks, and chaos‑engineering injections.

Since its launch in late 2019, the mesh has been deployed across dozens of product lines, covering over 80 % of core modules and handling traffic exceeding a trillion requests per day, delivering low‑cost, low‑intrusion, and standardized service governance for Baidu’s large‑scale microservice environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization microservices fault tolerance Istio Service Mesh traffic management Envoy

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.