Stability Governance and Observability in Baidu Search: From Kepler 1.0 to Kepler 2.0
This article examines how Baidu Search achieves five‑nine‑plus availability by analyzing stability challenges, introducing the Kepler 1.0 observability stack, evolving to Kepler 2.0 with full‑trace collection, custom compression, and practical use‑cases that dramatically improve fault diagnosis and capacity management in a massive micro‑service environment.
Baidu Search, one of the world’s largest online services, must maintain ultra‑high availability (five‑nine level) despite a complex micro‑service architecture that processes millions of queries per second across hundreds of services and petabytes of data.
Chapter 1 – Challenges: The sheer scale creates frequent fault scenarios, classified into PV loss, search‑effectiveness, and capacity failures, all of which require comprehensive data collection and automated analysis to avoid manual, low‑efficiency debugging.
Chapter 2 – Introducing Kepler 1.0: Early observability relied on sparse logs and metrics. By adopting Zipkin‑style tracing (Kepler 1.0) and Prometheus‑compatible metrics, Baidu built a query‑sampling system that generated call‑chains and enriched logs, enabling faster root‑cause analysis for rejection and performance issues.
Chapter 3 – Innovation with Kepler 2.0: To overcome sampling limits, Kepler 2.0 decouples tracing from logging, implements a deterministic span‑ID generation algorithm, and applies domain‑specific compression (timestamp deltas, IP truncation, protobuf varint/packed). This reduces storage by ~60 % while supporting full‑trace and full‑log indexing.
Key innovations include a location‑based log index (inode+offset+length) for O(1) log retrieval, a full‑trace call‑graph that captures every query, and flexible secondary indexes for queries lacking IDs. These capabilities enable concrete use‑cases such as retroactive query reconstruction, cache‑state debugging, and capacity‑aware container monitoring.
The article concludes that exhaustive data collection eliminates blind spots in fault analysis, paving the way for automated, intelligent stability solutions in Baidu Search.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.