OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System
This article introduces OCTO 2.0, Meituan’s next‑generation distributed service‑governance platform, detailing its overall architecture, mesh‑related features such as traffic hijacking, service subscription, lossless hot‑restart, data‑plane operations, and future cloud‑native evolution plans.
1. Overall Architecture
OCTO 2.0 builds on the existing OCTO 1.0 platform and integrates a Service Mesh layer, comprising four main components: infrastructure, control plane, data plane, and operation system. The infrastructure reuses services from OCTO 1.0 (MNS, KMS, MCC, Rhino) and connects them to the new control plane to avoid costly rewrites.
The control plane is a fully self‑developed alternative to Istio, while the data plane is based on a customized Envoy. Operational tasks such as component upgrades and releases are handled by the operation system.
2. Mesh Features
2.1 Traffic Hijacking
Instead of Istio’s native approach, OCTO 2.0 uses iptables for traffic interception but ultimately adopts a Unix Domain Socket (UDS) direct‑connection method to forward traffic between business processes and OCTO‑Proxy, offering better performance and lower operational cost.
2.2 Service Subscription
To avoid the overhead of Envoy’s full‑state CDS/EDS, OCTO 2.0 implements on‑demand service discovery: the business process sends an HTTP subscription request to OCTO‑Proxy, which updates the XDS with the required backend service list, improving scalability.
2.3 Lossless Hot‑Restart
For short‑lived connections, OCTO‑Proxy drains old connections before exiting, ensuring zero traffic loss. For long‑lived connections, a coordinated protocol between the client SDK and OCTO‑Proxy signals a “hot‑restart” flag, prompting the SDK to switch to a new connection and retry pending requests.
Both client‑side and server‑side strategies are employed: the client SDK reacts to hot‑restart responses, while the server‑side OCTO‑Proxy proactively notifies peers via a ProxyRestart message, allowing graceful transition.
2.4 Data‑Plane Operations
2.4.1 LEGO Operation Scheme
In Meituan’s single‑container runtime, the LEGO platform manages OCTO‑Proxy at the process level, providing health checks, fault‑restart, monitoring, and version rollout, which is faster than container‑level restarts.
2.4.2 Cloud‑Native Operation Scheme
An operator‑driven approach aims to manage OCTO‑Proxy as a container, but Kubernetes does not support dynamic container addition/removal within a running Pod. The final solution uses a dual‑container standby model: one active container and one standby container, enabling seamless hot‑restart without violating Kubernetes constraints.
3. Future Plans
As OCTO 2.0 scales to thousands of services and billions of daily requests, the roadmap focuses on cloud‑native operation and release automation, mesh‑enabled HTTP services to reduce gateway reliance, and full‑link mTLS support.
4. Author Information
Shu Chao, Shi Peng, and Lai Jun are engineers from Meituan’s Infrastructure Development team, responsible for OCTO 2.0 development.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.