Operations 13 min read

OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System

This article introduces OCTO 2.0, Meituan’s next‑generation distributed service‑governance platform, detailing its overall architecture, mesh‑related features such as traffic hijacking, service subscription, lossless hot‑restart, data‑plane operations, and future cloud‑native evolution plans.

High Availability Architecture

Mar 15, 2021

OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System

1. Overall Architecture

OCTO 2.0 builds on the existing OCTO 1.0 platform and integrates a Service Mesh layer, comprising four main components: infrastructure, control plane, data plane, and operation system. The infrastructure reuses services from OCTO 1.0 (MNS, KMS, MCC, Rhino) and connects them to the new control plane to avoid costly rewrites.

The control plane is a fully self‑developed alternative to Istio, while the data plane is based on a customized Envoy. Operational tasks such as component upgrades and releases are handled by the operation system.

2. Mesh Features

2.1 Traffic Hijacking

Instead of Istio’s native approach, OCTO 2.0 uses iptables for traffic interception but ultimately adopts a Unix Domain Socket (UDS) direct‑connection method to forward traffic between business processes and OCTO‑Proxy, offering better performance and lower operational cost.

2.2 Service Subscription

To avoid the overhead of Envoy’s full‑state CDS/EDS, OCTO 2.0 implements on‑demand service discovery: the business process sends an HTTP subscription request to OCTO‑Proxy, which updates the XDS with the required backend service list, improving scalability.

2.3 Lossless Hot‑Restart

For short‑lived connections, OCTO‑Proxy drains old connections before exiting, ensuring zero traffic loss. For long‑lived connections, a coordinated protocol between the client SDK and OCTO‑Proxy signals a “hot‑restart” flag, prompting the SDK to switch to a new connection and retry pending requests.

Both client‑side and server‑side strategies are employed: the client SDK reacts to hot‑restart responses, while the server‑side OCTO‑Proxy proactively notifies peers via a ProxyRestart message, allowing graceful transition.

2.4 Data‑Plane Operations

2.4.1 LEGO Operation Scheme

In Meituan’s single‑container runtime, the LEGO platform manages OCTO‑Proxy at the process level, providing health checks, fault‑restart, monitoring, and version rollout, which is faster than container‑level restarts.

2.4.2 Cloud‑Native Operation Scheme

An operator‑driven approach aims to manage OCTO‑Proxy as a container, but Kubernetes does not support dynamic container addition/removal within a running Pod. The final solution uses a dual‑container standby model: one active container and one standby container, enabling seamless hot‑restart without violating Kubernetes constraints.

3. Future Plans

As OCTO 2.0 scales to thousands of services and billions of daily requests, the roadmap focuses on cloud‑native operation and release automation, mesh‑enabled HTTP services to reduce gateway reliance, and full‑link mTLS support.

4. Author Information

Shu Chao, Shi Peng, and Lai Jun are engineers from Meituan’s Infrastructure Development team, responsible for OCTO 2.0 development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Operations service governance Hot Restart

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.