Operations 13 min read

OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System

This article introduces OCTO 2.0, Meituan’s next‑generation distributed service‑governance platform, detailing its overall architecture, mesh‑related features such as traffic hijacking, service subscription, lossless hot‑restart, data‑plane operations, and future cloud‑native evolution plans.

High Availability Architecture
High Availability Architecture
High Availability Architecture
OCTO 2.0: Architecture and Implementation of Meituan’s Next‑Generation Service Governance System

1. Overall Architecture

OCTO 2.0 builds on the existing OCTO 1.0 platform and integrates a Service Mesh layer, comprising four main components: infrastructure, control plane, data plane, and operation system. The infrastructure reuses services from OCTO 1.0 (MNS, KMS, MCC, Rhino) and connects them to the new control plane to avoid costly rewrites.

The control plane is a fully self‑developed alternative to Istio, while the data plane is based on a customized Envoy. Operational tasks such as component upgrades and releases are handled by the operation system.

2. Mesh Features

2.1 Traffic Hijacking

Instead of Istio’s native approach, OCTO 2.0 uses iptables for traffic interception but ultimately adopts a Unix Domain Socket (UDS) direct‑connection method to forward traffic between business processes and OCTO‑Proxy, offering better performance and lower operational cost.

2.2 Service Subscription

To avoid the overhead of Envoy’s full‑state CDS/EDS, OCTO 2.0 implements on‑demand service discovery: the business process sends an HTTP subscription request to OCTO‑Proxy, which updates the XDS with the required backend service list, improving scalability.

2.3 Lossless Hot‑Restart

For short‑lived connections, OCTO‑Proxy drains old connections before exiting, ensuring zero traffic loss. For long‑lived connections, a coordinated protocol between the client SDK and OCTO‑Proxy signals a “hot‑restart” flag, prompting the SDK to switch to a new connection and retry pending requests.

Both client‑side and server‑side strategies are employed: the client SDK reacts to hot‑restart responses, while the server‑side OCTO‑Proxy proactively notifies peers via a ProxyRestart message, allowing graceful transition.

2.4 Data‑Plane Operations

2.4.1 LEGO Operation Scheme

In Meituan’s single‑container runtime, the LEGO platform manages OCTO‑Proxy at the process level, providing health checks, fault‑restart, monitoring, and version rollout, which is faster than container‑level restarts.

2.4.2 Cloud‑Native Operation Scheme

An operator‑driven approach aims to manage OCTO‑Proxy as a container, but Kubernetes does not support dynamic container addition/removal within a running Pod. The final solution uses a dual‑container standby model: one active container and one standby container, enabling seamless hot‑restart without violating Kubernetes constraints.

3. Future Plans

As OCTO 2.0 scales to thousands of services and billions of daily requests, the roadmap focuses on cloud‑native operation and release automation, mesh‑enabled HTTP services to reduce gateway reliance, and full‑link mTLS support.

4. Author Information

Shu Chao, Shi Peng, and Lai Jun are engineers from Meituan’s Infrastructure Development team, responsible for OCTO 2.0 development.

distributed systemscloud-nativeOperationsservice meshservice governancehot restart
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.