Cloud Native 31 min read

Deep Practice of Service Mesh at Ant Financial: Large‑Scale Deployment, Challenges, and Recommendations

This article presents Ant Financial's extensive Service Mesh journey, detailing its multi‑stage large‑scale rollout, performance measurements during the 618 and upcoming Double‑11 events, the technical challenges encountered, optimization techniques applied to the data‑plane and control‑plane, and practical advice for organizations considering Service Mesh adoption.

AntTech

Nov 5, 2019

Deep Practice of Service Mesh at Ant Financial: Large‑Scale Deployment, Challenges, and Recommendations

In 2019 Ant Financial entered the deep‑water stage of Service Mesh deployment, achieving massive scale across hundreds of applications and over 100,000 pods, with the system supporting major traffic spikes such as the 618 promotion and preparing for Double‑11.

The rollout progressed through five phases: technical research (2017), exploration with Golang sidecar SOFAMosn and open‑source SOFAMesh (2018), small‑scale internal pilots, large‑scale internal adoption in early 2019, and full‑scale deployment in the second half of 2019.

Performance tests comparing workloads with and without the SOFAMosn sidecar showed modest CPU increase (≈2 % average), memory overhead of about 15 MB per node, and latency rise of roughly 0.2 ms, with some scenarios even showing latency reductions due to optimizations such as routing cache, writev batching, and protocol improvements.

Key challenges at this scale included CPU, memory, latency, routing complexity, and operational concerns. Ant Financial addressed them by optimizing the data‑plane (SOFAMosn) with writev, memory reuse, and protocol changes, and by improving the control‑plane components Pilot and Mixer through serialization, pre‑computation, and push‑optimizations.

To enable smooth migration from traditional SDK‑based microservices to Service Mesh, a “dual‑mode” approach was proposed, combining SDK and sidecar models, leveraging MCP and xDS/UDPA to fuse control‑plane and registry capabilities, and allowing gradual, gray‑scale upgrades without application code changes.

Four practical recommendations are offered: identify pain points such as multi‑language support and library upgrade difficulty; consider Service Mesh for legacy applications needing non‑intrusive enhancements; maintain a unified technology stack for organizations with limited engineering resources; and align Service Mesh adoption with broader cloud‑native strategies involving Kubernetes and serverless.

The article concludes by reaffirming the core value of Service Mesh—separating business logic from cross‑cutting concerns—while highlighting Ant Financial's ongoing plans for further scaling, open‑source contributions, and community collaboration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Microservices Service Mesh sidecar Ant Financial Large‑Scale Deployment

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.