Cloud Native 31 min read

Deep Practice of Service Mesh at Ant Financial: Large‑Scale Deployment, Challenges, and Recommendations

This article presents Ant Financial's extensive Service Mesh journey, detailing its multi‑stage large‑scale rollout, performance measurements during the 618 and upcoming Double‑11 events, the technical challenges encountered, optimization techniques applied to the data‑plane and control‑plane, and practical advice for organizations considering Service Mesh adoption.

AntTech
AntTech
AntTech
Deep Practice of Service Mesh at Ant Financial: Large‑Scale Deployment, Challenges, and Recommendations

In 2019 Ant Financial entered the deep‑water stage of Service Mesh deployment, achieving massive scale across hundreds of applications and over 100,000 pods, with the system supporting major traffic spikes such as the 618 promotion and preparing for Double‑11.

The rollout progressed through five phases: technical research (2017), exploration with Golang sidecar SOFAMosn and open‑source SOFAMesh (2018), small‑scale internal pilots, large‑scale internal adoption in early 2019, and full‑scale deployment in the second half of 2019.

Performance tests comparing workloads with and without the SOFAMosn sidecar showed modest CPU increase (≈2 % average), memory overhead of about 15 MB per node, and latency rise of roughly 0.2 ms, with some scenarios even showing latency reductions due to optimizations such as routing cache, writev batching, and protocol improvements.

Key challenges at this scale included CPU, memory, latency, routing complexity, and operational concerns. Ant Financial addressed them by optimizing the data‑plane (SOFAMosn) with writev, memory reuse, and protocol changes, and by improving the control‑plane components Pilot and Mixer through serialization, pre‑computation, and push‑optimizations.

To enable smooth migration from traditional SDK‑based microservices to Service Mesh, a “dual‑mode” approach was proposed, combining SDK and sidecar models, leveraging MCP and xDS/UDPA to fuse control‑plane and registry capabilities, and allowing gradual, gray‑scale upgrades without application code changes.

Four practical recommendations are offered: identify pain points such as multi‑language support and library upgrade difficulty; consider Service Mesh for legacy applications needing non‑intrusive enhancements; maintain a unified technology stack for organizations with limited engineering resources; and align Service Mesh adoption with broader cloud‑native strategies involving Kubernetes and serverless.

The article concludes by reaffirming the core value of Service Mesh—separating business logic from cross‑cutting concerns—while highlighting Ant Financial's ongoing plans for further scaling, open‑source contributions, and community collaboration.

cloud-nativePerformance OptimizationMicroservicesservice meshSidecarant financialLarge Scale Deployment
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.