Cloud Native 13 min read

Ctrip Service Mesh Performance Optimization and Practices

This article details Ctrip's large‑scale Service Mesh deployment using Istio, analyzes control‑plane performance bottlenecks such as pilot concurrency and configuration push latency, and presents a series of optimizations—including namespace isolation, concurrent ServiceEntry processing, incremental EnvoyFilter handling, and pilot configuration tweaks—that dramatically improve init‑context and push times while supporting thousands of services.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Ctrip Service Mesh Performance Optimization and Practices

The authors (Zuo Si, Shao Yu, Shirley Bo) from Ctrip's Cloud Container team introduce their work on deploying and optimizing Service Mesh (based on Istio) to support Ctrip's rapid business growth and migration to Kubernetes.

Since 2017, Ctrip has been containerizing applications and moving from monolithic to microservice architectures. Service Mesh became popular for providing richer governance, security, and observability, with sidecar‑based data‑plane handling of load‑balancing, circuit‑breaking, and rate‑limiting, while the control plane centrally manages configurations.

From mid‑2020, Ctrip conducted Istio research, customized Istio, and piloted it on a small scale. By the end of 2021, over 600 non‑core applications were onboarded, and today more than 2,000 applications (≈10k pods) run on the mesh.

Initial performance measurements revealed two main issues: pilot object‑handling concurrency and configuration‑push latency. After optimization, ServiceEntry processing speed increased ~15×, initContext P99 latency dropped from ~30 s to 5‑10 s, and overall push latency improved from >30 s to ~15 s, with incremental instance pushes reduced to ~2.5 s.

The challenges identified include insufficient pilot concurrency, lack of namespace isolation for Istio CRs, and full‑push overhead for large numbers of objects.

Optimization efforts covered:

Object handling: replacing linear queues with concurrent processing, adding namespace‑level isolation for Istio CRs, and redesigning ServiceEntryStore to use concurrent controllers.

Push performance: enabling incremental XDS pushes, reducing push volume by filtering gateway‑only services, disabling headless service listeners, increasing pilot push concurrency, and tuning debounce parameters.

Code changes: adding namespaceFilter to the Istio CR client, refactoring if pushRequest.Full { ... } handling, and implementing concurrent updateProxy and setProxyState logic.

Sidecar calculations: moving initSidecarScopes to the push phase, applying on‑demand calculations, and creating per‑application sidecars to avoid Envoy OOM.

Key code snippets:

if pushRequest.Full {
    // Update Proxy with current information.
    s.updateProxy(con.proxy, pushRequest.Push)
}
func (s *DiscoveryServer) updateProxy(proxy *model.Proxy, push *model.PushContext) {
    s.setProxyState(proxy, push)
    if util.IsLocalityEmpty(proxy.Locality) {
        if len(proxy.ServiceInstances) > 0 {
            proxy.Locality = util.ConvertLocality(proxy.ServiceInstances[0].Endpoint.Locality.Label)
        }
    }
}

func (s *DiscoveryServer) setProxyState(proxy *model.Proxy, push *model.PushContext) {
    proxy.SetWorkloadLabels(s.Env)
    proxy.SetServiceInstances(push.ServiceDiscovery)
    ...
    proxy.SetSidecarScope(push)
    proxy.SetGatewaysForProxy(push)
}

Future directions include removing Kubernetes resource dependencies from the control plane to achieve sub‑second push times, adopting NDS for DNS resolution, and further stability enhancements such as connection throttling, circuit‑breaking for pilot, and advanced debounce tuning.

Overall, Ctrip's two‑year journey demonstrates that Service Mesh can effectively manage traffic, improve system scalability, and decouple middleware from business logic, providing a solid foundation for large‑scale microservice environments.

cloud-nativeperformance optimizationKubernetesistioservice meshSidecarpilot
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.