Cloud Native 31 min read

From VLAN to Cloud‑Native: Ctrip’s Multi‑Generation Network Evolution

This article chronicles Ctrip’s network architecture evolution—from early VLAN‑based private‑cloud designs, through SDN‑enabled large‑layer‑2 solutions, to container‑aware hybrid‑cloud and cloud‑native approaches like Cilium—offering practical insights and lessons for large‑scale network engineers and teams.

Efficient Ops

May 4, 2019

From VLAN to Cloud‑Native: Ctrip’s Multi‑Generation Network Evolution

0 Ctrip Cloud Platform Overview

The Ctrip Cloud team, founded around 2013, initially built a private cloud on OpenStack, later added a bare‑metal system, Mesos and Kubernetes platforms, and finally integrated public‑cloud resources. All services are unified under CDOS – Ctrip Data Center Operating System – which manages compute, network and storage across private and public clouds.

Timeline of the Network Architecture Evolution

In the private cloud we run VMs, bare‑metal servers and containers; in the public cloud we consume resources from AWS, Tencent Cloud, UCloud, etc., all accessed via a unified CDOS API.

Network Evolution Timeline

We started with a simple VLAN‑based L2 network on OpenStack, then moved to an SDN‑enabled large‑layer‑2 design in 2016, expanded to support containers and hybrid cloud in 2017, and finally explored cloud‑native solutions in 2019.

1 VLAN‑Based L2 Network

In 2013 we provided VMs and bare‑metal resources on OpenStack. The main requirements were low latency, sufficient isolation, routable instance IPs, and a willingness to trade some security for performance.

1.2 Solution: OpenStack Provider Network

We selected the OpenStack provider network model, which uses a physical VLAN for isolation, OVS or Linux Bridge as the soft switch, and places the gateway on hardware switches, eliminating the need for overlay encapsulation.

Key characteristics:

Gateway resides on hardware, requiring hardware network support.

Instance IPs are routable, no tunneling needed.

Better performance because traffic is switched in hardware.

Features such as VLAN isolation, OVS ML2, no L3 agent, no DHCP, no floating IP, and security groups disabled for performance.

1.3 Hardware Network Topology

The physical topology follows a classic access‑aggregation‑core three‑tier design.

Physical Network Topology in the Datacenter

Each server has two NICs connected to two top‑of‑rack switches for high availability.

Access and aggregation layers use L2 switching; core layer uses L3 routing.

OpenStack gateways are configured on core routers.

Firewalls connect directly to core routers for additional security.

1.4 Host Internal Network Topology

Inside a compute node two OVS bridges – br-int and br-bond – are linked. The two physical NICs are bonded into br-bond, which also carries the management IP. All VM/instance ports attach to br-int. Communication between instances on different subnets traverses br-int → br-bond → physical NIC → switch → router and back, totaling 18 hops (24 hops in the legacy OpenStack model).

1.5 Summary

Advantages

Simplified architecture by removing L3 agent, DHCP, floating IP and security groups, reducing operational cost for teams new to OpenStack.

Shorter host‑internal paths improve latency.

Hardware‑based gateway yields better performance.

Routable instance IPs simplify monitoring and tracing.

Disadvantages

Security groups disabled, reducing host‑level firewall protection (partially compensated by external hardware firewalls).

Network provisioning still required manual configuration on core switches, introducing operational risk.

2 SDN‑Based Large L2 Network

As the network grew, the three‑tier architecture became a bottleneck, VLAN broadcast storms increased, and 1 Gbps NICs limited throughput. New business needs also demanded multi‑tenant VPC support and automated network provisioning.

2.2 Solution: OpenStack + SDN

We designed a second‑generation solution that combines OpenStack with a custom SDN controller (Ctrip Network Controller, CNC). The hardware topology switched to a spine‑leaf architecture, providing shorter three‑hop paths, better horizontal scalability and active‑active redundancy.

Spine‑Leaf Topology in the New Datacenter

The data plane uses VxLAN, while the control plane relies on MP‑BGP EVPN. Each leaf acts as a distributed gateway, and VxLAN encapsulation/de‑encapsulation occurs at the leaf.

2.3 SDN Components

Custom SDN controller (CNC) integrates with Neutron via an ML2 plugin.

Port state machine extended to model both underlay and overlay.

New APIs added for CNC interaction.

Monitoring Panel for Neutron Port States

2.4 Instance Creation Flow

Nova requests instance creation and selects a network.

Nova‑compute asks Neutron to create a port.

Neutron creates the port and triggers CNC via the ML2 plugin.

CNC stores the port and synchronises configuration to the relevant leaf switches.

Nova‑compute attaches the virtual NIC to OVS; OVS agent installs the underlay flows.

CNC configures the overlay on the leaf, completing connectivity.

2.5 Summary

The spine‑leaf hardware plus SDN software delivers lower latency, better fault tolerance, distributed gateways and a unified network for VMs, bare‑metal and containers.

3 Container and Hybrid‑Cloud Network

Starting in 2017 we introduced container platforms (Kubernetes, Mesos) on both private and public clouds. Containers bring massive scale, high churn and the need for IP stability during pod migration.

3.1 Private‑Cloud K8s Network

We extended the existing SDN solution to manage container networking via a Neutron CNI plugin. The plugin creates a veth pair, attaches it to OVS, and requests IP allocation from Neutron’s global IPAM, preserving the same IP across pod drifts.

Key Neutron changes:

Added a label attribute to networks; CNI can request an IP from any network sharing the same label.

Implemented bulk IP allocation, asynchronous APIs and performance optimisations.

Back‑ported features such as graceful OVS agent restart.

Container drift workflow updates the port’s host_id in Neutron, prompting CNC to delete the old leaf configuration and apply a new one, keeping the IP unchanged.

Pod drifting with the same IP within a K8S cluster

Deployment scale includes 4 availability zones, >500 nodes per zone, >20 000 total instances, and up to 500 pods per node.

3.2 Public‑Cloud K8s

For overseas deployments we provision EC2 instances as K8s nodes and develop a CNI plugin that dynamically attaches ENIs to containers, inspired by Lyft and Netflix. A global IPAM manages VPC‑wide IP allocation, and the plugin also supports floating IP attachment.

K8S network solution on public cloud vendor (AWS)

Our VPC topology spans Shanghai, Nanjing, Seoul, Moscow, Frankfurt, California, Hong Kong, Melbourne and other regions, with non‑overlapping IP ranges that become routable after dedicated inter‑connects.

4 Cloud‑Native Solution Exploration

Centralised Neutron IPAM becomes a performance bottleneck for high‑frequency container operations. We evaluated next‑generation solutions such as Calico and Cilium, ultimately focusing on Cilium for its eBPF‑based data plane.

4.1 Cilium Overview

Cilium leverages BPF/eBPF to implement L3‑L7 security policies, requiring Linux kernel 4.8+. It provides a CLI, an etcd‑backed policy repository, plugins for orchestrators, and an agent with local IPAM.

4.2 Host Networking

The agent creates a cilium_host<--->cilium_net veth pair; the first IP of the assigned CIDR becomes the gateway on cilium_host. For each container the CNI creates a veth pair, assigns an IP and installs BPF rules. Communication between containers on the same host uses the kernel’s L2 forwarding and BPF programs, while host‑to‑container traffic passes through the veth pair and BPF filtering.

4.3 Multi‑Host Networking

Cilium supports either VxLAN tunnels (via a cilium_vxlan device) or BGP direct routing. VxLAN offers simplicity but lower performance and non‑routable IPs, while BGP provides higher throughput and routable IPs at the cost of additional infrastructure.

4.4 Pros & Cons

Pros

Native L4‑L7 security policies expressed in Kubernetes YAML.

O(1) policy propagation, far faster than iptables‑based solutions.

High‑performance data plane (veth, IPVLAN).

Dual‑stack IPv4/IPv6 support.

Can run on top of Flannel for connectivity.

Active open‑source community backed by a commercial company.

Cons

Requires Linux kernel 4.8+ (preferably 4.14+).

Relatively new; few large‑scale production case studies.

Higher operational complexity: developers need C/BPF expertise to customise the data plane.

References

OpenStack Doc: Networking Concepts

Cisco Data Center Spine‑and‑Leaf Architecture: Design Overview

ovs‑vswitchd: Fix high CPU utilization when acquire idle lock fails

Open vSwitch port mirroring only mirrors egress traffic

Lyft CNI plugin

Netflix: run container at scale

Cilium Project

Cilium Cheat Sheet

Cilium Code Walk Through: CNI Create Network

Amazon EKS – Managed Kubernetes Service

Cilium: API Aware Networking & Network Security for Microservices using BPF & XDP

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SDN Cloud Networking hybrid cloud container networking Cilium

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.