Operations 22 min read

Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login

The article details Ctrip's call‑center architecture evolution, describing the multi‑layer active‑active design, public access, application and client layers, unified login mechanisms, operational challenges, disaster‑recovery drills, and future plans for software‑only and mobile agents, illustrating practical SRE principles in a large‑scale telephony system.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login

Shen Qiang, senior manager of Ctrip's Communication Technology Center, shares his extensive experience in building and operating a call‑center system that has evolved from a single‑site solution to a geographically distributed, active‑active architecture supporting over ten thousand agents.

Inspired by Google SRE, the presentation is divided into three parts: an overview of Ctrip's call‑center architecture, a description of the active‑active design across three layers, and the implementation of active‑active client access.

1. Call‑center at Ctrip

The core components are PBX (voice media processing), CTI (computer‑telephony integration, IVR, recording), and CRM (order management). An architecture diagram shows the integration of remote and home‑office agents, illustrating the shift from a single system to a multi‑site, redundant design.

Three major upgrades are highlighted: the 2007 building relocation (causing a two‑hour outage), the 2010 move to Nantong with the first active‑active redesign using SIP‑based modular routing, and the 2016 client‑side overhaul to achieve active‑active agent endpoints.

2. Active‑Active Architecture Overview

The system is divided into three layers:

Public Access Layer – works with telecom operators, employing dual‑site voice trunks, intelligent routing, percentage‑based or caller‑area routing, and SIP trunking to enable rapid capacity scaling and seamless failover.

Application Layer – mirrors typical web‑application routing; local clusters are preferred, with static routing rules that fall back to the remote cluster on failure.

Agent (Client) Access Layer – implements dual‑center connections, polling, and load‑balancing for the agent software.

The public layer uses SIP trunks to achieve fast line expansion and automatic primary/backup switching without additional hardware, reducing costs and eliminating single‑point failures.

The application layer relies on static routing policies that prioritize local clusters but can route traffic to the remote site when needed, with full‑mapping core clusters deployed in both locations for true redundancy.

The client layer implements three techniques: dual‑center connections (agents register to both sites), polling (automatic failover to the second server), and load‑balancing, all applied to the agent software.

3. Necessity of Active‑Active Agent Access

Real incidents (power failure, typhoon‑induced shutdown) demonstrated that without active‑active client access, a single‑site outage could halt thousands of agents for hours, causing severe business impact.

To address this, the team introduced IP phones with dual‑line capability, a unified login system, and dynamic virtual IDs to decouple agents from fixed operator numbers.

The unified login architecture links ITDB (device inventory), MAC‑to‑extension mapping, virtual agent IDs, and domain accounts, allowing an agent to log in from any location with a single credential while the system automatically assigns a suitable operator ID.

Resource configuration is centralized: virtual IDs are stored in a shared pool, IP‑phone MAC addresses are synchronized across sites, and extensions remain independent, enabling seamless cross‑site agent login.

A heartbeat‑driven monitoring and failover workflow coordinates client, CTI, PBX, and IP‑phone status, performs double‑confirmation to avoid false alarms, and triggers automatic cross‑site login without user awareness.

Technical Highlights

Automatic active‑active switch for online agents during failures.

Planned manual switch based on system, region, or skill‑group.

Support for 1,000+ concurrent agents with full failover within two minutes.

Drill Results

Regular disaster‑recovery drills revealed minor gaps that were subsequently fixed; performance testing confirmed that 1,000+ agents can be switched automatically in under two minutes.

Future Directions

Fully software‑based client to eliminate hardware phone constraints.

Mobile client enabling agents to work from any location, currently piloted for outbound calls.

The article concludes with a call for community engagement and provides links to related technical articles.

High AvailabilitySREdisaster recoveryactive-activecall centerIP phoneUnified Login
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.