Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies
This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.
Background Optical Transport Network (OTN) uses fiber as the transmission medium and, with DWDM and protection switching, provides high‑bandwidth, low‑latency, highly reliable data transfer, making it popular for inter‑data‑center connections. Ctrip operates its own OTN, called TOTN, for backbone traffic and office Internet access.
Because TOTN directly faces carrier fibers, it frequently suffers fiber‑cut incidents caused by construction activities. Statistics show roughly one cut per 1,000 km per year in the US, over 50 cuts per year for China Telecom, and multiple daily cuts in India. Since Ctrip monitors about 20 fiber cuts annually, an automatic switchover mechanism is essential to keep bandwidth unaffected.
Overall Architecture TOTN adopts a dual‑plane protected design: each IDC deploys two independent sets of transmission equipment, each connected to a different fiber route, forming two completely independent transmission planes.
Under normal conditions traffic follows the primary path; when the primary fiber is cut, the system switches traffic to the backup path. The primary‑backup switching time follows ITU‑TG.783 and ITU‑TG.841 standards and is less than 50 ms.
Although this protection mechanism prevents bandwidth loss, it introduces a problem: during the switchover, network device ports experience flapping (down‑up transitions), causing error reports from latency‑sensitive services such as Redis.
Problem Analysis The down‑to‑up time varies by device and optical module, and layer‑2/3 convergence can take seconds, leading to brief service interruptions. Real incidents on March 17 and September 11 showed Redis errors coinciding with fiber cuts and switchover events.
Industry practice often configures a link‑delay on switch interfaces so that the router delays marking the link down after a physical cut, avoiding frequent flaps. However, many devices do not support this non‑standard feature, and setting a long delay (e.g., 2 s) can waste valuable recovery time if a fault occurs.
Technical Research In 2023 TOTN introduced a DCI product capable of 5 ms switchover by using a magneto‑optic switch (based on the Faraday effect) and pre‑loading backup channel parameters into a DSP chip. Despite the fast optical switch, port flapping persisted because the optical layer still sent AIS signals to the electrical cards, which generated Local_Fault alarms that forced ports down (IEEE 802.3ae). By delaying the AIS signal transmission (default 4 × 50 ms), the switchover can complete before the alarm reaches the device, preventing flapping.
Since the alarm‑delay mechanism is independent of the 5 ms optical switch, it can also be applied to legacy products to achieve “no‑perception” switching.
Optimization Plan The solution is to change the 100 GE service mapping from BIT‑transparent to MAC‑transparent (which briefly interrupts traffic) and set the alarm‑delay to 200 ms. Laboratory tests showed identical throughput for both mappings, with negligible latency differences (BIT: 24 µs, MAC: up to 25 µs for 9600‑byte frames).
After a month of gray‑scale rollout on transmission plane A (MAC‑transparent mapping + 200 ms alarm delay), tests on August 18 confirmed that fiber‑cut switchover no longer caused port flapping and Redis remained error‑free. A real‑world fault on September 7 also showed no Redis spikes.
Subsequently, on September 15 the same optimization was applied to transmission plane B, reducing the alarm‑delay further to 100 ms, again achieving Redis‑no‑perception.
Future Plans Ctrip will redefine its optical network equipment standards, requiring new OTN devices to support BIT‑transparent mapping with configurable alarm delay. The goal is to make this practice industry‑wide, complementing other reliability measures such as fault detection, performance monitoring, and fiber‑route identification.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.