Automated Network Failure Detection and Intelligent Switching System at Qunar
This article describes Qunar's automated network outage detection and intelligent traffic switching system, detailing the problem background, solution architecture, component functions, workflow, optimization steps, and future plans for more precise, multi‑level failover handling.
Background
When a data‑center outbound network link fails, all services go down and trigger alarms; the article asks what operators can do in such situations.
Problem Solved
The goal is to quickly detect issues automatically and switch traffic to redundant data‑centers using existing software and systems, illustrated by Qunar's switching system.
System Overview
The system detects outbound failures, automatically switches inbound (user) traffic by updating DNS records, and redirects outbound (service) traffic by changing proxy addresses, ensuring continuous access.
Intelligent Switching System
When IDC outbound anomalies are detected, the system automatically identifies the fault, switches traffic to redundant sites, and maintains service availability.
Inbound Traffic Switching
Users access services via DNS; upon detection of an outbound fault, the system modifies authoritative DNS to point to a backup data‑center.
Outbound Traffic Switching
Internal services use proxy addresses; when a fault is detected, the proxy address is automatically updated to the backup site, redirecting outbound requests.
Requirements for Building the Switching System
Dynamic periodic testing
Effective aggregation and classification of detection data
Multi‑data‑center deployment for services
Comprehensive backend support
Component Description
Layer 1: Network detection using Smokeping alerts. Layer 2: Data aggregation and analysis with an internal monitoring tool (Watcher). Layer 3: Application switching layer including DNS manager and proxy manager, which performs actual traffic redirection based on analysis results.
System Workflow
Smokeping monitors nationwide points.
Abnormal packet loss or latency data is filtered.
Data is tagged and sent to Watcher.
Watcher classifies data by dimensions.
Network anomalies are identified after excluding host‑level issues.
Aggregated metrics per data‑center and ISP are calculated.
Automatic application-level switching is triggered.
Visualization
Weathermap (a Cacti plugin) visualizes Smokeping detection results.
Component Analysis
Defines how abnormal data is tagged, thresholds for packet loss (5% and 10%), and the use of multi‑pointer detection when monitoring points are numerous.
Optimization Process
Improvements include refining ICMP anomaly detection thresholds, setting a 30‑second detection cycle, employing multi‑pointer detection, optimizing aggregation thresholds, and ensuring robust logging and pre‑alerting mechanisms.
Future Plans
Three focus areas: more precise and broader detection covering more cities and ISPs; support for additional languages and service types; finer‑grained switching by province or region to reduce switch time and increase intelligence.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.