Operations 25 min read

How Qunar Scaled Application Ops Automation from Hundreds to Tens of Thousands of Servers

This article details Qunar's journey of automating application operations, covering the evolution of their host‑management system, unified monitoring/alert platform, and data‑interchange mechanisms that enabled the company to grow from a few hundred to over ten thousand servers with a stable six‑person ops team.

Efficient Ops

Sep 25, 2017

How Qunar Scaled Application Ops Automation from Hundreds to Tens of Thousands of Servers

Preface

I will share the theme “Qunar’s Application Ops Automation Evolution”. It describes the obstacles we faced during automation, the pitfalls we encountered, and how we solved them.

I joined Qunar in 2013 as a senior ops engineer and have been involved in ops development ever since. Our ops team works as full‑stack engineers, handling everything from PM to QA without separating front‑end and back‑end tasks.

The work mainly involves host management, application management, monitoring, and alarm platforms.

1. Qunar Application Ops Platform Overview

The platform consists of four parts:

Resource Management : hosts, images, files, object storage, network bandwidth, compute resources, etc.

Shared Middleware : log collection, configuration registration, monitoring/alert metrics, call tracing.

CI/CD : code management, build, test, and release pipelines.

Monitoring & Alerting : performance and business metrics collection, analysis, and alerting.

Qunar’s scale grew from dozens of machines to tens of thousands, prompting different solutions at each stage.

Four Evolution Stages

Stage 1 – Manual Ops : Small number of machines, ad‑hoc Linux commands, no scripts.

Stage 2 – Script‑Based Ops : Batch scripts for deployment and monitoring across hundreds of machines.

Stage 3 – Orchestrated Ops : Developed systems to chain scripts into discrete operations, introduced an “application tree” for hierarchical ownership and approval.

Stage 4 – Platform‑Level Automation : Designed a one‑stop service platform with data sharing, enabling automated host and account provisioning.

2. Three Key Points of the Ops Platform

2.1 Host Management

Our host management system is built on OpenStack (VM provisioning) and DNSDB (domain‑based host identification). We wrap scripts and tools into operations, assign permissions, and store logs. The UI allows ops staff to create, destroy, and view host details, automatically attaching monitoring alerts.

To solve ownership tracking issues, we introduced an “application tree” where each node (BU → department → sub‑department → application) can bind hosts, owners, and approvers, ensuring accurate responsibility data.

We also built a host‑request system and an account‑request system that leverage the application tree and an approval center, allowing developers to request resources without direct OPS involvement.

2.2 Monitoring & Alerting

Qunar needed a reliable 24/7 monitoring solution. Early tools (Cacti, Nagios) were fragmented and non‑HA. We created a company‑wide platform called Watcher , built on Graphite, offering high availability, unified configuration, and a tree‑based permission model.

Watcher aggregates over 2 000 000 metrics, 40 000+ alerts, and monitors more than 4 000 hosts. It uses consistent hashing for metric routing and provides a single API for both infrastructure and business monitoring.

2.3 Data Interconnection

We introduced a unique Appcode to identify any application (web service, GPU instance, MySQL, switch, etc.). Appcode is hierarchical (e.g., BU_department_application) and serves as a common key across systems.

By embedding Appcode in host, storage, compute, monitoring, billing, and CI/CD systems, we achieved data sharing, enabling:

Accurate, shared ownership information.

One‑stop portal for all application‑related operations.

Automatic propagation of host expansions, account creation, whitelist configuration, and billing calculations.

3. Summary

Qunar’s ops automation journey shows that as applications scale, a unified platform dramatically reduces manual effort. From a team of six engineers, we now manage tens of thousands of servers, with a robot acting as the seventh member. The robust monitoring and alerting system keeps daily failures low (about two to three per day), and the Appcode‑driven data sharing creates a virtuous cycle of accuracy and efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring data integration Operations Automation Qunar host management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.