How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency
This article details Qunar's end‑to‑end hardware automation system, covering background challenges, lifecycle management, automated testing, data collection, fault detection, and visualized monitoring, and explains how the integrated platform reduces manual effort, improves reliability, and cuts operational costs.
Preface
I am pleased to share Qunar's experience in hardware operations automation. The talk is divided into four parts: background overview, work description, specific implementation, and summary review.
Background overview Work description Specific implementation Summary review
1. Background Overview
Our hardware scope ranges from every rack in the data‑center to each individual device, including servers, network switches, routers, etc. The hardware can be grouped into four categories.
We further drill down to each device, such as a server, and examine components like CPU, memory, power supply, fans, and other peripherals. Before automation, Qunar faced several pain points: a single engineer had to manage tens of thousands of servers, operations like rack‑mounting, migration, provisioning were labor‑intensive, hardware quality was uncontrolled, fault handling was slow, and manual SSH access posed security risks.
To address these six pain points we aimed for automation and intelligence: ensure operational safety, guarantee hardware quality, improve efficiency, and ultimately reduce costs.
2. Work Description
The core concept is the hardware lifecycle, covering five stages from selection, procurement, rack‑mounting, operation, to decommissioning.
We perform targeted work for each stage: selection testing, arrival inspection, monitoring and alerting, and disposal handling.
3. Specific Implementation
Suppliers provide reference data that often overstates performance; our tests reveal the real gain is much lower. Therefore we focus on cost‑performance and choose configurations that match our actual workload.
We standardize BIOS, RAID, and OS settings to obtain peak performance, and we score each hardware configuration on CPU, memory, and I/O metrics.
When bulk shipments arrive, we encounter five common issues: missing components, defective parts, batch‑level defects, configuration mismatches, and damage during transport. Our platform verifies that the delivered configuration matches the tested baseline, performance meets standards, and no faults exist before deployment.
Data collection is achieved through a hybrid approach: agents on each machine push daemon data to the backend, while remote agents retrieve otherwise inaccessible metrics. This satisfies the diverse data needs of data‑center operators, hardware inventory, performance baselines, fault records, and time‑series system metrics.
We maintain an internal CMDB that stores hardware configuration, status (online, under repair, faulty), and metadata such as rack location, serial numbers, RAID and SSD details. A second system, Watcher, aggregates real‑time time‑series metrics from servers, containers, databases, and cloud services, providing both monitoring and alert configuration.
Automation includes rule‑based formatting of raw hardware data into a unified schema, enabling consistent downstream processing regardless of vendor or batch variations.
Fault handling is streamlined: alarms are classified into Critical, Warning, and OK; critical alerts trigger automated log collection, formatted email generation, and ticket creation. The system tracks repair progress and ensures closure.
Visualization tools display rack layouts, temperature, power consumption, and fault statistics, allowing operators to quickly assess the health of the data‑center.
4. Summary Review
By implementing this automated hardware operations platform, Qunar achieved unattended operation, integrated testing, and fault tracking, freeing engineers to focus on higher‑value tasks, improving reliability, reducing risk, and significantly lowering operational costs.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.