Open-falcon in Automotive Home: Application, Architecture, and Customizations
This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.
The article introduces the use of Xiaomi's open‑falcon monitoring system at Automotive Home, explaining its application scenarios and the improvements made to meet the platform's needs.
Basic Monitoring Solution Comparison
Traditional monitoring tools like Zabbix no longer meet the performance and scalability requirements of fast‑growing internet companies. A comparison of three open‑source monitoring solutions—Zabbix, Prometheus, and Open-falcon—is presented, evaluating installation complexity, data collection support, storage difficulty, and alarm support.
Installation Complexity
Data Collection Support
Data Storage Difficulty
Alarm Support
Zabbix
Medium
Low
High
High
Prometheus
Low
High
High
Medium
Open-falcon
High
Medium
Low
Medium
The comparison shows that while Open-falcon is not the most feature‑rich, it offers the simplest deployment and low storage overhead, making it suitable for the company's scale and requirements.
Open-falcon Architecture Overview
Open-falcon is an open‑source, high‑availability, and extensible monitoring solution developed by Xiaomi's operations team. It follows a front‑back separation architecture: the backend is written in Go, the frontend in Python. Agents are installed on monitored machines to push metrics to the Transfer component.
Key components:
A) Agent : Collects metrics (e.g., cpu.idle, load.1min) every 60 seconds and pushes them to Transfer via a long‑lived connection; supports Linux and a Windows‑Agent released by Automotive Home.
B) Transfer : Receives data from agents, shards it by hash, and forwards it to Graph and Judge.
C) Judge : Evaluates metrics against configured strategies and expressions to trigger alerts.
D) Alarm : Persists alert events to MySQL, pushes them to Redis queues, and sends notifications asynchronously.
E) Graph : Stores time‑series data in memory and RRD files, serves query requests for dashboards.
F) Query : Handles data storage queries.
G) HBS : Provides caching to accelerate data access for other systems.
H) Dashboard : User‑facing interface for visualizing metrics and trends.
Customizations for Automotive Home
To integrate with the platform's CMDB, the dashboard and HBS components were rewritten to source monitoring objects from a service‑tree, enabling automatic binding of templates, inheritance of alert strategies, and reduction of manual configuration.
Dynamic service‑tree based monitoring templates allow automatic inheritance and independent configuration; new nodes automatically adopt appropriate templates.
Alert targets are no longer limited to servers; they can be services, hosts, or container nodes, and subscription configurations can be set per metric to notify different stakeholders.
A self‑healing feature was added: when an alert fires, a predefined scenario (composed of Salt‑executed tasks, scripts, or callbacks) runs automatically, with the ability to cancel the scenario before execution.
Alert components now support custom plugins, nodata handling, and multiple notification channels (DingTalk, SMS, phone) via internal notification interfaces.
Global and service‑tree based alert silencing is implemented, automatically suppressing alerts for assets in non‑operational states (e.g., installation, decommissioning).
High‑availability is achieved by deploying most components in active‑active mode across data centers; judge, graph, and nodata have standby failover mechanisms that reconfigure Transfer and Query on failure.
Future Outlook
The current monitoring stack relies on open‑falcon‑v0.1, which has not been upgraded for years. Plans include migrating to Nightingale for better query performance and automatic fault isolation, as well as developing richer hardware monitoring agents in collaboration with server manufacturers.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.