How OWL Redefines Enterprise Monitoring with Dynamic Alerts and Scalable Architecture
This article introduces OWL, a distributed, enterprise‑grade monitoring solution that combines infrastructure and business metrics, offers floating alert rules, customizable dashboards, visual asset management, a resilient Golang‑based agent, and a parallel‑scalable HBase storage backend.
What is OWL?
OWL is a new distributed, enterprise‑level monitoring solution that can monitor both IT infrastructure resources and business‑level data. It supports languages familiar to operations engineers such as Python and Shell, while also allowing developers to add custom business metrics.
OWL Architecture Overview
The architecture consists of three main parts:
Left side: the backend, built around OpenTSDB and MySQL. MySQL stores dictionary tables and basic data, while OpenTSDB stores collected metric data.
Middle: the server layer, which receives data. It operates in two modes—single mode (directly receives data from agents) and proxy mode (transparent forwarding). The server maintains an in‑memory message queue that buffers data and triggers alerts when the queue grows too large.
Right side: the client side, written in Go, with plugins supporting Shell, Python, and Go. The agent uses a twin‑process structure to ensure high availability.
Key Features of OWL
1. Floating Alert Rules Based on Complex Algorithms
Unlike simple fixed‑threshold alerts, OWL supports floating alerts that automatically adjust thresholds after a breach, reducing unnecessary notifications. It also incorporates similarity algorithms to handle noisy metric data.
Similarity Algorithm
The system uses time‑based similarity measures to filter noisy data and generate more accurate alerts.
2. Flexible Customizable Dashboards
Each user can create a personalized dashboard showing only the metrics relevant to their role (e.g., network engineer, DBA, DevOps). Users select hosts or devices, choose metrics, and the web UI renders the charts with time‑range selection.
Select hosts/devices and metrics
View and interact with the generated charts
3. Visual Asset Management
OWL provides a simulated cabinet view that displays physical assets together with their monitoring status, making asset location and health instantly visible.
4. Resilient Agent Built with Go
The agent is OS‑agnostic, supports multiple plugins, and uses a twin‑process design with socket communication and heartbeat monitoring. If one process fails, a watchdog automatically restarts it, ensuring continuous data collection.
Watchdog restarts the agent if it crashes.
Agent maintains heartbeat and recovers state.
Agent executes plugins via Go’s exec library.
Plugin results are returned in a strict KV format.
Agent sends data to the server using a private protocol.
Strict data format validation reduces noise.
5. Parallel‑Scalable Storage Backend
After evaluating relational, KV, and time‑series databases, OWL chose HBase for its column‑oriented, horizontally scalable storage, with TSDB as a wrapper. The system currently ingests around 30 million data points per day with ~20 % server load.
Open‑Source Plan
OWL is planned to be open‑sourced by the end of 2015. The timeline depends on company scheduling, and contributions from the DevOps community are welcomed.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.