Operations 15 min read

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

The article details the background, challenges, multi‑year evolution, current architecture, and future roadmap of Hickwall, Ctrip's enterprise‑grade monitoring and observability platform, covering metrics, logs, traces, high‑cardinality handling, cloud‑native integration, alert governance, and storage engine migrations.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

Background – Modern observability relies on three pillars: Metrics, Tracing, and Logging. Logs capture arbitrary event data, Metrics are aggregated numeric series, and Traces represent structured call‑graphs. Observability extends traditional monitoring by providing white‑box insight into system behavior.

Problems Encountered – As Ctrip’s services grew, Hickwall faced high‑cardinality queries, lack of cloud‑native support, coarse data granularity, fragmented alert configurations, data latency, duplicate alerts, and metric explosion caused by large‑scale HPA deployments.

Major Evolutions

1. Cloud‑Native Monitoring – Upgraded the TSDB to VictoriaMetrics (fourth generation) with full Prometheus compatibility, added a custom Beacon container‑monitoring component tightly integrated with Kubernetes, and supported Prometheus SDK instrumentation.

2. High‑Cardinality Solutions – Introduced pre‑aggregation rules (166 rules, 209 generated metrics), implemented tag‑governance to block illegal writes, performed capacity planning, and allowed configurable ignore‑tags for volatile dimensions such as hostnames and IPs.

3. Data Granularity Enhancement – Collected core system, order, and key application metrics at sub‑second resolution to meet the 1‑5‑10 incident‑response goal.

4. Alert Middle‑Platform Integration – Developed a unified pull‑alert system handling over 100,000 rules and integrated with a central alert center for end‑to‑end alert governance.

5. Latency Reduction – Shortened the data path by writing directly from the gateway to the TSDB, eliminating intermediate Kafka consumption.

6. Time‑Series Storage Evolution – Progressed through four storage stages: ES + Graphite, InfluxDB + Incluster, ClickHouse + SQL, and finally VictoriaMetrics + PromQL, each improving write performance, query speed, and scalability.

Current Platform Status – The system now ingests tens of millions of data points per second, serves thousands of QPS, supports over 100,000 alert rules, maintains sub‑second query P99, stores data at 10‑second granularity for a year, and runs on a hundred‑plus node Kubernetes cluster.

Architecture Overview – Data flow: data → Proxy → TSDB ; Alerts flow: data → Proxy → Trigger . Core components include Collector (SDK, agent, Beacon), Proxy (multi‑protocol ingestion, rate limiting), Trigger (pull‑alert engine), DownSample (configurable aggregation), Piper (notification service), Transfer (data export), Grafana portal (visualization), TSDB Cluster (VictoriaMetrics with vminsert/vmstorage/vmselect), ClickHouse Cluster (OLAP), and Hickwall Portal (one‑stop monitoring, logging, and alert governance UI).

Future Plans – Introduce metric tiering, expand eBPF‑based host observability, unify Logging‑Metrics‑Tracing pipelines, adopt OpenTelemetry standards, and enable edge‑agent pre‑aggregation for reduced upstream load.

monitoringcloud-nativeoperationsObservabilitymetricsalertingTSDB
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.