Big Data 16 min read

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

This article details the evolution of Ctrip's log infrastructure, describing the shift from fragmented departmental logging to a unified Elasticsearch-based platform, the migration to ClickHouse for cost‑effective, high‑performance storage, and the subsequent Log 3.0 redesign that leverages Kubernetes, sharding, and a unified query governance layer to handle petabyte‑scale data.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

The author, Dongyu, a senior cloud‑native R&D engineer, introduces the motivation behind redesigning Ctrip's massive log system.

Before 2012 each department collected logs independently, leading to inconsistent standards and high operational overhead; in 2012 Ctrip adopted an Elasticsearch‑based platform that unified ingestion, ETL, storage, and query, but rapid growth pushed data volume to the 4 PB level, causing OOM, latency, load imbalance, and rising costs.

In early 2020 the team replaced Elasticsearch with ClickHouse, cutting storage costs by 52 % and scaling the platform to over 20 PB; by the end of 2021 the cluster comprised dozens of nodes.

Starting in 2022 a log‑unification strategy merged CLOG and UBT services, targeting 30 PB+ data and exposing operational challenges such as cluster explosion, migration difficulty, and table‑change anomalies, which prompted the development of Log 3.0.

Log 3.0’s architecture consists of six modules: data ingestion, ETL, storage, query/display, metadata management, and cluster management.

Data ingestion is realized either via the internal TripLog framework sending logs to Kafka (Hermes protocol) or through Filebeat/Logagent/Logstash or custom programs that also push to Kafka.

The ETL layer uses GoHangout , an open‑source Logstash‑like tool that consumes Kafka messages, applies JSON parsing, Grok regex, timestamp conversion, and extracts fields such as num for downstream storage.

Initially Elasticsearch stored logs across Master, Coordinator, and Data nodes, handling index creation, request routing, and heavy data storage.

For visualization the team employed Kibana from the Elastic Stack, providing real‑time histograms, line charts, pie charts, and tables.

A metadata management platform defines each index/table as a “Scenario”, configuring TTL, ownership, permissions, ETL rules, and monitoring.

ClickHouse, an open‑source columnar OLAP database, offers columnar storage, vectorization, high compression, and massive throughput, making it suitable for PB‑scale log analytics.

The migration solution automated table creation, modified GoHangout , deployed ClickHouse clusters, and adapted Kibana dashboards, achieving >95 % automation and transparent user migration.

Table design optimizations include double‑list tag storage, daily partitioning, Bloom filter ( Tokenbf_v1 ) for term queries, a global increment ID ( _log_increment_id ) for pagination, and ZSTD compression saving over 40 % space.

ClickHouse clusters consist of stateless query nodes, multiple data shards with primary‑primary replicas, and a Zookeeper ensemble for metadata consistency.

Data presentation was extended with ClickHouse‑specific panels (chhistogram, chhits, chpercentiles, chranges, chstats, chtable, chterms, chuniq) and Grafana‑based SQL dashboards.

A dedicated ClickHouse operation UI supports shard management, node provisioning, weight adjustment, DDL handling, and monitoring/alerting.

Results include >95 % migration automation, >50 % storage savings, and query speeds 4‑30× faster than Elasticsearch (P90 < 300 ms, P99 < 1.5 s).

Log 3.0 introduced ClickHouse on Kubernetes using StatefulSets, anti‑affinity, and ConfigMaps, reducing cluster provisioning from days to minutes, and enabling smaller, more manageable clusters.

The sharding design allows tables to span multiple clusters with time‑based partitioning, enabling cross‑cluster reads/writes and seamless sorting‑key changes without data loss.

Metadata management tracks table versions, cluster assignments, and time ranges, while a unified query governance layer parses SQL with Antlr4, rewrites queries, enforces QPS and scan limits, and routes traffic via a query proxy.

Future plans aim to refine query governance, add pre‑aggregation, AI‑driven alerts, hybrid‑cloud elasticity, and broader product adoption across Ctrip’s ecosystem.

cloud-nativeBig DataElasticsearchkubernetesClickHouseETLlog system
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.