Operations 18 min read

How eBay Scales Its Event Platform with ClickHouse and Kubernetes

This article details eBay's event platform architecture, explaining why a dedicated event system is needed, how ClickHouse provides high‑performance storage, the use of Kubernetes CRDs for cross‑region high availability, data routing, read/write separation, and query optimizations with LogQL.

Efficient Ops
Efficient Ops
Efficient Ops
How eBay Scales Its Event Platform with ClickHouse and Kubernetes

Background

Before introducing the event platform, the monitoring platform’s four signal types—metrics, logs, traces, and events—are described. Multi‑dimensional analysis, alerting, and anomaly detection are built on these signals, leading to solutions such as BCD, Groot, and Exemplar for root‑cause analysis and rapid issue localization.

What Is an Event?

Events are non‑periodic and can be user‑generated (deployment, scaling, configuration) or system‑generated (alerts, access logs), often carrying arbitrary key‑value pairs with high cardinality, making metric‑based solutions unsuitable.

Event Platform

The platform ingests 200 billion events per day, handles over 5 million queries daily, and runs on more than 400 ClickHouse nodes with over 1 PB of storage.

ClickHouse was chosen for its column‑store architecture, high compression (10‑100×) using LowCardinality, Delta encoding, LZ4/ZSTD, and vectorized columnar computation that maximizes CPU cache hits and SIMD utilization.

It also supports runtime code generation, vertical and horizontal query parallelism, and plans for per‑shard multi‑replica parallel processing.

The platform is fully containerized and orchestrated with Kubernetes, using custom resources (CRDs) such as FCHI and CHI to manage cross‑region ClickHouse clusters. Otel‑compatible data models are used for event collection, and both SQL and LogQL are provided for querying, integrating with Grafana and Prometheus.

Data routing is handled via WISB (expected routing) and WIRI (actual routing) records, enabling namespace‑based virtual ClickHouse resources and lightweight migration using virtual clusters and distributed tables.

Read/write separation is achieved through the

readWriteMod

parameter in FCHI, creating separate virtual sub‑clusters for reads and writes.

Typical Case

eBay migrated a service‑mesh monitoring workload from Elasticsearch to ClickHouse, reducing storage to 30 % of the original while extending retention from 9 to 30 days and improving anomaly‑detection query performance tenfold.

Future Outlook

The platform currently supports only fully structured data; future work includes adding support for semi‑structured and unstructured data using Map‑based free schema and ClickHouse’s JSON column type, as well as optimizing cross‑region aggregation to reduce network traffic.

observabilityHigh AvailabilitykubernetesClickHousedata routingEvent Platform
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.