Big Data 26 min read

Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

This article details the design and implementation of an advertising data platform at eBay, explaining the business scenario, why ClickHouse was chosen over alternatives, the technical challenges faced, and the solutions involving lambda architecture, table engine choices, compression techniques, data ingestion pipelines, consistency guarantees, and deployment practices.

DataFunTalk
DataFunTalk
DataFunTalk
Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

The talk introduces the eBay advertising business scenario, describing two ad models—cost-per-sale and cost-per-click—and the need for a robust reporting platform to monitor seller performance across various ad placements.

It then presents the overall system architecture, highlighting the use of a lambda architecture that combines offline Hadoop/Spark processing with real‑time Kafka/Flink ingestion, all feeding into ClickHouse for analytical queries.

Choosing ClickHouse over Druid is justified by superior columnar storage, compression, MPP query performance, simpler operations, and better handling of large‑scale data ingestion without the limitations of Druid’s time‑series design.

The article outlines major technical challenges: handling multi‑timezone reporting at massive scale, ingesting billions of daily events without overloading ClickHouse, meeting strict real‑time latency requirements, ensuring atomic offline‑online data updates, and managing versioned data and schema changes.

Key design decisions include using ClickHouse’s ReplicatedMergeTree for high‑availability storage, optimizing primary‑key and sorting‑key layouts for query efficiency, applying appropriate compression algorithms (LZ4, LZ4HC, ZSTD, DoubleDelta, Gorilla) and LowCardinality to reduce storage footprint, and employing replacement‑merge trees for upserts.

Data ingestion is split into real‑time pipelines (Kafka → Flink → ClickHouse via JDBC) and offline pipelines (Spark jobs scheduled via Livy, data written to HDFS, then imported into ClickHouse partitions), with careful partitioning by day and seller ID to support efficient queries.

To guarantee consistency during large‑scale offline replacements, the system uses ClickHouse’s partition‑level detouch/attach operations, version columns stored in dictionaries, and PREWHERE filters to expose only active data, while periodic cleanup removes obsolete versions.

Operational issues such as Zookeeper overload from thousands of ClickHouse clients are mitigated by reducing global operations, splitting Zookeeper clusters per shard, and tuning operation timeouts.

Data quality is enforced at multiple stages: Spark output validation, ClickHouse import verification, and runtime monitoring dashboards that trigger alerts and version rollbacks when anomalies are detected.

Finally, the release process involves dual‑source deployment, mirroring queries to a new service instance, and comparing results to ensure correctness before cutting over traffic.

Performance OptimizationadvertisingBig Datadata platformClickHouselambda architectureData ingestion
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.