Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance
This article details ByteDance's engineering practices for using ClickHouse to ingest, store, and query massive real‑time recommendation and advertising data, covering early external‑transaction mechanisms, the risks of direct INSERTs, the design and evaluation of Kafka Engine versus Flink pipelines, and a series of performance and reliability improvements implemented to support high‑frequency workloads.
ByteDance's technical team shares how ClickHouse was first adopted for offline analytics (user behavior and agile BI) using an external transaction layer to guarantee consistency across shards, and why direct INSERT queries were later prohibited due to file‑system pressure and data loss risks.
The team then describes the need for real‑time recommendation metrics, outlining business requirements such as debugging, high‑dimensional data, experiment‑ID filtering, and AUC calculations, and compares two ingestion architectures: a Flink‑to‑JDBC pipeline and ClickHouse's built‑in Kafka Engine.
After evaluating trade‑offs, the Kafka Engine solution was chosen; its principle involves three tables (a MergeTree, a Kafka table, and a materialized view) that together consume Kafka partitions directly inside ClickHouse.
Subsequent improvements include asynchronous index construction to boost write throughput, multi‑threaded Kafka consumption to achieve near‑linear scaling, and enhanced fault‑tolerance using ZooKeeper‑based leader election for replica‑aware consumption.
Additional engineering work introduced a platform for easy creation and management of Kafka→ClickHouse tasks, diagnostic system tables (system.kafka_log, system.kafka_tables), and operational SQL extensions for start/stop/restart and schema evolution without rebuilding tables.
Future plans focus on implementing distributed transactions, read/write separation with WAL and buffer layers, and eventually exposing direct INSERT capabilities while maintaining stability and consistency.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.