Databases 20 min read

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

This article details ByteDance's engineering practices for using ClickHouse to ingest, store, and query massive real‑time recommendation and advertising data, covering early external‑transaction mechanisms, the risks of direct INSERTs, the design and evaluation of Kafka Engine versus Flink pipelines, and a series of performance and reliability improvements implemented to support high‑frequency workloads.

DataFunTalk

Aug 25, 2020

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

ByteDance's technical team shares how ClickHouse was first adopted for offline analytics (user behavior and agile BI) using an external transaction layer to guarantee consistency across shards, and why direct INSERT queries were later prohibited due to file‑system pressure and data loss risks.

The team then describes the need for real‑time recommendation metrics, outlining business requirements such as debugging, high‑dimensional data, experiment‑ID filtering, and AUC calculations, and compares two ingestion architectures: a Flink‑to‑JDBC pipeline and ClickHouse's built‑in Kafka Engine.

After evaluating trade‑offs, the Kafka Engine solution was chosen; its principle involves three tables (a MergeTree, a Kafka table, and a materialized view) that together consume Kafka partitions directly inside ClickHouse.

Subsequent improvements include asynchronous index construction to boost write throughput, multi‑threaded Kafka consumption to achieve near‑linear scaling, and enhanced fault‑tolerance using ZooKeeper‑based leader election for replica‑aware consumption.

Additional engineering work introduced a platform for easy creation and management of Kafka→ClickHouse tasks, diagnostic system tables (system.kafka_log, system.kafka_tables), and operational SQL extensions for start/stop/restart and schema evolution without rebuilding tables.

Future plans focus on implementing distributed transactions, read/write separation with WAL and buffer layers, and eventually exposing direct INSERT capabilities while maintaining stability and consistency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Kafka ClickHouse Database Optimization data ingestion

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.