Big Data 13 min read

Evolution of ByteHouse Real‑Time Ingestion: From Internal Demands to a Cloud‑Native Architecture

This article details the motivation, architectural evolution, and technical implementations of ByteHouse's real‑time ingestion pipeline, covering internal business requirements, distributed‑system challenges, the custom HaKafka engine, memory‑table optimizations, and the transition to a cloud‑native design that delivers high availability, low‑latency, and exactly‑once semantics.

DataFunTalk
DataFunTalk
DataFunTalk
Evolution of ByteHouse Real‑Time Ingestion: From Internal Demands to a Cloud‑Native Architecture

ByteHouse, a cloud‑native data warehouse on Volcano Engine, provides ultra‑fast analytics for both real‑time and massive offline data, offering elastic scaling and enterprise‑grade features to support digital transformation.

The need for real‑time ingestion originated from internal ByteDance workloads where Kafka was the primary source; users demanded high throughput, stability, scalability, and sub‑second latency, prompting custom optimizations.

Adopting ClickHouse’s community‑native distributed architecture introduced three inherent pain points: node failures causing data loss, read‑write resource conflicts under load, and costly resharding during scaling.

To address these, the community’s high‑level consumption model and two‑level concurrency were examined, but they could not satisfy advanced requirements such as key‑based sharding and deterministic partition distribution.

ByteHouse therefore built a custom consumption engine, HaKafka, which adds high‑availability (leader election via ZooKeeper) and a low‑level consumption mode that guarantees ordered, balanced partition assignment and retains two‑level concurrency.

HaKafka also introduces a Memory Table that buffers incoming data in memory and flushes it in batches, reducing I/O pressure and improving ingestion speed by up to threefold, while still supporting query access.

Recognizing the limitations of the distributed architecture, ByteHouse migrated to a cloud‑native stack (released as ByConity) in early 2021, comprising three layers: Cloud Service (Server + Catalog), Virtual Warehouse (execution layer with isolated resources for query and write workloads), and VFS (cloud storage such as HDFS, S3, etc.).

In the cloud‑native design, the Server only orchestrates tasks; a Manager schedules consumption jobs to Virtual Warehouses, which execute them under transactional guarantees. The consumption flow includes RPC‑based transaction creation, rdkafka polling, block‑to‑part conversion, dumping to VFS, and transaction commit, ensuring data visibility only after successful commit.

Fault tolerance is achieved through bidirectional heartbeats between Manager and Tasks, rapid task replacement on failure, and automatic leader re‑election, providing seconds‑level recovery.

The system supports configurable parallelism up to the number of Kafka partitions, load‑balanced scheduling via a Resource Manager, and an exactly‑once semantic enabled by transactional commits of both part metadata and offsets.

Beyond Memory Table, the cloud‑native stack adds a generic Memory Buffer and a Write‑Ahead Log (WAL) to decouple buffering from Kafka, allowing broader use cases such as Flink batch imports.

The article concludes with a brief overview of ByteHouse’s production usage (PB‑scale daily throughput, 10‑20 MiB/s per consumer), support for additional sources (RocketMQ, Pulsar, MySQL, Flink), and future directions toward more universal ingestion pipelines and balanced latency‑performance trade‑offs.

Finally, readers are invited to try ByteHouse for free via the official website and QR‑code links.

cloud-nativedistributed architectureHigh AvailabilityKafkamemory tableByteHouseReal-time Ingestion
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.