Databases 20 min read

How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics

This article details ByteHouse's evolution from a ClickHouse‑based OLAP engine to a cloud‑native, massively parallel data warehouse, highlighting its distributed and cloud‑native architectures, enhanced table engines, HaKafka and Materialized MySQL extensions, and real‑world use cases in short‑video, marketing and gaming analytics.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics

Introduction

ByteHouse, a cloud‑native data warehouse derived from ClickHouse, was presented at the VeDI meetup to discuss its architecture evolution, enhanced HaKafka engine, improved Materialized MySQL, practical case studies, and future outlook.

Evolution of ByteHouse

Since 2017 ByteHouse has grown from internal experiments with ClickHouse to a production‑grade service launched publicly in 2021. By March 2022 the system operated 18,000 nodes internally, with single clusters scaling up to 2,400 nodes.

Architecture Overview

ByteHouse offers two architectures:

Distributed Architecture : Supports over 2,000 nodes per cluster, leveraging MPP 1.0 features such as integrated storage‑compute, custom table engines (HaMergeTree, HaUniqueMergeTree), combined RBO and CBO optimizers, hot‑cold data separation, and visual management.

Cloud‑Native Architecture (MPP 2.0) : Implements storage‑compute separation with shared‑everything storage and shared‑nothing compute, eliminating re‑sharding, providing resource isolation, elastic scaling, and support for HDFS and S3.

Technical Advantages

ByteHouse enhances data ingestion through self‑developed table engines:

HaMergeTree : Reduces ZooKeeper load by decoupling metadata and data synchronization, enabling PB‑scale data handling.

HaUniqueMergeTree : Provides real‑time upserts without update latency.

Bitmap Engine : Accelerates set operations by 10‑50× in large‑scale user segmentation.

Enhanced HaKafka Engine

The original community Kafka engine suffered from lack of high availability, unique‑key handling, and fault tolerance. ByteHouse adds:

High‑availability standby consumers with automatic leader election via ZooKeeper.

Low‑level consumption ensuring the same key lands in the same partition for unique‑key scenarios.

Node replacement logic that guarantees at least one fully synchronized replica during data copy.

Memory Table bound to HaKafka for buffering wide tables before batch flushing.

Enhanced Materialized MySQL

Community Materialized MySQL lacks distributed sync, DDL skipping, and robust error handling. ByteHouse extends it by integrating HaUniqueMergeTree for real‑time deduplication, supporting distributed tables, and providing parameters such as include/exclude tables , skip DDL , and visual operation monitoring.

Real‑World Use Cases

Short‑Video & Live Streaming : Handles billions of rows with a custom Unique engine for real‑time deduplication and a Memory Table for wide‑column buffering, achieving 30 MB/s per node and sub‑second query latency.

Marketing Real‑Time Monitoring : Uses HaKafka with exactly‑once semantics and Unique engine to ensure precise reward distribution, delivering 30 MB/s/node ingestion and second‑level analytics.

Game Advertising Analytics : Combines Kafka ingestion, Unique engine deduplication, and Materialized MySQL for seamless MySQL‑to‑ByteHouse sync, boosting query speed threefold and reaching 20 MB/s single‑thread throughput.

Future Strategy

ByteHouse aims for end‑to‑end solutions covering syntax conversion, data migration, and validation; integration of TP/AP workloads via DES logical replication; resource isolation with shared or dedicated pools; and multi‑engine extensions beyond ClickHouse to further enhance data synchronization capabilities.

Big DataReal-time AnalyticsByteHousecloud-native data warehouseHaKafkamaterialized MySQL
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.