Databases 22 min read

How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets

This article explains how Baidu’s TDE‑ClickHouse, as a core engine of the Turing 3.0 ecosystem, overcomes platform fragmentation, quality issues, and usability challenges through the OneData+ development paradigm, multi‑level aggregation, projection, query‑caching, bulk‑load ingestion, and a cloud‑native architecture to achieve sub‑second query response for massive data volumes.

Architecture & Thinking
Architecture & Thinking
Architecture & Thinking
How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets

Background and Problems

Ba​idu MEG’s first‑generation big‑data products suffered from platform fragmentation, uneven quality, and poor usability, leading to low development efficiency, high learning costs, and slow business response.

Turing 3.0 Ecosystem Overview

The Turing 3.0 ecosystem was built to address these issues and consists of three core components:

TDE (Turing Data Engine) – the compute engine, including ClickHouse and Spark.

TDS (Turing Data Studio) – a one‑stop data development and governance platform.

TDA (Turing Data Analysis) – a next‑generation visual BI product.

These components form a unified data lifecycle platform that supports end‑to‑end data operations.

OneData+ Development Paradigm

OneData+ shifts from the traditional four‑layer data‑warehouse model (ODS, DWD, DWS, ADS) to a dataset‑centric approach. Data developers proactively create reusable datasets based on business themes, and business users perform self‑service drag‑and‑drop analysis on these datasets, dramatically improving development efficiency and reducing operational costs.

ClickHouse Challenges

In the OneData+ model, datasets replace the ADS layer, causing data volume to increase by several orders of magnitude while still requiring sub‑second query performance. Additional challenges include large‑scale data import stability, high resource consumption, and limited cluster‑level transaction support.

Query Performance Optimizations

1. Compute‑layer Decoupling and Aggregation Layer – A dedicated aggregation layer with higher CPU, memory, and network capacity handles query coordination, reducing the need for uniformly high‑spec nodes and lowering cost.

2. Multi‑Level Data Aggregation – Projection tables pre‑aggregate intermediate results, and final query results are cached in external memory to avoid repeated disk I/O.

<code>SELECT keys, COUNT(DISTINCT cuid) FROM xxx GROUP BY keys</code>

3. Projection Automation – SQL parsing identifies candidate projections, evaluates cost, and automatically creates or drops them based on hit rate and cluster load.

4. Query Result Cache (Proxy QueryCache) – A global cache at the CHProxy layer stores final results, with version‑based invalidation ensuring consistency after data updates, reducing query latency by up to 50%.

5. High‑Cardinality UV Queries – Two‑stage optimization:

NoMerge queries pre‑partition data by cuid so that each shard returns only the partial count, and the aggregation layer simply sums the counts, yielding 5‑10× speedup.

Projection + RoaringBitmap replaces hash‑set deduplication with bitmap operations, further shrinking intermediate state size.

6. Rule‑Based Optimizer (RBO) for CASE‑WHEN – AST rewriting removes redundant CASE logic and enables index usage, cutting query time by 20‑40%.

Data Import Optimizations

Native ClickHouse Min‑Batch inserts caused entry‑node hotspots and write amplification due to LSM‑Tree compaction. Baidu introduced a BulkLoad pipeline:

Data construction – offline jobs build CH parts, merge them, and push the final part to a distributed file system (AFS).

Data delivery – a two‑phase commit validates and routes the bulk data to the appropriate CH nodes.

This approach achieves sub‑two‑hour ingestion for hundred‑billion‑row datasets and provides cold‑backup data for disaster recovery. Real‑time streaming ingestion with zero loss and sub‑second latency is also supported.

Cloud‑Native Migration

To improve scalability and operational efficiency, the CH cluster was refactored for cloud‑native deployment on Kubernetes (EKS) using the open‑source ClickHouse‑Operator. Key steps include:

Deploying StatefulSets, ConfigMaps, and custom CRDs to manage topology and shard anti‑affinity.

Implementing data persistence via local disks and AFS backups to survive pod recreation.

Integrating a lightweight meta service to replace ZooKeeper for cluster coordination, providing leader‑less stability.

Versioned metadata and MVCC mechanisms were added to guarantee strong consistency for imports, recovery, and DDL operations.

Cluster Coordination Upgrade

The new meta service decouples metadata storage from the service layer, allowing stateless scaling and global consistency. Combined with quorum protocols, it enables atomic imports and eliminates dirty reads/writes caused by the previous ZooKeeper‑based design.

Conclusion and Outlook

TDE‑ClickHouse now powers Baidu MEG’s data platform with over 350,000 CPU cores, 10 PB storage, and sub‑second query latency for billions of rows, supporting more than 30 business lines. Ongoing work focuses on further reducing resource consumption, enhancing lake‑warehouse integration, and preparing for future data‑driven AI workloads.

Distributed SystemsCloud NativePerformance OptimizationBig DataClickHouseData Warehouse
Architecture & Thinking
Written by

Architecture & Thinking

🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.