Big Data 19 min read

TDE-ClickHouse: Baidu MEG's High-Performance Big Data Analytics Engine

TDE‑ClickHouse, the core engine of Baidu MEG’s Turing 3.0 ecosystem, delivers sub‑second, self‑service analytics on petabyte‑scale data by decoupling compute, adding multi‑level aggregation, high‑cardinality and rule‑based optimizations, a two‑phase bulk‑load pipeline, cloud‑native deployment, and a lightweight meta service, now powering over 350 000 cores, 10 PB storage and more than 150 000 daily BI queries with average response times under three seconds.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
TDE-ClickHouse: Baidu MEG's High-Performance Big Data Analytics Engine

This article introduces TDE-ClickHouse, a core engine in Baidu MEG's Turing 3.0 data ecosystem, designed to provide self-service sub-second analytics on massive datasets.

Background: Baidu MEG's previous generation of big data products faced issues including scattered platforms, inconsistent quality, and poor usability, leading to low development efficiency, high learning costs, and slow business response. To address these challenges, Baidu developed the Turing 3.0 ecosystem comprising TDE (Turing Data Engine), TDS (Turing Data Studio), and TDA (Turing Data Analysis).

Query Performance Optimizations:

1. Compute Layer Decoupling: Introduced an aggregation layer with higher CPU, memory, and network throughput to handle coordinator responsibilities, enabling cross-cluster queries.

2. Multi-level Data Aggregation: Implemented automatic Projection creation for high-frequency BI queries with full lifecycle management, and added QueryCache at the CHProxy layer to cache final query results, reducing query latency by up to 50%.

3. High-Cardinality UV Optimization: Used NoMerge queries to push deduplication to individual shards, and combined Projection with RoaringBitMap to replace HashSet for better compression.

4. RBO Optimization: Implemented rule-based optimizations for common query patterns like Case-When queries, reducing query latency by 20-40%.

Data Import Optimization: Built a BulkLoad mechanism with two phases: data construction (building CH internal structures and merging parts to AFS) and data delivery (two-phase commit for validation and routing). This achieves billion-row-level imports in under 2 hours while ensuring read-write separation.

Distributed Architecture Upgrade:

1. Cloud-Native Transformation: Deployed ClickHouse clusters on EKS using ClickHouse-Operator, implementing automatic data recovery from replicas or AFS backups, and using BNS for service discovery.

2. Cluster Coordination Service: Built a lightweight Meta service to replace ZooKeeper-dependent architecture, implementing global versioning with Quorum protocol and MVCC for strong data consistency.

Results: The platform now manages 350,000+ cores and 10PB storage, serving 30+ business lines with 150,000+ daily BI queries, achieving average query response under 3 seconds and P90 under 7 seconds.

Distributed SystemsCloud NativeQuery OptimizationPerformance TuningClickHousedatabase architecturebig data analyticsdata-import
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.