Databases 19 min read

TDE-ClickHouse Optimization Practice at Baidu MEG: Query Performance, Data Import, and Distributed Architecture

Baidu MEG’s TDE‑ClickHouse optimization in the Turing 3.0 ecosystem boosts query speed up to 10×, halves latency, enables billion‑row bulk imports in under two hours, and migrates to a cloud‑native, ZooKeeper‑free architecture supporting 350 k CPU cores, 10 PB storage, and sub‑3‑second responses for 150 k daily BI queries.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
TDE-ClickHouse Optimization Practice at Baidu MEG: Query Performance, Data Import, and Distributed Architecture

This article introduces Baidu MEG's TDE-ClickHouse optimization practices within the Turing 3.0 ecosystem. The previous generation of big data products faced challenges including scattered platforms, inconsistent quality, and poor usability, leading to low development efficiency and slow business response.

Query Performance Optimization: The team implemented several key optimizations: (1) Computing layer decoupling with a dedicated aggregation layer featuring higher CPU, memory, and network throughput capabilities; (2) Multi-level data aggregation using automatic Projection for pre-aggregation of query intermediate states and QueryCache for caching final results, achieving up to 50% reduction in query latency; (3) High-cardinality UV query optimization through NoMerge queries (5-10x performance improvement) and RoaringBitmap to replace HashSet for better compression; (4) RBO (Rule-Based Optimizer) for Case-When query optimization, reducing query time by 20-40%.

Data Import Optimization: The team built a BulkLoad import mechanism with two phases: data construction (building CH internal structures and merging parts to AFS) and data delivery (two-phase commit for validation and routing). This achieves billion-row-level data import in under 2 hours while ensuring read-write separation.

Distributed Architecture Upgrade: (1) Cloud-native transformation using ClickHouse-Operator on EKS with 3000+ nodes, enabling automatic data recovery from replicas or AFS backups; (2) Cluster coordination service upgrade replacing ZooKeeper with a lightweight Meta service, implementing Quorum protocol and MVCC for strong data consistency.

Current scale: 350,000+ CPU cores, 10PB storage, 30+ business lines, 150,000+ daily BI queries with average response time under 3 seconds.

Cloud NativeClickHouseData WarehouseDatabase Optimizationquery performanceBulkloadBaidu MEG
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.