Big Data 11 min read

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

This article presents Tencent's real-time lakehouse architecture, detailing its three-layer design, the Auto Optimize Service with compaction, indexing, clustering and engine acceleration, scenario capabilities such as multi‑stream joins and in‑place migration, and outlines future optimization directions.

DataFunSummit

Nov 5, 2024

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

Overview – The presentation, delivered by senior engineer Chen Liang, introduces Tencent's real-time lakehouse solution, focusing on four topics: lakehouse architecture, intelligent optimization services, scenario capabilities, and a summary with outlook.

Lakehouse Architecture – The architecture consists of three layers: data‑lake compute (Spark for batch ETL, Flink for near‑real‑time streaming, StarRocks and Presto for ad‑hoc OLAP), data‑lake management (Iceberg as the core with an open API and an Auto Optimize Service to improve query performance and reduce storage cost), and data‑lake storage (HDFS and Tencent Cloud Object Storage, with Alluxio providing a unified cache layer).

Intelligent Optimization Service – The service comprises six modules:

Compaction Service – merges small files using RowGroup‑level and Page‑level copy strategies, achieving over 5× reduction in merge time and resources.

Expiration Service – removes expired snapshots.

Cleaning Service – manages lifecycle and orphan file cleanup.

Clustering Service – re‑distributes data using Z‑order to improve data skipping, delivering more than 4× performance gains.

Index Service – adds secondary indexes on Iceberg tables, provides min‑max and custom metrics, and offers an end‑to‑end index recommendation workflow based on query frequency and filter analysis.

Auto Engine Service – routes hot partitions to StarRocks based on OLAP engine events, enabling storage‑compute engine selection for better query performance.

Scenario Capabilities

Multi‑stream Join – combines data from multiple MQ streams by tagging and asynchronous compaction, allowing seamless merged reads.

Primary‑Key Table – uses bucketed writes and rescaling to support row‑level updates and column‑family style storage.

In‑Place Migration – migrates legacy Thive/Hive data to Iceberg without moving original files, supporting STRICT, APPEND, and OVERWRITE modes, and enhances name‑mapping and partition pruning for compute tasks.

PyIceberg – provides a JVM‑free Python API for Iceberg metadata, enabling Pandas, TensorFlow, PyTorch integration and fast SQL push‑down via DuckDB.

Summary and Outlook – Future work includes further enhancements to the Auto Optimize Service (cold‑hot separation, materialized view acceleration, intelligent sensing), primary‑key table improvements (deletion vectors, predicate push‑down), and AI‑driven lakehouse innovations such as optimized formats for model training and distributed DataFrame support.

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Tencent Iceberg

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.