Big Data 11 min read

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

DataFunSummit

Jan 3, 2025

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

Overview

The presentation introduces Tencent's real‑time lakehouse solution, which consists of three parts: data‑lake compute (Spark for batch, Flink for near‑real‑time, StarRocks/Presto for ad‑hoc OLAP), data‑lake management (Iceberg with open APIs and an Auto Optimize Service), and data‑lake storage (HDFS and COS with Alluxio cache).

Intelligent Optimization Service

The Auto Optimize Service is divided into six modules:

Compaction Service – merges small files using RowGroup‑level or Page‑level copy strategies, achieving over 5× reduction in merge time and resources.

Expiration Service – removes expired snapshots.

Cleaning Service – manages lifecycle and orphan file cleanup.

Clustering Service – re‑partitions data using Z‑order to improve data skipping, delivering more than 4× query performance gains.

Index Service – adds secondary indexes on Iceberg tables, with an intelligent recommendation engine that analyzes scan metrics, query frequency, and filter conditions.

Auto Engine Service – routes hot partitions to StarRocks based on OLAP engine events, enabling storage‑compute engine selection.

Scenario‑Based Capabilities

1. Multi‑Stream Join : Demonstrates how two MQ streams updating different columns can be merged by tagging branches in Iceberg and performing asynchronous compaction.

2. Primary‑Key Tables : Describes row‑level updates using bucketed primary‑key tables, rescaling buckets, and column‑family concepts for efficient full‑outer joins.

3. In‑Place Migration : Provides a metadata‑only migration path from legacy Hive/Thive to Iceberg, supporting STRICT, APPEND, and OVERWRITE modes, and introduces a new name‑mapping mechanism to enhance partition pruning.

4. PyIceberg : Highlights a JVM‑free Python client that enables fast native decoding, seamless integration with Pandas, TensorFlow, PyTorch, and DuckDB for data science and AI model training.

Summary and Outlook

The future work focuses on further optimizing the Auto Optimize Service (cold‑hot separation, materialized view acceleration, intelligent perception, compaction refinement, and advanced transform/UDF partition pruning), enhancing primary‑key tables with deletion vectors, and exploring AI‑driven lakehouse formats and distributed DataFrames that unify metadata and compute.

Overall, Tencent's real‑time lakehouse architecture and its intelligent optimization services demonstrate significant performance improvements and cost reductions for large‑scale data processing workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink data optimization Spark Iceberg Lakehouse

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.