Big Data 11 min read

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

DataFunSummit
DataFunSummit
DataFunSummit
Tencent Real‑Time Lakehouse Intelligent Optimization Practices

Overview

The presentation introduces Tencent's real‑time lakehouse solution, which consists of three parts: data‑lake compute (Spark for batch, Flink for near‑real‑time, StarRocks/Presto for ad‑hoc OLAP), data‑lake management (Iceberg with open APIs and an Auto Optimize Service), and data‑lake storage (HDFS and COS with Alluxio cache).

Intelligent Optimization Service

The Auto Optimize Service is divided into six modules:

Compaction Service – merges small files using RowGroup‑level or Page‑level copy strategies, achieving over 5× reduction in merge time and resources.

Expiration Service – removes expired snapshots.

Cleaning Service – manages lifecycle and orphan file cleanup.

Clustering Service – re‑partitions data using Z‑order to improve data skipping, delivering more than 4× query performance gains.

Index Service – adds secondary indexes on Iceberg tables, with an intelligent recommendation engine that analyzes scan metrics, query frequency, and filter conditions.

Auto Engine Service – routes hot partitions to StarRocks based on OLAP engine events, enabling storage‑compute engine selection.

Scenario‑Based Capabilities

1. Multi‑Stream Join : Demonstrates how two MQ streams updating different columns can be merged by tagging branches in Iceberg and performing asynchronous compaction.

2. Primary‑Key Tables : Describes row‑level updates using bucketed primary‑key tables, rescaling buckets, and column‑family concepts for efficient full‑outer joins.

3. In‑Place Migration : Provides a metadata‑only migration path from legacy Hive/Thive to Iceberg, supporting STRICT, APPEND, and OVERWRITE modes, and introduces a new name‑mapping mechanism to enhance partition pruning.

4. PyIceberg : Highlights a JVM‑free Python client that enables fast native decoding, seamless integration with Pandas, TensorFlow, PyTorch, and DuckDB for data science and AI model training.

Summary and Outlook

The future work focuses on further optimizing the Auto Optimize Service (cold‑hot separation, materialized view acceleration, intelligent perception, compaction refinement, and advanced transform/UDF partition pruning), enhancing primary‑key tables with deletion vectors, and exploring AI‑driven lakehouse formats and distributed DataFrames that unify metadata and compute.

Overall, Tencent's real‑time lakehouse architecture and its intelligent optimization services demonstrate significant performance improvements and cost reductions for large‑scale data processing workloads.

Big DataFlinkData OptimizationSparkIcebergLakehouse
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.