Big Data 12 min read

Tencent Real-time Lakehouse Intelligent Optimization Practice

Tencent’s real‑time lakehouse combines Spark, Flink, StarRocks and Presto compute layers with Iceberg‑based management and HDFS/COS storage, and its Intelligent Optimize Service—comprising Compaction, Expiration, Cleaning, Clustering, Index and Auto‑Engine modules—automatically reduces merge time, improves query performance, enables secondary indexing, and dynamically routes hot partitions, while future plans target cold/hot separation, materialized view acceleration, and AI‑driven optimizations.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Tencent Real-time Lakehouse Intelligent Optimization Practice

The talk titled "Tencent Big Data Real-time Lakehouse Intelligent Optimization Practice" was presented by Chen Liang, a senior engineer at Tencent, and edited/compiled by the DataFun community.

Tencent's lakehouse architecture consists of three layers: compute (Spark for batch ETL, Flink for near‑real‑time streaming, StarRocks and Presto for OLAP queries), management (Iceberg core with open APIs and an Auto Optimize Service built on top), and storage (HDFS and Tencent Cloud Object Storage (COS) with an Alluxio cache layer).

The core of the presentation is the Intelligent Optimize Service, which comprises six modules: Compaction Service (small‑file merging), Expiration Service (snapshot cleanup), Cleaning Service (lifecycle and orphan file removal), Clustering Service (data redistribution), Index Service (secondary index recommendation), and Auto Engine Service (automatic engine acceleration). Each module addresses specific performance and cost challenges in Iceberg‑based lakehouses.

Compaction Service optimizations include RowGroup‑level and Page‑level copy strategies, enhanced Delete Files merging via Left Anti Join and Bloom Index, and incremental Rewrite using Modify Time for partition‑level updates, resulting in more than a 5× reduction in merge time and resource consumption.

Index Service extends Iceberg’s min‑max indexes with a secondary index framework. It collects scan and filter metrics, analyzes query frequency and column cardinality, and intelligently recommends indexes. The end‑to‑end flow involves SQL reconstruction, coarse filtering, incremental index construction, blind‑run validation, effect evaluation, and user‑facing output, with support for task‑ and table‑level synchronization.

Clustering Service improves Data Skipping by re‑ordering data using Z‑order curves on selected columns, preserving cross‑column ordering and achieving over 4× improvement in query performance for both single‑ and multi‑column workloads.

Auto Engine Service monitors OLAP engine events, heats relevant partitions by routing them to StarRocks/Doris clusters, and exposes the partitioned metadata so upper engines can leverage OLAP‑level optimizations when querying external Iceberg/Hudi tables.

Scenario‑specific capabilities discussed include: multi‑stream concatenation via Iceberg’s branch/tag mechanism (asynchronous compaction merges branches for read‑side consistency), primary‑key table design for row‑level updates with bucket‑level rescalability and column‑family extension, in‑place migration from Tencent’s self‑developed Hive/Hive to Iceberg (metadata‑only regeneration with STRICT/APPEND/OVERWRITE modes and improved Name Mapping/partition pruning), and PyIceberg integration for JVM‑free, high‑performance access enabling AI/ML workflows with Pandas, TensorFlow, and PyTorch.

Future work outlined in the talk includes cold/hot separation for cost‑efficiency, materialized view acceleration, intelligent sensing for Auto Engine, further compaction refinement, expanded Transform UDF‑based partition pruning, primary‑key table enhancements using deletion vectors, and AI‑driven exploration of model‑ready lakehouse formats and distributed dataFrame designs that unify metadata and execution engines.

Big DataClusteringIndexingCompactionTencentIcebergLakehousePyIceberg
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.