Big Data 18 min read

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

DataFunTalk

Mar 13, 2022

Tencent Data Lake Metadata Governance Practice and Architecture

The talk introduces the concept of a data lake, contrasting it with traditional data warehouses: warehouses store structured data with predefined schemas, while data lakes accept raw structured and semi‑structured data, offering lower cost and higher scalability at the expense of data quality.

Using Tencent's DLC data lake as an example, the advantages of a data lake are highlighted—high efficiency through Iceberg and Alluxio, low cost via COS object storage and serverless compute, and easy scalability with a compute‑storage separation architecture.

The overall Tencent data lake architecture follows a "3+2" model: three core components (storage, compute, unified metadata) and two supporting modules (heterogeneous data sources and lake‑ingestion). Storage relies on COS with Alluxio caching; compute provides serverless Presto, Spark, and SuperSQL services; unified metadata offers online data catalog and offline governance.

Unified metadata consists of four parts: meta‑model definition, metadata collection, metadata storage (relational, index, and graph databases), and metadata applications (catalog, data dictionary, lineage, tagging, lifecycle). Two related modules handle heterogeneous data sources and lake‑ingestion using Iceberg tables and Flink.

Tenant design separates metadata tenants (isolated schema stores) from business tenants (specific data sources). Metadata tenants map one‑to‑many with databases, while business tenants map to multiple data sources, enabling flexible multi‑catalog management in both public and private cloud scenarios.

The online catalog rewrites Hive Metastore RPC interfaces to support multi‑tenant, RPC authentication, and HTTP APIs, replacing the original single‑tenant design. The new Metastore, built on Hive 2.3.7, implements 79 of 167 RPC methods, supports multiple engines (Presto, Spark, Flink, Iceberg, Alluxio), and uses MyBatis for a more efficient persistence layer.

Statistical metadata collected via ANALYZE statements is unified across engines to feed cost‑based optimizers (CBO), improving query planning without redundant analysis.

Offline data governance provides a full suite of capabilities—data maps, dictionaries, search, lineage, tagging, and lifecycle management—implemented with relational, index, and graph databases, message queues, and schedulers. Open‑source solutions like Apache Atlas, LinkedIn DataHub, and Lyft Amundsen are surveyed, but Tencent builds a custom solution (project Hybris) to meet specific business needs.

The complete governance workflow in public cloud triggers scheduled extraction tasks, pulls metadata from heterogeneous sources, processes it via a message‑driven pipeline, and persists it for downstream applications, ensuring high‑quality, searchable, and governed data across the lake.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Cloud Computing Multi‑tenant Data Lake Hive Metastore metadata governance

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.