Tencent Data Lake Metadata Governance Practice and Architecture
This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.
The talk introduces the concept of a data lake, contrasting it with traditional data warehouses: warehouses store structured data with predefined schemas, while data lakes accept raw structured and semi‑structured data, offering lower cost and higher scalability at the expense of data quality.
Using Tencent's DLC data lake as an example, the advantages of a data lake are highlighted—high efficiency through Iceberg and Alluxio, low cost via COS object storage and serverless compute, and easy scalability with a compute‑storage separation architecture.
The overall Tencent data lake architecture follows a "3+2" model: three core components (storage, compute, unified metadata) and two supporting modules (heterogeneous data sources and lake‑ingestion). Storage relies on COS with Alluxio caching; compute provides serverless Presto, Spark, and SuperSQL services; unified metadata offers online data catalog and offline governance.
Unified metadata consists of four parts: meta‑model definition, metadata collection, metadata storage (relational, index, and graph databases), and metadata applications (catalog, data dictionary, lineage, tagging, lifecycle). Two related modules handle heterogeneous data sources and lake‑ingestion using Iceberg tables and Flink.
Tenant design separates metadata tenants (isolated schema stores) from business tenants (specific data sources). Metadata tenants map one‑to‑many with databases, while business tenants map to multiple data sources, enabling flexible multi‑catalog management in both public and private cloud scenarios.
The online catalog rewrites Hive Metastore RPC interfaces to support multi‑tenant, RPC authentication, and HTTP APIs, replacing the original single‑tenant design. The new Metastore, built on Hive 2.3.7, implements 79 of 167 RPC methods, supports multiple engines (Presto, Spark, Flink, Iceberg, Alluxio), and uses MyBatis for a more efficient persistence layer.
Statistical metadata collected via ANALYZE statements is unified across engines to feed cost‑based optimizers (CBO), improving query planning without redundant analysis.
Offline data governance provides a full suite of capabilities—data maps, dictionaries, search, lineage, tagging, and lifecycle management—implemented with relational, index, and graph databases, message queues, and schedulers. Open‑source solutions like Apache Atlas, LinkedIn DataHub, and Lyft Amundsen are surveyed, but Tencent builds a custom solution (project Hybris) to meet specific business needs.
The complete governance workflow in public cloud triggers scheduled extraction tasks, pulls metadata from heterogeneous sources, processes it via a message‑driven pipeline, and persists it for downstream applications, ensuring high‑quality, searchable, and governed data across the lake.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.