Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data
This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.
TBDS (Tencent Big Data Processing Suite) is a data platform built on both Hadoop and MPP ecosystems, supporting four main scenarios: batch‑stream processing, cloud‑native data lake, lake‑warehouse integration, and localized data middle‑platforms.
The platform serves a wide range of customers—from finance to industry, media, and government—each with vastly different data volumes, cluster sizes, and service requirements, leading to significant data‑island problems.
To address data silos, a unified metadata approach using Hive Metastore is proposed, allowing both Hadoop and MPP engines to share metadata and table formats such as Iceberg, though Hive Metastore lacks governance capabilities and struggles with semi‑structured or unstructured data.
Existing permission solutions like Hadoop Ranger provide limited, component‑centric authorization, while many cloud vendors offer lake‑formation products that are often cloud‑locked and lack support for on‑premise or private‑cloud environments.
Gravitino, an Apache‑licensed open‑source project, offers a unified metadata service and an extensible permission framework that works across public cloud, private cloud, and non‑cloud deployments, exposing a standard SDK for engine integration.
The permission model follows RBAC with three core concepts: Metalake (the organization), Role (a set of permissions), and User (subjects). Metalake admins create roles and assign them to users, enabling fine‑grained access control across heterogeneous data stacks.
Gravitino’s architecture includes built‑in authentication and a plugin system for external authorization. Four plugin categories are supported: Native Catalog plugins for Hadoop‑style permissions, Ranger plugins for Hadoop ecosystems, JDBC plugins for MPP and relational databases, and Cloud Catalog plugins for cloud IAM services.
Authentication mechanisms include OAuth, Kerberos, and cloud IAM, while three operational roles—Service Admin, Metalake Admin, and regular User—manage metalake creation, role definition, and data entity creation respectively.
An example workflow shows Service Admin creating a metalake, Metalake Admin defining roles and users, and Users creating catalogs (e.g., Hive, MySQL) and tables, with permissions propagated via the unified model.
Since its open‑source launch in December 2023, Gravitino has attracted over 60 contributors and remains an active community project.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.