Design and Implementation of BanYu's Big Data Permission System
This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of BanYu's big data permission system, highlighting how it balances security and efficiency across Hive, Presto, HDFS, and other components.
Background
In the early stage of BanYu, the data warehouse had no permission checks or auditing, prioritizing efficiency. As business grew, data security became critical, leading to the creation of a big data permission system.
Data Access Methods
Data in the offline warehouse can be accessed via:
Hive (CLI, Beeline, HiveServer2 JDBC API)
Presto (JDBC API)
Hadoop (HDFS CLI)
Flink (HDFS Client API)
Data Access Channels
These methods are used through:
Client tools installed on cluster nodes
Metabase (BI platform) with Hive and Presto data sources
Offline development platform (DolphinScheduler) using Hive data sources and CLI scripts
Real‑time development platform integrated with Hive
Design Goals
Balance control and efficiency, aiming to tighten non‑warehouse team permissions, gradually standardize warehouse access (replace Hive CLI with Beeline), and unify permission operations in a single platform.
Research
User Authentication
Supported authentication methods:
HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom
Presto: Kerberos, LDAP, Password File
Hadoop components: Kerberos only
To avoid high operational cost of Kerberos, LDAP is chosen for user authentication across Hadoop components.
Authorization
HiveServer2 uses SQL‑standard fine‑grained (column‑level) authorization via Apache Ranger. Presto and Hadoop also adopt Ranger plugins for authorization.
System Design
The data flow and permission control are illustrated in the diagram below.
Authentication and authorization are performed on HiveServer2 and Presto Coordinator, while HDFS NameNode only handles authorization. Ranger Plugin relies on Hadoop Group Mapping, which in turn uses LDAP.
User Authentication Configuration
Example configuration for HiveServer2 to use LDAP:
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>ldap://*****:389</value>
</property>
<property>
<name>hive.server2.authentication.ldap.userDNPattern</name>
<value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>Authorization Details
User and Group
Ranger treats a User as an individual and a Group as a collection of users. Permissions can be granted to either, allowing O(1) permission assignment for many users.
Group information is obtained via Hadoop's Group Mapping, which uses LDAP.
Permission Levels
Columns are classified into three sensitivity levels (P0, P1, P2). Higher levels have longer approval processes.
Policy Types
Access: Grants read/write rights; column‑level policies can be defined.
Mask: Provides data masking for columns without access rights, built on top of Access policies.
All policies configured in Ranger Admin are periodically pulled by the Ranger Plugin and enforced at runtime.
Policy Configuration
Two approaches to create policies when a Hive table is created:
Synchronous creation : Wrap table creation and policy initialization in a single program. Works for data‑modeling platforms but not for direct CLI usage.
Asynchronous creation : Publish table‑creation events to a message queue (via Apache Atlas hooks). The permission system listens to the queue and initializes policies, ensuring coverage for all creation channels.
Field‑level permission changes are also captured via the same event pipeline.
Permission Integration with Metabase
Metabase reports are frequently accessed. The system integrates Metabase permissions as follows:
Report development : Users can develop reports after a data source is created and synchronized with their Metabase account.
Report viewing : Access is granted to Metabase groups; users are added to groups that receive the appropriate permissions through a synchronized account operation.
Examples of Metabase permission request tickets are shown in the images below.
Summary
The article outlines the core design of BanYu's big data permission system, covering authentication, fine‑grained authorization, policy management, and Metabase integration, achieving high automation, reduced approval cost, and consistent security across Hive, Presto, HDFS, and BI tools.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.