Big Data 16 min read

Design and Implementation of BanYu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of BanYu's big data permission system, highlighting how it balances security and efficiency across Hive, Presto, HDFS, and other components.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of BanYu's Big Data Permission System

Background

In the early stage of BanYu, the data warehouse had no permission checks or auditing, prioritizing efficiency. As business grew, data security became critical, leading to the creation of a big data permission system.

Data Access Methods

Data in the offline warehouse can be accessed via:

Hive (CLI, Beeline, HiveServer2 JDBC API)

Presto (JDBC API)

Hadoop (HDFS CLI)

Flink (HDFS Client API)

Data Access Channels

These methods are used through:

Client tools installed on cluster nodes

Metabase (BI platform) with Hive and Presto data sources

Offline development platform (DolphinScheduler) using Hive data sources and CLI scripts

Real‑time development platform integrated with Hive

Design Goals

Balance control and efficiency, aiming to tighten non‑warehouse team permissions, gradually standardize warehouse access (replace Hive CLI with Beeline), and unify permission operations in a single platform.

Research

User Authentication

Supported authentication methods:

HiveServer2: Kerberos, SASL, NOSASL, LDAP, PAM, Custom

Presto: Kerberos, LDAP, Password File

Hadoop components: Kerberos only

To avoid high operational cost of Kerberos, LDAP is chosen for user authentication across Hadoop components.

Authorization

HiveServer2 uses SQL‑standard fine‑grained (column‑level) authorization via Apache Ranger. Presto and Hadoop also adopt Ranger plugins for authorization.

System Design

The data flow and permission control are illustrated in the diagram below.

Authentication and authorization are performed on HiveServer2 and Presto Coordinator, while HDFS NameNode only handles authorization. Ranger Plugin relies on Hadoop Group Mapping, which in turn uses LDAP.

User Authentication Configuration

Example configuration for HiveServer2 to use LDAP:

<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>ou=People,dc=ipalfish,dc=com</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://*****:389</value>
</property>
<property>
  <name>hive.server2.authentication.ldap.userDNPattern</name>
  <value>cn=%s,ou=bigdata_user,ou=People,dc=ipalfish,dc=com</value>
</property>

Authorization Details

User and Group

Ranger treats a User as an individual and a Group as a collection of users. Permissions can be granted to either, allowing O(1) permission assignment for many users.

Group information is obtained via Hadoop's Group Mapping, which uses LDAP.

Permission Levels

Columns are classified into three sensitivity levels (P0, P1, P2). Higher levels have longer approval processes.

Policy Types

Access: Grants read/write rights; column‑level policies can be defined.

Mask: Provides data masking for columns without access rights, built on top of Access policies.

All policies configured in Ranger Admin are periodically pulled by the Ranger Plugin and enforced at runtime.

Policy Configuration

Two approaches to create policies when a Hive table is created:

Synchronous creation : Wrap table creation and policy initialization in a single program. Works for data‑modeling platforms but not for direct CLI usage.

Asynchronous creation : Publish table‑creation events to a message queue (via Apache Atlas hooks). The permission system listens to the queue and initializes policies, ensuring coverage for all creation channels.

Field‑level permission changes are also captured via the same event pipeline.

Permission Integration with Metabase

Metabase reports are frequently accessed. The system integrates Metabase permissions as follows:

Report development : Users can develop reports after a data source is created and synchronized with their Metabase account.

Report viewing : Access is granted to Metabase groups; users are added to groups that receive the appropriate permissions through a synchronized account operation.

Examples of Metabase permission request tickets are shown in the images below.

Summary

The article outlines the core design of BanYu's big data permission system, covering authentication, fine‑grained authorization, policy management, and Metabase integration, achieving high automation, reduced approval cost, and consistent security across Hive, Presto, HDFS, and BI tools.

big dataAccess ControlHiveData SecurityPrestoApache Ranger
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.