Information Security 15 min read

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Background : With the maturity of cloud computing and big‑data technologies, large‑scale data brings significant economic and social value, while data security becomes a major challenge for enterprises. Bilibili has been committed to protecting user privacy data.

Ranger Overview

2.1 User Authentication : Hadoop originally lacks security; Kerberos was introduced from Hadoop 1.x onward. Bilibili's data platform clients have all been integrated with Kerberos.

2.2 Ranger Introduction : Kerberos only controls login, not fine‑grained operation permissions. Ranger provides a centralized security framework for Hadoop ecosystem components (HDFS, YARN, Hive, HBase, etc.) with fine‑grained access control via its console or REST API.

Ranger 1.2.0 is deployed with two write nodes and two read nodes behind a load balancer to achieve read/write separation. Authorization requests from tools (e.g., Shielder) are routed through the LB to Ranger Admin's REST API, persisting policies to the DB and performing pre‑checks on read nodes.

3 HDFS Path Authorization

3.1 Refactoring the Authorization Interface : The original Ranger API required a full RangerPolicy object, which was cumbersome for path or table authorization. The new interface only requires four parameters:

service : Ranger service name

type : path or table

resources : specific HDFS path or Hive table

access : read or write

If type is table , the Hive Metastore client is used to obtain the table location and decide whether to create or update a policy.

3.2 Ranger Admin Refactoring : The original admin loaded policies serially from the DB into a List , causing unpredictable load times when policies changed during iteration. The new design loads policies in parallel into a Map , eliminating the need for a second DB read and removing the costly ORDER BY clause.

select obj from XXPolicyItemAccess obj, XXPolicyItem item where obj.policyItemId = item.id and item.policyId in (select policy.id from XXPolicy policy where policy.service = :serviceId) order by item.policyId, obj.policyItemId, obj.order
select obj from XXPolicyItemAccess obj, XXPolicyItem item where obj.policyItemId = item.id and item.policyId in (select policy.id from XXPolicy policy where policy.service = :serviceId)

With ~180k policies, 700k policy items, and 2M policy item accesses, removing the ORDER BY saves about 4 seconds; total load time is now ~25 seconds.

3.3 Gray‑Release : A group‑based gray‑release mechanism (ALWAYS_ALLOW, ALWAYS_DENY, BY_HADOOP_ACL) is added to the HDFS plugin. Strict groups are configurable without restarting the NameNode, allowing graceful rollout of new policies.

3.4 Permission Pre‑Check : After granting permissions via Ranger API, a pre‑check interface validates whether the permission has taken effect. It builds a RangerAccessRequest from the HDFS path, user, and access type, evaluates matching policies, and periodically refreshes the service policies cache.

4 Hive Table Authorization

4.1 Pain Points of HDFS Authorization : Issues such as table owners lacking drop permissions, path‑level only control, and inability to perform data masking motivate the introduction of Hive‑level authorization.

4.2 Hive Authorization Interface : Hive and HDFS policies are kept independent, but Hive authorization is translated into corresponding HDFS permissions to keep a unified enforcement point.

4.3 Hive Metastore Remote Authorization & Data Masking : A Hive Metastore plugin adds two interfaces—one for authorization (encapsulated in CheckPrivilegesBag ) and one for data masking (row filter and column masking).

struct CheckPrivilegesBag {
  1: HiveOperationObjectType hiveOperationObjectType, // operation type
  2: list<HiveObjectPrivileges> inputPrivileges, // input tables
  3: list<HiveObjectPrivileges> outputPrivileges, // output tables
  4: HiveAuthzContextObject hiveAuthzContext,
  5: string user,
}

Data‑masking APIs:

list<HiveObjectPrivileges> apply_row_filter_and_column_masking(1:HiveAuthzContextObject hiveAuthzContextObject, 2: list<HiveObjectPrivileges> objectPrivileges, 3: string user) throws (1:MetaException o1)

4.4 Spark Ranger : Spark Ranger uses rule and strategy injection to enforce authorization and masking. To avoid repeated checks, successful authorizations are cached, and only uncached tables/columns trigger HMS checks.

4.5 Presto Ranger : Presto’s native policy handling was replaced with Hive‑derived policies, enabling table/column, row‑filter, and column‑masking controls without maintaining separate Presto policies.

5 Future Plans

5.1 Incremental Load Policy : Adopt Ranger‑2.0 incremental policy loading to reduce policy activation time from ~25 s to sub‑second.

5.2 Fusion of HDFS and Hive Policies : Merge Hive table location data into the HDFS plugin so that HDFS path checks first consult the corresponding Hive table policy.

5.3 Moving HDFS Authorization to NNProxy : Consolidate HDFS‑Ranger interactions in a NameNode proxy to reduce the number of NS groups contacting Ranger Admin and lower NameNode latency.

References

An Introduction to Ranger RMS: https://blog.cloudera.com/an-introduction-to-ranger-rms/

Support for Incremental policy updates: https://issues.apache.org/jira/browse/RANGER-2341

腾讯—大数据安全体系介绍: https://mp.weixin.qq.com/s?__biz=MzIxMTE0ODU5NQ==&mid=2650247544&idx=1&sn=192ae24e3114502180a3b861e5f12a5c

浅谈有赞大数据安全体系: https://mp.weixin.qq.com/s?__biz=MzAxOTY5MDMxNA==&mid=2455762102&idx=1&sn=37281abfcecd4f247fb291bb8c3de8

access controlHiveHDFSrangerPrestoSparkBig Data Security
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.