Information Security 10 min read

Security Architecture and Data Masking in a Big Data Platform

The article outlines Youzan’s evolving security architecture for its big‑data platform—starting with Apache Ranger‑based permission control, moving to a centralized permission‑management service for easier requests and audits, and finally implementing column‑level masking via SQL rewriting to protect sensitive data while balancing usability.

Youzan Coder
Youzan Coder
Youzan Coder
Security Architecture and Data Masking in a Big Data Platform

In the early stage of building a big data platform, security is often not a primary focus. The platform mainly serves internal data developers, aiming to improve development efficiency and support data warehouse construction. Since the data itself is protected by network and physical isolation, the article explores what security issues need to be considered during platform construction.

Key security considerations include:

Environment isolation – developers should only access data belonging to their business domain, reducing exposure and accidental operations.

Data masking – even internal developers may need restricted access to sensitive fields.

Clear responsibilities – each business domain has a data owner, and all data access and operations are audited.

Balancing security and convenience – security measures inevitably affect platform usability.

The security construction at Youzan’s big data platform evolved through three stages.

Stage 1: Ranger + Plugin Permission Control

Initially, only a few data‑warehouse engineers had read/write rights, while other business teams only had read access. As business volume grew, the need for domain‑level data isolation became urgent. After evaluating solutions, the team chose Apache Ranger combined with HiveServer2 (and Spark Thrift) plugins. The architecture allowed authentication via the company’s LDAP and mapping of organizational units to Ranger groups.

However, this approach introduced friction: developers had to request permissions through Hue, approval required logging into Ranger UI, and extending support to other query engines (e.g., Presto) required custom development, limiting engine extensibility.

Stage 2: Centralized Permission Management Service

To improve usability, the platform consolidated entry points, removed Hue, and built a permission‑management service that interacts with Ranger through REST APIs. This service enables one‑click permission requests and approvals, supports temporary permissions, and allows data administrators to configure business‑domain policies via a UI, which are then automatically translated into Ranger policies. A complete audit system was also established.

Stage 3: Column‑Masking Data Desensitization

With growing business, the need for masking sensitive fields (e.g., phone numbers, email addresses) intensified. Although Ranger supports column masking, the team had already decoupled Ranger from execution engines, so they implemented masking at the platform layer using SQL rewriting.

The workflow is:

SQL Engine Proposer selects the most suitable execution engine based on query characteristics and cluster load.

Before submitting the query, the platform rewrites the SQL to mask sensitive columns according to policies stored in Ranger.

Example of an original query:

select acct_no from ods.xxx where par='20181128' limit 10;

After masking rewrite, the query becomes:

SELECT acct_no FROM (SELECT `par`, `id`, CAST(mask_show_last_n(acct_no, 4, 'x', 'x', 'x', -1, '1') AS bigint) AS `acct_no`, `kdt_id`, `water_no`, `target_id`, `remark`, `create_time`, `sub_target_id` FROM `ods`.`xxx`) `xxx` WHERE par = '20181128' LIMIT 10;

The masking logic is implemented with ANTLR4 parsers for Hive, Spark, and Presto. By rewriting all grammars to ANTLR4, the team could reuse visitor logic for both SQL rewriting and lineage tracking, enabling propagation of sensitivity levels across derived fields.

Future Outlook

Security construction is not a one‑off task; it evolves with the platform’s expanding business scope, including real‑time data warehouses, machine‑learning platforms, and more. The team anticipates continued work on security as new services emerge.

Finally, a brief invitation: the Youzan infrastructure team maintains the data platform (DP), real‑time computing (Storm, Spark Streaming, Flink), offline computing (HDFS, YARN, Hive, Spark SQL), online storage (HBase), and real‑time OLAP (Druid). Interested engineers can contact [email protected].

Big Dataaccess controlData SecurityrangerColumn MaskingSQL Rewriting
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.