Big Data 12 min read

Building a Scalable Data Masking and Mock Service for Warehouse Testing

This article explains how to design and implement a data‑masking service that also provides mock data generation for data‑warehouse testing, covering the architecture, pain points, masking principles, workflow, evolution into a warehouse mock service, practical scenarios, and the significant efficiency and cost benefits achieved.

NetEase Yanxuan Technology Product Team

Jun 28, 2022

Building a Scalable Data Masking and Mock Service for Warehouse Testing

Introduction

Effective data testing requires clear test boundaries. In addition to testing the data layer (metrics, models, warehouse tables), the data warehouse itself must be validated because it is a critical link in the end‑to‑end business chain. The data product originates from the business chain and the produced data should feed back to drive business growth.

Data Product Testing Landscape

Data products sit at the top of the architecture diagram as the presentation layer. Compared with ordinary application testing, data‑product testing adds an extra data‑warehouse step, making the testing chain longer and more complex because the warehouse link also needs verification.

Current Pain Points in Data‑Product Testing

Data‑quality testing is a high‑priority line but often receives insufficient focus.

Manual verification of metric correctness hampers overall data‑quality control; automated regression is lacking.

Sensitivity of data forces API testing to rely on locally deployed frameworks, preventing platform‑wide reuse.

The warehouse lacks a dedicated test environment; all tests run against production data, and model data cannot be displayed or queried in a test environment.

Masking Service Principles and Usage

Masking Fundamentals

SDK + independent masking service architecture.

Multiple customizable masking methods.

Whitelist/blacklist configuration for fine‑grained control.

Plug‑and‑play lightweight development.

Integration Workflow

Without masking service: Backend queries the warehouse via DQS, aggregates data, and returns it to the frontend.

With masking service: Backend calls DQS; an embedded SDK extracts the user ID (UID) and request URL, then forwards them to the masking service. The service checks the UID against a whitelist; if the UID is allowed, the service determines which API responses and fields require masking based on configured rules and applies blacklists where needed.

Real‑World Masking Effect

The service is deployed in the “Fuxi” and “VIPAPP” projects, supporting both PC and app clients. It enables a testing layer for sensitive data products and can be adopted by other similar projects.

Evolution to Warehouse Mock Service

Why a Mock Service Was Needed

The warehouse has no unified test environment; test environments query the production warehouse, receive no data, and thus cannot generate test data.

Current Data‑Warehouse Query Scenarios

Full‑table queries without specific fields.

Model‑field queries (e.g., query a specific SKU ID).

Linked queries where the result of one query drives subsequent queries.

Desired Capabilities

Ensure test environments receive data when querying the production warehouse.

Modify returned data so that online and test environments are aligned.

Support orchestration of multi‑model queries with relationships.

Mock Service Workflow

Business users input model identifiers, fields, and record counts in the Data Factory.

The service selects all data for the specified model (caching results for repeated queries).

Online model data is fetched, masked, and returned together with request metadata.

Returned data and metadata appear in the Data Factory, where users can edit them to craft test scenarios.

The generated rule link replaces the DQS request URL in the business system’s Apollo configuration.

The business system displays the mock data according to the rule.

Applicable Scenarios

Single‑model query with data return.

Multiple independent models.

Multiple models with field relationships.

Benefits

Manual data‑generation time saved: Seconds to fetch any model’s data in any quantity; batch import/export also supported. Estimated effort reduction >1000× compared with the previous manual pipeline.

Test‑environment cost reduction: Production warehouse runs on ~330 machines (≈9.9 M CNY). Scaling test environments to 1/10 of production saves ~1 M CNY annually plus maintenance labor.

Conclusion

Complex business systems require clear test boundaries. Data testing must address both data‑layer verification and the warehouse’s role as a business‑chain link. QA, being most familiar with the end‑to‑end flow, plays an indispensable role in ensuring data quality and driving business development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data testing Data Warehouse data masking mock service

Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.