Big Data 25 min read

Data Quality Governance: From Compliance to Reasonableness and the Quality Review Tool System

This article explains how to assess and improve data quality by moving from simple compliance checks to deeper reasonableness analysis, using visual dashboards, a comprehensive quality‑review tool suite, intelligent judgement rules, self‑diagnosis utilities, and key technical components such as sample libraries and a three‑layer architecture.

DataFunTalk
DataFunTalk
DataFunTalk
Data Quality Governance: From Compliance to Reasonableness and the Quality Review Tool System

Introduction

Data collection, in a broad sense, means digitizing and recording the real world. The depth, breadth, and accuracy of data collection determine the upper limit of data applications, so data governance must start at the source. The biggest current challenge in data‑collection governance is quality and efficiency; this article focuses on the quality aspect, deeply analyzing how to truly see and even control data quality.

The main content includes the following parts:

Seeing Data Quality

Quality Review Tool System

Key Technical Analysis

Summary and Outlook

Q&A

1. Seeing Data Quality

1.1 A Case to See Data Quality

Assume you are a data scientist who needs to use the terminal‑behavior log parameter login_type (login type). Besides its Chinese and English names, you know nothing else and are worried about its quality. How would you proceed?

Possible approaches include asking people, checking a metadata platform, or pulling a few rows of data.

The method proposed in this article is to use visual charts for quality inspection.

This stacked percentage trend chart shows the distribution of login_type enumeration values. From the chart we can see:

The parameter is enumerated, with few values whose meanings are easy to infer (e.g., WeChat, QQ, phone).

In the past two weeks there has been no major fluctuation; the trend is stable.

The distribution on both platforms is roughly the same, but Android uses uppercase while iOS uses lowercase, which needs attention.

Consequently, we may conclude:

login_type is filled seriously, its semantics are clear, and the trend is stable, though there is an inconsistency in case between platforms.

Whether the decision made is usable or not, it is based on an objective and concrete understanding of the parameter.

The whole process from ignorance to decision often takes no more than two minutes.

The speed comes from the high information density of the simple chart, which implicitly performs a reasonable data‑quality analysis.

2. Data Quality: Compliance vs Reasonableness

In the data field, quality is usually evaluated by five dimensions: completeness, validity, accuracy, consistency, and timeliness.

This article re‑frames them into two cognitive dimensions: compliance and reasonableness.

Compliance

Compliance means the data conforms to the specifications set by designers.

Completeness and validity are essentially compliance checks (e.g., required fields, out‑of‑range values, enumeration violations).

Compliance results are absolute: either correct or wrong. Rules can be defined before data is even produced.

Reasonableness

Reasonableness means the data makes sense from all angles and is self‑consistent.

Accuracy and consistency need data analysis to judge reasonableness because we cannot directly compare data with the real world.

Reasonableness lacks a unified standard; only experienced domain experts can spot unreasonable clues.

Compliance is the minimum requirement, but reasonableness is essential for trustworthy data.

3. Quality Review Tool System

The system consists of three parts: Quality Review, Intelligent Judgement, and Self‑Diagnosis.

3.1 Quality Review – Let People See Data Quality

(1) The Key Is Relativity

Reasonableness relies on comparison; all reasonableness comes from relative analysis.

Returning to the earlier case, the chart enables three internal comparisons:

Internal comparison : Top‑10 enumeration values and their proportions.

Date comparison : Day‑over‑day and week‑over‑week trends.

Dual‑platform comparison : Android vs iOS distribution.

These comparisons give a comprehensive view of the parameter’s quality profile.

(2) Types and Indicators of Quality Inspection

Quality inspection is like a health check report for data. Different data types require different inspection items.

Inspection types are defined from a quality perspective (as opposed to storage perspective) and can be manually labeled or auto‑generated.

Each inspection type maps to several inspection indicators, each producing at least one visualization chart. Examples include element click penetration rate, page entry/exit defect rate, etc.

Inspection indicators are essentially feature engineering for data quality.

3.2 Intelligent Judgement – Let Machines Automatically Detect Problems

Examples of intelligent judgement results are shown below. Each row represents a detected problem, e.g., “Tab button exposure interval distribution on 11.09 Android gray‑release version was abnormal, discovered by gray‑main comparison and Manhattan distance operator.”

Two concrete examples are provided, illustrating how the system flags enumeration distribution anomalies and event penetration rate issues, and how the problems are classified (business feature, technical fault, etc.).

(1) Judgement Rules and Rule Library

Intelligent judgement converts the comparison ideas from quality review into executable code.

Typical judgement thoughts include:

Main‑line date ring‑compare : Compare today’s main version with the average of the past N days.

Gray‑main comparison : Compare today’s gray version with today’s main version.

Gray‑dual‑platform comparison : Compare today’s gray Android with today’s main iOS (or vice versa).

Gray‑gray comparison (TBD) : Compare today’s gray with a previous similar gray.

(2) Judgement Operators

Distance operators : Manhattan distance and weighted variants for distribution metrics.

Difference operators : Difference correction and weighted variants for ratio metrics.

Compliance operators : Various compliance checks for absolute guarantees.

(3) Configuration, Thresholds, and Gates

Configuration : Increases flexibility and reusability of rules.

Thresholds : Three levels – warning, problem, fatal.

Gates : Pre‑conditions to control false positives.

The rule library stores reusable judgement rules, allowing users to select or extend them.

(4) Limited Intelligence

Currently there is no AI or machine learning; the system relies on rule‑based logic for explainability. Future work may introduce AI to expand rule models and automate rule generation.

3.3 Self‑Diagnosis – Let People Quickly Diagnose Problems

Useful small tools include:

Comparison tool – users can freely compare any two inspection indicators.

Dimension drill‑down – drill into any parameter with 3D visualization.

Sample extraction – provide a few detailed rows for quick reference.

SQL templates – auto‑generate query templates for non‑technical users.

For complex issues, a user‑behavior diagnosis tool visualizes a Gantt‑style timeline with linked statistics and details, helping trace single‑user actions.

4. Key Technical Analysis

4.1 Sample Library Technology

Principle: In the receiving gateway, 1% of devices are sampled and routed to a sample stream; the stream is persisted as a sample library.

Confidence is measured by combining PV and UV differences (each weighted 50%). Even with 99% confidence, the error margin is 1%, which is acceptable for high‑level quality monitoring.

Benefits: consumes only ~1% of warehouse resources while saving massive compute and query time; recommended for any sizable business.

4.2 Three‑Layer Separation Architecture

Design splits traditional monitoring rules into three layers: Compute rules, Judgement rules, and Alarm rules, forming a three‑layer architecture.

Compute layer : homogenizes heterogeneous data into uniform inspection indicators, focusing on cost reduction via merging, scheduling, sampling, and edge computing.

Judgement layer : extracts data problems; humans perform quality review, machines perform intelligent judgement.

Alarm layer : human‑friendly notifications to producers and consumers, with metrics for receipt and resolution rates.

4.3 Decision Engine and Metric Storage

The first‑generation decision engine uses a DSL based on Golang’s gengine , handling 600k+ daily tasks with good performance but limited extensibility.

The next generation may adopt Python DSL to support third‑party libraries and datasets, possibly using Jupyter as the development environment.

Metric storage uses Tencent Cloud CTSDB (ElasticSearch‑based) for its schema‑free design, providing stability and high performance for time‑series data.

5. Summary and Outlook

The quality‑review concepts and tools described have been applied in multiple Tencent businesses and are becoming increasingly important. Future work will cover efficiency aspects such as hierarchical reporting, columnar packaging, standardized reporting models, visual joint debugging, real‑time validation, lightweight processes, and intelligent dashboards.

Q&A

Q1: How effective is the sample library? Does it mainly address confidence issues? How to judge if sampled data is problematic?

A1: The sample library’s main value lies in its high confidence (generally >99%). Although probability sampling introduces uncertainty, the library provides reliable results, and no false‑positive cases have been observed so far. Specific business scenarios may require tailored confidence assessments.

Q2: How to improve the accuracy component of the precision‑recall trade‑off?

A2: Improving accuracy is challenging; it requires balancing generalization and precision. Common methods include adjusting thresholds, refining algorithms, expanding rule sets, and focusing on stable comparison signals. Ultimately, domain expertise combined with technical skill yields the best results.

Big Datadata qualityvisualizationData Governanceintelligent detection
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.