Big Data 18 min read

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing compliance from reasonableness, introduces a comprehensive quality review tool system—including visual inspection, intelligent judgement, and self‑diagnosis—details key techniques such as comparison operators and sampling, and outlines a three‑layer architecture and future directions for data quality assurance.

DataFunTalk

Feb 15, 2024

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

1. Seeing Data Quality

Using a case study of the login_type parameter, the article demonstrates how visual charts can quickly reveal data quality issues such as enumeration completeness, stable trends, and platform inconsistencies (e.g., Android vs iOS case differences).

2. Data Quality: Compliance vs Reasonableness

Data quality is divided into five classic dimensions (completeness, validity, accuracy, consistency, timeliness). The article re‑frames these as Compliance (adhering to predefined rules) and Reasonableness (making sense from multiple perspectives). Compliance checks are absolute; reasonableness requires analytical judgment.

3. Quality Review Tool System

The system consists of three parts: Quality Review , Intelligent Judgement , and Self‑Diagnosis .

3.1 Quality Review

Emphasizes relative comparisons: internal distribution, date‑wise trends, and dual‑platform (Android/iOS) differences. Visualizations (stacked percentage charts, trend lines) enable rapid understanding of data health.

3.2 Intelligent Judgement

Transforms human comparison ideas into executable code. Uses comparison operators such as Manhattan distance for distribution metrics and delta adjustments for ratio metrics. Configurable rules, thresholds, and pre‑conditions reduce false alarms.

3.3 Self‑Diagnosis

Provides tools for ad‑hoc metric comparison, dimension drill‑down, sample extraction, and auto‑generated SQL templates, allowing users to quickly pinpoint problematic data.

4. Key Technical Analysis

4.1 Sample Library

Samples 1% of device data at the gateway, storing it for fast, low‑cost analysis. Confidence is evaluated as a weighted mix of PV and UV differences, typically exceeding 99%.

4.2 Three‑Layer Separation Architecture

Divides traditional monitoring rules into Computation Layer (unified metric extraction), Judgement Layer (human review or automated detection), and Alert Layer (notification and follow‑up).

4.3 Decision Engine & Metric Storage

The first‑generation engine uses a Golang DSL (gengine) for high‑performance rule execution; future versions may adopt Python for richer libraries. Metrics are stored in Tencent Cloud CTSDB (ElasticSearch‑based) for schema‑free, scalable time‑series handling.

5. Summary and Outlook

The presented quality review concepts and tools have been deployed across multiple Tencent businesses, proving valuable for data collection governance. Future work will focus on cost‑optimal reporting, automated model generation, visual debugging, and streamlined processes.

Q&A

Q1: How effective is the sample library? A1: Confidence usually exceeds 99%, with no observed false‑positive cases so far.

Q2: How to improve precision in recall‑oriented scenarios? A2: Adjust thresholds, refine operators, and incorporate richer rule sets while balancing generalization and over‑fitting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data quality assurance Data Governance data sampling

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.