Data Quality Review: From Compliance to Reasonableness and Toolchain Overview
This article explores data collection governance by distinguishing compliance from reasonableness, introduces a comprehensive quality review tool system—including visual inspection, intelligent judgement, and self‑diagnosis—details key techniques such as comparison operators and sampling, and outlines a three‑layer architecture and future directions for data quality assurance.
1. Seeing Data Quality
Using a case study of the login_type parameter, the article demonstrates how visual charts can quickly reveal data quality issues such as enumeration completeness, stable trends, and platform inconsistencies (e.g., Android vs iOS case differences).
2. Data Quality: Compliance vs Reasonableness
Data quality is divided into five classic dimensions (completeness, validity, accuracy, consistency, timeliness). The article re‑frames these as Compliance (adhering to predefined rules) and Reasonableness (making sense from multiple perspectives). Compliance checks are absolute; reasonableness requires analytical judgment.
3. Quality Review Tool System
The system consists of three parts: Quality Review , Intelligent Judgement , and Self‑Diagnosis .
3.1 Quality Review
Emphasizes relative comparisons: internal distribution, date‑wise trends, and dual‑platform (Android/iOS) differences. Visualizations (stacked percentage charts, trend lines) enable rapid understanding of data health.
3.2 Intelligent Judgement
Transforms human comparison ideas into executable code. Uses comparison operators such as Manhattan distance for distribution metrics and delta adjustments for ratio metrics. Configurable rules, thresholds, and pre‑conditions reduce false alarms.
3.3 Self‑Diagnosis
Provides tools for ad‑hoc metric comparison, dimension drill‑down, sample extraction, and auto‑generated SQL templates, allowing users to quickly pinpoint problematic data.
4. Key Technical Analysis
4.1 Sample Library
Samples 1% of device data at the gateway, storing it for fast, low‑cost analysis. Confidence is evaluated as a weighted mix of PV and UV differences, typically exceeding 99%.
4.2 Three‑Layer Separation Architecture
Divides traditional monitoring rules into Computation Layer (unified metric extraction), Judgement Layer (human review or automated detection), and Alert Layer (notification and follow‑up).
4.3 Decision Engine & Metric Storage
The first‑generation engine uses a Golang DSL (gengine) for high‑performance rule execution; future versions may adopt Python for richer libraries. Metrics are stored in Tencent Cloud CTSDB (ElasticSearch‑based) for schema‑free, scalable time‑series handling.
5. Summary and Outlook
The presented quality review concepts and tools have been deployed across multiple Tencent businesses, proving valuable for data collection governance. Future work will focus on cost‑optimal reporting, automated model generation, visual debugging, and streamlined processes.
Q&A
Q1: How effective is the sample library? A1: Confidence usually exceeds 99%, with no observed false‑positive cases so far.
Q2: How to improve precision in recall‑oriented scenarios? A2: Adjust thresholds, refine operators, and incorporate richer rule sets while balancing generalization and over‑fitting.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.