Big Data 9 min read

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

This article explains common data‑quality challenges when rebuilding business systems, compares manual SQL‑based validation with a dedicated data‑comparison product, and walks through practical steps for configuring, executing, and reviewing automated data‑matching tasks in a big‑data environment.

Data Thinking Notes

May 23, 2024

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

Data Quality Scenarios

1.1 Data Transfer Result Validation

Business‑system data is often transferred or synchronized into a big‑data platform for further processing. After transfer, it is essential to verify that the data matches the source tables and that no records were lost. Traditionally this requires manual, costly side‑by‑side comparisons.

1.2 Business System Rebuild

The procurement system, a legacy supply‑chain application, struggled to keep up with growing business volume and evolving logic. Frequent issues forced the product‑research team to allocate multiple engineers to troubleshooting, leaving little capacity for new features. The rebuild was divided into five workstreams: business analysis, architecture design, QA, data migration, and external‑dependency coordination. Data migration directly impacts the data warehouse because historical and incremental data must be handled without retaining duplicate processing logic.

Two Processing Solutions

2.1 Manual Processing Solution

When a system is rebuilt without product support, manual validation becomes extremely labor‑intensive. The main tasks include:

Business analysis – the richness of data documentation greatly affects the difficulty of analysis; lacking documentation often means starting from scratch.

Data‑dependency analysis – identify source tables to prepare for data extraction and subsequent validation.

Source‑side data‑quality assurance – ensure accuracy, completeness, and timeliness of source data, which directly influences downstream processing reliability.

Result‑side data‑quality assurance – perform comprehensive checks such as total row counts, segment counts, purchase amounts, order numbers, primary‑key uniqueness, and enumeration counts, followed by full‑text field comparisons.

Overall comparison – custom SQL validates table‑level and field‑level metrics.

Full‑text comparison – after overall checks, core fields are compared in depth.

All these comparisons require hand‑crafted SQL scripts, and the workload multiplies when data reconstruction is added.

2.2 Product Processing Solution

The Data Quality Center provides a product feature for data comparison, supporting Hive‑to‑Hive and Hive‑to‑other‑source full‑text comparisons. This capability eliminates the need for manual SQL and dramatically reduces engineering effort.

Data Comparison Feature Practice

Step 1: Create a Data Comparison Task

Select source and target tables; the example uses two Hive databases for a full‑text comparison. The product auto‑detects partitioned tables and defaults to partition comparison, while also allowing full‑table comparison.

Step 2: Set Comparison Method, Association Mode, and Field Mapping

Comparison method: full‑volume or sampling (with configurable sample ratio). Association mode: primary‑key or MD5.

Step 3: Scheduling Strategy

Choose a temporary storage for comparison results. The task can be saved and executed immediately or saved for later manual execution. Automatic scheduling is planned for future releases.

Step 4: Execution

After creating the task, you may run it immediately or trigger it later from the task list. Execution respects data‑access permissions; only the task owner can run the job.

Step 5: View Results

Results appear in the execution instance list. Successful runs provide both table‑level and field‑level comparison outcomes.

Quality Center Introduction

4.1 Data Comparison Summary

The product dramatically frees data‑development engineers from manual SQL work. Revisiting the procurement‑system rebuild scenario, the following four aspects are addressed:

Business analysis – documentation depth influences analysis difficulty.

Data‑dependency analysis – identifies source tables for downstream validation.

Source‑side data‑quality assurance – ensures accuracy, completeness, and timeliness.

Result‑side data‑quality assurance – performed via comprehensive table and field comparisons.

Points 3 and 4 (source and result data‑quality assurance) can be perfectly solved with automated data comparison.

4.2 Quality Center Overview

Data comparison and shape exploration are two core functions of the Quality Center, effectively tackling the two main data‑quality problems. Users request broader capabilities, such as MySQL‑to‑MySQL comparison, scheduling integration, and triggering comparison after data‑transfer tasks. These enhancements are planned for future product iterations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Big Data data quality data validation data comparison

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.