Big Data 8 min read

Testing Plan and Efficiency Strategies for a Data Download Refactoring Project

This article outlines the testing plan for a data download refactoring project involving over 400 metrics, describes automated CSV comparison scripts, evaluates single‑process, multithreaded, and multiprocess approaches with shared memory, and provides practical recommendations for improving verification efficiency and performance.

360 Quality & Efficiency

Aug 12, 2022

Testing Plan and Efficiency Strategies for a Data Download Refactoring Project

The project required testing the accuracy of more than 400 data metrics by downloading them as CSV files and manually verifying each field, which proved inefficient due to large file sizes, slow opening, and the need for sampling.

To address these challenges, an automated verification solution based on the pytest framework was introduced, focusing on two validation stages: field‑level accuracy (comparing common records between new and old versions) and overall value comparison (summing numeric columns or counting frequencies of categorical values).

The article then details the evolution of the comparison script:

Step 01: Convert both CSV files into dictionaries using the account ID as the key.

Step 02: Retrieve each record from the comparison dictionary and obtain its key.

Step 03: Look up the key in the base dictionary, compare each column, and log mismatches.

Step 04: If the key is missing, log an error.

Initial implementations logged extensively with loguru, causing a 40‑minute runtime. The logging was then consolidated to reduce I/O overhead, cutting execution time roughly in half.

Subsequent attempts explored concurrency:

Multithreading: Threads were used to process chunks of the comparison dictionary, but Python’s Global Interpreter Lock (GIL) prevented any noticeable speedup because the workload was CPU‑bound with little I/O.

Multiprocessing: The dictionary was split using numpy and distributed across processes, halving the runtime to about 10 minutes for moderate file sizes.

For very large files (e.g., 1 million rows per CSV), spawning separate processes became memory‑intensive and slowed startup. To mitigate this, a shared‑memory approach using multiprocessing.Manager was adopted, allowing all processes to access a common dictionary without duplicating memory, resulting in significant performance gains.

The final recommendations emphasize choosing multiprocess solutions for compute‑intensive, low‑memory tasks, partitioning large datasets when possible, employing shared memory for massive data structures, and minimizing logging to improve overall efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization shared memory Data Testing CSV comparison Python multiprocessing

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.