Databases 21 min read

Efficient Deduplication of Large MySQL Tables Using Indexes, Variables, and Window Functions

This article demonstrates how to efficiently remove duplicate rows from a million‑record MySQL table by comparing created_time and item_name, exploring various approaches such as correlated subqueries, joins, user‑defined variables, index optimization, window functions, and parallel execution with shell scripts and MySQL events to achieve significant performance gains.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Efficient Deduplication of Large MySQL Tables Using Indexes, Variables, and Window Functions

The article addresses the problem of deduplicating a MySQL table ( t_source ) containing 1,000,000 rows where 500,000 rows have duplicate created_time and item_name values, and the goal is to write the distinct 500,000 rows into a target table ( t_target ).

First, a test environment is set up on a CentOS 6.4 VM with MySQL 8.0.16, and a stored procedure ( sp_generate_data ) is used to generate the source data, including an intentional duplicate row.

Several deduplication methods are evaluated:

Correlated subquery: Uses SELECT DISTINCT t1.* FROM t_source t1 WHERE item_id IN (SELECT MIN(item_id) FROM t_source t2 WHERE t1.created_time = t2.created_time AND t1.item_name = t2.item_name) . This method scans the full table twice and takes about 19 seconds.

Join with derived table: Joins t_source with a derived table that groups by created_time, item_name . Execution time improves to roughly 13–14 seconds.

User‑defined variables: Initializes variables @a and @b , then uses a single‑pass scan with conditions like (@a!=created_time OR @b!=item_name) AND (@a:=created_time) IS NOT NULL AND (@b:=item_name) IS NOT NULL . With an appropriate index, this method runs in about 12 seconds.

Window function (MySQL 8+): Applies ROW_NUMBER() OVER (PARTITION BY created_time, item_name) and selects rows where the row number equals 1. Although it still performs a full table scan, it simplifies the query and completes in roughly 12 seconds.

Index optimization is crucial. Creating a composite index on created_time, item_name (named idx_sort ) allows the variable method to perform a range scan and eliminates the need for a temporary table, achieving the best performance.

To further accelerate processing, the article explores parallel execution:

Shell background processes: A Bash script launches four concurrent calls to a stored procedure ( sp_unique(i) ) that processes distinct time ranges. The total runtime drops to about 5 seconds, roughly 2.5× faster than the single‑threaded approach.

MySQL Schedule Events: Four events ( ev1 – ev4 ) are created, each invoking sp_unique for a specific partition. Execution times are logged in a history table ( t_event_history ), and the overall duration is comparable to the shell method.

All tests confirm that using the composite index with user‑defined variables or the window‑function approach provides the most efficient deduplication, and parallel execution can further reduce total processing time on multi‑core hardware.

MySQLDeduplicationIndexeswindow functionsParallel ExecutionSQL performance
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.