Databases 8 min read

Why MySQL IN Subquery with Semi‑Join Is Slow and How to Optimize It

The article analyzes a slow MySQL query that uses an IN subquery on a massive user table, explains how the optimizer’s semi‑join and materialized temporary tables cause full‑table scans, and demonstrates practical fixes such as disabling the semi‑join optimizer switch or rewriting the SQL to regain index usage.

Architecture Digest
Architecture Digest
Architecture Digest
Why MySQL IN Subquery with Semi‑Join Is Slow and How to Optimize It

A system needs to push promotional messages to millions of users stored in a single large users table and an auxiliary users_extent_info table; the initial query uses an IN subquery to filter recent logins, but the execution takes dozens of seconds.

The original SELECT and COUNT statements are:

SELECT id, name FROM users WHERE id IN (
  SELECT user_id FROM users_extent_info
  # 查询最近登录过的用户 WHERE latest_login_time < xx
);
SELECT COUNT(id) FROM users WHERE id IN (
    SELECT user_id FROM users_extent_info WHERE latest_login_time < xxxxx
);

Running EXPLAIN on the COUNT query shows that MySQL materializes the subquery result into a temporary table (4561 rows) and then performs a full table scan on users , joining each row with the materialized table, which leads to massive I/O and a filtered rate of only about 10%.

This behavior is caused by the optimizer’s automatic conversion of the IN clause into a semi‑join; the semi‑join forces a full scan of the temporary table for every row in users , effectively disabling index usage.

Disabling the semi‑join optimization with:

SET optimizer_switch='semijoin=off';

produces a normal execution plan: the subquery uses a range scan on the idx_login_time index, and the outer query uses the primary key index on id , reducing the query time to around 100 ms.

An alternative without changing server settings is to rewrite the SQL so that the optimizer does not apply the semi‑join, for example by adding a harmless OR condition:

SELECT COUNT(id) FROM users WHERE (
    id IN (SELECT user_id FROM users_extent_info WHERE latest_login_time < xxxxx)
    OR id IN (SELECT user_id FROM users_extent_info WHERE latest_login_time < -1)
);

The second predicate can never be true, but it changes the plan to a regular subquery with index usage, avoiding the costly materialized temporary table.

The key takeaway is to examine the execution plan, understand why full scans occur, and ensure that queries can leverage appropriate indexes rather than relying on automatic semi‑join optimizations that may degrade performance.

MySQLSQL Optimizationindexexecution planSemi-Join
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.