Databases 17 min read

Case Study of Slow SQL Governance in an Activity Middleware System

The case study details how an activity‑middleware platform tackled slow SQL problems by cleaning historic data, introducing sharding and Elasticsearch off‑loading, optimizing queries with proper indexes and simplified joins, and automating nightly deletions, ultimately cutting daily slow queries from thousands to double‑digit levels and boosting system stability.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Case Study of Slow SQL Governance in an Activity Middleware System

The activity middleware system places great emphasis on performance and user experience. Database performance issues, especially slow SQL queries, can degrade application responsiveness, increase system load, and even cause crashes. This article presents a practical case study of how the system addresses slow SQL problems.

1. Definition of Slow SQL

Slow SQL refers to queries that take a long time to execute, often involving extensive row scans, temporary file sorting, or frequent disk flushes, leading to high I/O and timeouts. The slow‑query log records not only SELECT statements but also INSERT, UPDATE, and other DML statements whose execution time exceeds the long_query_time threshold.

2. Harms of Slow SQL

From a business perspective, slow queries degrade user experience and reduce product satisfaction. From a database perspective, each slow query consumes a portion of limited I/O resources, causing other queries to wait and potentially leading to queue buildup.

3. Causes of Slow SQL

Lack of appropriate indexes, forcing full‑table scans.

Poor query conditions, such as unnecessary JOINs or sub‑queries.

Excessive data volume.

Lock waiting.

Insufficient hardware resources.

Suboptimal database design.

Inaccurate statistics causing bad execution plans.

4. Governance Practices

4.1 Data Volume

When tables reach tens of millions or billions of rows, query latency exceeds one second. The team performed manual data cleaning, deleting historical data older than one year in batches to reduce impact on the production environment.

Example of the cleaning strategy:

DELETE FROM activity_table WHERE create_time < DATE_SUB(NOW(), INTERVAL 1 YEAR) LIMIT 10000;

They split the cleanup into five rounds, each handling 2–3 tables, and used primary‑key based deletions to improve efficiency.

4.2 Sharding

For tables that remain large after cleaning, the team introduced horizontal sharding. A routing table records the shard mapping, and the routing configuration is stored in a centralized config center, allowing dynamic addition of shards without redeploying.

4.3 Offloading Large Queries to ES

For latency‑critical scenarios where MySQL cannot meet performance requirements, data is migrated to Elasticsearch or cache layers.

5. SQL Statement Optimization

The execution order of a SQL statement is FROM → JOIN → WHERE. Common problems include selecting all columns, missing indexes, and complex joins.

Example of a non‑optimal query:

SELECT * FROM a WHERE id = 0;

Optimized to select only needed columns:

SELECT result FROM a WHERE id = 0;

5.1 Index Usage

Indexes dramatically improve query speed. The article shows a table creation script and demonstrates the performance difference between querying on a non‑indexed column ( result , 4286 ms) and an indexed column ( test , 16 ms). The EXPLAIN output confirms the type changes from ALL (full scan) to ref (index lookup).

Typical MySQL access types:

ALL – full table scan.

index – full index scan.

range – range scan using part of an index.

index_merge – merging multiple indexes.

unique_subquery – using a unique index in a sub‑query.

const – constant table lookup.

ref – non‑unique index lookup.

5.2 Join Queries

Joins on large tables can be slow. The team split a LEFT JOIN into two separate SELECT statements, reducing execution time from 1432 ms to roughly 730 ms per query.

SELECT * FROM a WHERE id = 0;
SELECT * FROM b WHERE id = 0;

5.3 Complex Conditions

Nested sub‑queries and IN clauses can cause severe latency (e.g., 12648 ms). The solution is to denormalize data into a dedicated table to avoid such complex queries.

6. Overall Strategy

The system runs a nightly scheduled task that deletes invalid historical data based on configurable parameters (shard count, table name, conditions, batch size, time window). The original configuration:

{
  "分表数量": 7,
  "表名": "table",
  "条件": "condition"
}

was enhanced to include batch size and explicit start/end times, allowing more flexible and targeted deletions.

Key improvements:

Avoid join queries; prefer simple statements.

Base deletion on activity lifecycle rather than pure time range.

Leverage shard routing to quickly locate relevant partitions.

After applying these measures, the number of slow SQLs dropped from several thousand per day to double‑digit levels, significantly improving system stability.

7. Experience Summary

Prevent slow SQLs at the source by adhering to coding standards.

Do not ignore slow queries in offline databases; they can still affect production.

Design databases thoughtfully from the beginning to avoid costly refactoring.

Continuously review and share lessons learned to prevent repeat mistakes.

IndexingShardingMySQLSQL OptimizationDatabase Performanceslow query
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.