Mastering Database Sharding: When and How to Split Databases and Tables
This article explains the concepts of database and table sharding, why they are needed, various vertical and horizontal splitting strategies, range and hash partitioning methods, their advantages and drawbacks, timing considerations, common challenges such as transactions and distributed IDs, and introduces popular sharding middleware.
Preface
Hello everyone, today we discuss database sharding (分库分表).
1. What Is Database Sharding
Sharding databases : splitting one database into multiple databases deployed on different machines.
Sharding tables : splitting a single table into multiple tables.
2. Why Sharding Is Needed
2.1 Why Split Databases?
When business volume surges, a single MySQL instance may hit performance bottlenecks. Splitting reduces disk usage and eases concurrent connection limits, especially in micro‑service architectures where different modules (orders, users, products) are isolated into separate databases.
2.2 Why Split Tables?
Large tables slow down queries, especially when indexes are not used or when a table exceeds ten million rows, causing B+‑tree height to increase and query speed to degrade.
InnoDB stores data in 16 KB pages. For a B+‑tree of height 2, each leaf can hold 16 records (16 KB/1 KB). Assuming an 8‑byte bigint primary key and a 6‑byte pointer, each leaf can hold about 1 170 records (16 KB/14 B). Thus a height‑2 tree can store roughly 18 720 records, while a height‑3 tree can hold about 21 902 400 records. When height reaches 4, extra disk reads make queries noticeably slower.
3. How to Shard Databases and Tables
3.1 Vertical Splitting
3.1.1 Vertical Database Sharding
Early in development a single database may suffice. As the system grows, separate databases (user, order, points, product) can be created and deployed on different servers, reducing pressure on any single instance.
3.1.2 Vertical Table Sharding
If a table has many columns, frequently unused or large columns (e.g., email, address, user_desc) can be moved to a separate detail table, reducing I/O for common
select *queries.
3.2 Horizontal Splitting
3.2.1 Horizontal Database Sharding
Data is partitioned across multiple database servers, each holding the same schema but different row subsets, alleviating single‑node bottlenecks.
3.2.2 Horizontal Table Sharding
Rows are distributed to multiple tables based on a rule such as hash modulo or range. For example, an order table can be split by time range into several tables.
3.3 Sharding Strategies
3.3.1 Range Partitioning
Rows are divided by key ranges (e.g., 0‑10 million, 10‑20 million). This simplifies scaling because new tables can be added without moving existing data, but it may create hotspot issues when recent IDs concentrate in one table.
3.3.2 Hash Modulo Partitioning
Rows are assigned to tables by computing
key % N. This distributes load evenly and avoids hotspots, but expanding from N to a larger number of tables requires data migration.
3.3.3 Combined Range + Hash
Use range partitioning to allocate large key intervals to different databases, then apply hash modulo within each database to spread rows across tables, balancing scalability and hotspot avoidance.
4. When to Consider Sharding
4.1 When to Shard Tables
If daily order volume reaches hundreds of thousands and query performance degrades, especially when table size approaches tens of millions (B+‑tree height > 3), sharding should be planned. Many practitioners start considering sharding around 5 million rows.
4.2 When to Shard Databases
When multiple services share a single monolithic database and it becomes a performance bottleneck, separate databases (order, user, etc.) can be created, often aligning with a micro‑service architecture.
5. Problems Introduced by Sharding
Transaction management – local transactions no longer work across databases; distributed transactions are required.
Cross‑database joins – need to be performed in multiple steps.
Sorting and aggregation – must be done per node and merged in the application.
Pagination – either aggregate results after per‑node pagination or let the front‑end handle pagination across nodes.
Distributed ID generation – cannot rely on auto‑increment; alternatives include UUID or Snowflake algorithms.
6. Popular Sharding Middleware
cobar
Mycat
Sharding‑JDBC
Atlas
TDDL (Taobao)
vitess
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.