Database Sharding: Horizontal Partitioning, Hash Modulo and Range Schemes, and a Hybrid Design
The article explains vertical and horizontal database sharding, compares hash-modulo and range partitioning schemes with their advantages and drawbacks, and proposes a hybrid group‑based design that balances load, avoids hotspots, and simplifies scaling without costly data migrations.
Database Sharding Overview
In large‑scale projects, when data volume grows, developers split data either vertically (by business domain) or horizontally (by rows). Vertical sharding creates separate databases for modules such as orders and users, while horizontal sharding distributes rows of a single table across multiple tables or databases.
Horizontal sharding is illustrated with a 40 million‑row order table that exceeds MySQL’s recommended single‑table size; the data can be split into four or more tables, or further into separate databases.
1. Hash Modulo Scheme
Each row is routed by applying id % N where N is the total number of tables. For example, with four tables, id=12 maps to table 0 and id=13 to table 1, ensuring an even distribution and eliminating hotspot issues.
Advantages: Uniform data distribution, no hotspot.
Disadvantages: When the number of tables changes (e.g., expanding from 4 to 8), the modulo base changes, causing existing rows to map to different tables and requiring painful data migration.
2. Range Scheme
Rows are assigned to tables based on predefined ID ranges (e.g., IDs 0‑9 999 999 go to table 0, 10 000 000‑19 999 999 to table 1). This design avoids data migration when new tables are added because existing ranges remain unchanged.
Advantages: No data migration needed during scaling.
Disadvantages: Hotspot risk because sequential IDs concentrate recent data in the same table.
3. Hybrid Design Idea
The proposed solution combines the two schemes: use a range to assign an ID to a logical group , then within the group apply hash‑modulo across all tables of that group. This keeps existing data stable while achieving uniform distribution.
Group Definition
A group contains several databases and tables. For example, IDs 0‑40 million belong to group01 , which has three databases (DB_0, DB_1, DB_2) and a total of ten tables.
Routing steps:
Determine the group by ID range.
Compute id % total_table_count (e.g., 10). Map the remainder to a specific database (e.g., 0‑3 → DB_0, 4‑6 → DB_1, 7‑9 → DB_2).
Within the selected database, route to the appropriate table.
This approach lets servers with higher capacity store more tables (DB_0 holds four tables, DB_1 and DB_2 hold three each), achieving load balancing according to hardware capabilities.
Core Process Flow
Scaling to New Groups
When the data volume exceeds the current group’s range, a new group (e.g., group02 ) is defined with its own ID range and the same routing logic, eliminating the need for data migration.
4. System Design
The final architecture stores the relationships between groups, databases, and tables, and caches them locally (e.g., JVM cache) to avoid frequent lookups. For dynamic updates, a distributed configuration center (or Zookeeper) can refresh the cache without restarting services.
Group‑to‑DB and table‑to‑DB mappings are illustrated below.
With this design, horizontal scaling, hotspot avoidance, and hardware‑aware data distribution are achieved without costly migrations.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.