How to Implement Database Sharding: Horizontal & Vertical Partitioning with Sharding-JDBC
This article explains the fundamentals of database sharding, covering both vertical and horizontal partitioning concepts, routing algorithms, advantages and drawbacks, practical implementation steps using sharding-jdbc, and discusses related challenges such as distributed transactions, pagination, global IDs, and tool selection.
Previously many Java beginners left comments wanting a systematic study of sharding, and with the company project using
sharding-jdbcto refactor a MySQL architecture, I decided to write a series of practical articles on sharding as a summary of my architectural learning.
I have read many fragmented articles on sharding online, but most are repetitive and lack detailed real‑world cases. After purchasing a few paid courses, I found that while experienced developers could follow them, beginners still face a steep learning curve.
What is Sharding
In fact,
shardingconsists of two concepts:
database shardingand
table sharding. They are usually performed together, so we commonly refer to them as sharding.
Sharding solves performance degradation caused by excessively large databases or tables by splitting a massive database into multiple independent databases and a massive table into several smaller tables, thereby optimizing the performance of each individual database or table.
How to Shard
The core idea of sharding is to split data (
Sharding) and then quickly locate and aggregate query results. Sharding can be performed along two dimensions:
vertical(by business) and
horizontal(by data range or hash).
Below we use an order‑related business example to illustrate vertical and horizontal splitting.
Vertical Split
Vertical split includes
verticaldatabase sharding and
verticaltable sharding.
1. Vertical Database Split
Vertical database split is easy to understand: each business gets its own dedicated database (
专库专用).
Tables related to orders, payments, coupons, points, etc., are placed in separate databases. Developers cannot directly access other business databases; instead, they must use API interfaces, which is the initial form of micro‑services.
Vertical splitting improves performance to some extent but does not solve the problem of a single table becoming too large, so it often needs to be combined with horizontal splitting.
2. Vertical Table Split
Vertical table split is based on columns, i.e., splitting a large table into smaller tables by separating frequently accessed columns from large or rarely used columns.
For example, an
ordertable can have a separate table for frequently accessed fields (order amount, order number) and another table
work_extendfor large
blobfields. The split tables are then distributed across different databases.
Since databases load data by rows, after vertical splitting the core tables contain only high‑frequency, short‑length fields, allowing more data to be cached in memory, improving hit rate and reducing disk I/O.
Business data is decoupled, enabling independent maintenance, monitoring, and scaling.
In high‑concurrency scenarios, database pressure is alleviated to some extent.
Development complexity increases because cross‑business data must be accessed via APIs.
Distributed transaction management becomes harder.
The single‑table size problem remains; horizontal splitting is still required.
Horizontal Split
When vertical splitting cannot fully eliminate large‑scale data bottlenecks, horizontal splitting is needed to further improve performance.
1. Horizontal Database Split
Horizontal database split distributes the same table across multiple databases, each possibly on a different server, achieving horizontal scalability.
This approach solves single‑database storage and performance limits, but introduces routing complexity.
For example, three databases
orderDB_1,
orderDB_2,
orderDB_3each contain an identical
ordertable. An order ID modulo 3 determines which database stores a particular order.
2. Horizontal Table Split
Within the same database, a large table is split into multiple identical tables, each holding a portion of the data.
For an
ordertable with 9 million rows, it can be divided into
order_1,
order_2,
order_3, each storing 3 million rows.
Although horizontal table split reduces the size of a single table, all sub‑tables still reside on the same database instance, competing for the same CPU, memory, and network resources. To achieve true distributed scaling, the split tables must be placed on different database instances.
Solves the problem of a single database becoming a bottleneck under high concurrency, improving system stability and load capacity.
The amount of work required for system refactoring is moderate.
Cross‑shard transaction consistency is difficult to guarantee.
Cross‑database join queries have poor performance.
Scaling and maintenance become complex, especially when thousands of sub‑tables are created.
What are the Rules
The "certain rule" mentioned earlier is a routing algorithm that decides which database and table a piece of data should reside in.
Common algorithms include the modulo algorithm and the range algorithm.
1. Modulo Algorithm
Taking the hash of a field and applying modulo N (where N is the number of database instances or sub‑tables) is the most common method.
For the
ordertable, the
work_no(order number) field is modulo‑ed: the remainder i determines the target database (i=0 → first DB, i=1 → second DB, etc.).
Data shards are relatively uniform, avoiding hot‑spot concentration.
If a machine fails, the modulo base changes, causing data to be routed to different databases, which can break consistency.
2. Range Algorithm
Data can be split by time intervals or ID ranges. For example, a
Usertable may store IDs 1‑9999 in the first database, 10000‑19999 in the second, and so on.
Each table’s data volume is controllable.
Horizontal scaling is simple—just add new nodes without migrating existing data.
Data location can be quickly determined.
Hot‑spot issues may arise when a particular time range receives a surge of orders.
Challenges of Sharding
1. Distributed Transactions
Cross‑database transactions are inevitable. Traditional two‑phase or three‑phase commit solutions have poor performance and high development cost. Many systems adopt eventual consistency with compensation mechanisms, such as using Alibaba’s
Seataframework.
2. Pagination, Sorting, Cross‑DB Joins
These common operations become painful after sharding because data must be aggregated from multiple shards before being presented to the user.
3. Distributed Primary Keys
Auto‑increment IDs lose their global uniqueness across shards, so a distributed ID generator is required.
4. Read‑Write Separation
Most relational databases provide master‑slave high‑availability. When combined with sharding, both read and write replicas must also be sharded.
5. Data Masking
Sensitive information (ID numbers, phone numbers, etc.) must be masked according to privacy rules.
Sharding Tools
Instead of building a wheel, many mature sharding middleware solutions exist. Common choices include:
sharding-jdbc(Alibaba)
TSharding(Mogujie)
Atlas(360)
Cobar(Alibaba)
MyCAT(based on Cobar)
Oceanus(58.com)
Vitess(Google)
Why Choose sharding-jdbc
sharding-jdbcis a lightweight Java framework delivered as a
jar. It works as an enhanced JDBC driver on the client side, requiring no separate server deployment. Its strong compatibility with any JDBC‑based ORM (JPA, Hibernate, MyBatis, Spring JDBC Template) and with all major connection pools (DBCP, C3P0, BoneCP, Druid, HikariCP) makes it easy to integrate.
Minimal intrusion: no code changes or SQL modifications are needed; only table configuration is required.
Broad compatibility with various databases and connection pools.
Summary
This article reviewed the basic concepts of database sharding. In upcoming posts, I will demonstrate how
sharding-jdbcimplements each functional point of sharding in a real‑world project.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.