Comprehensive Guide to Database Sharding: Vertical & Horizontal Partitioning, Global ID Generation, and Distributed Transaction Challenges
This article explains the concepts, methods, advantages, and drawbacks of database sharding, covering vertical and horizontal partitioning, sharding rules, global primary key generation, distributed transaction issues, cross‑node queries, pagination, data migration, and practical case studies, with code examples and diagrams.
1. Data Partitioning
Relational databases can become performance bottlenecks when a single table reaches massive size (e.g., 10 million rows or 100 GB). Sharding (data partitioning) reduces load by distributing data across multiple databases, shortening query time.
1.1 Vertical (Logical) Partitioning
Vertical partitioning includes vertical database splitting and vertical table splitting.
Vertical splitting stores low‑coupling tables in separate databases, similar to micro‑service governance.
Vertical table splitting extracts rarely used or large columns into an extension table, improving memory cache hit rate and reducing disk I/O.
Advantages:
Reduces business coupling and clarifies responsibilities.
Enables independent scaling and monitoring of different business domains.
Improves I/O and connection limits under high concurrency.
Disadvantages:
Cross‑database joins become impossible; aggregation must be done via service calls.
Distributed transaction handling becomes more complex.
Large tables may still require horizontal splitting.
1.2 Horizontal (Physical) Partitioning
When vertical splitting is insufficient, horizontal partitioning splits a large table into multiple tables or databases based on a sharding key.
Two common rules:
Range based: Split by time or ID intervals (e.g., month, userId range).
Hash/modulo based: Use hash(key) % N to distribute rows evenly.
Advantages of range‑based sharding: easy to add new nodes without data migration. Drawbacks: uneven load if newer data is hotter.
Advantages of hash‑based sharding: balanced data and request distribution. Drawbacks: re‑hashing required for scaling.
2. Global Primary Key Generation
Auto‑increment IDs are unsuitable for sharded environments. Common solutions:
UUID – simple but large and index‑unfriendly.
Dedicated sequence table (MyISAM) with a unique stub column to generate globally unique IDs.
Multiple ID‑generation servers with staggered auto‑increment steps.
Twitter Snowflake – 64‑bit IDs composed of timestamp, datacenter ID, worker ID, and sequence.
Meituan‑Dianping Leaf – a production‑grade ID service handling high availability and clock rollback.
3. Problems Introduced by Sharding
3.1 Distributed Transaction Consistency
Updates spanning multiple shards require XA or two‑phase commit, increasing latency and conflict probability.
3.2 Cross‑Node Joins
Joins across shards are costly; solutions include global tables, field redundancy, data assembly in two steps, or ER‑based sharding that keeps related tables in the same shard.
3.3 Pagination, Sorting, and Aggregation
Cross‑shard pagination requires sorting each shard and merging results, which can be CPU‑ and memory‑intensive for large page numbers. Aggregation functions (MAX, SUM, COUNT) must be executed per shard and then combined.
3.4 Global Primary Key Collision
Strategies such as UUID, sequence tables, staggered auto‑increment, or Snowflake avoid collisions while providing scalability.
3.5 Data Migration & Scaling
When capacity limits are reached, historical data must be migrated to new shards. Range‑based sharding eases scaling; hash‑based sharding requires re‑hashing.
4. When to Consider Sharding
Table size exceeds practical limits (e.g., >10 GB, >10 M rows).
Backup or DDL operations cause unacceptable downtime.
High write/read concurrency leads to lock contention.
Specific fields become hot spots and benefit from vertical splitting.
Business growth demands horizontal scaling.
Separating workloads improves availability and isolates failures.
5. Sharding Middleware Recommendations
sharding‑jdbc (Dangdang)
TSharding (Mogujie)
Atlas (360)
Cobar (Alibaba)
MyCAT (based on Cobar)
Oceanus (58.com)
Vitess (Google)
6. Case Study: User Center
A typical user table (uid, login_name, passwd, sex, age, nickname, etc.) grows from 100 k to billions of rows. Vertical splitting isolates frequently updated fields (e.g., last_login_time) into a separate table, while horizontal sharding distributes users across databases using range or modulo on uid.
For non‑uid queries (login_name, email), a mapping table or hash‑based gene function maps the attribute to uid, enabling efficient routing.
Operational analytics that require massive scans are off‑loaded to a separate backend service or data warehouse (e.g., Elasticsearch, Hive) to avoid impacting the user‑facing service.
Overall, sharding improves scalability, availability, and performance but introduces complexity in transaction management, query routing, and data migration.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.