Databases 17 min read

How We Scaled a High‑Volume Order System: Sharding, Migration, and Zero‑Downtime Strategies

This article details the end‑to‑end process of expanding a high‑traffic ride‑hailing order system on Alibaba Cloud, covering capacity planning, database sharding from 256 to 4096 tables, custom data‑sync middleware, validation, repair, and a gray‑release migration that achieved zero incidents.

Alibaba Cloud Developer

Mar 1, 2021

How We Scaled a High‑Volume Order System: Sharding, Migration, and Zero‑Downtime Strategies

Background

In 2020 the Gaode Taxi order system was migrated and sharded on Alibaba Cloud. Services ran on ECS, databases on RDS, configuration via ACM, and data sync used Alibaba DTS combined with custom sharding and distributed ID components.

Note: "弹内" and "弹外" refer to two independent elastic‑compute network environments, internal (Alibaba production network) and external (Alibaba public cloud).

Capacity Planning

The original setup had 4 instances, 4 databases, each with 64 tables (256 tables total). Some tables already exceeded ten million rows and were projected to reach over a hundred million rows per year, causing performance concerns.

Current resources: 4 instances (16C/64G/3T SSD), 4 databases, 64 tables per database. Table space usage was assessed via RDS diagnostics.

To handle growth, the plan was to double instances to 8 (still 16C/64G/3T SSD), increase each instance to 4 databases with 128 tables each (though best practice suggests 32‑64 tables per database). Future scaling could involve upgrading instance specs (e.g., 16C/128G) or adding more instances and migrating databases.

With 4096 tables the system can support 5 × 10⁷ orders per day for three years; 32 databases each with 128 tables allow scaling up to 32 instances without rehash.

Data Migration

Expanding sharding required rehashing from 256 to 4096 tables, which Alibaba DTS cannot handle directly because it only supports same‑schema migrations. Therefore a custom middleware was built that leveraged DTS binlog‑to‑Kafka capability to perform data synchronization.

The migration workflow includes preparation, data sync (full historical sync, real‑time incremental sync, rehash), data validation (full and incremental checks), and data repair.

Preparation

All tables were audited to ensure a unique business ID and proper unique indexes. Tables lacking unique indexes were updated (often with composite unique keys). Since DTS uses auto‑increment IDs for deduplication, a custom approach using even/odd IDs was previously employed, but the new bidirectional sync component avoids this issue.

Sharding Rule Review

Each table’s sharding strategy was examined (user‑ID based, non‑user‑ID, database‑only, etc.) to define rehash and validation logic.

Data Sync

Data sync is based on binlog streaming: DTS publishes binlog to Kafka, the data‑sync component consumes, filters, merges, rehashes, batches, and writes to the target database without touching business code.

Loop message filtering: removes duplicate binlog loops.

Data merging: retains only the latest operation per record.

Update‑to‑insert conversion: ensures batch inserts succeed.

Batching by target table improves insert efficiency.

Key challenges addressed:

Message order – ensured a single consumer per topic to preserve binlog order.

Message loss – offsets are stored in MySQL after successful writes; on restart consumption resumes from the stored offset.

Full‑field binlog guarantees no field loss.

Rehash

The order ID embeds the last four digits of the user ID. During migration, the middleware computes userId % 4096 to locate the new database/table and userId % 256 for the old layout.

To prevent infinite binlog loops, a data‑coloring technique is used: a transaction table tb_transaction records a status flag (0 = normal, 1 = sync‑generated). The middleware ignores binlog events generated while status=1.

# Begin transaction to ensure atomicity
start transaction;
set autocommit = 0;
-- Mark transaction start for coloring
update tb_transaction set status = 1 where tablename = ${tableName};
-- Business operations that generate binlog
insert xxx;
update xxx;
update xxx;
-- Mark transaction end
update tb_transaction set status = 0 where tablename = ${tableName};
commit;

Data Validation

The data‑check service compares rows between old and new databases.

Full validation: each row in the source is verified in the target (both directions).

Incremental validation: every five minutes the service checks recent changes and performs multi‑round diff; persistent mismatches trigger alerts.

Data Repair

Two repair strategies are employed:

For large‑scale inconsistencies, reset Kafka offsets and replay data to overwrite erroneous rows.

For small‑scale issues, the data‑check logs are parsed to generate corrective SQL statements.

Gray‑Release Switch

The final production cut‑over follows a gray‑release strategy:

Stop writes (seconds‑level) to avoid data lag.

Ensure all data has been synchronized.

Switch the data source.

Resume normal writes.

ABC Verification

Two additional database instances (B and C) are provisioned. DTS syncs from the original A to B (verification) while C serves as the new target. After successful validation, A and C are configured for bidirectional sync.

Gray‑Release Steps

Deploy two sets of sharding rules in code.

Configure gray‑ratio in ACM.

Intercept MyBatis requests, compute user‑ID modulo, compare with ACM ratio, and set a ThreadLocal flag for the new database.

Apply white‑list checks for privileged users.

The sharding component reads the flag and routes reads/writes accordingly.

During traffic shift, ACM‑driven gray users are paused for write‑stop.

Conclusion

The migration, spanning roughly two months, was completed without any production incidents, required no changes to business code, and demonstrated a reliable, zero‑downtime expansion of sharding and data migration for a high‑traffic order system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Sharding databases Zero Downtime cloud-computing capacity-planning

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.