How We Scaled a High‑Volume Order System: Sharding, Migration, and Zero‑Downtime Strategies
This article details the end‑to‑end process of expanding a high‑traffic ride‑hailing order system on Alibaba Cloud, covering capacity planning, database sharding from 256 to 4096 tables, custom data‑sync middleware, validation, repair, and a gray‑release migration that achieved zero incidents.
Background
In 2020 the Gaode Taxi order system was migrated and sharded on Alibaba Cloud. Services ran on ECS, databases on RDS, configuration via ACM, and data sync used Alibaba DTS combined with custom sharding and distributed ID components.
Note: "弹内" and "弹外" refer to two independent elastic‑compute network environments, internal (Alibaba production network) and external (Alibaba public cloud).
Capacity Planning
The original setup had 4 instances, 4 databases, each with 64 tables (256 tables total). Some tables already exceeded ten million rows and were projected to reach over a hundred million rows per year, causing performance concerns.
Current resources: 4 instances (16C/64G/3T SSD), 4 databases, 64 tables per database. Table space usage was assessed via RDS diagnostics.
To handle growth, the plan was to double instances to 8 (still 16C/64G/3T SSD), increase each instance to 4 databases with 128 tables each (though best practice suggests 32‑64 tables per database). Future scaling could involve upgrading instance specs (e.g., 16C/128G) or adding more instances and migrating databases.
With 4096 tables the system can support 5 × 10⁷ orders per day for three years; 32 databases each with 128 tables allow scaling up to 32 instances without rehash.
Data Migration
Expanding sharding required rehashing from 256 to 4096 tables, which Alibaba DTS cannot handle directly because it only supports same‑schema migrations. Therefore a custom middleware was built that leveraged DTS binlog‑to‑Kafka capability to perform data synchronization.
The migration workflow includes preparation, data sync (full historical sync, real‑time incremental sync, rehash), data validation (full and incremental checks), and data repair.
Preparation
All tables were audited to ensure a unique business ID and proper unique indexes. Tables lacking unique indexes were updated (often with composite unique keys). Since DTS uses auto‑increment IDs for deduplication, a custom approach using even/odd IDs was previously employed, but the new bidirectional sync component avoids this issue.
Sharding Rule Review
Each table’s sharding strategy was examined (user‑ID based, non‑user‑ID, database‑only, etc.) to define rehash and validation logic.
Data Sync
Data sync is based on binlog streaming: DTS publishes binlog to Kafka, the data‑sync component consumes, filters, merges, rehashes, batches, and writes to the target database without touching business code.
Loop message filtering: removes duplicate binlog loops.
Data merging: retains only the latest operation per record.
Update‑to‑insert conversion: ensures batch inserts succeed.
Batching by target table improves insert efficiency.
Key challenges addressed:
Message order – ensured a single consumer per topic to preserve binlog order.
Message loss – offsets are stored in MySQL after successful writes; on restart consumption resumes from the stored offset.
Full‑field binlog guarantees no field loss.
Rehash
The order ID embeds the last four digits of the user ID. During migration, the middleware computes userId % 4096 to locate the new database/table and userId % 256 for the old layout.
To prevent infinite binlog loops, a data‑coloring technique is used: a transaction table tb_transaction records a status flag (0 = normal, 1 = sync‑generated). The middleware ignores binlog events generated while status=1.
# Begin transaction to ensure atomicity
start transaction;
set autocommit = 0;
-- Mark transaction start for coloring
update tb_transaction set status = 1 where tablename = ${tableName};
-- Business operations that generate binlog
insert xxx;
update xxx;
update xxx;
-- Mark transaction end
update tb_transaction set status = 0 where tablename = ${tableName};
commit;Data Validation
The data‑check service compares rows between old and new databases.
Full validation: each row in the source is verified in the target (both directions).
Incremental validation: every five minutes the service checks recent changes and performs multi‑round diff; persistent mismatches trigger alerts.
Data Repair
Two repair strategies are employed:
For large‑scale inconsistencies, reset Kafka offsets and replay data to overwrite erroneous rows.
For small‑scale issues, the data‑check logs are parsed to generate corrective SQL statements.
Gray‑Release Switch
The final production cut‑over follows a gray‑release strategy:
Stop writes (seconds‑level) to avoid data lag.
Ensure all data has been synchronized.
Switch the data source.
Resume normal writes.
ABC Verification
Two additional database instances (B and C) are provisioned. DTS syncs from the original A to B (verification) while C serves as the new target. After successful validation, A and C are configured for bidirectional sync.
Gray‑Release Steps
Deploy two sets of sharding rules in code.
Configure gray‑ratio in ACM.
Intercept MyBatis requests, compute user‑ID modulo, compare with ACM ratio, and set a ThreadLocal flag for the new database.
Apply white‑list checks for privileged users.
The sharding component reads the flag and routes reads/writes accordingly.
During traffic shift, ACM‑driven gray users are paused for write‑stop.
Conclusion
The migration, spanning roughly two months, was completed without any production incidents, required no changes to business code, and demonstrated a reliable, zero‑downtime expansion of sharding and data migration for a high‑traffic order system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
