Databases 16 min read

How to Seamlessly Migrate Billions of Rows: A Practical Guide to Database Sharding and Sync

This article outlines a comprehensive, step‑by‑step approach for migrating massive tables—up to billions of rows—through sharding, dual‑write, data synchronization, validation, testing, and controlled switch‑over to ensure minimal impact on production services.

Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
How to Seamlessly Migrate Billions of Rows: A Practical Guide to Database Sharding and Sync

Background

As business grows rapidly, the KuJiaLe database size has exploded, with some tables reaching the 20‑billion‑row scale. An int overflow incident highlighted the risks of a single massive table. Large single‑instance tables increase query pressure, lengthen read/write latency, and become bottlenecks for complex workflows. Sharding and, where necessary, integrating search engines like Elasticsearch are common mitigation strategies.

Solution Design

Data migration must address two core concerns: consistency of data synchronization and zero‑impact to online services. The migration process is divided into several stages: migration design, impact assessment, data sync & verification, switch operations, and code cleanup. Switches are used to control progress, allowing gradual rollout and quick rollback if issues arise.

Deploy code with switch support for dual‑write and read switching (already verified internally).

Enable dual‑write switch to write to both old and new databases; new DB auto‑increments IDs from the old DB.

Synchronize existing data from old to new database.

Change write order to prioritize the new database; old DB IDs are sourced from the new DB.

Perform data sync verification and business testing.

Switch read operations to the new database.

After a stable observation period, stop writing to the old database, remove the switch and migration code.

Because daily write volume is huge, incremental sync must be completed before bulk data migration.

It is crucial to ensure migration code and switches are deployed first; the data sync itself is straightforward, but the order of source subscription and switch handling carries the highest risk.

Before migration, consolidate read/write traffic on the source tables to avoid inconsistencies, especially primary‑key conflicts during the switch.

Data Synchronization

Identify the primary key before syncing; for auto‑increment IDs, define the growth rule, preferably using a sequence table.

Two main sync approaches exist: custom code or framework‑based tools.

1. Bulk Data Sync via Code

Use cursors to batch‑select from the old DB, rehash, and batch‑insert into the new DB. This method is slow for large volumes and unsuitable for tables with frequent updates. Example: MongoDB to HBase sync (illustrated below).

<code>/**
 * Save manual light configuration
 */
public Boolean saveLevelLightDataWithConfig(final String homeDataId, final String manualLightTemplateId, final LevelLightDataWithConfig levelLightDataWithConfig) {
    if (levelLightDataWithConfig == null) {
        return true;
    }
    final String rowKey = generateNewRowKey(homeDataId, manualLightTemplateId);
    LOG.message("saveLevelLightDataWithConfig - hbase log").with("rowKey", rowKey).info();
    try (Table table = connection.getTable(LEVEL_LIGHT_DATA_TABLE)) {
        Put put = new Put(Bytes.toBytes(rowKey));
        put.addColumn(LEVEL_LIGHT_DATA_COLUMN_FAMILY, LEVEL_LIGHT_DATA_COL_BYTES, MAPPER.writeValueAsBytes(levelLightDataWithConfig.getLevelLightData()));
        if (levelLightDataWithConfig.getManualLightConfig() != null) {
            put.addColumn(LEVEL_LIGHT_DATA_COLUMN_FAMILY, MANUAL_LIGHT_CONFIG_COL_BYTES, MAPPER.writeValueAsBytes(levelLightDataWithConfig.getManualLightConfig()));
        }
        table.put(put);
        return true;
    } catch (Exception e) {
        LOG.message("saveLevelLightDataWithConfig - failed", e).with("rowKey", rowKey).error();
        return false;
    }
}
</code>

2. Framework‑Based Sync

Open‑source tools such as Canal, Maxwell, DataX, or internal solutions like Vimur + Oceanus enable high‑throughput, heterogeneous data migration. These frameworks capture binary logs (e.g., MySQL binlog) and forward changes to targets like MySQL, Kafka, or Elasticsearch.

KuJiaLe also built a real‑time data pipeline called Oceanus, which pulls full and incremental data from Vimur (a MySQL change‑capture service) and publishes to downstream consumers.

3. Incremental Data Sync

Two techniques: dual‑write (application‑level writes to both DBs) and Change Data Capture (CDC). Dual‑write offers full control but introduces distributed transaction complexity. CDC captures source transaction logs and propagates changes with minimal latency, suitable for massive data volumes.

<code>@Async("mongoExecutor")
public void insert(final RenderSnapshotData renderSnapshotData) {
    if (mFeatureConfiguration.isActive(FeatureSwitchConstants.RENDER_SNAPSHOT_WRITE_TO_HBASE, null)) {
        // Stage 1: async dual‑write to HBase
        RENDER_SNAPSHOT_WRITE_EXECUTOR.addSubTask(mRenderSnapshotDataHBase, "insert",
                new ThreadTaskInfo(renderSnapshotData.getTaskId(), true), renderSnapshotData);
    }
    if (mFeatureConfiguration.isActive(FeatureSwitchConstants.RENDER_SNAPSHOT_NO_LONGER_WRITE_TO_MONGO, null)) {
        // Stage 3: stop writing to MongoDB
        return;
    }
    insert(renderSnapshotData, WriteConcern.ACKNOWLEDGED);
}
</code>

Data Validation

1. Bulk Data Validation – The data‑check module compares each field between old and new databases, retrying or alerting on mismatches. Validation can be performed from the perspective of either the old or new DB.

If sync is implemented via code, validation can be embedded in the sync process. For framework‑based sync, total record counts are compared post‑job; any significant discrepancy triggers a re‑sync.

2. Incremental Data Validation – API calls verify dual‑write correctness. Nightly jobs compare recent updates (e.g., last 5‑minute window) between old and new DBs, performing multiple diff checks before raising alerts.

Testing and Verification

Testing occurs at each migration stage, aligned with switch status:

Data layer: verify consistency after DTS sync and during dual‑write.

Service layer: run API tests, possibly using traffic replay platforms.

Business layer: perform end‑to‑end regression on core workflows after each switch.

Performance testing: ensure response times improve (e.g., ES integration reduces latency to milliseconds) and conduct load tests after each major change.

Deployment and Switch Operations

In internal test environments, execute the full migration sequence to expose issues early:

Enable dual‑write switch → internal data sync → internal validation → switch read to new DB → write‑only to new DB → remove switches and migration code.

In pre‑production, proceed cautiously because the same storage is shared with production.

Production switch‑over is the riskiest step; gradual gray releases are preferred, though often only supported at the feature layer.

Conclusion

Due to the need for business stability, migration spans a long period with careful observation after each phase. Limited migration windows require selecting the most reliable plan, addressing every detail—from preparation and sync testing to gray‑release strategies—to ensure uninterrupted online services.

dual writeShardingData SynchronizationDatabase MigrationCDC
Qunhe Technology Quality Tech
Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.