Databases 17 min read

Engineering Wisdom Behind High‑Availability Architecture for E‑Commerce Storage Layers

The article analyzes how to design a high‑availability architecture for large‑scale e‑commerce systems, detailing layered risk isolation, stateful storage strategies for flow and state data, unified document‑ID routing, multi‑replica databases, multi‑datacenter synchronization, and real‑world JD case studies that demonstrate elastic scaling and disaster recovery.

JD Tech
JD Tech
JD Tech
Engineering Wisdom Behind High‑Availability Architecture for E‑Commerce Storage Layers

This article examines the design of a high‑availability (HA) architecture for e‑commerce systems, emphasizing the construction of HA for the stateful storage layer.

HA Architecture Paradigm

The core goal of HA is to keep services running despite hardware failures, software bugs, or network interruptions, minimizing downtime and ensuring business continuity and data consistency. Achieving this typically involves layered risk isolation, redundant data disaster‑recovery, and failover mechanisms.

Layered System Overview

Frontend layer : Uses CDN or edge caching to serve static, stateless resources; redundancy improves performance and disaster recovery.

Gateway layer : Provides load balancing and request forwarding; stateless and requires rate‑limiting and circuit‑breaker to prevent cascading failures.

Service layer : Micro‑service architecture with multiple instances; services communicate synchronously or asynchronously and remain stateless.

Storage layer : Supplies relational, NoSQL, and search storage; employs sharding for throughput and master‑slave replication for disaster recovery. It is the only stateful component and must address throughput, read/write performance, node‑failure isolation, replication lag, and rapid backup recovery.

Characteristics of E‑Commerce Business Data

E‑commerce generates two main data categories:

Document‑type (flow) data : Orders, payment records, logistics records, etc. These are generated sequentially without inter‑record dependencies, forming a high‑throughput, flow‑type workload.

State data : User profiles, product information, inventory, coupons, etc. These are read‑heavy with occasional writes that must be strongly consistent.

A comparison table (converted to text) shows that document data is write‑dominant with a high creation‑to‑update ratio, while state data is read‑dominant and requires strong consistency for certain business scenarios.

3.1 Flow‑Data HA Upgrade

The primary objective is business‑transparent storage scaling and unified disaster recovery across the entire link. Because flow data has no dependencies, new records can be written directly to a newly provisioned database when capacity is insufficient or a failure occurs, enabling seamless scaling or failover.

Two key upgrades are required:

Unified document‑ID generation rule : Each document receives a unique ID that embeds routing information indicating the target database.

Routing databases based on document ID : The embedded routing info directs the record to the appropriate storage node.

During runtime, the system can dynamically change the ID generation strategy to route new flow records to a new database, achieving elastic scaling and disaster recovery without affecting the business.

3.2 State‑Data HA Exploration

State data is divided into two sub‑categories:

Read‑many‑write‑few data (e.g., product, inventory, user info): Implement a one‑write‑many‑read architecture where writes go to the database and reads are served primarily from cache with real‑time synchronization. Cache node failures trigger master‑slave failover; database failures trigger master‑slave switch.

Strongly consistent read‑write data (e.g., coupons, red packets): Require both reads and writes to be strongly consistent. Use sharding plus isolation of undecided data, with master‑slave replication and black‑list routing to avoid dirty writes during failover.

3.3 Database Multi‑Replica HA Assurance

Each storage node employs a three‑replica (one master, two slaves) setup distributed across three availability zones for risk isolation. Replication uses semi‑synchronous mode: a transaction is considered committed only after the log reaches at least one slave, ensuring data durability even if the master fails. HA components monitor health and perform rapid master election and topology reconstruction to achieve sub‑second failover.

4 Multi‑Datacenter HA Construction

JD’s data centers in Beijing and Suqian operate a multi‑active architecture. Logical units are split by user dimension and placed in both sites. Traffic is fully converged within each site; inter‑site traffic is switched at the logical‑unit level.

Challenges include network latency causing replication delay and the risk of data loss or service interruption during cross‑site switchover. The solution routes newly created flow records to a brand‑new database in the target site, avoiding the need to wait for cross‑city replication and guaranteeing 100% business continuity for new records. Since updates to existing data constitute less than 10% of traffic and have second‑level latency, their impact on continuity is minimal.

5 Business HA Architecture Upgrade Cases

5.1 Delivery System Database Upgrade

In 2025 JD expanded into food delivery. The original system had a single‑point storage bottleneck. By redesigning document IDs and routing, and applying dual‑write with gray‑release, the storage was migrated to a distributed architecture within one month, achieving elastic scaling and disaster recovery without business impact.

5.2 Core‑Link Document Data Unified Upgrade

Also in 2025, JD upgraded the core‑link document data architecture, unifying the document‑ID generation and routing rules, which enabled elastic expansion and unified disaster recovery for all core‑link services.

5.3 Payment System Multi‑Active Deployment

The payment system, with strict financial HA requirements, adopted the same flow‑data routing mechanism to achieve RPO = 0 and RPO < 10 s across Beijing and Suqian. New flow records are routed to a fresh database, ensuring uninterrupted service and zero data loss during cross‑city switchover.

Conclusion

In a full‑link HA architecture, the storage layer is the sole stateful component and thus determines overall business continuity, data reliability, and scalability. By applying unified document‑ID generation, ID‑based database routing, read‑many‑write‑few caching for state data, and multi‑replica, multi‑datacenter designs, JD has built a highly extensible, highly available distributed active‑active system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

e-commercedistributed architecturehigh availabilitydatabase replicationmulti-datacenterstorage layer
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.