Fundamentals 17 min read

Designing High‑Reliability Storage Systems: Strategies from JD Cloud & Intel

An in‑depth look at how JD Cloud’s high‑reliability storage architecture tackles data reliability challenges—covering replica management, redundancy, detection and repair mechanisms, tiered storage designs, and Intel Optane’s role in boosting performance—offering practical strategies for balancing cost and resilience.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
Designing High‑Reliability Storage Systems: Strategies from JD Cloud & Intel

Maintaining Data High Reliability: Challenges and Solutions

Data reliability is the baseline for storage systems; once lost, data cannot be recovered. Maintaining high reliability faces challenges such as replica management and disk failures.

Replica Issues

Key challenges include controlling the number of data replicas to tolerate failures while minimizing redundancy, and addressing replica data corruption caused by hardware faults, software bugs, or operational errors.

Hardware failures (disk, head, network) cause data corruption.

Software bugs in write/storage processes cause errors or loss.

Operational mistakes (user or admin) lead to accidental deletion.

Disk failures are inevitable; detecting failures promptly and repairing with healthy replicas is essential to avoid total data loss.

Redundancy Concepts

Two common redundancy forms:

Replication (e.g., RAID, three‑copy, EC) provides real‑time write/read capability.

Backup records all operations and data, allowing restoration to any point in time, but with slower read/restore performance.

Combining replication and backup can mitigate reliability loss, but reducing detection and repair latency remains crucial.

Detection Methods

Use CRC checks to quickly identify corrupted data across client, network, and disk.

Perform regular consistency checks among local data and across replicas, including spot checks for newly written data.

Repair Strategies

Fast recovery using additional disks and bandwidth to restore failed replicas.

Soft delete mechanism acts as a recycle bin, allowing recovery of mistakenly deleted data.

Cost‑Reliability Balance

JD Cloud classifies data into three categories to tailor reliability solutions:

Low‑frequency update data (e.g., object storage) – large volume, infrequent changes.

Hot data (e.g., cloud disk) – smaller volume, high performance demand.

Metadata – critical but small; requires maximum reliability regardless of cost.

Different storage architectures are applied to each class, with a three‑layer design: Blob storage (massive data, three‑copy or EC), metadata storage, and business data services (object, file, block storage).

Blob Storage Design

Goals: support massive capacity at ultra‑low cost while ensuring high availability and reliability. Key points:

Append‑only system – no in‑place modifications.

Backend selects write locations to avoid delayed replicas.

Large cluster scale.

Replication uses three copies; writes succeed with two copies. Replication groups of 50‑100 GB balance repair time and cluster size.

Repair Time Optimization

Three measures:

System tolerates two replica failures by intelligent write placement.

Utilize 60 MB/s I/O for repair, allowing disks to consume high bandwidth.

Decouple replication group management into scalable Allotter services.

Single‑node repair includes internal CRC checks, background consistency verification, and business‑level consistency checks between block data and metadata.

Meta and Cloud Disk Reliability

Metadata is split into immutable stored data (handled like Blob) and incremental data stored on higher‑grade hardware with backups. Cloud disks separate hot and cold data: hot data receives premium hardware, cold data follows Blob‑style handling.

From Design to Production

Key practices include gray‑scale releases to isolate bugs, and soft‑delete mechanisms to protect against accidental loss.

Intel Optane’s Role in Enhancing JD Cloud

Intel’s 3D XPoint technology provides two products: DCPMM (persistent memory) to bridge memory capacity gaps, and Optane SSDs to address storage performance gaps.

In JD Cloud, Optane accelerates, caches, and tiers data (ACT):

Accelerate hot data reads/writes.

Cache by adding Optane SSDs for faster I/O.

Tier place hot data on Optane, warm/cold data on traditional NAND SSDs.

Four benefits of Optane in ACT scenarios:

Higher IOPS – mixed read/write performance up to three times that of NAND SSDs.

Longer lifespan – up to 20× NAND SSD endurance.

Lower latency – over five times faster than comparable NAND SSDs.

Better QoS – no garbage collection, yielding consistent performance.

Deployments replace traditional NAND+SATA SSD combos with an Optane + NVMe NAND configuration, using Optane for logs and NAND for data, achieving >3× node performance, halving node count, and reducing total cost of ownership.

Future collaborations aim to further integrate Optane and QLC NAND for performance gains and cost efficiency.

reliabilitydata replicationstoragecloudintel optane
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.