Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned
Didi migrated HDFS to Hadoop 3.2 and implemented erasure coding—using XOR and Reed‑Solomon RS(6,3) striping—to replace three‑replica storage for cold data, building back‑ported clients, automated conversion tools, and cross‑datacenter backup pipelines, while addressing operational bugs and noting performance trade‑offs.
The article begins by noting that HDFS's default three‑replica scheme incurs a 200% overhead in storage and network bandwidth. For cold data, erasure coding (EC) offers a space‑efficient alternative while maintaining comparable fault tolerance.
It then explains the core EC algorithms: XOR codes, which use bitwise XOR to tolerate a single block loss, and Reed‑Solomon (RS) codes, which employ linear algebra to generate multiple parity blocks and tolerate several simultaneous failures. The RS(n,m) notation is introduced, along with the encoding and decoding process using a generator matrix.
Next, the HDFS EC block layout is contrasted with traditional contiguous block storage. EC adopts a striping layout where data and parity are distributed across blocks in a block group, illustrated with RS(6,3) examples.
The paper details Didi’s real‑world adoption: upgrading NameNode and DataNode to Hadoop 3.2, building a Hadoop 2‑compatible EC read/write client by back‑porting EC code, and developing the Anty system for automated conversion of replica files to EC, including a customized distcp that uses COMPOSITE_CRC for checksum consistency.
Additional tooling is described, such as an offline physical‑space calculator that handles the complex striping‑based size computation of EC files, and enhancements to FastCopy to support EC‑based inter‑cluster migration in a federated HDFS environment.
Two practice scenarios are presented: (1) converting cold offline data to EC storage, using a pipeline that schedules daily conversion via Anty, enforces a minimum average file size of 6 MB for RS(6,3), and tracks lifecycle in MySQL; (2) cross‑datacenter backup of core incremental data from machine room 01 to room 02, where the backup is written as EC files via Anty and custom distcp.
The article lists several operational issues encountered during the rollout—client hang‑ups, DataNode offline failures, dirty data during reconstruction, and standby NameNode memory leaks—referencing relevant JIRA issues and the patches or community fixes applied.
Finally, it concludes that while EC introduces some read/write performance trade‑offs and is unsuitable for small files, ongoing community optimizations and Didi’s alignment with upstream HDFS progress promise broader adoption and cost savings.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.