Big Data 17 min read

Implementing Erasure Coding in HDFS: Migration Strategy, Testing Framework, and Data Lifecycle Management

This article details JD's practical experience migrating HDFS to erasure coding, covering the decision between upgrade and porting, the step‑by‑step upgrade and rollback procedures, automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and comprehensive data‑integrity safeguards to achieve significant storage cost reductions while maintaining production reliability.

JD Tech
JD Tech
JD Tech
Implementing Erasure Coding in HDFS: Migration Strategy, Testing Framework, and Data Lifecycle Management

To reduce storage costs and improve efficiency, JD's HDFS team migrated the EC (Erasure Coding) feature into production, developing a data lifecycle management system for automated hot‑warm‑cold data handling and building a three‑dimensional data validation mechanism to ensure EC data correctness.

The EC feature (RS‑3‑2‑1024k) splits a 200 MB file into three data stripes and two parity stripes, achieving a 45% storage saving compared to three‑replica storage while preserving redundancy.

Faced with the choice of upgrading the existing codebase or porting EC, the team selected porting due to extensive customizations on the 2.7.1 branch and the large number of patches required for a direct upgrade.

Porting principles included module‑wise migration, preserving community code style, omitting unnecessary code, maintaining interface compatibility, migrating all test cases, and clearly marking TODOs for future work.

Quality assurance involved extensive automated integration testing using Ansible for cluster provisioning, pytest for HDFS test cases, and a CI pipeline with Jenkins, Docker, and Makefile to ensure functional, regression, and performance testing.

The upgrade and rollback process was performed in three stages: (1) upgrade a standby NN with LayoutVersion ‑63, (2) upgrade the standby to the EC‑enabled version with LayoutVersion ‑64, and (3) replace the original NN, all while maintaining service continuity.

Data lifecycle management introduces a FileConvertCommand and ConvertTaskBalancer to schedule hot‑warm‑cold data conversion within the NN, enabling atomic swaps of original and EC files and ensuring metadata preservation.

Comprehensive data integrity protection combines file‑level and block‑level verification, using MD5 checksums and EC codec utilities to reconstruct and compare missing blocks, supplemented by a real‑time block‑level detection mechanism based on streaming computation.

The project contributed dozens of patches to the Hadoop community (e.g., HDFS‑14171, HDFS‑14353) and plans future work on native EC acceleration and further stability improvements.

cluster upgradeStorage OptimizationErasure CodingHDFSData Lifecycle Management
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.