Big Data 6 min read

How to Become a Spark Committer: The Journey of JD’s Zheng Ruifeng

The article chronicles JD engineer Zheng Ruifeng’s path to becoming a Spark Committer, highlighting his early involvement, key contributions to Spark’s ML and GraphX components, the community’s scale, and his vision for future improvements in the big‑data platform.

JD Retail Technology
JD Retail Technology
JD Retail Technology
How to Become a Spark Committer: The Journey of JD’s Zheng Ruifeng

Openness and sharing are core to the internet spirit, and open‑source communities embody this principle; Apache Spark, managed by the Apache Foundation, is one of the most widely used distributed computing engines in the big‑data field.

Today Spark is the de‑facto standard for big‑data computation, adopted by countless internet companies, and its community spans over 570 regions with more than 300,000 members, yet only 74 committers exist, giving a roughly 4000:1 ratio between ordinary members and core contributors who ensure code quality.

Zheng Ruifeng, a developer from JD.com’s Retail Technology and Data Platform, became China’s fourth and JD’s first Spark Committer in August 2019 due to his outstanding contributions.

He began using Spark in 2012 and started contributing code in 2015, focusing on machine‑learning components, PySpark, and GraphX. For example, he resolved SPARK‑21690 by redesigning the Imputer algorithm, achieving a ten‑fold speedup for multi‑column data preprocessing.

He also introduced advanced algorithms such as the GBDT‑based feature‑combination method (SPARK‑13677), which had been missing from Spark but is now included in Spark 3.0.

Beyond technical work, Zheng enjoys reading research papers and spending time with his child, viewing the study of papers as both a hobby and a necessary trait for a true geek and Spark Committer.

Looking ahead, he aims to bring more advanced techniques into Spark, improve model‑loading and callback functionalities, adopt more compact data structures, resolve inconsistencies between PySpark and the JVM, and promote broader use of the Barrier scheduler for GPU and deep‑learning workloads.

Within JD, Spark serves as the primary engine for offline data processing, supporting products such as “ShuFang”, “ShangZhi”, and “GoldEye”. Zheng’s team uses Spark‑GraphX for shop‑competition analysis and hopes to both import community innovations into JD and contribute JD’s practical models back to the Spark ecosystem.

The Spark Committer community is diverse, comprising contributors from Databricks, top universities (UC Berkeley, Virginia, Wisconsin, Princeton, Michigan, Stanford), major tech firms (Facebook, Microsoft, Apple, Google, IBM, Intel, Nvidia, Uber, Oracle, eBay, Netflix), Chinese giants (JD, Alibaba, Tencent, Huawei), and various other companies.

big datamachine learningOpen SourceApache SparkCommitterjd.com
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.