Big Data 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

Efficient Ops
Efficient Ops
Efficient Ops
Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Guest Introduction

Zhu Guanyin, a 2008 master’s graduate of Beijing University of Posts and Telecommunications, is a senior technical manager in Baidu’s Infrastructure Department and one of the first Hadoop engineers in China, leading large‑scale offline model training and real‑time computing projects.

Topic Overview

Baidu introduced Hadoop in 2007 and now operates the world’s largest Hadoop clusters (single cluster over 13,000 nodes, total over 100,000 nodes) with daily CPU utilization exceeding 80%.

Typical Offline Computing Scenario

Offline jobs with latency over five minutes are handled by Hadoop and MPI platforms.

MapReduce Development Timeline

Early 2000s: Publication of GFS, MapReduce, Bigtable papers.

2004: MapReduce paper released; 2006: Doug Cutting founded Hadoop.

October 2007: Hadoop 0.15.1 released.

2007 – Hadoop Journey Begins

First Hadoop trial in November 2007 with a 28‑node cluster built from idle servers; initial workloads included large‑scale search PV/UV analysis.

Key improvements: LZMA compression and a binary streaming interface (bistreaming) to support non‑text data such as web indexing.

2009 – MPI Journey Begins

MPI was introduced to address Hadoop’s limitations for iterative machine‑learning tasks, offering a single All‑Reduce operation equivalent to an entire MapReduce job.

MPI’s All‑Reduce dramatically reduces job startup overhead and improves iteration efficiency.

Optimizing PLSA on Hadoop and then migrating to MPI yielded an order‑of‑magnitude speedup.

2010 – Infrastructure Department Formation

Consolidation of infrastructure teams and integration of the Pyramid system with Hadoop.

Initial MPI scheduling was manual, later replaced by Torque and then Maui for more robust scheduling.

Development of Hadoop C++ Extension (hce) and extensive bug fixes and feature additions.

2011–2015 Milestones

2011: Single‑cluster MapReduce scaled to 5,000 nodes.

2012: Baidu’s Hadoop 2.0 cluster launched, a year ahead of the open‑source version.

2013: World’s largest Hadoop cluster (13,000+ nodes) with millions of concurrent jobs; introduced transparent LZMA compression based on file hotness.

2014: Native C++ DAG engine deployed, merging multiple MapReduce jobs into a single DAG to cut redundant I/O.

2015: In‑memory streaming shuffle implemented, pushing map output to reducers proactively.

Real‑Time and Streaming Platforms

Ba​idu’s DStream platform achieves millisecond‑level latency, predating Storm.

TaskManager provides exactly‑once processing with 30 s–5 min latency, using a queue‑worker model and HDFS for durable storage.

Q&A Highlights

Performance vs. Spark

Spark is used selectively; Baidu prefers SparkSQL for ad‑hoc queries but relies on its own platforms for large‑scale DAG jobs.

TaskManager Guarantees

Ensures no data loss or duplication through queue‑worker decoupling and HDFS‑backed streams.

Shuffle Reliability

Map side pushes data to HDFS; reducers read from HDFS with acknowledgment mechanisms to handle failures.

Security and Auditing

Data access requires owner approval; comprehensive logging enables traceability.

Compression Strategy

Transparent LZMA compression applied to cold files during idle periods, balancing CPU cost with storage savings.

Resource Management

Monitoring dashboards track job counts, throughput, queue times, and cluster utilization.

Scheduling

Baidu’s self‑developed Normandy scheduler complements YARN, offering per‑queue concurrency limits and priority controls.

big dataDAGMapReducedistributed computingHadoopMPIBaidu
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.