Big Data 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

Efficient Ops

Jun 25, 2015

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Guest Introduction

Zhu Guanyin, a 2008 master’s graduate of Beijing University of Posts and Telecommunications, is a senior technical manager in Baidu’s Infrastructure Department and one of the first Hadoop engineers in China, leading large‑scale offline model training and real‑time computing projects.

Topic Overview

Baidu introduced Hadoop in 2007 and now operates the world’s largest Hadoop clusters (single cluster over 13,000 nodes, total over 100,000 nodes) with daily CPU utilization exceeding 80%.

Typical Offline Computing Scenario

Offline jobs with latency over five minutes are handled by Hadoop and MPI platforms.

MapReduce Development Timeline

Early 2000s: Publication of GFS, MapReduce, Bigtable papers.

2004: MapReduce paper released; 2006: Doug Cutting founded Hadoop.

October 2007: Hadoop 0.15.1 released.

2007 – Hadoop Journey Begins

First Hadoop trial in November 2007 with a 28‑node cluster built from idle servers; initial workloads included large‑scale search PV/UV analysis.

Key improvements: LZMA compression and a binary streaming interface (bistreaming) to support non‑text data such as web indexing.

2009 – MPI Journey Begins

MPI was introduced to address Hadoop’s limitations for iterative machine‑learning tasks, offering a single All‑Reduce operation equivalent to an entire MapReduce job.

MPI’s All‑Reduce dramatically reduces job startup overhead and improves iteration efficiency.

Optimizing PLSA on Hadoop and then migrating to MPI yielded an order‑of‑magnitude speedup.

2010 – Infrastructure Department Formation

Consolidation of infrastructure teams and integration of the Pyramid system with Hadoop.

Initial MPI scheduling was manual, later replaced by Torque and then Maui for more robust scheduling.

Development of Hadoop C++ Extension (hce) and extensive bug fixes and feature additions.

2011–2015 Milestones

2011: Single‑cluster MapReduce scaled to 5,000 nodes.

2012: Baidu’s Hadoop 2.0 cluster launched, a year ahead of the open‑source version.

2013: World’s largest Hadoop cluster (13,000+ nodes) with millions of concurrent jobs; introduced transparent LZMA compression based on file hotness.

2014: Native C++ DAG engine deployed, merging multiple MapReduce jobs into a single DAG to cut redundant I/O.

2015: In‑memory streaming shuffle implemented, pushing map output to reducers proactively.

Real‑Time and Streaming Platforms

Baidu’s DStream platform achieves millisecond‑level latency, predating Storm.

TaskManager provides exactly‑once processing with 30 s–5 min latency, using a queue‑worker model and HDFS for durable storage.

Q&A Highlights

Performance vs. Spark

Spark is used selectively; Baidu prefers SparkSQL for ad‑hoc queries but relies on its own platforms for large‑scale DAG jobs.

TaskManager Guarantees

Ensures no data loss or duplication through queue‑worker decoupling and HDFS‑backed streams.

Shuffle Reliability

Map side pushes data to HDFS; reducers read from HDFS with acknowledgment mechanisms to handle failures.

Security and Auditing

Data access requires owner approval; comprehensive logging enables traceability.

Compression Strategy

Transparent LZMA compression applied to cold files during idle periods, balancing CPU cost with storage savings.

Resource Management

Monitoring dashboards track job counts, throughput, queue times, and cluster utilization.

Scheduling

Baidu’s self‑developed Normandy scheduler complements YARN, offering per‑queue concurrency limits and priority controls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DAG MapReduce Distributed computing Hadoop MPI Baidu

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.