Big Data 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Full-Stack Internet Architecture

Jan 27, 2021

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

1. Hadoop Overview

In the era of big data, the volume, velocity, variety, and low value density of data require new storage and processing solutions; Hadoop was created to address these challenges.

GFS inspired HDFS. MapReduce inspired MR. BigTable inspired HBase.

Today, "Hadoop" usually refers to the entire Hadoop ecosystem, which includes storage (HDFS), computation (MapReduce), resource management (YARN), and common utilities.

2. Hadoop Characteristics

High Availability : multiple replicas of each block ensure data is not lost if a node fails.

High Scalability : nodes can be added or removed dynamically; large clusters (e.g., 3K‑5K nodes) are supported.

High Efficiency : parallel execution under the MapReduce model accelerates processing.

High Fault Tolerance : slow or failed tasks are automatically retried and reassigned.

3. Hadoop Architecture Design

Hadoop 1.x tightly couples MapReduce with resource scheduling, while Hadoop 2.x separates scheduling into YARN, allowing independent management of resources.

3.1 HDFS (Hadoop Distributed File System)

HDFS stores files as blocks (default 128 MB in 2.x) replicated across DataNodes. It provides high fault tolerance and scalability but has limitations such as high latency, poor handling of many small files, and lack of random writes.

3.1.1 Advantages

Automatic replication (default 3 copies) for fault tolerance.

Dynamic cluster expansion.

Supports GB/TB/PB‑scale data and million‑plus files.

Low‑cost commodity hardware.

Java API with language‑agnostic client interfaces.

3.1.2 Disadvantages

Not suitable for low‑latency access.

Small‑file handling consumes NameNode memory (~150 bytes per file).

Files are append‑only; no in‑place modification.

Only one writer at a time per file.

3.1.3 Core Components

Client : splits files into blocks, communicates with NameNode for metadata, and with DataNodes for read/write.

NameNode : master metadata server; stores namespace, block mapping, and replica policies.

DataNode : stores actual block data, performs periodic heartbeats and block reports.

Secondary NameNode : checkpoints FsImage and Edit logs to reduce NameNode recovery time.

Block : physical storage unit; size configurable (64 MB in 1.x, 128 MB in 2.x).

3.1.4 Write Process

Client requests upload permission from NameNode.

NameNode returns a list of three DataNodes for the first block.

Client streams data to the first DataNode, which forwards to the second and third (pipeline).

Each DataNode acknowledges receipt; the client proceeds with subsequent blocks.

3.1.5 Read Process

Client asks NameNode for block locations.

Client selects the nearest DataNode (or random if equal) and reads block data in packets.

3.1.6 High Availability (HA)

Active/Standby NameNode pair with ZooKeeper and ZKFC provides automatic failover when the active NameNode crashes.

4. MapReduce

MapReduce is Hadoop's distributed computation framework, consisting of a Map phase (parallel processing) and a Reduce phase (aggregation).

4.1 Advantages

Easy programming model.

Linear scalability by adding nodes.

Built‑in fault tolerance.

Suitable for batch processing of PB‑scale data.

4.2 Disadvantages

Not suitable for real‑time or streaming workloads.

High I/O overhead due to intermediate data writes.

Limited to a single Map and Reduce stage; complex pipelines require chaining jobs.

4.3 Core Concepts

Serialization (Writable) : Hadoop defines its own lightweight serialization types (e.g., IntWritable, Text).

InputFormat & OutputFormat : determine how data is split and written. Common implementations include TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and CombineTextInputFormat for small‑file optimization.

Split Size Calculation :

SplitSize = Math.max(minSize, Math.min(maxSize, blockSize))

Combiner : optional local reducer to reduce network traffic; must not change final results (e.g., sum is safe, average is not).

Shuffle & Sort : Map outputs are partitioned, buffered in a circular memory buffer, spilled to disk, merged, and finally sorted before being sent to reducers.

Partitioner (default hash partitioner):

public int getPartition(K2 key, V2 value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

MapJoin vs ReduceJoin : MapJoin caches the small table in the mapper, reducing shuffle traffic; ReduceJoin performs the join in the reducer and may cause data skew.

5. YARN (Yet Another Resource Negotiator)

YARN is Hadoop's resource‑management layer that decouples scheduling from the data‑processing engine.

5.1 Main Components

ResourceManager : receives client requests, tracks NodeManagers, and allocates containers.

NodeManager : manages resources on a single node, launches containers on behalf of the ResourceManager.

ApplicationMaster : per‑application entity that negotiates resources, monitors tasks, and handles failures.

Container : abstract resource bundle (memory, CPU, disk, network) on which a task runs.

5.2 Job Scheduling Flow

Client submits a job; ResourceManager assigns a job ID and a staging directory.

Client uploads the job JAR, input splits, and configuration.

ResourceManager schedules the ApplicationMaster in an available NodeManager.

ApplicationMaster requests containers for Map tasks; ResourceManager allocates them based on locality and resource availability.

After all Map tasks finish, ApplicationMaster requests Reduce containers.

Progress and counters are reported back to the client.

Upon completion, containers are released and job history is stored.

5.3 Scheduler Types

FIFO : first‑in‑first‑out simple queue.

Capacity Scheduler : multiple queues with guaranteed capacities; jobs within a queue run FIFO.

Fair Scheduler : aims to give each user a fair share of resources; supports preemption.

5.4 Speculative Execution

When a small fraction of tasks run significantly slower than the average, YARN launches duplicate (speculative) tasks; the first to finish supplies the result, reducing overall job latency.

6. MapReduce Optimization Techniques

Input : combine small files using Hadoop Archive or SequenceFile; enable JVM reuse.

Map Phase : increase spill buffer size, reduce number of merges, use a combiner when possible.

Reduce Phase : choose appropriate numbers of map and reduce tasks; enable overlap of map and reduce execution.

IO : enable compression (e.g., Snappy, LZO) and use binary formats like SequenceFile.

Data Skew : sample data to set partition boundaries, implement custom partitioners, or use map‑side joins.

7. References

HDFS‑Shell commands: http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MapReduce Distributed computing YARN HDFS Hadoop

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.