Big Data 7 min read

Understanding HDFS: Blocks, Packets, Chunks, and Read/Write Processes

This article explains the core concepts of HDFS—including its block, packet, and chunk structures, their roles in data streaming, the detailed write and read workflows, and how checksums ensure data integrity—providing a comprehensive overview for anyone working with Hadoop distributed storage.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding HDFS: Blocks, Packets, Chunks, and Read/Write Processes

Overview : HDFS (Hadoop Distributed File System) is the open‑source implementation of Google File System, designed to run on inexpensive hardware with high fault tolerance, stream‑oriented access, and scalability for massive data sets.

Key characteristics :

Runs on cheap machines; hardware failures are expected, so it offers strong fault tolerance.

Provides streaming data access rather than random read/write.

Targets large‑scale data sets, supports batch processing and horizontal scaling.

Uses a simple consistency model assuming write‑once, read‑many.

Drawbacks :

Does not support low‑latency data access.

Inefficient for storing many small files due to metadata overhead.

Only one writer per file; concurrent writes are not allowed.

No random file modifications; only append operations are supported.

HDFS data units :

Block : The largest unit, typically 128 MB. Files are split into blocks before upload; block size affects address overhead and parallelism.

Packet : The second‑largest unit, the basic data transfer unit between client and DataNode (or between DataNodes in a pipeline), default size 64 KB.

Chunk : The smallest unit, 512 bytes of actual data plus a 4‑byte checksum (total 516 bytes). Chunks are the verification units inside packets.

Write workflow :

Client sends a write request to the NameNode.

NameNode checks file existence and permissions, logs the operation in the EditLog, and returns an output stream.

Client splits the file into 128 MB blocks.

Client receives a list of writable DataNodes from the NameNode and streams data to the first DataNode; the data then flows through the pipeline of DataNodes as packets.

Each DataNode writes its received packet and, after completing a block, sends an acknowledgment.

Client closes the output stream after all data is sent.

Client notifies the NameNode that the write is complete; the timing of this notification depends on the cluster’s consistency model (strong vs. eventual).

Read workflow :

Client queries the NameNode for block locations and receives an input stream.

Client selects a nearby DataNode and establishes an input stream.

DataNode streams data to the client in packets, each containing checksummed chunks.

Client closes the input stream after reading.

Data integrity : Each chunk carries a checksum; packets aggregate chunks, and blocks aggregate packets. The client computes checksums when writing and stores them in hidden files. During reads, the client verifies the stored checksums against the data, retrieving alternate replicas if mismatches occur.

Big Datadistributed file systemHDFSblock storageData Integrity
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.