Big Data 10 min read

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

Kafka Message Structure : A Kafka message consists of a fixed‑length header and a variable‑length body. The header includes a 1‑byte magic field and a 4‑byte CRC32; if magic=1 an additional attributes byte is present. The body contains the key/value payload.

Viewing Kafka Offsets : In versions 0.9+, the new Consumer client provides consumer.seekToEnd() and consumer.position() to obtain the latest offset.

Hadoop Shuffle Process

Map side : Map tasks write intermediate results to local disk, first buffering in memory then spilling to disk when a threshold is reached. Before spilling, data is sorted by partition and key, optionally processed by a combiner, and finally merged into a single spill file.

Reduce side : Reducers copy partitioned map outputs, then sort/merge them, and finally run the reduce function to produce the final output written to HDFS.

Spark Cluster Deployment Modes

Spark can run in standalone, YARN, Mesos, or cloud (e.g., AWS EC2) modes. Standalone follows a master/slave architecture with optional ZooKeeper HA; YARN and Mesos delegate resource management to their respective frameworks.

HDFS Read/Write Process

Read : Client contacts the NameNode for block locations, selects a DataNode, opens a socket, receives data packets, and writes them locally.

Write : Client requests upload permission from the NameNode, receives a list of three DataNodes, establishes a pipeline (A→B→C), and streams packets through the pipeline while awaiting acknowledgments.

RDD reduceByKey vs. groupByKey

reduceByKey performs a local merge on each mapper (similar to a combiner), reducing data transferred to reducers and improving speed and memory usage. groupByKey aggregates all values for each key on the reducer side, causing full data shuffling and potential OOM errors.

Spark 2.0 Highlights

Provides ANSI SQL support, a more efficient API, faster execution via the Spark compiler, and Structured Streaming for advanced streaming workloads.

RDD Dependencies

Wide dependencies (e.g., groupByKey, reduceByKey, sortByKey) involve multiple child partitions and trigger a shuffle. Narrow dependencies (e.g., map, filter, union) involve a one‑to‑one relationship between parent and child partitions.

Spark Streaming Kafka Integration

Two approaches: Receiver‑based (uses Kafka high‑level Consumer API, stores data in executor memory, requires Write‑Ahead Log for fault tolerance) and Direct (introduced in Spark 1.3, queries Kafka for offsets and reads data directly via the simple consumer API).

Kafka Data Storage

Kafka primarily relies on sequential disk I/O, which can match or exceed memory speed due to Linux read‑ahead, write‑behind, and page cache, while avoiding JVM GC overhead.

Preventing Kafka Data Loss

Producer side: configure replication factor and partition count. Broker side: ensure sufficient partitions across brokers. Consumer side: disable auto‑commit ( enable.auto.commit=false ) and commit offsets only after successful processing.

Big DataStreamingKafkaHDFSSparkHadoopRDD
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.