Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.
Kafka Message Structure : A Kafka message consists of a fixed‑length header and a variable‑length body. The header includes a 1‑byte magic field and a 4‑byte CRC32; if magic=1 an additional attributes byte is present. The body contains the key/value payload.
Viewing Kafka Offsets : In versions 0.9+, the new Consumer client provides consumer.seekToEnd() and consumer.position() to obtain the latest offset.
Hadoop Shuffle Process
Map side : Map tasks write intermediate results to local disk, first buffering in memory then spilling to disk when a threshold is reached. Before spilling, data is sorted by partition and key, optionally processed by a combiner, and finally merged into a single spill file.
Reduce side : Reducers copy partitioned map outputs, then sort/merge them, and finally run the reduce function to produce the final output written to HDFS.
Spark Cluster Deployment Modes
Spark can run in standalone, YARN, Mesos, or cloud (e.g., AWS EC2) modes. Standalone follows a master/slave architecture with optional ZooKeeper HA; YARN and Mesos delegate resource management to their respective frameworks.
HDFS Read/Write Process
Read : Client contacts the NameNode for block locations, selects a DataNode, opens a socket, receives data packets, and writes them locally.
Write : Client requests upload permission from the NameNode, receives a list of three DataNodes, establishes a pipeline (A→B→C), and streams packets through the pipeline while awaiting acknowledgments.
RDD reduceByKey vs. groupByKey
reduceByKey performs a local merge on each mapper (similar to a combiner), reducing data transferred to reducers and improving speed and memory usage. groupByKey aggregates all values for each key on the reducer side, causing full data shuffling and potential OOM errors.
Spark 2.0 Highlights
Provides ANSI SQL support, a more efficient API, faster execution via the Spark compiler, and Structured Streaming for advanced streaming workloads.
RDD Dependencies
Wide dependencies (e.g., groupByKey, reduceByKey, sortByKey) involve multiple child partitions and trigger a shuffle. Narrow dependencies (e.g., map, filter, union) involve a one‑to‑one relationship between parent and child partitions.
Spark Streaming Kafka Integration
Two approaches: Receiver‑based (uses Kafka high‑level Consumer API, stores data in executor memory, requires Write‑Ahead Log for fault tolerance) and Direct (introduced in Spark 1.3, queries Kafka for offsets and reads data directly via the simple consumer API).
Kafka Data Storage
Kafka primarily relies on sequential disk I/O, which can match or exceed memory speed due to Linux read‑ahead, write‑behind, and page cache, while avoiding JVM GC overhead.
Preventing Kafka Data Loss
Producer side: configure replication factor and partition count. Broker side: ensure sufficient partitions across brokers. Consumer side: disable auto‑commit ( enable.auto.commit=false ) and commit offsets only after successful processing.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.