Big Data 23 min read

In-Depth Overview of Apache Kafka Architecture and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed streaming platform features, message queue patterns, topic and partition design, broker and cluster roles, producer and consumer mechanics, partition assignment strategies, data storage, reliability guarantees, and performance optimizations such as zero‑copy and batch processing.

Architect

Feb 12, 2022

Apache Kafka is a distributed streaming platform that functions as a publish/subscribe‑based message queue, offering three key characteristics: publish‑subscribe messaging, durable storage with fault tolerance, and real‑time processing of streaming records.

It supports two messaging models: point‑to‑point (queue) where each message is consumed by a single consumer, and publish/subscribe (topic) where messages are delivered to all subscribed consumers.

Kafka is suitable for building real‑time data pipelines and streaming applications, enabling reliable data transfer between systems and on‑the‑fly data transformation.

Core concepts include clusters of brokers, topics that act like tables, and partitions that are ordered logs; each partition guarantees order only within itself. Retention policies control how long records are kept.

Producers publish messages to topics, optionally specifying partitions via key hashing or round‑robin, while consumers read messages, track offsets, and belong to consumer groups that ensure each partition is processed by only one consumer in the group.

Kafka brokers handle requests using acceptor and processor threads, store data on disk in segment files with accompanying index files, and manage replication through leader‑follower mechanisms, maintaining an in‑sync replica (ISR) set for durability.

Reliability is configured via acknowledgment levels (acks=0,1,‑1) and replication settings; the leader only acknowledges after ISR followers have persisted data.

Consumer offset management has moved from ZooKeeper to an internal __consumer_offsets topic, enabling fault‑tolerant recovery.

Partition assignment strategies include RangeAssignor, RoundRobinAssignor, and StickyAssignor, each balancing load differently while minimizing reassignment disruption.

Performance optimizations such as sequential disk writes, zero‑copy transfer, batching, and configurable parameters (batch.size, linger.ms, max.in.flight.requests.per.connection) improve throughput and latency.

Code example for custom partitioning logic:

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    int numPartitions = partitions.size();
    if (keyBytes == null) {
        int nextValue = nextValue(topic);
        List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
        if (availablePartitions.size() > 0) {
            int part = Utils.toPositive(nextValue) % availablePartitions.size();
            return availablePartitions.get(part).partition();
        } else {
            return Utils.toPositive(nextValue) % numPartitions;
        }
    } else {
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }
}

private int nextValue(String topic) {
    AtomicInteger counter = topicCounterMap.get(topic);
    if (null == counter) {
        counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
        AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
        if (currentCounter != null) {
            counter = currentCounter;
        }
    }
    return counter.getAndIncrement();
}

The article concludes with references and notes on sharing the content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming Message Queue Consumer Producer

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.