Big Data 20 min read

Kafka Concept Overview

This article provides a comprehensive introduction to Kafka, covering its definition, message‑queue models, architecture components, installation steps, configuration details, producer and consumer mechanisms, reliability guarantees, partition assignment strategies, offset management, and high‑performance read/write techniques.

Architecture Digest
Architecture Digest
Architecture Digest
Kafka Concept Overview

Kafka Concept Overview

1.1 Definition

Kafka is a distributed, publish/subscribe‑based message queue primarily used for real‑time processing in big‑data scenarios.

1.2 Message Queue

1.2.1 Traditional vs. Modern Queue Models

Traditional queues process downstream steps (e.g., sending an SMS) after the main transaction completes, while modern queues allow immediate response by decoupling subsequent processing.

1.2.2 Benefits of Using a Message Queue

A. Decoupling B. Recoverability C. Buffering D. Flexibility & peak handling E. Asynchronous communication

1.2.3 Queue Patterns

A. Point‑to‑point: each message is consumed by a single consumer. B. Publish/Subscribe: messages are delivered to all subscribed consumers; Kafka follows this model with topics and supports both pull (consumer‑initiated) and push (producer‑initiated) delivery.

1.3 Kafka Basic Architecture

The core components are brokers, producers, consumer groups, and ZooKeeper (for coordination).

Producers send messages; brokers buffer messages and host topics, each divided into partitions and replicas.

Consumer groups read messages; a group’s consumers cannot read the same partition simultaneously, which improves parallel consumption. The number of consumers must be less than or equal to the number of partitions.

Offsets are stored in ZooKeeper before version 0.9 and in an internal Kafka topic thereafter.

1.4 Kafka Installation

A. Install by extracting the tar package:

tar -zxvf kafka_2.11-2.1.1.tgz -C /usr/local/

B. View configuration files:

cd /usr/local/kafka/config
ls -l

C. Edit server.properties to set broker.id , data directories, topic deletion policy, log retention time, log file size, ZooKeeper connection, and default partition count.

1.5 Starting Kafka

A. Start each broker manually (blocking mode).

B. Recommended daemon mode start.

1.6 Kafka Operations

A. List existing topics (via ZooKeeper).

B. Create a topic with specified partitions and replication factor.

C. Delete a topic.

D. View topic details.

2. Kafka Architecture Deep Dive

Kafka guarantees ordering only within a partition, not across partitions.

2.1 Workflow

Producers write to topics; consumers read from topics. Each partition has its own log file and offset; consumers track the offset they have processed.

2.2 Internals

Each partition is split into segments, each consisting of an index file and a log file. The index maps offsets to physical positions, enabling fast seeks.

3. Producers and Consumers

3.1 Producers

Partitions improve concurrency. Producers can specify a partition or use round‑robin distribution.

3.2 Reliability (acks)

Three ack levels:

A. acks=0 – fire‑and‑forget (high loss risk). B. acks=1 – leader writes to disk before ack (possible loss if leader fails). C. acks=-1 (all) – all in‑sync replicas (ISR) write to disk before ack (higher durability, possible duplicates).

3.3 Consumer Consistency (HW)

HW (high water mark) is the smallest LEO among ISR, defining the maximum offset visible to consumers, preventing data loss after leader failure.

3.4 Consumers

3.4.1 Consumption Model

Kafka uses pull‑based consumption, allowing consumers to control read speed.

3.4.2 Partition Assignment

Two strategies:

• RoundRobin – works when all consumers in a group subscribe to the same set of topics. • Range – default; assigns contiguous partitions per topic, which may lead to imbalance.

3.4.3 Offset Management

Offsets are stored either in ZooKeeper (legacy) or in a dedicated Kafka topic.

3.4.5 Consumer Group Example

Changing consumer‑group IDs, starting multiple consumers, and observing how each consumer receives distinct messages within the same group.

4. High‑Performance Read/Write Mechanisms

4.1 Distributed Deployment

Multiple nodes operate in parallel.

4.2 Sequential Disk Writes

Producers append to log files sequentially, achieving high throughput (≈600 MB/s) compared to random writes.

4.3 Zero‑Copy

Kafka transfers data directly between kernel buffers, avoiding user‑space copies and boosting performance.

5. Role of ZooKeeper in Kafka

ZooKeeper elects a controller broker that manages broker membership, partition‑replica allocation, and leader election.

distributed systemsbig datastreamingKafkaMessage QueueConsumerproducer
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.