Introduction to Apache Kafka: Concepts, Architecture, and Core APIs
This article provides a comprehensive overview of Apache Kafka, explaining its role in real‑time data pipelines and stream processing, describing key concepts such as topics, partitions, logs, producers, consumers, replication, guarantees, and how Kafka functions as both a messaging and storage system.
Introduction
Kafka excels at building real‑time data pipelines between applications and creating streaming applications that process data streams.
It is used for two major types of applications: (1) constructing real‑time data channels between applications, and (2) building real‑time stream processing applications.
Key Concepts
Kafka runs in a clustered mode on one or more servers.
Data streams are stored as topics.
Each record contains a key, a value, and a timestamp.
Core APIs
Producer API – sends data streams to one or more Kafka topics.
Consumer API – subscribes to topics and processes incoming data.
Streams API – enables applications to act as stream processors, transforming input topics to output topics.
Connector API – builds reusable producers and consumers to connect Kafka topics with external systems, such as relational databases.
Kafka client‑server communication uses a simple, efficient, language‑agnostic TCP protocol that is versioned and backward compatible; Java client is provided, but clients exist for many languages.
Topics and Logs
A topic is a category for publishing records and can have multiple subscribers.
Each topic is divided into partitions, each being an ordered, immutable commit log. Records in a partition are identified by a sequential offset.
Kafka retains records based on a configurable retention time (e.g., two days), independent of data volume, and discards them afterward.
The only metadata that must be persisted is the consumer offset, which determines where a consumer resumes reading.
Consumers can reset offsets to reprocess past data or skip data, allowing consumption of both past and future records without affecting other consumers.
Distribution
Partition logs are distributed across cluster servers; each partition has a leader and zero or more followers. The leader handles all read/write requests, while followers replicate the leader. If a leader fails, a follower is automatically promoted.
Producers
Producers send data to chosen topics and decide which partition to write to, either by round‑robin or based on semantic keys.
Consumers
Consumers belong to a consumer group identified by a group name. Messages in a topic are distributed among the instances of the same group, providing load balancing and fault tolerance.
If all consumers share the same group, messages are evenly distributed; if they belong to different groups, each group receives a copy of every message (broadcast).
The consumer group model combines queue‑like load balancing with publish‑subscribe broadcasting.
Guarantees
Messages sent by a producer to a specific partition are appended in send order, preserving offset order.
Consumers read messages in the order they are stored in the log.
With a replication factor of N, the system can tolerate N‑1 broker failures without losing committed data.
Kafka as a Messaging System
Kafka blends queue and publish‑subscribe models: each topic supports both semantics, offering stronger ordering guarantees than traditional messaging systems.
Consumer groups enable parallel processing (like queues) while still allowing broadcast to multiple groups (like pub‑sub).
Kafka as a Storage System
Kafka persists data to disk and replicates it for fault tolerance, allowing high‑performance, low‑latency log storage from small to petabyte‑scale datasets.
Clients can control read positions, making Kafka suitable as a durable, distributed log storage system.
Kafka for Stream Processing
Beyond read/write and storage, Kafka provides real‑time stream processing via the Streams API, allowing developers to build applications that transform, aggregate, and join streams.
Simple processing can be done with producers and consumers; complex transformations use the Streams API, which leverages Kafka’s core functions, state storage, and group mechanisms for fault‑tolerant processing.
Putting the Pieces Together
Kafka combines messaging, storage, and stream processing into a unified platform, enabling applications to process both historical and future data continuously.
By integrating low‑latency subscription with durable storage, Kafka serves as a high‑performance pipeline for both real‑time and batch workloads.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.