Big Data 13 min read

Introduction to Apache Kafka: Concepts, Architecture, and Core APIs

This article provides a comprehensive overview of Apache Kafka, explaining its role in real‑time data pipelines and stream processing, describing key concepts such as topics, partitions, logs, producers, consumers, replication, guarantees, and how Kafka functions as both a messaging and storage system.

Architecture Digest
Architecture Digest
Architecture Digest
Introduction to Apache Kafka: Concepts, Architecture, and Core APIs

Introduction

Kafka excels at building real‑time data pipelines between applications and creating streaming applications that process data streams.

It is used for two major types of applications: (1) constructing real‑time data channels between applications, and (2) building real‑time stream processing applications.

Key Concepts

Kafka runs in a clustered mode on one or more servers.

Data streams are stored as topics.

Each record contains a key, a value, and a timestamp.

Core APIs

Producer API – sends data streams to one or more Kafka topics.

Consumer API – subscribes to topics and processes incoming data.

Streams API – enables applications to act as stream processors, transforming input topics to output topics.

Connector API – builds reusable producers and consumers to connect Kafka topics with external systems, such as relational databases.

Kafka client‑server communication uses a simple, efficient, language‑agnostic TCP protocol that is versioned and backward compatible; Java client is provided, but clients exist for many languages.

Topics and Logs

A topic is a category for publishing records and can have multiple subscribers.

Each topic is divided into partitions, each being an ordered, immutable commit log. Records in a partition are identified by a sequential offset.

Kafka retains records based on a configurable retention time (e.g., two days), independent of data volume, and discards them afterward.

The only metadata that must be persisted is the consumer offset, which determines where a consumer resumes reading.

Consumers can reset offsets to reprocess past data or skip data, allowing consumption of both past and future records without affecting other consumers.

Distribution

Partition logs are distributed across cluster servers; each partition has a leader and zero or more followers. The leader handles all read/write requests, while followers replicate the leader. If a leader fails, a follower is automatically promoted.

Producers

Producers send data to chosen topics and decide which partition to write to, either by round‑robin or based on semantic keys.

Consumers

Consumers belong to a consumer group identified by a group name. Messages in a topic are distributed among the instances of the same group, providing load balancing and fault tolerance.

If all consumers share the same group, messages are evenly distributed; if they belong to different groups, each group receives a copy of every message (broadcast).

The consumer group model combines queue‑like load balancing with publish‑subscribe broadcasting.

Guarantees

Messages sent by a producer to a specific partition are appended in send order, preserving offset order.

Consumers read messages in the order they are stored in the log.

With a replication factor of N, the system can tolerate N‑1 broker failures without losing committed data.

Kafka as a Messaging System

Kafka blends queue and publish‑subscribe models: each topic supports both semantics, offering stronger ordering guarantees than traditional messaging systems.

Consumer groups enable parallel processing (like queues) while still allowing broadcast to multiple groups (like pub‑sub).

Kafka as a Storage System

Kafka persists data to disk and replicates it for fault tolerance, allowing high‑performance, low‑latency log storage from small to petabyte‑scale datasets.

Clients can control read positions, making Kafka suitable as a durable, distributed log storage system.

Kafka for Stream Processing

Beyond read/write and storage, Kafka provides real‑time stream processing via the Streams API, allowing developers to build applications that transform, aggregate, and join streams.

Simple processing can be done with producers and consumers; complex transformations use the Streams API, which leverages Kafka’s core functions, state storage, and group mechanisms for fault‑tolerant processing.

Putting the Pieces Together

Kafka combines messaging, storage, and stream processing into a unified platform, enabling applications to process both historical and future data continuously.

By integrating low‑latency subscription with durable storage, Kafka serves as a high‑performance pipeline for both real‑time and batch workloads.

Big Datastream processingMessage QueuesKafkaDistributed StreamingConsumer APIProducer API
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.