Big Data 13 min read

Introduction to Apache Kafka: Concepts, Architecture, and Core APIs

This article provides a comprehensive overview of Apache Kafka, explaining its role in real‑time data pipelines and stream processing, describing key concepts such as topics, partitions, logs, producers, consumers, replication, guarantees, and how Kafka functions as both a messaging and storage system.

Architecture Digest

Aug 29, 2017

Introduction

Kafka excels at building real‑time data pipelines between applications and creating streaming applications that process data streams.

It is used for two major types of applications: (1) constructing real‑time data channels between applications, and (2) building real‑time stream processing applications.

Key Concepts

Kafka runs in a clustered mode on one or more servers.

Data streams are stored as topics.

Each record contains a key, a value, and a timestamp.

Core APIs

Producer API – sends data streams to one or more Kafka topics.

Consumer API – subscribes to topics and processes incoming data.

Streams API – enables applications to act as stream processors, transforming input topics to output topics.

Connector API – builds reusable producers and consumers to connect Kafka topics with external systems, such as relational databases.

Kafka client‑server communication uses a simple, efficient, language‑agnostic TCP protocol that is versioned and backward compatible; Java client is provided, but clients exist for many languages.

Topics and Logs

A topic is a category for publishing records and can have multiple subscribers.

Each topic is divided into partitions, each being an ordered, immutable commit log. Records in a partition are identified by a sequential offset.

Kafka retains records based on a configurable retention time (e.g., two days), independent of data volume, and discards them afterward.

The only metadata that must be persisted is the consumer offset, which determines where a consumer resumes reading.

Consumers can reset offsets to reprocess past data or skip data, allowing consumption of both past and future records without affecting other consumers.

Distribution

Partition logs are distributed across cluster servers; each partition has a leader and zero or more followers. The leader handles all read/write requests, while followers replicate the leader. If a leader fails, a follower is automatically promoted.

Producers

Producers send data to chosen topics and decide which partition to write to, either by round‑robin or based on semantic keys.

Consumers

Consumers belong to a consumer group identified by a group name. Messages in a topic are distributed among the instances of the same group, providing load balancing and fault tolerance.

If all consumers share the same group, messages are evenly distributed; if they belong to different groups, each group receives a copy of every message (broadcast).

The consumer group model combines queue‑like load balancing with publish‑subscribe broadcasting.

Guarantees

Messages sent by a producer to a specific partition are appended in send order, preserving offset order.

Consumers read messages in the order they are stored in the log.

With a replication factor of N, the system can tolerate N‑1 broker failures without losing committed data.

Kafka as a Messaging System

Kafka blends queue and publish‑subscribe models: each topic supports both semantics, offering stronger ordering guarantees than traditional messaging systems.

Consumer groups enable parallel processing (like queues) while still allowing broadcast to multiple groups (like pub‑sub).

Kafka as a Storage System

Kafka persists data to disk and replicates it for fault tolerance, allowing high‑performance, low‑latency log storage from small to petabyte‑scale datasets.

Clients can control read positions, making Kafka suitable as a durable, distributed log storage system.

Kafka for Stream Processing

Beyond read/write and storage, Kafka provides real‑time stream processing via the Streams API, allowing developers to build applications that transform, aggregate, and join streams.

Simple processing can be done with producers and consumers; complex transformations use the Streams API, which leverages Kafka’s core functions, state storage, and group mechanisms for fault‑tolerant processing.

Putting the Pieces Together

Kafka combines messaging, storage, and stream processing into a unified platform, enabling applications to process both historical and future data continuously.

By integrating low‑latency subscription with durable storage, Kafka serves as a high‑performance pipeline for both real‑time and batch workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

message queues Kafka Distributed Streaming Consumer API Producer API

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.