Kafka Overview: Architecture, Core Concepts, and Comparison with Other Message Queues
This article provides a comprehensive overview of Kafka, covering its background, design goals, architecture, key terminology, message routing, consumer groups, delivery guarantees, and a comparison with other popular message queue systems such as RabbitMQ, Redis, ZeroMQ, and ActiveMQ.
Background
Kafka is a distributed publish/subscribe messaging system originally developed at LinkedIn for activity streams and operational data pipelines. It is now widely used by many companies as a core data pipeline and messaging platform.
Activity stream data (page views, content accesses, searches) and operational data (CPU, I/O, request latency, logs) are typically logged to files and periodically aggregated. Modern web services require more sophisticated infrastructure to handle these workloads.
Kafka Introduction
Kafka is designed to provide O(1) message persistence even for terabytes of data, high throughput (over 100K messages/second on commodity hardware), partitioned messaging with ordered delivery per partition, support for both offline and real‑time processing, and horizontal scalability.
Why Use a Message System?
Decoupling: A message queue introduces an implicit data‑driven interface that allows producers and consumers to evolve independently as long as they adhere to the same contract.
Redundancy: Messages are persisted until explicitly consumed, preventing data loss even when processing fails.
Scalability: Increasing message production or consumption rates only requires adding more producers or consumers; no code changes or parameter tuning are needed.
Flexibility & Burst Handling: Queues absorb traffic spikes, protecting critical components from overload.
Recoverability: Failure of a single consumer does not affect the whole system; unprocessed messages remain in the queue for later processing.
Ordering Guarantees: Kafka guarantees order within each partition.
Buffering: Queues act as buffers, allowing faster producers to write while slower consumers read at their own pace.
Asynchronous Communication: Producers can fire‑and‑forget messages, letting consumers process them later.
Common Message Queue Comparison
RabbitMQ: Erlang‑based, supports many protocols (AMQP, XMPP, SMTP, STOMP), heavyweight, broker‑centric, good for routing, load‑balancing, and persistence.
Redis: Key‑value NoSQL store with lightweight MQ capabilities; excels at small payloads (<10 KB) for enqueue/dequeue performance.
ZeroMQ: Fast, broker‑less library offering advanced patterns; non‑persistent, suitable for high‑throughput scenarios but requires more custom wiring.
ActiveMQ: Apache project offering both broker and peer‑to‑peer models; relatively lightweight.
Kafka / Jafka: Apache project, high‑performance, O(1) persistence, high throughput, fully distributed, integrates with Hadoop for parallel loading, and supports both offline and real‑time processing.
Kafka Architecture
Terminology
Broker: A server in a Kafka cluster.
Topic: Logical category of messages; physically stored across one or more brokers.
Partition: A physical slice of a topic; each partition is a ordered log.
Producer: Publishes messages to brokers.
Consumer: Reads messages from brokers.
Consumer Group: A set of consumers that share a group name; each partition is consumed by only one member of the group.
Kafka Topology
A typical cluster contains multiple producers (e.g., page‑view emitters, server logs), several brokers, consumer groups, and a Zookeeper ensemble for configuration, leader election, and rebalancing.
Topic & Partition
Logically a topic behaves like a queue; physically it is split into multiple partitions, each stored in its own directory with log segment files and index files. Each message has a 64‑bit offset and is stored as a log entry consisting of a magic byte, CRC, and payload.
Kafka retains all messages (subject to time‑ or size‑based retention policies) rather than deleting consumed messages, enabling replay and simplifying consumer state management.
Producer Message Routing
Producers select a partition based on a configurable partitioner (e.g., kafka.producer.Partitioner ). The default can be overridden; a common example uses the message key modulo the number of partitions.
Configuration example (default partitions): $KAFKA_HOME/config/server.properties – set num.partitions .
Consumer Group
Within a consumer group, each partition is consumed by only one consumer, but the same topic can be consumed by multiple groups simultaneously, enabling both broadcast and unicast semantics.
Example: a topic with three partitions, one consumer in group 1 receives all messages, while three consumers in group 2 each receive a distinct partition.
Push vs. Pull
Kafka follows the pull model: producers push messages to brokers, consumers pull messages from brokers. Pull allows consumers to match their processing rate, avoiding overload that can occur with push‑only systems.
Kafka Delivery Guarantees
Three delivery semantics are supported:
At most once: Messages may be lost but are never duplicated.
At least once: No loss, but duplicates may occur.
Exactly once: Each message is delivered once and only once (requires external coordination; not fully implemented in older versions).
Producers can achieve at‑most‑once by asynchronous sends; at‑least‑once is the default. Consumers commit offsets to Zookeeper; the commit point determines whether a processed message may be re‑delivered after a crash.
Author Bio
Jason Guo (郭俊): Master's graduate, works on big‑data platform development, proficient with Kafka, Storm, and other distributed streaming technologies. Contact: WeChat habren, Sina Weibo 郭俊_Jason, blog http://www.jasongj.com .
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.