Big Data 25 min read

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

vivo Internet Technology

Dec 18, 2024

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Background

In the era of big data, real‑time data processing is increasingly important, especially for monitoring scenarios where timeliness and reliability are critical. Kafka Streams, an open‑source stream processing framework built on top of Kafka, provides strong support for handling massive, continuous data flows with millisecond‑level latency.

Basic Concepts of Kafka Streams

Kafka Streams is a client library that enables infinite, ordered, replayable, and immutable streams of key‑value records. It supports real‑time transformations, aggregations, and filtering, and integrates seamlessly with Kafka Connect and the standard Kafka producer/consumer APIs.

Key components include:

Source Processor – reads from one or more input topics.

Sink Processor – writes results to an output topic.

Kafka Streams offers two APIs for defining topologies: the high‑level DSL (e.g., map, filter) and the low‑level Processor API for custom processing and state store interaction.

Time Semantics

Three time notions are used:

Event time – the original timestamp of the event.

Processing time – the time when the record is processed by the application.

Ingestion time – the time the record is appended to a Kafka partition.

TimestampExtractor can assign timestamps based on record content or wall‑clock time, influencing windowing and join operations.

Monitoring Scenario Application

Kafka Streams can serve as the core of a monitoring data ETL pipeline, consuming from a source topic (e.g., TopicA), performing real‑time analysis, and writing results to a sink topic (e.g., TopicB).

Example code:

//创建配置类
Properties props = new Properties();
//设置订阅者
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-processing-service");
//设置servers地址
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker1:9092");

StreamsBuilder builder = new StreamsBuilder();
//构建流
KStream<String, String> userActions = builder.stream("TopicA");
//对流进行处理
KTable<String, Long> userClickCounts = userActions
    .filter((key, value) -> value.contains("click"))
    .groupBy((key, value) -> value.split(":")[0])
    .count();
//流写回Kafka
userClickCounts.toStream().to("TopicB", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

This snippet demonstrates creating a StreamsBuilder, filtering click events, grouping by a derived key, counting occurrences, and writing the aggregated results back to Kafka.

Architecture and Execution Model

Kafka Streams leverages Kafka's producer and consumer clients, inheriting partitioning, fault‑tolerance, and coordination capabilities. The processing topology is decomposed into multiple tasks, each bound to a specific input partition. Tasks are executed by stream threads; the number of threads can be configured via num.stream.threads. Fault tolerance is achieved through Kafka's replication and changelog topics for state stores.

Key architectural diagrams (omitted here) illustrate the internal flow from input topics through the processor topology to output topics.

Creating Streams and Writing Back to Kafka

KStreamBuilder builder = new KStreamBuilder();
KStream<String, GenericRecord> source1 = builder.stream("topic1", "topic2");
KTable<String, GenericRecord> source2 = builder.table("topic3", "stateStoreName");
GlobalKTable<String, GenericRecord> source3 = builder.globalTable("topic4", "globalStoreName");

Streams can be materialized back to Kafka using KStream.to(), KTable.to(), or the through() method for read‑modify‑write cycles.

Configuration and Tuning

Essential configuration parameters include: bootstrap.servers – Kafka cluster addresses. key.deserializer / value.deserializer – SerDes for keys and values. group.id – Consumer group identifier.

Common tuning parameters: num.stream.threads – Number of processing threads. state.dir – Local directory for state stores. batch.size, linger.ms, compression.type – Producer settings.

Adjusting these settings can improve throughput, latency, and resource utilization.

Partition Skew Problem and Solutions

Skew occurs when one partition receives disproportionately more data or processing capacity is uneven across consumer instances. Mitigation strategies include:

Designing balanced partitioning schemes (e.g., round‑robin).

Scaling out consumer instances.

Implementing load‑balancing within the consumer group.

Optimizing processing logic and adjusting batch/window sizes.

Using side‑outputs to offload heavy partitions.

Conclusion

The article provides a comprehensive overview of Kafka Streams, covering its core concepts, topology design, time handling, monitoring‑focused ETL use cases, architectural details, configuration, performance tuning, and common operational challenges such as partition skew. Readers should now have the knowledge to design and implement their own Kafka Streams applications for real‑time monitoring workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data ETL kafka streams Stream Topology

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.