Big Data 8 min read

Real-Time Computing with Apache Storm: Architecture, Code Samples, and Fault Tolerance

This article explains the principles of real-time computing, compares it with offline batch processing, and demonstrates a practical solution using Kafka for ingestion, Apache Storm for continuous computation, and various storage options, while also covering streaming concepts and Storm's high‑availability mechanisms.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Real-Time Computing with Apache Storm: Architecture, Code Samples, and Fault Tolerance

Continuing from the previous discussion on offline computation, this article introduces real‑time computing, which processes data as soon as it is generated and delivers results within seconds, enabling use cases such as instant hot‑topic detection on large websites.

The proposed solution consists of three layers: (1) data collection via a message queue (e.g., Kafka) that continuously streams search logs; (2) a persistent computation framework (e.g., Apache Storm) that consumes the queue, applies business logic, and emits results; (3) real‑time presentation and storage using in‑memory stores (Redis, MongoDB) or disk‑based databases (HBase, MySQL, SQL Server).

Storm Overview

Storm is a distributed, fault‑tolerant real‑time computation system. Its core abstraction is a topology , analogous to a MapReduce job, which defines the complete data‑flow graph. Unlike Hadoop, Storm’s Spout component can ingest data from any source (Kafka, Redis, databases, etc.), and multiple spouts can be used to separate streams.

Example Spout implementation:

public class CalcPriceSpout : BaseRichSpout {
    private SpoutCollector Collector;
    public override void NexData() {
        // Read from various sources such as Redis, message queues, databases
        Collector.emit("message");
    }
}

The NexData method is repeatedly invoked by Storm, reading messages from Kafka and emitting them to downstream bolts for immediate processing.

Stream Processing

Stream processing treats data like a flowing liquid: each node (Bolt) receives a tuple, performs computation, and forwards the result to the next node, forming a directed acyclic graph. This model is more flexible than the two‑stage MapReduce paradigm.

Example Bolt implementation:

public class CalcProductPriceBolt : BaseRichBolt {
    private BoltCollector Collector;
    public override void Execute(Tuplestring, string> input) {
        //Result = compute something.
        //Collector.Emit("Result"); // send to next node
    }
}

Images illustrate the spout‑bolt data flow and the overall topology.

Fault Tolerance

Storm ensures reliability through an Acker component that tracks the completion of each tuple using unique IDs. If a bolt fails, the originating spout can re‑emit the tuple after a timeout. Similar recovery mechanisms exist for failed ackers, spouts, or entire Storm components, with the Nimbus master handling node restarts.

In summary, unlike Hadoop’s batch‑oriented, static data model, Storm’s streaming architecture allows continuous, low‑latency computation on data from any source, while keeping the processing logic immutable and providing robust fault‑tolerance.

big dataStream ProcessingKafkaFault ToleranceReal-Time ComputingApache Storm
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.