Big Data 10 min read

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.

360 Tech Engineering

Jan 16, 2020

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

The channel analysis system is a multidimensional data analysis platform that supports channel operations and evaluation, requiring timely and accurate data. The first generation relied on offline computation with hour‑level granularity, causing latency in new user metrics and passive decision‑making.

This paper implements a unified real‑time and offline solution to provide accurate, efficient, and timely data for channel additions, describing challenges, the proposed solution, and implementation details.

Challenges

Large data volume: 5–6 TB per day, peaks up to 100 MB/s.

High data complexity: multiple product lines, encrypted logs, and diverse sources.

Low‑latency requirement: real‑time decisions need minimal delay.

High accuracy demand: precise channel evaluation for fair settlement.

Solution Overview

The design adopts a dual‑write approach, storing raw data in both a real‑time streaming message queue and distributed storage, and combines real‑time computation with offline calibration to achieve real‑time updates, queries, and visualizations.

Overall Design

Real‑time stream processing is split into three stages—log parsing, product sharding, and new‑data calculation—using Flink for low‑latency parsing and Spark Streaming for micro‑batch aggregation. Kafka and Elasticsearch provide high‑throughput messaging and fast indexing, while an offline HIVE pipeline offers calibration and disaster recovery.

Detailed Implementation

Log Parsing : Kafka streams are consumed by Flink, which leverages its dual‑stream capability to enrich logs with product identifiers, enabling dynamic, configurable data extraction.

Product Sharding : Parsed logs are fed into Spark Streaming, which converts the stream into micro‑batches and updates Elasticsearch. ES’s unique primary key ensures per‑product incremental updates, maintaining a cumulative table of all historical records.

New‑Data Calculation : ES is queried periodically per product to compute hourly and cumulative new user counts, reducing latency from hour‑level to minute‑level, enabling operators to see campaign effects within minutes.

Disaster Recovery : An offline calibration pipeline mirrors the real‑time flow. When real‑time failures occur, the offline job parses logs into HIVE and recomputes new‑data metrics, achieving hour‑level correction with error rates below 0.5%.

Results

Performance tests show log parsing latency of ~1.43 ms per record and product sharding latency of ~1.39 s per batch, far below data production rates, resulting in overall system latency at the second level—well under the 10‑minute business requirement. New‑data and query latencies are at the millisecond level.

Key Challenges Overcome

Low Latency : By decomposing the pipeline into fine‑grained stages and adjusting parallelism, the system maintains minute‑level responsiveness even under peak loads.

Data Accuracy and Stability : Real‑time monitoring with alerts, combined with the offline calibration flow, ensures data correctness and supports precise channel evaluation for fair settlement.

Conclusion

The proposed architecture delivers a minute‑level cumulative new‑user calculation solution, transforming passive, hour‑delayed decisions into proactive, real‑time adjustments, thereby empowering channel operations and evaluation with reliable, low‑latency data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Real-time Streaming Elasticsearch Kafka Spark Streaming

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.