Real-Time and Offline Integrated Solution for Channel Analysis Data Processing
This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.
The channel analysis system is a multidimensional data analysis platform that supports channel operations and evaluation, requiring timely and accurate data. The first generation relied on offline computation with hour‑level granularity, causing latency in new user metrics and passive decision‑making.
This paper implements a unified real‑time and offline solution to provide accurate, efficient, and timely data for channel additions, describing challenges, the proposed solution, and implementation details.
Challenges
Large data volume: 5–6 TB per day, peaks up to 100 MB/s.
High data complexity: multiple product lines, encrypted logs, and diverse sources.
Low‑latency requirement: real‑time decisions need minimal delay.
High accuracy demand: precise channel evaluation for fair settlement.
Solution Overview
The design adopts a dual‑write approach, storing raw data in both a real‑time streaming message queue and distributed storage, and combines real‑time computation with offline calibration to achieve real‑time updates, queries, and visualizations.
Overall Design
Real‑time stream processing is split into three stages—log parsing, product sharding, and new‑data calculation—using Flink for low‑latency parsing and Spark Streaming for micro‑batch aggregation. Kafka and Elasticsearch provide high‑throughput messaging and fast indexing, while an offline HIVE pipeline offers calibration and disaster recovery.
Detailed Implementation
Log Parsing : Kafka streams are consumed by Flink, which leverages its dual‑stream capability to enrich logs with product identifiers, enabling dynamic, configurable data extraction.
Product Sharding : Parsed logs are fed into Spark Streaming, which converts the stream into micro‑batches and updates Elasticsearch. ES’s unique primary key ensures per‑product incremental updates, maintaining a cumulative table of all historical records.
New‑Data Calculation : ES is queried periodically per product to compute hourly and cumulative new user counts, reducing latency from hour‑level to minute‑level, enabling operators to see campaign effects within minutes.
Disaster Recovery : An offline calibration pipeline mirrors the real‑time flow. When real‑time failures occur, the offline job parses logs into HIVE and recomputes new‑data metrics, achieving hour‑level correction with error rates below 0.5%.
Results
Performance tests show log parsing latency of ~1.43 ms per record and product sharding latency of ~1.39 s per batch, far below data production rates, resulting in overall system latency at the second level—well under the 10‑minute business requirement. New‑data and query latencies are at the millisecond level.
Key Challenges Overcome
Low Latency : By decomposing the pipeline into fine‑grained stages and adjusting parallelism, the system maintains minute‑level responsiveness even under peak loads.
Data Accuracy and Stability : Real‑time monitoring with alerts, combined with the offline calibration flow, ensures data correctness and supports precise channel evaluation for fair settlement.
Conclusion
The proposed architecture delivers a minute‑level cumulative new‑user calculation solution, transforming passive, hour‑delayed decisions into proactive, real‑time adjustments, thereby empowering channel operations and evaluation with reliable, low‑latency data.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.