Big Data 14 min read

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

This article describes how 58.com’s commercial engineering team redesigned its real‑time feature‑mining pipeline—replacing a minute‑level Spark Streaming framework with Flink—to achieve sub‑second latency, higher throughput, stronger fault‑tolerance, and end‑to‑end exactly‑once semantics for user‑profile generation in the second‑hand‑car recommendation scenario.

58 Tech

Oct 10, 2019

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

Background – 58.com operates a massive lifestyle information platform covering local services, recruitment, real‑estate, used‑car, and second‑hand goods. With the decline of mobile‑internet population growth, fine‑grained operations and real‑time user profiling have become critical for improving user experience, client effectiveness, and monetization efficiency. The existing feature‑mining pipeline generated and updated user profiles in minutes, which was insufficient for the second‑hand‑car recommendation use case.

Feature‑Mining Platform Overview – The platform (named “Dahui”) abstracts heterogeneous data sources and computation through Importer and Operator modules, allowing developers to focus on feature logic without dealing with underlying storage or execution details.

Real‑Time Computing Framework Upgrade – The team evaluated three mainstream streaming engines: Spark Streaming, Storm, and Flink. Comparative analysis showed that Flink and Storm support true stream processing with millisecond‑level latency, while Spark Streaming is batch‑oriented with minute‑level latency. Flink also offers incremental checkpointing, higher throughput, and exactly‑once guarantees, making it the preferred choice.

Adopted Architecture – After migration, the new pipeline combines Spark for offline batch feature mining and Flink for both micro‑batch and pure‑stream processing. The resulting architecture (illustrated in Figure 1.4) routes data from Kafka through Flink operators, writes enriched features to the internally developed wtable KV store, and finally persists results to downstream storage.

Flink Cluster Structure – In Yarn mode, a JobManager coordinates tasks submitted by a client, while multiple TaskManagers execute operators. The three‑layer graph model (StreamGraph → JobGraph → ExecutionGraph) enables parallelism, fault‑tolerant checkpointing, and efficient data exchange via IntermediateResult and JobEdge constructs.

Storage Selection – After benchmarking Redis, HBase, and the self‑developed wtable, the team chose wtable for its high QPS, large capacity, automatic scaling, and low latency (tp99 ≈ 26 ms). A timeout‑retry strategy keeps write failure rates below 1 × 10⁻⁷.

End‑to‑End Exactly‑Once – The solution requires idempotent writes and transactional updates. Although Flink supports two‑phase commit with Kafka 0.11+, the team opted for a simpler approach using wtable with strict timeout‑retry, which meets the reliability requirements for the feature‑mining workload.

Monitoring & Performance Evaluation – Metrics such as QPS, latency, CPU/memory usage, and back‑pressure alerts are collected. After optimization, the overall pipeline latency (tp99) dropped from ~8 s to ~4 s, with Flink processing staying under 1 s. The main bottleneck shifted to the Flume transport layer, which was later mitigated by provisioning a dedicated transmission channel.

Practical Issues Encountered – The team documented challenges including task initialization, custom serialization, null‑pointer exceptions, missing reduce‑by‑key operator in Flink, distributed cache failures in Yarn, and bug‑related checkpoint loss. Work‑arounds involved overriding open() for initialization, using Flink’s native serializers, applying keyBy+map instead of reduce, and manually copying cache files to task nodes.

Conclusion & Outlook – The upgraded framework now supports both batch and true streaming, improving latency, throughput, and reliability for real‑time user profiling. Future work includes implementing end‑to‑end exactly‑once semantics within Flink and extending the solution to other commercial scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink feature extraction Spark Exactly-Once

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.