Big Data 5 min read

Data Heterogeneity with BinLake, Binlog, and Flink: Approaches for Order, Subscription, and Product Data

The article explains how data heterogeneity is achieved using JD's BinLake to capture MySQL binlogs, with Flink handling sequential and parallel consumption for order, subscription, and product data, discussing challenges such as ordering guarantees, idempotency, IO overhead, and the shift toward stream‑processing architectures.

JD Retail Technology

Apr 18, 2019

Data Heterogeneity with BinLake, Binlog, and Flink: Approaches for Order, Subscription, and Product Data

Data heterogeneity refers to storing data in heterogeneous locations. In the service market, JD's BinLake (a real‑time MySQL binlog collection, distribution, subscription, and monitoring service) is used to achieve data heterogeneity by subscribing to MySQL binlogs and receiving JMQ messages to build remote storage.

There are two consumption modes: sequential and parallel. Order‑related data must preserve strict ordering, so parallel consumption is unsuitable because it can cause data inconsistency.

Sequential consumption faces challenges such as single‑point low efficiency and blocking when errors occur. Initially, a dedicated server with IP restrictions was used; later the workflow migrated to a streaming platform (Strom/Flink). Flink’s parallel writes to Elasticsearch and JimDB bring natural advantages, but in failure scenarios the JMQ retry mechanism must ensure idempotent handling.

Order Data Heterogeneity

Order data uses sequential consumption: each order inserted into MySQL triggers a binlog event, which Flink subscribes to and heterogenizes into Elasticsearch. Because each record is processed individually, concurrency is not involved, allowing subscription from the master MySQL.

Subscription Data Heterogeneity

Subscription data consists of a main table and an extension table in MySQL that must be merged into a single record in Elasticsearch. Parallel consumption can lead to out‑of‑order receipt of binlog events (e.g., the extension table’s event arriving before the main table’s). The adopted solution subscribes only to the main table’s binlog, then performs a reverse lookup on MySQL to fetch the extension data, merges them, and writes the combined record. To avoid overloading the master MySQL, the reverse lookup queries a slave instance, which introduces additional I/O and latency.

Product Data Heterogeneity

Product data uses parallel consumption because strict ordering is not required. The architecture mirrors the subscription data flow but also writes to Redis for caching, in addition to Elasticsearch. This pattern is straightforward to implement with Flink.

Conclusion

Flink is increasingly adopted across business scenarios as a high‑performance distributed stream‑processing platform. The evolution from a binlog‑centric processing model to a Flink‑driven streaming architecture reflects both technical improvements and a shift in mindset toward real‑time, scalable data pipelines.

以上内容仅代表作者个人思考，希望本文对大家有所帮助~~

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing Elasticsearch binlog data heterogeneity

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.