Data Heterogeneity with BinLake, Binlog, and Flink: Approaches for Order, Subscription, and Product Data
The article explains how data heterogeneity is achieved using JD's BinLake to capture MySQL binlogs, with Flink handling sequential and parallel consumption for order, subscription, and product data, discussing challenges such as ordering guarantees, idempotency, IO overhead, and the shift toward stream‑processing architectures.
Data heterogeneity refers to storing data in heterogeneous locations. In the service market, JD's BinLake (a real‑time MySQL binlog collection, distribution, subscription, and monitoring service) is used to achieve data heterogeneity by subscribing to MySQL binlogs and receiving JMQ messages to build remote storage.
There are two consumption modes: sequential and parallel. Order‑related data must preserve strict ordering, so parallel consumption is unsuitable because it can cause data inconsistency.
Sequential consumption faces challenges such as single‑point low efficiency and blocking when errors occur. Initially, a dedicated server with IP restrictions was used; later the workflow migrated to a streaming platform (Strom/Flink). Flink’s parallel writes to Elasticsearch and JimDB bring natural advantages, but in failure scenarios the JMQ retry mechanism must ensure idempotent handling.
Order Data Heterogeneity
Order data uses sequential consumption: each order inserted into MySQL triggers a binlog event, which Flink subscribes to and heterogenizes into Elasticsearch. Because each record is processed individually, concurrency is not involved, allowing subscription from the master MySQL.
Subscription Data Heterogeneity
Subscription data consists of a main table and an extension table in MySQL that must be merged into a single record in Elasticsearch. Parallel consumption can lead to out‑of‑order receipt of binlog events (e.g., the extension table’s event arriving before the main table’s). The adopted solution subscribes only to the main table’s binlog, then performs a reverse lookup on MySQL to fetch the extension data, merges them, and writes the combined record. To avoid overloading the master MySQL, the reverse lookup queries a slave instance, which introduces additional I/O and latency.
Product Data Heterogeneity
Product data uses parallel consumption because strict ordering is not required. The architecture mirrors the subscription data flow but also writes to Redis for caching, in addition to Elasticsearch. This pattern is straightforward to implement with Flink.
Conclusion
Flink is increasingly adopted across business scenarios as a high‑performance distributed stream‑processing platform. The evolution from a binlog‑centric processing model to a Flink‑driven streaming architecture reflects both technical improvements and a shift in mindset toward real‑time, scalable data pipelines.
以上内容仅代表作者个人思考,希望本文对大家有所帮助~~
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.