Data Synchronization Strategies Between MySQL and Elasticsearch
This article explains why MySQL alone struggles with large‑scale queries, introduces Elasticsearch as a complementary search engine, and compares several synchronization approaches—including synchronous and asynchronous dual‑write, Logstash, Binlog, Canal, and Alibaba Cloud DTS—detailing their implementations, advantages, disadvantages, and suitable use cases.
In real project development and operation, MySQL often serves as the core business database, providing strong transaction processing and data integrity, but as data volume and query complexity increase, relying solely on MySQL for efficient retrieval becomes increasingly difficult.
To alleviate this challenge, a read‑write separation strategy introduces Elasticsearch (ES) as a dedicated query engine, leveraging its superior search performance, flexible schema, and scalability to handle complex queries and improve user experience.
Ensuring data synchronization between MySQL and ES is critical for real‑time accuracy and system stability. Various synchronization methods include Logstash, Kafka Connect, Debezium, scheduled jobs, etc., each with trade‑offs regarding latency, complexity, and incremental updates.
1. Synchronous Dual‑Write
Synchronous dual‑write is a data synchronization strategy that writes changes to MySQL and ES simultaneously, guaranteeing consistency but potentially increasing code complexity and risk.
Goal
The goal is to replicate business data from MySQL to ES in real time, enabling ES’s efficient query capabilities for complex search while reducing MySQL’s query load.
Implementation
Direct write : In business code, each write to MySQL is accompanied by a write to ES.
Middleware : Use message queues (e.g., Kafka), CDC tools (e.g., Debezium) or ETL tools (e.g., Logstash) to capture MySQL change events and forward them to ES, decoupling business logic from synchronization logic.
Triggers & Stored Procedures : Set MySQL triggers or stored procedures to automatically invoke ES writes when data changes.
Pros & Cons
Advantages Simple business logic High query real‑time
Disadvantages Hard‑coded business logic; every MySQL write point must add ES write code High coupling between business code and ES Risk of data loss if dual‑write fails Performance degradation due to additional ES writes
Application Scenarios
Suitable for scenarios requiring strong data consistency and optimized query performance, such as e‑commerce systems where product information and order data are stored in MySQL and synchronized to ES for complex search and analysis.
2. Asynchronous Dual‑Write
Asynchronous dual‑write allows MySQL write operations to be propagated to multiple data sources (e.g., ES) asynchronously, reducing primary database write latency and improving overall system performance, but may introduce temporary inconsistency.
Pros
Higher system availability: backup failures do not affect primary writes
Reduced primary write latency, enabling faster writes
Multiple data source synchronization, facilitating future extensions
Cons
Hard‑coded integration for new data sources requires new consumer code
Increased system complexity due to added middleware
Lower real‑time visibility; asynchronous consumption may delay data visibility
Potential temporary data inconsistency; additional measures needed to ensure eventual consistency
Application Scenarios
Suitable for scenarios where absolute consistency is not critical but performance is, such as synchronizing non‑critical data like user browsing logs or click counts while keeping critical order data strongly consistent.
3. Logstash Synchronization
Logstash is an open‑source server‑side data processing pipeline that can collect data from multiple sources, transform it, and send it to a 存储库 . It can play an important role in syncing MySQL and Elasticsearch.
Pros
Non‑intrusive: no changes to original code
No strong coupling; does not affect original program performance
Cons
Potential latency because it relies on scheduled polling, even with second‑level intervals
Polling adds pressure on the database; can be mitigated by using a replica for polling
Cannot sync deletions automatically; manual ES delete commands are required
ES _id must match MySQL id
4. Binlog Real‑Time Synchronization
Binlog (Binary Log) records all data‑changing SQL statements in MySQL. Tools such as Canal or Maxwell listen to Binlog changes and synchronize them to other databases or storage systems.
Advantages
Real‑time capture of database changes
Ensures data consistency between source and target
Flexibility to sync across various databases and storage systems
Scalable and can be extended/customized per business needs
No code intrusion; original system requires no changes
Disadvantages
Configuration and maintenance of sync tools can be complex
In high‑concurrency scenarios, Binlog writing and syncing may impact database performance
Sync tools depend on Binlog functionality; version or configuration changes may require re‑configuration
5. Canal Data Synchronization
Canal is an open‑source Alibaba product that parses MySQL binlog by masquerading as a slave, providing incremental data subscription and consumption. It can forward data to Alibaba Cloud ES or ES Serverless via RESTful API.
Sync Principle
Canal pretends to be a MySQL slave, subscribes to the master’s binlog, parses the binary data into JSON, and forwards it.
Sync Process
Canal server requests the dump protocol from MySQL master.
MySQL master pushes binlog to Canal, which parses the byte stream into JSON.
Canal client receives data via TCP or MQ and writes it to ES.
6. Alibaba Cloud DTS
Data Transmission Service (DTS) is Alibaba Cloud’s real‑time data flow service, supporting relational databases, NoSQL, and OLAP sources, integrating data sync, migration, subscription, integration, and processing.
System Architecture
DTS modules have active‑standby architecture for high availability; disaster‑recovery monitors node health and switches links automatically.
Data Sync Workflow
Initialization : DTS collects incremental data, loads source schema and existing data into the target.
Real‑time Sync : DTS continuously replicates ongoing data changes, keeping source and target databases synchronized.
DTS Serverless
Serverless instances adjust resources dynamically based on load metrics (RPS, CPU, memory, network), providing cost‑effective scaling without over‑provisioning.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.