Databases 16 min read

Data Synchronization Strategies Between MySQL and Elasticsearch

This article explains why MySQL alone struggles with large‑scale queries, introduces Elasticsearch as a complementary search engine, and compares several synchronization approaches—including synchronous and asynchronous dual‑write, Logstash, Binlog, Canal, and Alibaba Cloud DTS—detailing their implementations, advantages, disadvantages, and suitable use cases.

Architect

Jan 2, 2025

Data Synchronization Strategies Between MySQL and Elasticsearch

In real project development and operation, MySQL often serves as the core business database, providing strong transaction processing and data integrity, but as data volume and query complexity increase, relying solely on MySQL for efficient retrieval becomes increasingly difficult.

To alleviate this challenge, a read‑write separation strategy introduces Elasticsearch (ES) as a dedicated query engine, leveraging its superior search performance, flexible schema, and scalability to handle complex queries and improve user experience.

Ensuring data synchronization between MySQL and ES is critical for real‑time accuracy and system stability. Various synchronization methods include Logstash, Kafka Connect, Debezium, scheduled jobs, etc., each with trade‑offs regarding latency, complexity, and incremental updates.

1. Synchronous Dual‑Write

Synchronous dual‑write is a data synchronization strategy that writes changes to MySQL and ES simultaneously, guaranteeing consistency but potentially increasing code complexity and risk.

Goal

The goal is to replicate business data from MySQL to ES in real time, enabling ES’s efficient query capabilities for complex search while reducing MySQL’s query load.

Implementation

Direct write : In business code, each write to MySQL is accompanied by a write to ES.

Middleware : Use message queues (e.g., Kafka), CDC tools (e.g., Debezium) or ETL tools (e.g., Logstash) to capture MySQL change events and forward them to ES, decoupling business logic from synchronization logic.

Triggers & Stored Procedures : Set MySQL triggers or stored procedures to automatically invoke ES writes when data changes.

Pros & Cons

Advantages Simple business logic High query real‑time

Disadvantages Hard‑coded business logic; every MySQL write point must add ES write code High coupling between business code and ES Risk of data loss if dual‑write fails Performance degradation due to additional ES writes

Application Scenarios

Suitable for scenarios requiring strong data consistency and optimized query performance, such as e‑commerce systems where product information and order data are stored in MySQL and synchronized to ES for complex search and analysis.

2. Asynchronous Dual‑Write

Asynchronous dual‑write allows MySQL write operations to be propagated to multiple data sources (e.g., ES) asynchronously, reducing primary database write latency and improving overall system performance, but may introduce temporary inconsistency.

Pros

Higher system availability: backup failures do not affect primary writes

Reduced primary write latency, enabling faster writes

Multiple data source synchronization, facilitating future extensions

Cons

Hard‑coded integration for new data sources requires new consumer code

Increased system complexity due to added middleware

Lower real‑time visibility; asynchronous consumption may delay data visibility

Potential temporary data inconsistency; additional measures needed to ensure eventual consistency

Application Scenarios

Suitable for scenarios where absolute consistency is not critical but performance is, such as synchronizing non‑critical data like user browsing logs or click counts while keeping critical order data strongly consistent.

3. Logstash Synchronization

Logstash is an open‑source server‑side data processing pipeline that can collect data from multiple sources, transform it, and send it to a 存储库 . It can play an important role in syncing MySQL and Elasticsearch.

Pros

Non‑intrusive: no changes to original code

No strong coupling; does not affect original program performance

Cons

Potential latency because it relies on scheduled polling, even with second‑level intervals

Polling adds pressure on the database; can be mitigated by using a replica for polling

Cannot sync deletions automatically; manual ES delete commands are required

ES _id must match MySQL id

4. Binlog Real‑Time Synchronization

Binlog (Binary Log) records all data‑changing SQL statements in MySQL. Tools such as Canal or Maxwell listen to Binlog changes and synchronize them to other databases or storage systems.

Advantages

Real‑time capture of database changes

Ensures data consistency between source and target

Flexibility to sync across various databases and storage systems

Scalable and can be extended/customized per business needs

No code intrusion; original system requires no changes

Disadvantages

Configuration and maintenance of sync tools can be complex

In high‑concurrency scenarios, Binlog writing and syncing may impact database performance

Sync tools depend on Binlog functionality; version or configuration changes may require re‑configuration

5. Canal Data Synchronization

Canal is an open‑source Alibaba product that parses MySQL binlog by masquerading as a slave, providing incremental data subscription and consumption. It can forward data to Alibaba Cloud ES or ES Serverless via RESTful API.

Sync Principle

Canal pretends to be a MySQL slave, subscribes to the master’s binlog, parses the binary data into JSON, and forwards it.

Sync Process

Canal server requests the dump protocol from MySQL master.

MySQL master pushes binlog to Canal, which parses the byte stream into JSON.

Canal client receives data via TCP or MQ and writes it to ES.

6. Alibaba Cloud DTS

Data Transmission Service (DTS) is Alibaba Cloud’s real‑time data flow service, supporting relational databases, NoSQL, and OLAP sources, integrating data sync, migration, subscription, integration, and processing.

System Architecture

DTS modules have active‑standby architecture for high availability; disaster‑recovery monitors node health and switches links automatically.

Data Sync Workflow

Initialization : DTS collects incremental data, loads source schema and existing data into the target.

Real‑time Sync : DTS continuously replicates ongoing data changes, keeping source and target databases synchronized.

DTS Serverless

Serverless instances adjust resources dynamically based on load metrics (RPS, CPU, memory, network), providing cost‑effective scaling without over‑provisioning.

Elasticsearch MySQL binlog Canal Data Synchronization DTS Logstash

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.