Backend Development 17 min read

How to Sync MySQL Data to Elasticsearch: Strategies, Pros & Cons

This article explains why MySQL‑Elasticsearch synchronization is needed in large‑scale projects, compares several sync methods—including synchronous and asynchronous double‑write, Logstash, Binlog, Canal, and Alibaba Cloud DTS—detailing their implementation steps, advantages, disadvantages, and typical application scenarios.

Top Architect
Top Architect
Top Architect
How to Sync MySQL Data to Elasticsearch: Strategies, Pros & Cons

Overview

In many projects MySQL serves as the core business database, but as data volume and query complexity grow, relying solely on MySQL for fast retrieval becomes a bottleneck. Introducing Elasticsearch (ES) as a dedicated query engine improves search performance, flexibility, and scalability, making data synchronization between MySQL and ES a critical task.

Sync Schemes

1. Synchronous Double‑Write

Goal : Write the same data to MySQL and ES simultaneously to keep them consistent and offload query pressure from MySQL.

Implementation :

Direct Sync : In business code, after a MySQL write, execute an ES write immediately. Simple but increases code complexity and risk of failure.

Middleware : Use message queues (Kafka), CDC tools (Debezium), or ETL tools (Logstash) to capture MySQL changes and forward them to ES, decoupling business logic from sync logic.

Triggers & Stored Procedures : Define MySQL triggers or procedures that invoke ES writes on data changes, reducing code intrusion but adding load to MySQL.

Pros :

Simple business logic

High real‑time query capability

Cons :

Hard‑coded write paths everywhere

Strong coupling between services

Risk of data loss if one write fails

Additional write latency reduces overall performance

2. Asynchronous Double‑Write

This approach writes to MySQL first, then asynchronously propagates changes to ES, reducing write latency and improving system performance while tolerating temporary inconsistency.

Pros :

Higher system availability—backup failures don’t affect the primary

Reduced primary write latency

Easy to add more downstream data stores

Cons :

Hard‑coded integration for each new data source

Increased system complexity due to message middleware

Potential delay in data visibility

Eventual consistency risks require additional safeguards

3. Logstash Sync

Logstash is an open‑source data pipeline that can ingest data from multiple sources, transform it, and output to a target repository. It can capture MySQL changes and push them to ES.

Pros :

No code intrusion, no hard‑coding

Decouples sync logic, preserving original application performance

Cons :

Latency due to periodic polling, even at second‑level intervals

Polling adds load to the database (often mitigated by using a replica)

Cannot automatically delete documents in ES; manual cleanup required

ES _id must match MySQL primary key

4. Binlog Real‑Time Sync

Binlog records all data‑changing SQL statements in MySQL. Real‑time sync tools (e.g., Canal, Maxwell) listen to binlog events, parse them, and replicate changes to ES.

Advantages :

Real‑time capture

Strong data consistency

Supports multiple target systems

Scalable and flexible

No code changes required

Disadvantages :

Configuration and maintenance can be complex

High‑throughput environments may see performance impact on MySQL

Tooling depends on binlog availability; version changes may need reconfiguration

5. Canal Sync

Canal, an open‑source project from Alibaba, pretends to be a MySQL slave to subscribe to binlog events, providing near‑millisecond latency data replication.

Sync Process :

Canal client requests dump protocol from MySQL master.

MySQL streams binlog to Canal, which converts binary data to JSON.

Canal forwards parsed events to ES via TCP or MQ.

6. Alibaba Cloud DTS

Data Transmission Service (DTS) offers real‑time data flow between heterogeneous data sources (RDBMS, NoSQL, OLAP). It handles initial data load and continuous incremental sync, providing high availability, dynamic source address adaptation, and optional Serverless instances that auto‑scale with workload.

Key Features :

High‑availability architecture with automatic failover

Dynamic adaptation to source address changes

Two‑phase sync: initialization (full load) + real‑time incremental sync

Serverless Mode : Adjusts resources (CPU, memory, RPS) automatically, reducing cost during low‑traffic periods while ensuring capacity during spikes.

存储库
backendElasticsearchMySQLBinlogCanalData SynchronizationLogstash
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.