Operations 13 min read

Evolution of JD.com Order Center Elasticsearch Cluster Architecture and Lessons Learned

This article details the progressive evolution of JD.com’s order center Elasticsearch cluster—from its initial default setup through isolation, replica optimization, master‑slave adjustments, and real‑time dual‑cluster backup—highlighting architectural decisions, scaling strategies, synchronization methods, and operational challenges encountered.

Selected Java Interview Questions

Sep 3, 2020

Evolution of JD.com Order Center Elasticsearch Cluster Architecture and Lessons Learned

In JD.com’s order‑to‑home business, the massive volume of order queries caused a read‑heavy workload that could not be efficiently handled by MySQL alone, prompting the adoption of Elasticsearch as the primary search engine for order data.

Initially the ES cluster was deployed with default settings on elastic cloud instances, resulting in a chaotic node layout and single‑point‑of‑failure risks.

To improve stability, the cluster was isolated onto dedicated physical machines, eliminating resource contention with other services.

Subsequently, replica tuning was performed: the default one‑primary‑one‑replica configuration was expanded to one‑primary‑two‑replicas, and additional nodes were added, boosting throughput and query performance.

Later, a master‑slave architecture was introduced, with a standby cluster that receives writes synchronously while the primary handles most traffic; the standby stores recent hot data (≈10% of primary volume) and can take over instantly during primary failures.

Finally, the system evolved into a real‑time dual‑cluster setup after upgrading the primary from ES 1.7 to ES 6.x, employing a seamless failover mechanism and a bi‑directional write strategy to ensure continuous service.

The article also compares two data‑sync approaches from MySQL to ES: (1) listening to binlog events and pushing changes to ES, which decouples the systems but adds a new service and maintenance overhead; (2) directly using the ES API in business code, which is simpler and lower‑latency but tightly couples the application to ES. JD.com chose the latter, supplemented by a compensation worker that retries failed writes.

Key operational pitfalls discussed include the near‑real‑time nature of ES refresh (making high‑freshness queries better served by the database), the performance impact of deep pagination (large from values cause heavy per‑shard processing), and the memory pressure of fielddata versus the more efficient doc‑values for sorting and aggregations.

Overall, the rapid architectural iterations driven by business growth illustrate that there is no single “best” design—only the most suitable one for current scale and requirements, with continuous optimization needed to handle ever‑increasing throughput and stability demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch high availability data synchronization Cluster Architecture

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.