Elasticsearch Performance Optimization for Billion‑Scale Data
The article explains how to improve Elasticsearch query speed on tens of billions of records by leveraging filesystem cache, limiting indexed fields, using hot‑cold data separation, designing efficient document models, and employing scroll or search_after APIs to avoid deep pagination bottlenecks.
Interview question: How to improve query efficiency in Elasticsearch when dealing with data volumes of tens of billions? The answer highlights that ES performance is not magical; first searches may take 5‑10 seconds on large datasets, but subsequent queries become faster due to caching.
Filesystem cache as the key optimizer – ES writes data to disk, and the operating system caches index segment files in the filesystem cache. Allocating enough memory so that the cache can hold most or all index files moves searches from disk to memory, reducing latency from seconds to milliseconds.
A real‑world case: a 3‑node cluster with 64 GB RAM per node allocated 32 GB JVM heap each, leaving only 32 GB per node for cache while the total index size was 1 TB. Because only ~10 % of data could reside in cache, most queries hit disk and were slow.
The guideline is to keep the amount of data stored in ES no larger than the total filesystem cache (ideally half of the machine’s memory). Store only searchable fields (e.g., id, name, age) in ES and keep the rest in a backend store such as HBase, then retrieve full records by ID after the ES hit.
Data warm‑up – Periodically query hot data (e.g., popular user profiles or top‑selling products) to preload it into the cache, ensuring fast response for subsequent user requests.
Hot‑cold separation – Create separate indices for hot and cold data, placing hot indices on a subset of nodes so their cache is not polluted by cold data, thereby preserving high performance for frequent queries.
Document model design – Avoid complex joins or nested queries in ES; pre‑join data in the application layer and index only the fields needed for search.
Pagination performance – Deep pagination is costly because each shard must return many documents to the coordinating node. Recommended solutions include:
Disallow deep pagination.
Use the scroll API to retrieve results sequentially via a scroll_id, suitable for infinite‑scroll interfaces.
Use search_after with a unique sort field for efficient forward‑only pagination.
Both scroll and search_after provide millisecond‑level latency but do not support random page jumps.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.