Databases 12 min read

Elasticsearch Index and Search Performance Optimization for Billion‑Scale Data

This article presents a comprehensive case study of optimizing Elasticsearch and its underlying Lucene structures to achieve sub‑second query responses on billions of records, covering architecture basics, index design, doc‑values tuning, bulk‑write strategies, and extensive performance testing.

Architect

Sep 23, 2022

Elasticsearch Index and Search Performance Optimization for Billion‑Scale Data

The data platform has evolved through three versions, encountering common challenges that led to a consolidated set of documentation focusing on Elasticsearch (ES) optimization, while referring to other components like HBase and Hadoop.

Project Background: A business system stores over a hundred million rows per day in partitioned tables, limited to three months of data retention, making cross‑day queries and long‑term storage difficult.

Improvement Goals:

Enable cross‑month queries and support over a year of historical data export.

Achieve second‑level response times for conditional queries.

3.1 ES and Lucene Fundamentals: ES clusters consist of nodes, indices, shards, and replicas, with Lucene as the core storage engine. Key concepts such as clusters, nodes, indices, types, documents, shards, and doc‑values are explained.

Lucene stores data in segments, each containing multiple documents and fields, which are tokenized into terms. The Lucene index file structure includes dictionaries, inverted lists, forward files, and doc‑values.

3.3 ES Index and Search Sharding: Documents are routed to shards using shard = hash(routing) % number_of_primary_shards. Proper routing aligns data distribution and improves query performance.

Doc‑Values: Column‑store structures that enable fast sorting and aggregation. Unnecessary doc‑values should be disabled to reduce resource consumption.

Optimization Cases: The case study uses fixed query fields without full‑text search, stores only row keys in ES, and keeps actual data in HBase. Recommendations include batch writes, multi‑threaded ingestion, extending refresh_interval, allocating ample memory for Lucene caching, using SSDs, customizing IDs, and tuning merge thread counts.

4.1 Index Performance Tuning:

Batch write sizes of hundreds to thousands of records.

Multi‑threaded ingestion matching the number of machines.

Set refresh_interval": "-1" and manually refresh after bulk loads.

Allocate ~50% of node memory to Lucene file cache (e.g., 64 GB per node).

Prefer SSDs over HDDs for random I/O.

Use custom keys aligned with HBase row keys.

Configure merge throttling and thread counts based on disk type.

4.2 Search Performance Tuning:

Disable unnecessary doc‑values.

Prefer keyword fields over numeric ranges when possible.

Turn off _source storage for fields not needed in results.

Use filters or constant_score queries to avoid scoring overhead.

Handle pagination efficiently with from+size, search_after, or scroll as appropriate.

Introduce combined timestamp‑ID fields for sorting.

Allocate CPUs with 16 cores or more for sorting‑heavy workloads.

Set merge.policy.expunge_deletes_allowed": "0" to purge deleted records promptly.

Performance Testing: Benchmarks include single‑node tests with 50 M–100 M records, cluster tests up to 3 B records, varied query combinations, and SSD vs. HDD comparisons.

Production Results: The optimized platform handles tens of billions of records, returning 100‑row queries within 3 seconds, with fast pagination. Future bottlenecks can be addressed by scaling nodes.

{
    "mappings": {
        "data": {
            "dynamic": "false",
            "_source": {
                "includes": ["XXX"]
            },
            "properties": {
                "state": {
                    "type": "keyword",
                    "doc_values": false
                },
                "b": {
                    "type": "long"
                }
            }
        }
    },
    "settings": {......}
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization indexing lucene big-data

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.