Operations 14 min read

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Facing severe IO-wait and read bottlenecks as product data grew from tens of millions to billions, this article analyzes root causes in Elasticsearch clusters and presents a comprehensive solution involving index parameter tuning, merge settings, translog async writes, query optimizations, and hardware upgrades to restore performance and stability.

政采云技术

Jul 14, 2022

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Problem Description

As the Gov Procurement Cloud search service expanded, the total number of products quickly grew from tens of millions to billions. The rapid data growth caused the index size to swell, leading to increasingly severe I/O read bottlenecks in the Elasticsearch cluster, with high io-wait and disk read nearing performance limits, adversely affecting query latency and cluster stability.

Root Cause Hypothesis

After encountering the I/O issue, we suspect the problem mainly stems from the following aspects:

Rapid data expansion: the product catalog has grown from tens of millions to billions, consuming large memory and continuously increasing I/O usage.

Improper use of Elasticsearch: the cluster is not fully leveraging ES performance capabilities.

Insufficient hardware: the ES cluster hardware needs configuration upgrades.

Solution Exploration

Parameter Adjustment

In the early stage, we mainly tried to solve the problem by adjusting configuration parameters.

Slow Down Segment Generation Rate

Elasticsearch relies on Lucene for retrieval, and Lucene’s smallest execution unit is a Segment. Therefore, the number of Segments directly impacts query performance.

More Segments per shard mean more scans and higher I/O overhead. This can be mitigated by increasing the index refresh_interval (e.g., to 60 s) to reduce real‑time segment creation.

"refresh_interval": "60s"

Merge Parameter Adjustment

Besides extending the refresh interval, we can also control Segment merging. Elasticsearch provides dynamic index‑level configuration to manage how Segments are merged.

The merge parameters are mainly divided into three categories: rate limiting, new segment generation, and forcemerge.

Rate Limiting

index.merge.scheduler.max_thread_count : Maximum number of merge threads; can be increased on SSDs.

index.merge.scheduler.auto_throttle : Whether to enable I/O throttling (default 20 MB/s).

index.merge.scheduler.max_merge_count : Maximum concurrent merge tasks.

index.merge.policy.max_merge_at_once : Maximum number of Segments merged in a single operation.

New Segment Generation

index.merge.policy.floor_segment : Segments smaller than this are rounded up; default 2 MB. Raising it reduces total Segment count but may increase merge frequency.

index.merge.policy.max_merged_segment : Maximum size of a merged Segment (default 5 GB). Forcemerge can bypass this limit.

index.merge.policy.segments_per_tier : Allowed Segments per tier; smaller values trigger more merges.

forcemerge Parameters

If automatic merge cannot meet requirements, the forcemerge API can force Segment merging, effectively reducing Segment count and physically removing deleted documents.

index.merge.policy.expunge_deletes_allowed : Works with only_expunge_deletes to merge only Segments whose delete ratio exceeds this threshold.

index.merge.policy.max_merge_at_once_explicit : Controls the maximum number of Segments merged in a single forcemerge operation.

Note: forcemerge incurs heavy I/O and CPU usage and invalidates caches; it should be run during low‑traffic periods to avoid cluster overload.

Translog Asynchronous Write

{
  "translog": {
    "flush_threshold_size": "512mb",
    "sync_interval": "5s",
    "durability": "async"
  }
}

By default Elasticsearch flushes the translog synchronously after each write, guaranteeing data safety but adding I/O overhead. Switching to asynchronous durability reduces I/O and improves performance at the cost of a slight reliability trade‑off.

Query Optimization

After parameter tuning and nightly forcemerge tasks, the cluster’s peak I/O read dropped from ~500 MB/s to ~400 MB/s.

However, as the product volume continues to expand, further measures are required. We therefore focus on query‑side optimizations, which fall into two main directions:

Reduce Data Scanned per Query

Elasticsearch normally queries all shards concurrently. Reducing the scan range can dramatically improve retrieval efficiency and lower the number of opened documents. Three approaches are used:

Routing Strategy : Specify a routing key during indexing and querying to target a particular shard, avoiding unnecessary shard scans. This may affect global uniqueness of _id and increase migration cost.

Scan Limits : Limit fetch time and the amount of data traversed per shard using parameters such as timeout and terminate_after.

{
  "timeout": "1s", // limit traversal time
  "terminate_after": 1000 // limit documents per shard
}

Index Splitting : Similar to database sharding, split large indices by time, shop, or category. Combined with ILM and hot‑cold node allocation, this reduces the amount of data each query scans.

Adopt More Efficient Query Fields

Avoid using Elasticsearch for relational joins. Nested and parent/child queries are orders of magnitude slower and inflate document count, consuming disk and memory.

Choose appropriate field types: use keyword for exact matches and numeric types for range queries.

Avoid wildcard queries; they are CPU‑intensive. Instead, employ N‑gram tokenizers for fuzzy matching.

Hardware Upgrade (Scaling)

In parallel with software tuning, we performed two upgrade paths:

Upgrade disks to SSDs.

Increase memory capacity.

Upgrade to SSD

When budget permits, switching to SSDs is recommended. SSDs provide lower latency but require regular backups and redundancy (multiple racks, zones, replicas, or CCR) to mitigate data loss risks. Hot nodes can use SSDs while cold nodes keep HDDs for cost‑effective storage.

Expand Memory

Adding RAM allows Lucene to cache more data in the page cache, dramatically reducing disk reads. For moderate data volumes, adding a few hundred GB of memory is cheaper than SSD upgrades. We typically set the JVM max heap to 31 GB, keeping the rest for Lucene, and maintain a memory‑to‑disk ratio of roughly 1:10.

Conclusion

This exploration of I/O bottlenecks covered three aspects: parameter tuning, query refactoring, and hardware upgrades. Upgrading hardware yields the most immediate gains, while application‑level refactoring provides the most thorough, long‑term solution. Parameter adjustments can postpone issues but are not a final fix.

References

Elasticsearch official documentation (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/nested.html)

Lucene core technology (https://www.amazingkoala.com.cn/Lucene/Index/2019/0513/58.html)

Elasticsearch Core Technology and Practice (https://time.geekbang.org/course/intro/197?tab=catalog)

Mingyi Tianxia (https://blog.csdn.net/laoyang360)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch search performance IO optimization Index Tuning translog cluster operations Merge settings

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.