Big Data 11 min read

Performance Investigation and Optimization of an Elasticsearch Search Service

This article describes a performance bottleneck in a high‑traffic search service, details the investigation of hardware limits, long‑tail query impact, load‑testing methodology, and the subsequent optimizations—including SSD upgrade, data‑structure reduction, and Elasticsearch segment tuning—that reduced disk I/O and improved throughput.

政采云技术
政采云技术
政采云技术
Performance Investigation and Optimization of an Elasticsearch Search Service

1. Phenomenon Description

Rapid growth of business data caused frequent alerts in the search center during peak periods, and users reported unusually slow product search speeds. An initial quick scaling response did not address the root cause, prompting a thorough performance investigation and optimization effort.

2. Problem Investigation

Hardware configuration of the Elasticsearch data nodes was 32 vCPU, 128 GiB RAM, 1 TB storage with 350 MB/s disk I/O. Monitoring revealed a sudden surge in request volume that saturated disk I/O and later stressed CPU, creating an avalanche effect during business peaks.

Analysis of traffic showed a significant increase in long‑tail keyword requests, which further impacted search performance.

3. Load‑Testing Verification

A production‑environment load test was conducted by replaying real request bodies captured during the incident via the DUBBO interface, gradually increasing concurrency from 10 to 30 requests.

Findings:

CPU usage on Elasticsearch nodes remained moderate as concurrency grew.

Disk I/O rose sharply, approaching the threshold at 20 concurrent requests.

Further concurrency increases caused thread‑pool queues to grow, reproducing the online issue and confirming the bottleneck was disk I/O.

Reducing the maximum keyword length lowered the number of tokenized terms, decreasing the number of Elasticsearch document queries and CPU pressure.

4. Further Optimization Directions

Hardware Resource Optimization

Replace cloud disks with local SSDs and adjust configurations:

New node: 16 vCPU, 128 GiB RAM, 2 × 1788 GiB SSD, 1 GB/s I/O.

CPU cores reduced from 32 to 16 (CPU not the bottleneck).

Disk throughput increased from 350 MB/s to 1 GB/s.

SSD upgrade improves IOPS from ~52k to 300k and reduces network overhead.

Data Structure Optimization

Elasticsearch relies heavily on file‑system cache; at least half of available memory should be dedicated to it. Large fields were removed, shrinking index size from >900 GB to ~400 GB, allowing more data to reside in cache and dramatically lowering disk I/O.

Elasticsearch Parameter Optimization

Each shard refreshes every second, creating new segments; more segments mean slower searches. Segment size was increased from the default 5 GB to 10 GB and then 15 GB, improving TPS during load tests.

Post‑Optimization Load Testing

After applying SSD upgrade, data‑structure reduction, and segment tuning, the same load‑testing scenario was rerun. Results showed:

Disk I/O reduced by 99% due to a ~60% data size reduction, with most reads served from OS page cache.

CPU utilization increased from 20% to 40% (more headroom).

Memory usage remained stable, dominated by page cache.

5. Summary

The performance tuning of the search service centered on Elasticsearch, involving hardware upgrades, data‑structure cleanup, and parameter adjustments. Comprehensive analysis and iterative optimization were essential to ensure stable, high‑performance search operations.

Big DataElasticsearchPerformance Tuningload testingDisk I/OSegment Optimization
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.