Performance Investigation and Optimization of an Elasticsearch Search Service
This article describes a performance bottleneck in a high‑traffic search service, details the investigation of hardware limits, long‑tail query impact, load‑testing methodology, and the subsequent optimizations—including SSD upgrade, data‑structure reduction, and Elasticsearch segment tuning—that reduced disk I/O and improved throughput.
1. Phenomenon Description
Rapid growth of business data caused frequent alerts in the search center during peak periods, and users reported unusually slow product search speeds. An initial quick scaling response did not address the root cause, prompting a thorough performance investigation and optimization effort.
2. Problem Investigation
Hardware configuration of the Elasticsearch data nodes was 32 vCPU, 128 GiB RAM, 1 TB storage with 350 MB/s disk I/O. Monitoring revealed a sudden surge in request volume that saturated disk I/O and later stressed CPU, creating an avalanche effect during business peaks.
Analysis of traffic showed a significant increase in long‑tail keyword requests, which further impacted search performance.
3. Load‑Testing Verification
A production‑environment load test was conducted by replaying real request bodies captured during the incident via the DUBBO interface, gradually increasing concurrency from 10 to 30 requests.
Findings:
CPU usage on Elasticsearch nodes remained moderate as concurrency grew.
Disk I/O rose sharply, approaching the threshold at 20 concurrent requests.
Further concurrency increases caused thread‑pool queues to grow, reproducing the online issue and confirming the bottleneck was disk I/O.
Reducing the maximum keyword length lowered the number of tokenized terms, decreasing the number of Elasticsearch document queries and CPU pressure.
4. Further Optimization Directions
Hardware Resource Optimization
Replace cloud disks with local SSDs and adjust configurations:
New node: 16 vCPU, 128 GiB RAM, 2 × 1788 GiB SSD, 1 GB/s I/O.
CPU cores reduced from 32 to 16 (CPU not the bottleneck).
Disk throughput increased from 350 MB/s to 1 GB/s.
SSD upgrade improves IOPS from ~52k to 300k and reduces network overhead.
Data Structure Optimization
Elasticsearch relies heavily on file‑system cache; at least half of available memory should be dedicated to it. Large fields were removed, shrinking index size from >900 GB to ~400 GB, allowing more data to reside in cache and dramatically lowering disk I/O.
Elasticsearch Parameter Optimization
Each shard refreshes every second, creating new segments; more segments mean slower searches. Segment size was increased from the default 5 GB to 10 GB and then 15 GB, improving TPS during load tests.
Post‑Optimization Load Testing
After applying SSD upgrade, data‑structure reduction, and segment tuning, the same load‑testing scenario was rerun. Results showed:
Disk I/O reduced by 99% due to a ~60% data size reduction, with most reads served from OS page cache.
CPU utilization increased from 20% to 40% (more headroom).
Memory usage remained stable, dominated by page cache.
5. Summary
The performance tuning of the search service centered on Elasticsearch, involving hardware upgrades, data‑structure cleanup, and parameter adjustments. Comprehensive analysis and iterative optimization were essential to ensure stable, high‑performance search operations.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.