Performance Optimization of Elasticsearch in an ELK Log Architecture
This article summarizes a year‑long performance tuning of an ELK logging system, analyzing bottlenecks such as write thread pool saturation, JVM heap and GC settings, refresh intervals, translog durability, merge threads, shard and replica counts, and provides concrete configuration changes that reduced latency, eliminated data loss, and stabilized node resource usage.
The author reviews a year of performance optimization work on a company's ELK logging system, focusing on Elasticsearch (ES) as the log storage component.
Current environment : 3 virtual machines (16 CPU, 32 GB RAM each) running ES 6.3.0 with 20 GB heap per node, JDK 1.8, CMS + ParNew GC, CentOS 7.4.
Issues observed : Daily creation of ~230 indices and 30‑50 million documents, causing 5‑40 minute log delays and frequent log loss due to write thread‑pool saturation (16 active threads, 200‑size queue).
Root‑cause analysis : Data accumulates in the ES memory buffer and is not flushed to the OS cache quickly enough; write thread‑pool reaches max capacity, leading to task rejections.
Optimization directions :
JVM tuning – adjust heap size, align Xms/Xmx, configure NewSize/MaxNewSize, consider G1 GC.
ES tuning – modify refresh interval, translog durability, merge thread count, shard/replica numbers.
Thread‑pool configuration – increase write pool size to 17 and enlarge queue to 10 000.
Disable swap – use sudo swapoff -a , set vm.swappiness=1 , or enable bootstrap.memory_lock: true .
JVM tuning examples :
-Xms16g
-Xmx16g -XX:NewSize=8G
-XX:MaxNewSize=8GThese settings reduce Young GC frequency and shorten pause times.
ES configuration snippets :
"index.refresh_interval": "5s"
"index.translog.durability": "async"
"index.translog.flush_threshold_size": "1024mb"
"index.translog.sync_interval": "120s"
"index.merge.scheduler.max_thread_count": "1"Apply to existing indices by closing them, updating settings via curl -XPOST 'http://localhost:9200/_all/_close' , then reopening.
Template creation for future indices:
PUT _template/business_log
{
"index_patterns": ["*202*.*.*"],
"settings": {
"index.merge.scheduler.max_thread_count": "1",
"index.refresh_interval": "5s",
"index.translog.durability": "async",
"index.translog.flush_threshold_size": "1024mb",
"index.translog.sync_interval": "120s"
}
}Thread‑pool adjustment in elasticsearch.yml :
thread_pool:
write:
size: 17
queue_size: 10000After applying all optimizations, the system showed no write rejections, log loss was eliminated, latency dropped to under 10 seconds (typically 5 seconds), and node CPU/memory usage remained stable.
Final results :
No more data loss.
Log query delay within 5‑10 seconds.
Stable node load with moderate CPU and memory consumption.
References include several Chinese and English blog posts and official Elasticsearch documentation.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.