Big Data 20 min read

Performance Optimization of Elasticsearch in an ELK Log Architecture

This article summarizes a year‑long performance tuning of an ELK logging system, analyzing bottlenecks such as write thread pool saturation, JVM heap and GC settings, refresh intervals, translog durability, merge threads, shard and replica counts, and provides concrete configuration changes that reduced latency, eliminated data loss, and stabilized node resource usage.

Full-Stack Internet Architecture

Jan 9, 2021

Performance Optimization of Elasticsearch in an ELK Log Architecture

The author reviews a year of performance optimization work on a company's ELK logging system, focusing on Elasticsearch (ES) as the log storage component.

Current environment : 3 virtual machines (16 CPU, 32 GB RAM each) running ES 6.3.0 with 20 GB heap per node, JDK 1.8, CMS + ParNew GC, CentOS 7.4.

Issues observed : Daily creation of ~230 indices and 30‑50 million documents, causing 5‑40 minute log delays and frequent log loss due to write thread‑pool saturation (16 active threads, 200‑size queue).

Root‑cause analysis : Data accumulates in the ES memory buffer and is not flushed to the OS cache quickly enough; write thread‑pool reaches max capacity, leading to task rejections.

Optimization directions :

JVM tuning – adjust heap size, align Xms/Xmx, configure NewSize/MaxNewSize, consider G1 GC.

ES tuning – modify refresh interval, translog durability, merge thread count, shard/replica numbers.

Thread‑pool configuration – increase write pool size to 17 and enlarge queue to 10 000.

Disable swap – use sudo swapoff -a, set vm.swappiness=1, or enable bootstrap.memory_lock: true.

JVM tuning examples :

-Xms16g
-Xmx16g

-XX:NewSize=8G
-XX:MaxNewSize=8G

These settings reduce Young GC frequency and shorten pause times.

ES configuration snippets :

"index.refresh_interval": "5s"
"index.translog.durability": "async"
"index.translog.flush_threshold_size": "1024mb"
"index.translog.sync_interval": "120s"
"index.merge.scheduler.max_thread_count": "1"

Apply to existing indices by closing them, updating settings via curl -XPOST 'http://localhost:9200/_all/_close', then reopening.

Template creation for future indices:

PUT _template/business_log
{
  "index_patterns": ["*202*.*.*"],
  "settings": {
    "index.merge.scheduler.max_thread_count": "1",
    "index.refresh_interval": "5s",
    "index.translog.durability": "async",
    "index.translog.flush_threshold_size": "1024mb",
    "index.translog.sync_interval": "120s"
  }
}

Thread‑pool adjustment in elasticsearch.yml:

thread_pool:
  write:
    size: 17
    queue_size: 10000

After applying all optimizations, the system showed no write rejections, log loss was eliminated, latency dropped to under 10 seconds (typically 5 seconds), and node CPU/memory usage remained stable.

Final results :

No more data loss.

Log query delay within 5‑10 seconds.

Stable node load with moderate CPU and memory consumption.

References include several Chinese and English blog posts and official Elasticsearch documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM Elasticsearch Performance tuning thread-pool ELK Refresh Interval

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.