Databases 18 min read

Douyin ElasticSearch Write Performance Optimization: Doubling Throughput and Reducing Write Rejections

By consolidating each bulk request onto a single shard, tuning node write paths, and pruning unnecessary fields, Douyin’s ElasticSearch team eliminated long‑tail latency, more than doubled write throughput to over 1 million ops per second, slashed write rejections, and saved millions in infrastructure costs.

Didi Tech

Aug 12, 2020

Douyin ElasticSearch Write Performance Optimization: Doubling Throughput and Reducing Write Rejections

Background

DiDi’s ElasticSearch platform serves all internal ElasticSearch workloads, including core search, RDS replicas, log retrieval, security analytics, and metric analysis. The platform now exceeds 3,000 nodes, stores 5 PB of data, and holds over a trillion documents. Peak write throughput reaches 20 million writes per second, with nearly 1 billion queries per day. To sustain this scale, stability, usability, performance, and cost must be addressed.

After a data‑hot‑node reduction for storage‑cost savings, the cluster began to experience write bottlenecks: many write rejections and Kafka‑to‑ES latency. The following sections describe how the bottleneck was identified and eliminated, ultimately more than doubling write throughput.

2. Write Bottleneck Analysis

2.1 Discovery

Initial performance tests showed ES node CPU usage >80 % under load, yet production nodes only used <50 % CPU, with overall cluster CPU <40 % and no I/O or network pressure. Scaling an index from 10 to 16 nodes increased write speed from ~200 k/s to ~300 k/s, indicating the bottleneck resides on the server side.

2.2 ES Write Model Overview

Clients batch data via the Bulk API. A BulkRequest is split by routing into multiple BulkShardRequests, each sent to a shard’s DataNode. The BulkRequest returns only after all BulkShardRequests complete.

2.3 Root Cause Investigation

BulkRequest slow‑log entries were examined. Example log lines:

[xxx][INFO ][o.e.m.r.RequestTracker   ] [log6-clientnode-sf-5aaae-10] bulkDetail||requestId=null||size=10486923||items=7014||totalMills=2206||max=2203||avg=37

[xxx][INFO ][o.e.m.r.RequestTracker   ] [log6-clientnode-sf-5aaae-10] bulkDetail||requestId=null||size=210506||items=137||totalMills=2655||max=2655||avg=218

The large gap between avg and max suggested a long‑tail effect: a single slow BulkShardRequest dominates the overall latency.

Thread‑state analysis of write threads showed intermittent “waiting” states despite low CPU usage, confirming the long‑tail phenomenon.

2.4 Detailed Factors

Lucene refresh : Periodic refreshes run on write threads, sometimes blocking bulk writes.

Translog ReadWriteLock : WriteLock during translog roll‑over can pause real‑time writes, occasionally exceeding 100 ms.

Write queue saturation : DataNodes use a fixed thread pool and a write queue (size ≈ 1000). When active threads are busy, requests pile up, increasing wait time.

JVM GC pauses : GC can add tens to hundreds of milliseconds to bulk latency.

2.5 Conclusions

The long‑tail originates from BulkRequests being distributed across many shards; a single slow shard drags down the whole request.

3. Performance Optimizations

3.1 Write Model Optimization

The goal is to route an entire BulkRequest to as few shards as possible—ideally a single shard—thereby eliminating the long‑tail effect.

3.1.1 No‑routing scenario

When users do not rely on custom routing, the client assigns a random routing value to the whole BulkRequest, causing all documents to land on the same physical shard.

3.1.2 Routing scenario

A logical‑shard concept ( number_of_routing_size) is introduced on top of physical shards ( number_of_shards). Logical shards (slots) map to multiple physical shards. The placement formula is:

slot = hash(random(value)) % (number_of_shards / number_of_routing_size)

shard_num = hash(_routing) % number_of_routing_size + number_of_routing_size * slot

The client SDK now generates a random routing value per BulkRequest and optionally specifies a logical‑shard parameter, ensuring that all documents in the batch are written to the same physical shard.

This optimization slightly impacts query performance because a logical shard may span several physical shards, but for write‑heavy log workloads the trade‑off is acceptable.

3.2 Single‑Node Write Capability Boost

Back‑ported community improvements: disabling refresh on _flush /_force_merge, async translog sync before reader close, and sync before trimUnreferencedReaders (≈ 18 % gain).

Optional translog disable (dynamic switch) – 10‑30 % gain at the risk of data loss, mitigated by external checkpointing.

Lucene write‑path tweak to avoid synchronous segment flush – 7‑10 % gain.

3.3 Application‑Level Optimizations

Removing two large redundant fields ( args, response) from indexed documents saved ~20 % write latency and ~10 % storage.

4. Production Results

After applying the write‑model optimization, write TPS increased from ~500 k/s to ~1.2 M/s for the same index. Many indices saw a 2× throughput boost.

Write rejection rates dropped dramatically because bulk requests now target a single shard, reducing queue pressure.

Latency charts show a clear reduction in tail latency post‑optimization.

5. Summary

The Douyin ES team achieved a breakthrough in write performance, halving the required SSD fleet, enabling hot‑cold data separation, and saving tens of millions of dollars annually. The comprehensive analysis and multi‑layer optimizations (write model, single‑node tuning, and application‑level changes) provide a roadmap for large‑scale ElasticSearch deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Long Tail Bulk Request Write Performance

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.