Douyin ElasticSearch Write Performance Optimization: Doubling Throughput and Reducing Write Rejections
By consolidating each bulk request onto a single shard, tuning node write paths, and pruning unnecessary fields, Douyin’s ElasticSearch team eliminated long‑tail latency, more than doubled write throughput to over 1 million ops per second, slashed write rejections, and saved millions in infrastructure costs.
Background
DiDi’s ElasticSearch platform serves all internal ElasticSearch workloads, including core search, RDS replicas, log retrieval, security analytics, and metric analysis. The platform now exceeds 3,000 nodes, stores 5 PB of data, and holds over a trillion documents. Peak write throughput reaches 20 million writes per second, with nearly 1 billion queries per day. To sustain this scale, stability, usability, performance, and cost must be addressed.
After a data‑hot‑node reduction for storage‑cost savings, the cluster began to experience write bottlenecks: many write rejections and Kafka‑to‑ES latency. The following sections describe how the bottleneck was identified and eliminated, ultimately more than doubling write throughput.
2. Write Bottleneck Analysis
2.1 Discovery
Initial performance tests showed ES node CPU usage >80 % under load, yet production nodes only used <50 % CPU, with overall cluster CPU <40 % and no I/O or network pressure. Scaling an index from 10 to 16 nodes increased write speed from ~200 k/s to ~300 k/s, indicating the bottleneck resides on the server side.
2.2 ES Write Model Overview
Clients batch data via the Bulk API. A BulkRequest is split by routing into multiple BulkShardRequests, each sent to a shard’s DataNode. The BulkRequest returns only after all BulkShardRequests complete.
2.3 Root Cause Investigation
BulkRequest slow‑log entries were examined. Example log lines:
[xxx][INFO ][o.e.m.r.RequestTracker ] [log6-clientnode-sf-5aaae-10] bulkDetail||requestId=null||size=10486923||items=7014||totalMills=2206||max=2203||avg=37 [xxx][INFO ][o.e.m.r.RequestTracker ] [log6-clientnode-sf-5aaae-10] bulkDetail||requestId=null||size=210506||items=137||totalMills=2655||max=2655||avg=218The large gap between avg and max suggested a long‑tail effect: a single slow BulkShardRequest dominates the overall latency.
Thread‑state analysis of write threads showed intermittent “waiting” states despite low CPU usage, confirming the long‑tail phenomenon.
2.4 Detailed Factors
Lucene refresh : Periodic refreshes run on write threads, sometimes blocking bulk writes.
Translog ReadWriteLock : WriteLock during translog roll‑over can pause real‑time writes, occasionally exceeding 100 ms.
Write queue saturation : DataNodes use a fixed thread pool and a write queue (size ≈ 1000). When active threads are busy, requests pile up, increasing wait time.
JVM GC pauses : GC can add tens to hundreds of milliseconds to bulk latency.
2.5 Conclusions
The long‑tail originates from BulkRequests being distributed across many shards; a single slow shard drags down the whole request.
3. Performance Optimizations
3.1 Write Model Optimization
The goal is to route an entire BulkRequest to as few shards as possible—ideally a single shard—thereby eliminating the long‑tail effect.
3.1.1 No‑routing scenario
When users do not rely on custom routing, the client assigns a random routing value to the whole BulkRequest, causing all documents to land on the same physical shard.
3.1.2 Routing scenario
A logical‑shard concept ( number_of_routing_size ) is introduced on top of physical shards ( number_of_shards ). Logical shards (slots) map to multiple physical shards. The placement formula is:
slot = hash(random(value)) % (number_of_shards / number_of_routing_size) shard_num = hash(_routing) % number_of_routing_size + number_of_routing_size * slotThe client SDK now generates a random routing value per BulkRequest and optionally specifies a logical‑shard parameter, ensuring that all documents in the batch are written to the same physical shard.
This optimization slightly impacts query performance because a logical shard may span several physical shards, but for write‑heavy log workloads the trade‑off is acceptable.
3.2 Single‑Node Write Capability Boost
Back‑ported community improvements: disabling refresh on _flush /_force_merge, async translog sync before reader close, and sync before trimUnreferencedReaders (≈ 18 % gain).
Optional translog disable (dynamic switch) – 10‑30 % gain at the risk of data loss, mitigated by external checkpointing.
Lucene write‑path tweak to avoid synchronous segment flush – 7‑10 % gain.
3.3 Application‑Level Optimizations
Removing two large redundant fields ( args , response ) from indexed documents saved ~20 % write latency and ~10 % storage.
4. Production Results
After applying the write‑model optimization, write TPS increased from ~500 k/s to ~1.2 M/s for the same index. Many indices saw a 2× throughput boost.
Write rejection rates dropped dramatically because bulk requests now target a single shard, reducing queue pressure.
Latency charts show a clear reduction in tail latency post‑optimization.
5. Summary
The Douyin ES team achieved a breakthrough in write performance, halving the required SSD fleet, enabling hot‑cold data separation, and saving tens of millions of dollars annually. The comprehensive analysis and multi‑layer optimizations (write model, single‑node tuning, and application‑level changes) provide a roadmap for large‑scale ElasticSearch deployments.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.