Implementing ZSTD Compression in Didi's Elasticsearch for High‑Performance Log Ingestion
By integrating ZSTD compression into Didi’s Elasticsearch 7.6, the team cut CPU usage by about 15 %, reduced index storage roughly 30 %, boosted write throughput up to 25 %, and eliminated over 20 servers, demonstrating a faster, more storage‑efficient solution for petabyte‑scale log ingestion.
The article introduces Didi's effort to improve Elasticsearch (ES) write performance for massive log ingestion (5‑10 PB per day) by adopting the ZSTD compression algorithm.
ES provides data retrieval through indexes, which consist of shards, each containing segment files that store inverted indexes and document data. The main segment file types are row‑store files (.fdt/.fdx), column‑store files (.dvd), and index‑related files (.tim/.doc).
Because the log cluster is write‑heavy, the row‑store files dominate storage (>30 % of index size). Didi's ES 7.6.0 (Lucene 8.4.0) supports two compression strategies: BEST_SPEED (LZ4) and BEST_COMPRESSION (ZIP). ZIP reduces storage by 20‑40 % compared with LZ4 but increases CPU usage, which can exceed 30 % of the node.
ZSTD (Zstandard) uses FSE encoding, SIMD optimizations, and dictionary compression, offering a good balance of speed and ratio. Benchmarks on a 1 GB log file show ZSTD compresses 4.5× faster and decompresses 1.5× faster than ZIP, with comparable compression ratio.
Implementation steps include:
Extending ES settings and engine to support a ZSTD compression format per shard.
Adding ZSTD support to Lucene via the zstd‑jni library and extending CompressionMode with custom ZStandardCompressor and ZStandardDecompressor .
Parameter tuning: adjusting Chunk Size (set to 60 KB) and selecting an appropriate ZSTD compression level (level 3 for a good speed‑ratio trade‑off, level 9 for higher storage savings).
After three months of testing, the ZSTD‑enabled ES version was deployed to 16 clusters (over 60 k indexes). Results:
Average CPU usage during peak reduced by ~15 %.
Cluster A: CPU usage down 18 %, write‑reject rate down 50 %.
Large index M: CPU usage down 15 %, write throughput up 25 %.
Overall index storage reduced by ~30 % after switching from LZ4 to ZSTD.
Cluster resource reduction enabled the removal of more than 20 machines.
In summary, ZSTD compression provides higher performance and lower cost for Elasticsearch log services, and future work will include read/write separation and major ES version upgrades.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.