Security Hardening and Architecture of Didi's Elasticsearch Deployment
Didi hardened its massive Elasticsearch deployment—spanning 66 clusters and thousands of nodes—by adding a custom security plugin that authenticates requests at the cluster level, implementing a one‑click toggle and staged rolling upgrades, ultimately enabling authentication across all clusters and dramatically reducing data‑leak risk.
The article first references two previous posts: "How Didi's self‑developed ES strong‑consistency multi‑active architecture works" and "How to improve ES performance potential". It then notes that Elasticsearch (ES) is widely used for search and analytics, making it a frequent target for attacks, and lists several high‑profile data‑leak incidents involving ES misconfigurations.
It describes Didi's own security issues, such as unauthenticated access to ES HTTP port 9200 and Kibana port 5601, prompting the ES team to fix these vulnerabilities.
Problem Description
Didi's ES architecture consists of five parts: ES cluster, Gateway cluster (provides authentication, rate‑limiting, routing, metrics), ES Admin platform (metadata, index lifecycle, DCDR verification), User console, and the user side (ES client accessing the Gateway).
The core issue is that while the overall ES service has authentication via the Gateway, the ES cluster itself lacks built‑in security. Anyone who can reach the ES IP and port can perform unrestricted operations. Therefore, security authentication must be added directly to the ES cluster, and the admin, gateway, and client components must be adapted to carry credentials.
Solution 1: ES X‑Pack Plugin
The official X‑Pack plugin provides security features such as authentication, authorization, and encryption. Enabling X‑Pack creates accounts stored in an ES index, and the AuthenticationService validates HTTP credentials before allowing operations. Advantages include no code changes for Kibana and native support for full auth/audit. Drawbacks are inability to perform rolling upgrades, DCDR data sync issues, slow rollback, and risk of losing credential indices.
Key modification points for X‑Pack: add a dynamic switch for security, remove TCP‑level TLS/SSL to keep DCDR functional, and ensure admin/gateway/client send credentials in request headers.
Solution 2: Custom ES Security Plugin
The custom plugin implements an HTTP interceptor that extracts credentials from request headers and validates them against a local configuration file. Advantages are simple architecture, support for rolling upgrades, one‑click security toggle for rapid rollback, and no required changes to Kibana. It also avoids accidental credential loss by storing passwords in elasticsearch.yml and making them immutable.
Solution Choice
After comparing development effort, operability, stability, and usability, the team chose the custom security plugin (Solution 2). The post then outlines the query flow with the chosen solution:
ES client sends a query to the Gateway.
Gateway authenticates and authorizes the request, retrieves the target ES cluster address and credentials from Admin, and caches them.
Gateway forwards the query to the appropriate ES cluster using the obtained credentials.
ES executes the query and returns results to the Gateway, which forwards them back to the client.
Release Assurance
The security upgrade touches ES clusters, Gateways, Admin platforms, ES clients, Fastindex (Hive→ES), DataX (MySQL→ES), and Flink→ES. Scale details: 66 ES clusters (2,236 nodes), 28 Gateway clusters (492 nodes), 2 Admin clusters (12 nodes), 8,500+ Flink tasks, 3 Fastindex clusters, 3 DataX clusters.
Rolling upgrades of ES clusters are the most complex, as node restarts can cause the cluster state to turn yellow, requiring recovery to green before proceeding. In worst cases a single node upgrade can exceed one hour, leading to a total upgrade duration of over three months.
To ensure a safe rollout, the team implemented:
One‑click security toggle for the ES engine.
Priority‑based upgrade order (log clusters → public clusters → isolated clusters) with rollback capability.
Scheduled scripts to scan ES and Gateway clusters for the security version.
Verification of Flink task versions.
Additional security metrics to detect and temporarily disable security if it impacts business.
Conclusion
Over more than three months, Didi's ES R&D and SRE teams, together with Flink and DataX owners, completed the upgrade of all ES components and tasks with minimal impact on business. All clusters now have security authentication enabled, significantly reducing the risk of data leakage and loss.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.