Building a Scalable Elasticsearch-as-a-Service Platform on Mesos, Marathon, and Docker at Qunar
This article describes how Qunar's operations team designed and implemented a cloud‑native Elasticsearch‑as‑a‑Service platform using Mesos, Marathon, and Docker, covering requirements analysis, technology selection, resource quota management, cluster isolation, service discovery, data reliability, monitoring, automated deployment, and future improvements.
Qunar's platform division needed to address the rapid growth of Elasticsearch (ES) demand, which traditional VM‑based deployments could not scale efficiently. The team aimed to create a cloud‑native, self‑service ES platform that supports rapid cluster provisioning, automatic scaling, standardized operations, and a user‑friendly interface.
After evaluating Elastic Cloud, Amazon Elasticsearch, and Elasticsearch on Mesos, the team selected a Mesos + Marathon + Docker stack because Marathon provides mature resource scheduling, dynamic reservations, and persistent volume support, essential for data durability.
Key implementation challenges included quota allocation, cluster isolation, service discovery, data reliability, monitoring, and automated deployment. Quota is enforced using Mesos roles, with each role assigned a dedicated Sub‑Marathon, ensuring resource isolation. Cluster isolation is achieved by nesting Marathon instances, each managing one or more ES clusters.
Service discovery relies on a unicast ES configuration combined with Bamboo and HAProxy. Bamboo registers Marathon callbacks to dynamically update HAProxy, allowing ES nodes to discover each other even when their placement on Mesos slaves is unknown.
Data reliability is ensured by leveraging ES replication, enforcing at least one replica per index, limiting shard allocation per node, and using Mesos persistent volumes to bind containers to specific hosts. Snapshotting to HDFS and hot‑standby clusters provide additional protection.
Monitoring is performed with es2graphite (collecting ES metrics) and a custom pyadvisor tool (collecting container metrics), both feeding into Qunar's Watcher monitoring system. Alerts trigger on non‑green ES status and excessive GC times.
Automation is driven by Jenkins pipelines that generate configuration scripts, commit them to GitLab, and invoke Marathon APIs to create the necessary applications. The ESAAS Console, modeled after ES Cloud, offers cluster overview, configuration management, plugin installation, and operation logs.
Since deployment, the ESAAS platform has operated stably for over six months, managing 44 ES clusters across 77 servers, storing roughly 120 TB of data, and serving 30 business lines. Ongoing challenges include billing, log collection integration, cross‑cluster data migration, and resource priority management.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.