Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies
This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.
Mesos is described as a distributed system kernel that follows the same design principles as the Linux kernel but operates at a higher abstraction level, providing resource management and task scheduling for applications such as Hadoop, Spark, Kafka, and Elasticsearch across all servers in a data center.
Originally launched in 2009 as a Berkeley research project and later adopted by Twitter and Airbnb, Mesos consists of a Master that registers slaves and framework schedulers and allocates resources, and Slaves that execute tasks on behalf of frameworks.
The resource allocation workflow is illustrated with a step‑by‑step example: a slave reports free resources, the master offers them to a framework, the framework’s scheduler requests specific CPU and memory slices, and the master finally dispatches tasks to the slave’s executor.
Mesos enables fine‑grained resource distribution, contrasting coarse‑grained allocation, and Marathon is highlighted as a Mesos framework that runs long‑lived services, provides a REST API, and integrates with HAProxy for service discovery and load balancing.
Qunar’s experience is detailed: Mesos has been used since version 0.22 for data‑analysis workloads, with Marathon versions 0.8‑0.11 (and a recommendation to use 1.1 due to a persistent‑volume bug). Spark, Alluxio, etcd, and HDFS are run on Mesos, with specific considerations for persistent storage and SSD‑aware scheduling.
The article discusses two main operational questions: whether all frameworks can be unified under Marathon and whether framework nesting is worthwhile, exploring the trade‑offs of custom frameworks versus Marathon’s built‑in monitoring, HA, and API support.
Challenges such as abnormal task recovery, message control, service discovery, lack of monitoring in custom frameworks, and multi‑tenant resource allocation are examined, leading to the proposal of a “Root Framework” that centralizes framework management, improves HA, simplifies service discovery, and reduces operational overhead.
Finally, the piece outlines additional optimizations like fail‑over timeouts, dynamic resource reservations, and hierarchical framework deployment to support large‑scale, multi‑tenant environments.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.