Cloud Native 24 min read

Design and Implementation of Project Eru: A Docker‑Based Cloud Native Scheduling Platform at Mango TV

The article recounts the evolution from Douban's App Engine to Mango TV's Nebulium Engine and finally Project Eru, describing how Docker, Redis Cluster, MacVLAN networking, and custom resource scheduling were combined to build a scalable, cloud‑native platform for heterogeneous workloads.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Design and Implementation of Project Eru: A Docker‑Based Cloud Native Scheduling Platform at Mango TV

This article, originally shared in the QCon High‑Availability Architecture group, is compiled by volunteers and credits Peng Zhef, the core technology lead of Mango TV's platform team, who has extensive experience with Docker and Redis Cluster.

Douban Period

At Douban, the author developed Douban App Engine, a Python‑focused PaaS similar to Google App Engine, using Virtualenv for runtime isolation. Dependency conflicts arose when multiple versions of libraries like werkzeug were required, leading to a shift from modifying CPython to separating dependencies.

In 2013 Docker was released, sparking interest in building a Docker‑based PaaS. After traveling across Asia, the author joined Mango TV and began experimenting with these ideas.

Mango TV's Nebulium Engine

Initially, a Docker‑isolated PaaS called Nebulium Engine (NBE) was built, mirroring the DAE architecture but moving control outside containers. However, lacking a unified language and facing resource management challenges, NBE did not meet operational expectations.

In late 2014 the team revisited Borg and Omega concepts, leading to the second‑generation NBE, now called Project Eru, which focuses on service orchestration and scheduling rather than a traditional PaaS.

Project Eru

Eru can run both offline and online services, allocates CPU resources with fine granularity (0.1, 0.01, 0.001 cores), uses Redis as a message bus, and leverages Docker image layers combined with Git for automated build and test pipelines. It also isolates runtimes to avoid contamination.

Eru introduces logical Pods (similar to Kubernetes) for business grouping, while actual isolation is handled by the network layer. Dockerfiles are generated from a standardized App.yaml , and a common entrypoint enables code reuse across roles.

The architecture moves from a monolithic closed‑loop design to a set of stateless Core services that can be scaled out independently, improving reliability.

Details

Eru consists of Core and Agent components that communicate via Redis Cluster. Agents report container status and perform host‑level operations; Cores control Docker daemons and remain stateless.

Storage uses DeviceMapper for most hosts and Overlay for a subset, with MooseFS providing shared volumes. Networking chose MacVLAN over tunnel solutions for simplicity, performance, and two‑layer isolation, despite its tighter coupling to physical links.

Resource scheduling primarily uses CPU as the metric, with fine‑grained allocation (e.g., 0.1 cores). Containers receive a dedicated “fragment” core and share the remaining capacity via CFS shares. Memory is allocated proportionally based on host capacity.

Scaling decisions are delegated to business teams, who monitor metrics stored in InfluxDB and invoke Core APIs for auto‑scaling. Public Server instances provide a stateless layer for testing and image building.

Service Discovery and Security

After deployment, services are discovered via an internal DNS built on Dnscache and Skydns, with network‑level firewalls enforcing isolation. Business teams interact with services through IPs or DNS names without needing to know underlying infrastructure.

Examples include a Redis Cluster of 400 instances across 10 clusters, where scaling is driven by monitoring data and automated via Eru APIs, achieving near‑instant provisioning.

Conclusion

Project Eru demonstrates a modular, message‑driven architecture built on Docker, Redis, and custom scheduling, emphasizing flexibility and low operational overhead while allowing business teams to control scaling and deployment policies.

Future work includes Dockerizing Yarn executors, further sysctl tuning, and extending the platform to support PaaS capabilities for offline, online, and service workloads by year‑end.

Cloud NativedockerRedisinfrastructureResource SchedulingContainer OrchestrationMacvlan
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.