Artificial Intelligence 4 min read

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

Alibaba Cloud Infrastructure

Nov 29, 2024

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced Mooncake, a KVCache‑centric large‑model inference architecture that significantly improves inference throughput while reducing costs, attracting broad industry attention.

As part of the AIR (Alibaba‑Tsinghua Innovation Research) project, the collaboration explores industrial applications of model resource‑pooling technology, standardizing a cache‑pooling layer and implementing an efficient distributed resource‑decoupling architecture that optimizes long‑context inference for large models.

Alibaba Cloud contributed code to key components such as the Transfer Engine, peer‑to‑peer storage (P2P Store), and high‑performance memory storage, integrating Mooncake with the widely used vLLM inference framework to achieve substantial performance gains and providing reference implementations for other frameworks; the Transfer Engine leverages Alibaba's self‑developed eRDMA network and plans CXL support for rapid, scalable cloud deployment.

Professor Zhang Mingxing of the MADSys Lab emphasized that Mooncake fully utilizes CPU, memory, and SSD resources in AI infrastructure, enables cache sharing across inference instances, reduces resource waste, and that open‑sourcing the project aims to foster academia‑industry collaboration to accelerate large‑model inference system development.

Alibaba Cloud will deepen its participation in Mooncake, partnering with more enterprises, institutions, and universities to explore more efficient and advanced model inference system architectures, with the open‑source repository available at https://github.com/kvcache-ai/mooncake .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large-model inference open-source Tsinghua University KVCache

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.