Artificial Intelligence 15 min read

Introducing ModelScope-Agent: An Open‑Source Multi‑Modal Multi‑Agent System

This article presents ModelScope‑Agent, an open‑source multi‑modal multi‑agent framework built on the ModelScope community, explains its underlying agent concepts, outlines its architecture and key features, showcases several real‑world applications such as ModelScope GPT, Story‑Agent and Facechain‑Agent, and includes a detailed Q&A on future directions and challenges.

DataFunSummit
DataFunSummit
DataFunSummit
Introducing ModelScope-Agent: An Open‑Source Multi‑Modal Multi‑Agent System

ModelScope‑Agent is an open‑source multi‑modal multi‑agent system developed by the ModelScope community, which aggregates models, datasets, and demos and was recently awarded the SAIL Star at the World AI Conference, highlighting its industry recognition.

The presentation is organized into three parts: a brief recap of the basic concepts of agents, an introduction to the ModelScope‑Agent open‑source framework, and a showcase of several interesting applications built on the framework.

Agents have existed since the reinforcement‑learning era, with early work from DeepMind and OpenAI (e.g., AlphaGo, StarCraft). Traditional RL agents are limited to specific environments, whereas recent large‑model agents leverage massive knowledge, strong instruction‑following, and tool‑calling capabilities (e.g., code generation, information retrieval), making them more versatile.

ModelScope‑Agent is a customizable, full‑featured framework that provides dataset collection, tool registration, storage handling, custom model training, and application development. It uses open‑source LLMs as core components, supporting Alibaba’s Tongyi Qianwen and other popular text or multi‑modal models, and offers comprehensive APIs for developers.

The framework operates by planning with an LLM, invoking the appropriate API, receiving the result, and letting the LLM generate the final response. An example on ModelScope GPT demonstrates a workflow that retrieves a TTS tool, generates a short story with an LLM, and then reads it aloud.

Several applications are highlighted:

ModelScope GPT – supports single‑round and multi‑round multi‑API calls for tasks such as text generation, voice synthesis, and video creation.

Story‑Agent – an interactive story‑book creator for autistic children, using prompt configuration, image generation, and TTS.

Facechain‑Agent – generates personalized portrait or ID photos with style‑specific LoRA fine‑tuning.

Multi‑role chatrooms – a multi‑agent chat platform where each role is defined by a profile, including a “surrounded by beauties” scenario and a Xiaomi SU7 agent.

α‑UMi collaborative framework – decomposes complex tasks into Planner, Caller, and Summarizer modules to improve efficiency and effectiveness over single‑agent setups.

Story‑book video generation – combines LLM‑driven outline creation, page‑wise content generation, StoryDiffusion for consistent images, and AudioLDM for sound effects.

The Q&A section addresses concerns about tool‑call reliability, controllability, the future of agents, decision‑making modules, and the integration of knowledge graphs (RAG) to enhance model accuracy.

Future goals include building the best open‑source text, speech, and video synthesis agent, leveraging Tongyi Qianwen for content creation, and encouraging developers to explore novel scenarios on the ModelScope platform.

The session concludes with thanks to the audience.

Artificial IntelligenceOpen-sourcelarge language modelmultimodalmulti-agentTool CallingModelScope-Agent
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.