How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones
The article examines the underlying GUI‑Agent technologies behind the 2025 “Doubao” smartphone, comparing Microsoft’s UFO series, Alibaba’s Mobile‑Agent v2/v3, and ByteDance’s UI‑TARS, detailing their model foundations, input modalities, action spaces, planning mechanisms, learning strategies, open‑source status, and multi‑agent frameworks.
Key Dimensions Comparison
Core Positioning: Alibaba Mobile-Agent – mobile‑focused multi‑agent system; ByteDance UI‑TARS – cross‑platform native agent model; Microsoft UFO – heterogeneous cross‑platform framework.
Input Modality: Mobile‑Agent uses screenshots + OCR + icon detection; UI‑TARS relies on pure visual screenshots; UFO combines UI Automation (UIA), visual cues, and text.
Model Base: Mobile‑Agent builds on a self‑trained multimodal model (Qwen2.5‑VL); UI‑TARS uses a self‑trained Vision‑Language Model ranging from 2 B to 72 B parameters; UFO leverages GPT‑4 with vision capabilities.
Action Space: Mobile‑Agent issues Android ADB commands; UI‑TARS unifies GUI atomic actions, keyboard/mouse, terminal commands, and APIs; UFO supports UIA, Win32, COM, and generic GUI actions.
Planning Mechanism: Mobile‑Agent employs multi‑agent collaboration with ReAct‑style reflection; UI‑TARS follows a System‑2 reasoning chain (thought → action); UFO adopts a dual‑brain HostAgent + AppAgent architecture.
Continual Learning: Mobile‑Agent relies on manual rules and trajectory replay; UI‑TARS uses a multi‑turn reinforcement‑learning data flywheel; UFO incorporates Retrieval‑Augmented Generation (documents + Bing + experience).
Open‑Source Status: Mobile‑Agent’s model and demo are open; UI‑TARS models are fully released on HuggingFace; UFO is MIT‑licensed and completely open.
Longest Leg (Key Strength): Mobile‑Agent excels at multi‑agent division and self‑reflection; UI‑TARS offers end‑to‑end VLM across platforms; UFO provides system‑level APIs and RAG knowledge.
Alibaba Mobile‑Agent
GUI‑Owl – Unified Multimodal Foundation Model
Positioning: First native end‑to‑end multimodal GUI‑agent model that unifies perception, localization, reasoning, planning, and execution.
Base Model: Built on Qwen2.5‑VL and further trained on large‑scale GUI interaction data.
Capabilities: Supports cross‑platform GUI automation (Android, Windows, macOS, Web) and both single‑agent autonomy and multi‑agent collaboration.
Multi‑Agent Framework
Manager: Strategic planner that decomposes user commands into sub‑goals and dynamically adjusts plans.
Worker: Executor that selects and performs actionable sub‑goals based on the current state.
Reflector: Self‑evaluation module that judges execution success and generates feedback.
Notetaker: Memory module that records key information (e.g., verification codes, order numbers) for reuse across steps.
RAG Module: Real‑time retrieval of external knowledge such as weather or tutorials.
State‑Driven Loop: Execute → Feedback → Update Plan → Continue.
https://arxiv.org/abs/2508.15144
Mobile-Agent‑v3: Fundamental Agents for GUI Automation
https://github.com/X-PLUG/MobileAgentByteDance UI‑TARS
UI‑TARS compresses perception, reasoning, memory, and action into a single Vision‑Language Model trained on 50 B tokens. Three model sizes are released on HuggingFace: 2 B (on‑device), 7 B (edge), and 72 B (cloud).
System‑2 Reasoning Chain: Generates an explicit “thought” draft before producing an action, enabling dynamic decomposition, reflection, and error correction.
Data Flywheel: Uses sandboxed task generation and reinforcement learning to continuously create new training data; model updates occur bi‑weekly.
Mixed Action Flow: A single task can invoke GUI clicks, terminal commands, and APIs; demo shows opening Notion, crawling data, running Python analysis, and writing results back to the page.
https://arxiv.org/pdf/2509.02544
https://github.com/bytedance/ui-tars
UI‑TARS‑2 Technical Report: Advancing GUI Agent with Multi‑Turn Reinforcement LearningMicrosoft UFO
The UFO series (UFO → UFO2 → UFO3) evolves from basic UI automation to a multi‑device orchestration framework called Galaxy, which coordinates agents across heterogeneous platforms.
Declarative DAG Decomposition: Requests are broken into a dynamic Directed Acyclic Graph of TaskStar nodes with dependencies for automatic scheduling and runtime rewriting.
Result‑Driven Graph Evolution: The DAG adapts continuously based on execution feedback.
Heterogeneous, Asynchronous, Secure Orchestration: Capability‑based device matching, asynchronous execution, safety locks, and formal verification ensure reliable cross‑platform operation.
Unified Agent Interaction Protocol (AIP): WebSocket‑based secure coordination layer with fault tolerance and auto‑reconnect.
Template‑Based MCP Toolkit: Lightweight SDK for rapid agent development, integrating a Modular Component Platform (MCP) to extend tool functionality.
https://arxiv.org/pdf/2511.11332
UFO3: Weaving the Digital Agent Galaxy
https://github.com/microsoft/UFO/Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
