Mobile-Agent: An Autonomous Multi‑Modal Mobile Device Agent with Visual Perception
The Mobile-Agent paper presents a vision‑only, autonomous multi‑modal AI system that can interpret user commands, locate UI elements on a smartphone screen, and execute complex tasks such as browsing, commenting, and content creation through a defined operation space, self‑planning, and self‑reflection mechanisms, achieving high success rates across diverse Chinese and English scenarios.
Mobile-Agent is a vision‑only autonomous multi‑modal agent designed to operate a smartphone by converting natural‑language instructions into a sequence of screen actions.
01 Operation Space The agent works within a predefined set of eight primitive actions: open app, click text, click icon, type text, scroll up/down, go back, exit app, and stop.
02 Operation Localization To pinpoint where to click, the system uses an OCR module for text detection (handling multiple occurrences by cropping and re‑ranking) and an icon detection module that crops all icon regions, computes CLIP similarity with the provided description, and selects the highest‑scoring region as the click coordinate.
03 Self‑Planning Given a user command, Mobile-Agent generates a system prompt describing the overall task, then iteratively captures the current screen, reviews the history of actions, and predicts the next primitive operation. The loop continues until the agent outputs a termination signal.
04 Self‑Reflection When an action fails to change the screen or leads to an error page, the agent revises the operation or its parameters. After completing the planned steps, it evaluates whether the original instruction has been satisfied; if not, it re‑enters the planning loop.
05 Experimental Results Evaluation on 33 tasks shows a success rate of 91%–82% across three instruction types, with average correct‑action precision around 85% and overall task completion comparable to human performance (≈90%). The self‑reflection component notably improves robustness.
06 Chinese Capability The paper also demonstrates the agent’s ability to handle Chinese commands, successfully completing simple Chinese scenarios despite current limitations of GPT‑4V in Chinese text recognition.
The work highlights the feasibility of pure‑visual, autonomous mobile agents and suggests future directions for improving multimodal perception and error correction on handheld devices.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.