Artificial Intelligence 5 min read

Midscene.js: AI‑Powered UI Automation Framework with Instant Actions and Deep Think

Midscene.js, an AI‑driven UI automation framework from the Web Infra team, introduces Instant Actions for stable interactions and Deep Think for precise element locating, providing developers with direct UI operation APIs and enhanced reliability across vision‑capable language models.

ByteDance Web Infra

Apr 7, 2025

Midscene.js: AI‑Powered UI Automation Framework with Instant Actions and Deep Think

Midscene.js is an AI × UI automation framework released by the Web Infra team. Starting from version v0.14.0, it adds two major features: Instant Actions and Deep Think.

Instant Actions – Making Interaction More Stable

The existing .ai interface automatically plans steps using an LLM to interact with web pages, e.g.:

await agent.ai('在搜索框中输入 "Headphones"，按下回车键');

Behind the scenes, Midscene calls an LLM to generate and execute the plan, which is shown in the report. However, complex prompts can lead to incorrect steps or inaccurate element coordinates, causing frustration for test engineers.

To address this, Midscene introduces direct UI operation APIs such as aiTap(), aiHover(), aiInput(), aiKeyboardPress(), and aiScroll(). These functions execute the specified actions while the AI model only handles low‑level tasks like element locating, resulting in faster and more reliable execution.

For example, the previous search operation can be rewritten as:

await agent.aiInput('耳机', '搜索框');
await agent.aiKeyboardPress('Enter');

The report now shows no planning step, indicating a streamlined process.

Although scripts using these APIs may appear more verbose, they save time when the desired actions are clear.

Deep Think – Improving Element Positioning Accuracy

When interacting with complex UI controls, LLMs may struggle to locate target elements. Midscene adds a deepThink option to instant‑action functions to enhance visual locating.

Enabling deepThink changes the function signature, e.g.:

await agent.aiTap('target', { deepThink: true });

deepThink

first identifies a region containing the target element, then refines the search within that region, yielding more accurate coordinates.

In a Coze.com workflow editor page with many custom icons, using deepThink allows the LLM to reliably find the intended element, as shown in the report screenshots.

Note that deepThink only works with vision‑enabled models such as qwen2.5‑vl; it will not function with models lacking visual capabilities like gpt‑4o.

References

Prompting Tips: https://midscenejs.com/zh/prompting-tips.html

Midscene.js Official Site: https://midscenejs.com/zh

GitHub Repository: https://github.com/web-infra-dev/midscene

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI UI automation web testing Deep Think Instant Actions Midscene.js

Written by

ByteDance Web Infra

ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.