Artificial Intelligence 5 min read

Midscene.js: AI‑Powered UI Automation Framework with Instant Actions and Deep Think

Midscene.js, an AI‑driven UI automation framework from the Web Infra team, introduces Instant Actions for stable interactions and Deep Think for precise element locating, providing developers with direct UI operation APIs and enhanced reliability across vision‑capable language models.

ByteDance Web Infra
ByteDance Web Infra
ByteDance Web Infra
Midscene.js: AI‑Powered UI Automation Framework with Instant Actions and Deep Think

Midscene.js is an AI × UI automation framework released by the Web Infra team. Starting from version v0.14.0, it adds two major features: Instant Actions and Deep Think.

Instant Actions – Making Interaction More Stable

The existing .ai interface automatically plans steps using an LLM to interact with web pages, e.g.:

await agent.ai('在搜索框中输入 "Headphones",按下回车键');

Behind the scenes, Midscene calls an LLM to generate and execute the plan, which is shown in the report. However, complex prompts can lead to incorrect steps or inaccurate element coordinates, causing frustration for test engineers.

To address this, Midscene introduces direct UI operation APIs such as aiTap() , aiHover() , aiInput() , aiKeyboardPress() , and aiScroll() . These functions execute the specified actions while the AI model only handles low‑level tasks like element locating, resulting in faster and more reliable execution.

For example, the previous search operation can be rewritten as:

await agent.aiInput('耳机', '搜索框');
await agent.aiKeyboardPress('Enter');

The report now shows no planning step, indicating a streamlined process.

Although scripts using these APIs may appear more verbose, they save time when the desired actions are clear.

Deep Think – Improving Element Positioning Accuracy

When interacting with complex UI controls, LLMs may struggle to locate target elements. Midscene adds a deepThink option to instant‑action functions to enhance visual locating.

Enabling deepThink changes the function signature, e.g.:

await agent.aiTap('target', { deepThink: true });

deepThink first identifies a region containing the target element, then refines the search within that region, yielding more accurate coordinates.

In a Coze.com workflow editor page with many custom icons, using deepThink allows the LLM to reliably find the intended element, as shown in the report screenshots.

Note that deepThink only works with vision‑enabled models such as qwen2.5‑vl; it will not function with models lacking visual capabilities like gpt‑4o.

References

Prompting Tips: https://midscenejs.com/zh/prompting-tips.html

Midscene.js Official Site: https://midscenejs.com/zh

GitHub Repository: https://github.com/web-infra-dev/midscene

AIUI Automationweb testingDeep ThinkInstant ActionsMidscene.js
ByteDance Web Infra
Written by

ByteDance Web Infra

ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.