Midscene.js: AI‑Powered UI Automation Framework with Instant Actions and Deep Think
Midscene.js, an AI‑driven UI automation framework from the Web Infra team, introduces Instant Actions for stable interactions and Deep Think for precise element locating, providing developers with direct UI operation APIs and enhanced reliability across vision‑capable language models.
Midscene.js is an AI × UI automation framework released by the Web Infra team. Starting from version v0.14.0, it adds two major features: Instant Actions and Deep Think.
Instant Actions – Making Interaction More Stable
The existing .ai interface automatically plans steps using an LLM to interact with web pages, e.g.:
await agent.ai('在搜索框中输入 "Headphones",按下回车键');Behind the scenes, Midscene calls an LLM to generate and execute the plan, which is shown in the report. However, complex prompts can lead to incorrect steps or inaccurate element coordinates, causing frustration for test engineers.
To address this, Midscene introduces direct UI operation APIs such as aiTap() , aiHover() , aiInput() , aiKeyboardPress() , and aiScroll() . These functions execute the specified actions while the AI model only handles low‑level tasks like element locating, resulting in faster and more reliable execution.
For example, the previous search operation can be rewritten as:
await agent.aiInput('耳机', '搜索框');
await agent.aiKeyboardPress('Enter');The report now shows no planning step, indicating a streamlined process.
Although scripts using these APIs may appear more verbose, they save time when the desired actions are clear.
Deep Think – Improving Element Positioning Accuracy
When interacting with complex UI controls, LLMs may struggle to locate target elements. Midscene adds a deepThink option to instant‑action functions to enhance visual locating.
Enabling deepThink changes the function signature, e.g.:
await agent.aiTap('target', { deepThink: true });deepThink first identifies a region containing the target element, then refines the search within that region, yielding more accurate coordinates.
In a Coze.com workflow editor page with many custom icons, using deepThink allows the LLM to reliably find the intended element, as shown in the report screenshots.
Note that deepThink only works with vision‑enabled models such as qwen2.5‑vl; it will not function with models lacking visual capabilities like gpt‑4o.
References
Prompting Tips: https://midscenejs.com/zh/prompting-tips.html
Midscene.js Official Site: https://midscenejs.com/zh
GitHub Repository: https://github.com/web-infra-dev/midscene
ByteDance Web Infra
ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.