Skip to content

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Notifications You must be signed in to change notification settings

showlab/WorldGUI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Henry Hengyuan Zhao · Difei Gao · Mike Zheng Shou

Paper PDF Project Page

Show Lab, National University of Singapore

📢 Update

  • [23/02/2025] We will release agent GUI-Thinker (Claude-3.5.-Sonnet) for easy use before 21 Feb.
  • [13/02/2025] We release the WorldGUI in Arxiv.

Benchmark Overview

benchmark

WorldGUI: An illustration of our proposed real-world GUI benchmark. The left shows that for each task, WorldGUI provides a user query, instructional video, and pre-actions. The pre-actions lead to different initial states. The key characteristic of our WorldGUI is the various initial states of the same task to stimulate the real-world testing process. The right shows the software included in our benchmark and the interactions about testing the agents in our GUI environment.

Agent Framework Overview

agent

An overview of GUI-Thinker, includes five proposed components: Planner, Planner-Critic, Step-Check, Actor, and Actor-Critic. The Planner module receives the user query and an instructional video as input and generates an initial plan for the Planner-Critic process. This plan is then refined and executed step by step. Before each step is passed to the Actor module, it undergoes a Step-Check. After the Actor produces an action, the Actor-Critic module iteratively verifies the completion of the action and makes corrections if needed.

🙏 Acknowledgement

  • We express our great thanks to Kaiming Yang, Mingyi Yan, Wendi Yu for their hard work for data ananotation and baseline testing.

  • OOTB (Computer Use): Computer Use OOTB is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (Claude 3.5 Computer Use) and locally-running models (ShowUI, UI-TARS).

  • ShowUI: Open-source, End-to-end, Lightweight, Vision-Language-Action model for GUI Agent & Computer Use.

  • AssistGUI: AssistGUI is the first work that focuses on desktop productivity software usage with over 100 realistic GUI tasks.

  • VideoGUI: A Benchmark for GUI Automation from Instructional Videos. Can a GUI agent behave like a human when giving an image-style effect and a user query?

  • SWE-bench Multimodal: SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks.

🎓 Citation

If you find WorldGUI useful, please cite using this BibTeX:

@misc{zhao2025worldguidynamictestingcomprehensive,
      title={WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation}, 
      author={Henry Hengyuan Zhao and Difei Gao and Mike Zheng Shou},
      year={2025},
      eprint={2502.08047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.08047}, 
}

About

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published