GitHub - showlab/WorldGUI: WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Henry Hengyuan Zhao · Difei Gao · Mike Zheng Shou

Show Lab, National University of Singapore

📢 Update

[23/02/2025] We will release agent GUI-Thinker (Claude-3.5.-Sonnet) for easy use before 21 Feb.
[13/02/2025] We release the WorldGUI in Arxiv.

Benchmark Overview

WorldGUI: An illustration of our proposed real-world GUI benchmark. The left shows that for each task, WorldGUI provides a user query, instructional video, and pre-actions. The pre-actions lead to different initial states. The key characteristic of our WorldGUI is the various initial states of the same task to stimulate the real-world testing process. The right shows the software included in our benchmark and the interactions about testing the agents in our GUI environment.

Agent Framework Overview

An overview of GUI-Thinker, includes five proposed components: Planner, Planner-Critic, Step-Check, Actor, and Actor-Critic. The Planner module receives the user query and an instructional video as input and generates an initial plan for the Planner-Critic process. This plan is then refined and executed step by step. Before each step is passed to the Actor module, it undergoes a Step-Check. After the Actor produces an action, the Actor-Critic module iteratively verifies the completion of the action and makes corrections if needed.

🙏 Acknowledgement

We express our great thanks to Kaiming Yang, Mingyi Yan, Wendi Yu for their hard work for data ananotation and baseline testing.
OOTB (Computer Use): Computer Use OOTB is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (Claude 3.5 Computer Use) and locally-running models (ShowUI, UI-TARS).
ShowUI: Open-source, End-to-end, Lightweight, Vision-Language-Action model for GUI Agent & Computer Use.
AssistGUI: AssistGUI is the first work that focuses on desktop productivity software usage with over 100 realistic GUI tasks.
VideoGUI: A Benchmark for GUI Automation from Instructional Videos. Can a GUI agent behave like a human when giving an image-style effect and a user query?
SWE-bench Multimodal: SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks.

🎓 Citation

If you find WorldGUI useful, please cite using this BibTeX:

@misc{zhao2025worldguidynamictestingcomprehensive,
      title={WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation}, 
      author={Henry Hengyuan Zhao and Difei Gao and Mike Zheng Shou},
      year={2025},
      eprint={2502.08047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.08047}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
css		css
.DS_Store		.DS_Store
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📢 Update

Benchmark Overview

Agent Framework Overview

🙏 Acknowledgement

🎓 Citation

About

Releases

Packages

Languages

showlab/WorldGUI

Folders and files

Latest commit

History

Repository files navigation

📢 Update

Benchmark Overview

Agent Framework Overview

🙏 Acknowledgement

🎓 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages