Skip to content

Commit 9b7a46c

Browse files
committed
init
0 parents  commit 9b7a46c

File tree

378 files changed

+134337
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

378 files changed

+134337
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
name: Bug Report & Assistance Request
3+
about: Create a report to help us improve
4+
title: "[Bug/Assistance] "
5+
labels: bug, help wanted
6+
assignees: ''
7+
8+
---
9+
10+
**Describe the bug**
11+
A clear and concise description of what the bug is.
12+
13+
**To Reproduce**
14+
Steps to reproduce the behavior:
15+
1. Go to '...'
16+
2. Click on '....'
17+
3. Scroll down to '....'
18+
4. See error
19+
20+
**Screenshots or Terminal Copy&Paste**
21+
If applicable, add screenshots to help explain your problem.
22+
23+
**Desktop (please complete the following information):**
24+
- OS: [e.g. Ubuntu 22.04]
25+
- Python: [e.g. 3.9]
26+
27+
**Additional context**
28+
Add any other context about the problem here.
+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
name: Feature Request
3+
about: Suggest an idea for this project
4+
title: "[Feature] "
5+
labels: enhancement
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Additional context**
17+
Add any other context or screenshots about the feature request here.

.gitignore

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
__pycache__
2+
%*
3+
.idea
4+
.vscode
5+
src/tasks/humaneval_x/env/vendor
6+
logs
7+
outputs
8+
data/full
9+
results
10+
config.sh
11+
download
12+
.DS_Store
13+
# local*
14+
*.ipynb
15+
.cache
16+
src/server/tasks/card_game/result
17+
.dockerfile
18+
.dockerfile-cache
19+
analysis

README.md

+180
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# AgentBench
2+
3+
![](./assets/cover.jpg)
4+
5+
<p align="center">
6+
<a href="https://llmbench.ai" target="_blank">🌐 Website</a> | <a href="https://twitter.com/thukeg" target="_blank">🐦 Twitter</a> | <a href="mailto:[email protected]">✉️ Google Group</a> | <a href="https://arxiv.org/abs/2308.03688" target="_blank">📃 Paper </a>
7+
</p>
8+
9+
<p align="center">
10+
👋 Join our <a href="https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg" target="_blank">Slack</a> for <i>Q & A</i> or <i><b>collaboration</b> on next version of AgentBench</i>!
11+
</p>
12+
13+
## 📌Introducing AgentBench v0.2🎉
14+
15+
You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).
16+
17+
Based on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:
18+
19+
- Updated the framework architecture for easier use and extension
20+
- Adjusted some task settings
21+
- Added test results for more models
22+
- Released the full data for the Dev and Test sets
23+
24+
# AgentBench: Evaluating LLMs as Agents
25+
26+
https://github.com/THUDM/AgentBench/assets/129033897/656eed6e-d9d9-4d07-b568-f43f5a451f04
27+
28+
**AgentBench** is the first benchmark designed to evaluate **LLM-as-Agent** across a diverse spectrum of different
29+
environments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to
30+
operate as autonomous agents in various scenarios. These environments include 5 freshly created domains, namely
31+
32+
- Operating System (OS)
33+
- Database (DB)
34+
- Knowledge Graph (KG)
35+
- Digital Card Game (DCG)
36+
- Lateral Thinking Puzzles (LTP)
37+
38+
as well as 3 recompiled from published datasets:
39+
40+
- House-Holding (HH) ([ALFWorld](https://github.com/alfworld/alfworld))
41+
- Web Shopping (WS) ([WebShop](https://github.com/princeton-nlp/webshop))
42+
- Web Browsing (WB) ([Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web))
43+
44+
![](./assets/agentbench.png)
45+
46+
## Table of Contents
47+
48+
- [Dataset Summary](#dataset-summary)
49+
- [Leaderboard](#leaderboard)
50+
- [Quick Start](#quick-start)
51+
- [Next Steps](#next-steps)
52+
- [Citation](#citation)
53+
54+
## Dataset Summary
55+
56+
We offer two splits for each dataset: Dev and Test. The multi-turn interaction requires an LLMs to generate around 4k
57+
and 13k times respectively.
58+
59+
![](./assets/statistics.png)
60+
61+
## Leaderboard
62+
63+
Here is the scores on test set (standard) results of AgentBench.
64+
65+
![](./assets/leaderboard.png)
66+
67+
While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance towards practical
68+
usability are significant.
69+
70+
![](./assets/intro.png)
71+
72+
## Quick Start
73+
74+
This section will guide you on how to quickly use gpt-3.5-turbo-0613 as an agent to launch the `dbbench-std` and `os-std` tasks.
75+
For the specific framework structure, please refer to [Framework Introduction](docs/Introduction_en.md).
76+
For more detailed configuration and launch methods, please check [Configuration Guide](docs/Config_en.md)
77+
and [Program Entrance Guide](docs/Entrance_en.md).
78+
79+
### Step 1. Prerequisites
80+
81+
Clone this repo and install the dependencies.
82+
83+
```bash
84+
cd AgentBench
85+
conda create -n agent-bench python=3.9
86+
conda activate agent-bench
87+
pip install -r requirements.txt
88+
```
89+
90+
Ensure that [Docker](https://www.docker.com/) is properly installed.
91+
92+
```bash
93+
docker ps
94+
```
95+
96+
Build required images for `dbbench-std` and `os-std`.
97+
98+
```bash
99+
docker pull mysql
100+
docker pull ubuntu
101+
docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default
102+
docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages
103+
docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu
104+
```
105+
106+
### Step 2. Configure the Agent
107+
108+
Fill in your OpenAI API Key at the correct location in `configs/agents/openai-chat.yaml`. (e.g. `gpt-3.5-turbo-0613`)
109+
110+
You can try using `python -m src.client.agent_test` to check if your agent is configured correctly.
111+
112+
By default, `gpt-3.5-turbo-0613` will be started. You can replace it with other agents by modifying the parameters:
113+
114+
```bash
115+
python -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613
116+
```
117+
118+
### Step 3. Start the task server
119+
120+
Starting the task worker involves specific tasks. Manual starting might be cumbersome; hence, we provide an automated
121+
script.
122+
123+
The assumption for this step is that ports from 5000 to 5015 are available.
124+
125+
```bash
126+
python -m src.start_task -a
127+
```
128+
129+
This will launch five task_workers each for `dbbench-std`, `os-std`, and `kg-std` tasks and automatically connect them
130+
to the controller on port 5000. **After executing this command, please allow approximately 1 minute for the task setup to complete.**
131+
132+
### Step 4. Start the assigner
133+
134+
This step is to actually start the tasks.
135+
136+
If everything is correctly configured so far, you can now initiate the task tests.
137+
138+
```bash
139+
python -m src.assigner
140+
```
141+
142+
## Next Steps
143+
144+
If you wish to launch more tasks or use other models, you can refer to the content
145+
in [Configuration Guide](docs/Config_en.md) and [Program Entrance Guide](docs/Entrance_en.md).
146+
147+
For the environment of the remaining five tasks, you will need to download the Docker images we provide.
148+
149+
```
150+
longinyu/agentbench-ltp
151+
longinyu/agentbench-webshop
152+
longinyu/agentbench-mind2web
153+
longinyu/agentbench-card_game
154+
longinyu/agentbench-alfworld
155+
```
156+
157+
The resource consumption of a single task_worker for the eight tasks is roughly as follows; consider this when
158+
launching:
159+
160+
| Task Name | Start-up Speed | Memory Consumption |
161+
| --------- | -------------- | ------------------ |
162+
| webshop | ~3min | ~15G |
163+
| mind2web | ~5min | ~1G |
164+
| db | ~20s | < 500M |
165+
| alfworld | ~10s | < 500M |
166+
| card_game | ~5s | < 500M |
167+
| ltp | ~5s | < 500M |
168+
| os | ~5s | < 500M |
169+
| kd | ~5s | < 500M |
170+
171+
## Citation
172+
173+
```
174+
@article{liu2023agentbench,
175+
title = {AgentBench: Evaluating LLMs as Agents},
176+
author = {Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang},
177+
year = {2023},
178+
journal = {arXiv preprint arXiv: 2308.03688}
179+
}
180+
```

assets/agentbench.png

6.01 MB
Loading

assets/architecture.png

58.2 KB
Loading

assets/cover.jpg

521 KB
Loading

assets/intro.png

114 KB
Loading

assets/leaderboard.png

163 KB
Loading

assets/logo.png

32.1 KB
Loading

assets/statistics.png

79.8 KB
Loading

configs/agents/api_agents.yaml

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
gpt-3.5-turbo-0613:
2+
import: "./openai-chat.yaml"
3+
parameters:
4+
name: "gpt-3.5-turbo-0613"
5+
body:
6+
model: "gpt-3.5-turbo-0613"
7+
max_tokens: 512
8+
9+
text-davinci-003:
10+
import: "./openai-text.yaml"
11+
parameters:
12+
name: "text-davinci-003"
13+
body:
14+
model: "text-davinci-003"
15+
max_tokens: 512
16+
17+
text-davinci-002:
18+
import: "./openai-text.yaml"
19+
parameters:
20+
name: "text-davinci-002"
21+
body:
22+
model: "text-davinci-002"
23+
max_tokens: 512

configs/agents/fs_agent.yaml

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
default:
2+
module: "src.client.agents.FastChatAgent"
3+
parameters:
4+
name: "FastChat"
5+
controller_address: "http://localhost:55555"
6+
max_new_tokens: 512
7+
temperature: 0
8+
9+
vicuna-33b:
10+
parameters:
11+
model_name: "vicuna-33b-v1.3"
12+
13+
wizard-30b:
14+
parameters:
15+
model_name: "WizardLM-30B-V1.0-merged"
16+
17+
vicuna-13b:
18+
parameters:
19+
model_name: "vicuna-13b-v1.5"
20+
21+
vicuna-7b:
22+
parameters:
23+
model_name: "vicuna-7b-v1.5"

configs/agents/openai-chat.yaml

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
module: src.client.agents.HTTPAgent
2+
parameters:
3+
url: https://api.openai.com/v1/chat/completions
4+
headers:
5+
Content-Type: application/json
6+
Authorization: Bearer <% PUT-YOUR-OPENAI-KEY-HERE %>
7+
body:
8+
temperature: 0
9+
prompter:
10+
name: role_content_dict
11+
args:
12+
agent_role: assistant
13+
return_format: "{response[choices][0][message][content]}"

configs/agents/openai-text.yaml

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
module: src.client.agents.HTTPAgent
2+
parameters:
3+
name: <% NAME %>
4+
url: https://api.openai.com/v1/chat/completions
5+
headers:
6+
Content-Type: application/json
7+
Authorization: Bearer <% PUT-YOUR-OPENAI-KEY-HERE %>
8+
body:
9+
model: <% NAME %>
10+
temperature: 0
11+
prompter:
12+
name: prompt_string
13+
return_format: "{response[choices][0][text]}"
14+

configs/assignments/default.yaml

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import: definition.yaml
2+
3+
concurrency:
4+
task:
5+
dbbench-std: 5
6+
os-std: 5
7+
agent:
8+
gpt-3.5-turbo-0613: 5
9+
10+
assignments: # List[Assignment] | Assignment
11+
- agent: # "task": List[str] | str , "agent": List[str] | str
12+
- gpt-3.5-turbo-0613
13+
task:
14+
- dbbench-std
15+
- os-std
16+
17+
output: "outputs/{TIMESTAMP}"

configs/assignments/definition.yaml

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
definition:
2+
task:
3+
overwrite:
4+
module: src.client.TaskClient
5+
parameters:
6+
controller_address: "http://localhost:5000/api"
7+
import: ../tasks/task_assembly.yaml
8+
agent:
9+
import:
10+
- ../agents/api_agents.yaml
11+
- ../agents/fs_agent.yaml

configs/start_task.yaml

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
definition:
2+
import: tasks/task_assembly.yaml
3+
4+
start:
5+
dbbench-std: 5
6+
os-std: 5

configs/tasks/alfworld.yaml

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
default:
2+
module: src.server.tasks.alfworld.ALFWorld
3+
docker:
4+
image: longinyu/agentbench-alfworld
5+
command: umask 0; [ -f /root/.setup.sh ] && bash /root/.setup.sh;
6+
parameters:
7+
name: alfworld-std
8+
data_path: "/AgentBench/data/alfworld"
9+
config_path: "src/server/tasks/alfworld/configs/base_config.yaml"
10+
prompts_path: "src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
11+
split: "standard"
12+
max_step: 35
13+
14+
alfworld-dev:
15+
parameters:
16+
name: alfworld-dev
17+
split: "dev"
18+
19+
alfworld-std:
20+
parameters:
21+
name: alfworld-std
22+
split: "standard"

0 commit comments

Comments
 (0)