Platform β’
Paper β’
Dataset β’
Discord β’
Twitter β’
WeChat
Existing benchmarks for web agent tasks are either offline and static, or operate within a fully reproducible environment with limited Internet dynamics. The WebCanvas project aims to pioneer the online evaluation of web agents. Additionally, we offer a suite of toolkits for scaling and maintaining web agent data to support this endeavor. We welcome any constructive feedback on the project and look forward to partnering with you in developing agents for web tasks!
- [2024, July 13] We've just released v0.0.2 of WebCanvas! This update brings the ability to call different base model services, including OpenAI, Claude, Gemini, and together.ai. Now, you can choose any of these model services for testing on our platform. Additionally, we've launched a new repository: WebCanvas Showcase. This repo demonstrates how different agent frameworks can be integrated with the WebCanvas framework for online evaluation. We're kicking things off with the integration of SEEACT1 and WebCanvas. Play with it and explore the possibilities!
- [2024, June 18] Our paper will be presented at agentic markets workshop in ICML 2024 and natural language reasoning and structured explanations workshop in ACL 2024. See you in Vienna and Bangkok!
- [2024, June 18] Our pre-print paper "WebCanvas: Benchmarking Web Agents in Online Environments" is available!
- [2024, June 6] We've released WebCanvas, including Data, Platform, Toolkits, and Web agents(in this repo)!
- Base Agent Framework: Includes a universal agent framework with four key modules: Planning, Observation, Memory, and Reward, designed to perform complex tasks within real-world online web environments effectively.
- Dynamic and Real-time Web Environment Interaction: Utilizes live web environments to provide a realistic assessment and feedback of web agents.
- Key Nodes Annotation: Introduces the concept of "key nodes" to offer in-progress feedback and a granular, phase-based assessment system that adapts to frequent changes in real-world web navigation.
- Enhanced Granularity of Progress Reward: Allows for a thorough assessment of the reward module within the framework of autonomous web agents, focusing on the pivotal influence of reward signal quality.
- Easy to Scale with Online Web Environment: Connected to a comprehensive suite of toolkits with accurate observation capture and rich action space to define demonstration trajectories and intermediate states for real-time, open-ended web tasks, allowing for robust evaluation in dynamic web environments. Check out our browser plugin and data platform.
- Mind2Web-Live Dataset: Presents a refined version of the original Mind2Web2 static dataset, containing 542 tasks with 2439 intermediate evaluation states, serving as the foundation general purpose benchmark.
- Better Modularity and More Flexible Integration: To help easier integration of WebCanvas evaluation, connect offline agents to online environment.
- Better Observation: Faster in computing, more accurate, and combine more modality(text, code, vision, conversation, etc.)
- Broader Action Space: Add actions like cache in memory, output final answer, code execution etc. to develop a better interface for web agent, which may differ from human's.
- Dynamic Evaluation Function: Provide toolkit for community to define dynamic evaluation functions(for example, model-based evaluation) as supplementary of current static evaluation functions.
- More Dataset Coverage: Introduce more datasets in different domains that address key capabilities in online web tasks.
- Accumulate Knowledge on Agent Experiences: Develop better algorithm to handle error encountered when inference in live environment, also accumulate knowledge on agent experiences in different websites.
- Statistics on Agent Cost other than Performance: Enable calculation of token consumption or GPU consumption of agent framework or agent model to serve as another optimization goal for truly practical web agent.
- Cloud Version to Mitigate Environment Discrepancy: We are working on a cloud version for more reliable evaluation.
- Support more base model calling(Claude, Gemini, Open-source Models from together.ai, etc.). (Done)
- Add more brilliant web agent benchmarking data as showcase: webarena3, GAIA4, workarena5, etc. (in progress)
- Enable token consumption calculation. (in progress)
- Better modularity to ease integration. (in progress)
- Add vision as an extra observation and implement various grounding strategies.
- Keep updating error handling module.
- Develop up-to-date visualizations of current live websites agent performance.
- Add information extraction related actions and relative evaluation metrics.
- Enable script-based actions and evaluation.
First, ensure your environment is ready by installing the necessary dependencies:
conda create -n webcanvas python=3.11
conda activate webcanvas
pip install -r requirements.txt
From our experiments, the experimental environment plays a crucial role in agent performance. We recommend experimenting on a Windows server using Chrome or Firefox browser engines, preferably on servers located in the United States. Below is the experiment results on Mind2Web-Live test set.
Planning Model | IP Region | System | Browser | Completion Rate | Task Success Rate | Efficiency Score |
---|---|---|---|---|---|---|
gpt-3.5-turbo-0125 | United States | Windows | Chrome | 40.2% | 16.5% | 3.03 |
gpt-3.5-turbo-0125 | United States | Windows | Firefox | 42.1% | 20.2% | 2.79 |
gpt-3.5-turbo-0125 | United States | Linux | Chrome | 36.5% | 15.4% | 3.33 |
gpt-3.5-turbo-0125 | United Kingdom | Windows | Chrome | 23.6% | 8.65% | 7.78 |
gpt-3.5-turbo-0125 | Singapore | Windows | Chrome | 42.3% | 21.2% | 2.95 |
Before running the repos, you need to set up the required API keys as using features dependent on external APIs.
For setting up OpenAI API keys, add your API key to your environment variables:
MacOS/Linux:
export OPENAI_API_KEY='your-api-key-here'
Windows:
setx OPENAI_API_KEY "your-api-key-here"
Visit Quickstart tutorial - OpenAI API for more details.
For setting up Claude API keys, add your API key to your environment variables:
MacOS/Linux:
export ANTHROPIC_API_KEY='your-api-key-here'
Windows:
setx ANTHROPIC_API_KEY "your-api-key-here"
For setting up Gemini API keys, add your API key to your environment variables:
MacOS/Linux:
export GOOGLE_API_KEY='your-api-key-here'
Windows:
setx GOOGLE_API_KEY "your-api-key-here"
For setting up Together AI API keys, add your API key to your environment variables:
MacOS/Linux:
export TOGETHER_API_KEY='your-api-key-here'
Windows:
setx TOGETHER_API_KEY "your-api-key-here"
Make sure to replace your-api-key-here
with your actual API keys. This ensures that the necessary APIs are accessible for the features you intend to use in the repository.
Register on the platform here.
First, ensure your environment variables are correctly set so that the code can access the necessary credentials and URL.
export GRAPHQL_USERNAME=your_username
export GRAPHQL_PASSWORD=your_password
To download a file, use the following command:
python data/dataset_io.py download \
--challenge-id your_challenge_id \
--save-path /path/to/save/file
your_challenge_id
: The ID of the challenge for the download. Obtain this ID on the url link of the challenge for now. For example, the ID of Mind2Web-Live Test is "WjVIjPfpa-psiltU3oD2W"./path/to/save/file
: The path where the downloaded file will be saved.
The raw data contain rich information on step level to inspire future research. However, it's not for our evaluation.
To process the raw data, run the follow command:
python data/raw_data_processor.py \
--input-file path/to/input/file \
--output-file path/to/output/file
You can run the repos with the following command:
python evaluate.py \
--global_reward_mode dom_reward \
--index -1 \
--single_task_name "Find Dota 2 game and add all DLC to cart in steam." \
--planning_text_model gpt-3.5-turbo \
--global_reward_text_model gpt-3.5-turbo
This command runs the script with DOM-based self-reward, processing the default task "Find Dota 2 game and add all DLC to cart in steam" or using the default data index -1. It also uses the GPT-3.5 Turbo model for both observation and global reward processing. The evaluation mode is controlled by the task_mode
parameter in configs/setting.toml
, allowing you to choose between batch mode and single mode(without automatic evaluation). Remember to specify your path to the test file in configs/setting.toml
.
This program supports several command-line arguments to customize its behavior:
-
--global_reward_mode
: Selects the method for getting global rewards.- Options:
dom_vision_reward
,dom_reward
,vision_reward
,no_global_reward
- Default:
dom_reward
- Description: Define how rewards are got based on the interaction mode:
dom_vision_reward
: Rewards are calculated using both DOM and vision data.dom_reward
: Rewards are based solely on DOM interactions.vision_reward
: Rewards are derived from vision-based interactions only.no_global_reward
: No global rewards are calculated.
- Options:
-
--index
: Decide which data index to start with.- Type: String
- Default:
-1
- Description: Use this parameter to specify a range or specific index for data processing. For example,
0,5
will process data from index 0 to 5.
-
--single_task_name
: Defines the task name of the single task to execute.- Type: String
- Default:
"Find Dota 2 game and add all DLC to cart in steam."
- Description: Use this parameter to specify the task that the agent should perform.
-
--planning_text_model
: Specifies the model used for planning module.- Type: String
- Default:
gpt-3.5-turbo
- Description: Use this parameter to specify which text model to use for planning module.
-
--global_reward_text_model
: Specifies the model used for global reward reasoning.- Type: String
- Default:
gpt-3.5-turbo
- Description: Use this parameter to specify which text model to use for global reward reasoning.
Evaluating web agents in an online environment can sometimes be painful due to issues like network problems or bot tests on certain websites. Adopting an evaluation method that accommodates these issues allows for an accurate assessment of an agent's performance under specific current conditions. Additionally, we provide a more flexible interaction mode, enabling users to manually solve environmental issues and get the optimized performance of their web agents. You can simply set the interaction_mode
parameter in configs/setting.toml
to enable this feature. We will accumulate our implementation on error handling in online agent inference, and try to minimize human efforts by triggering only when exceptions occur in the following version.
IMPORTANT: You should upload the generated out.json file to participate a challenge. To upload your result, use the following command:
python data/dataset_io.py upload \
--file-path /path/to/your/file \
--challenge-id your_challenge_id \
--name your_agent_name \
--base-model your_agent_base_model
Replace the placeholders with your actual values:
/path/to/your/file
: The path to the result you want to upload.your_challenge_id
: The ID of the challenge you want to participate.your_agent_name
: The agent name for the upload.your_agent_base_model
: The agent base model information for the upload.
You can also submit through our platform. We will conduct an official check on your submission to prevent cheating.
You can follow instructions on this documentation about how to create your own challenging benchmark for web agents.
We welcome contributions to WebCanvas!
In the coming updates, we will provide detailed guidelines on how to contribute to our project. This will include instructions on our coding standards, the process for submitting pull requests, and how to report issues, and more. Stay tuned for more information!
Thank you for your interest in improving WebCanvas. Your contributions are greatly appreciated and essential to the growth and success of our project.
We are building a vibrant and inclusive community around WebCanvas! Join our community to stay up-to-date with the latest developments and to contribute to the project:
We value your feedback and suggestions!
We will be providing a detailed guide on how to give feedback in the upcoming documentation. This will include information on how to submit feedback, the types of feedback we are looking for, and how we plan to address and incorporate your suggestions. Stay tuned for more updates!
- Talk to Founder, we welcome any discussion and feedback on the future of live agent evaluation!
If you use this project in your research, please cite our paper:
@article{pan2024webcanvas,
title={WebCanvas: Benchmarking Web Agents in Online Environments},
author={Pan, Yichen and Kong, Dehan and Zhou, Sida and Cui, Cheng and Leng, Yifei and Jiang, Bing and Liu, Hangyu and Shang, Yanyi and Zhou, Shuyan and Wu, Tongshuang and others},
journal={arXiv preprint arXiv:2406.12373},
year={2024}
}
Footnotes
-
Zheng, Boyuan, et al. "Gpt-4v (ision) is a generalist web agent, if grounded." arXiv preprint arXiv:2401.01614 (2024). β©
-
Deng, Xiang, et al. "Mind2web: Towards a generalist agent for the web." Advances in Neural Information Processing Systems 36 (2024). β©
-
Zhou, Shuyan, et al. "Webarena: A realistic web environment for building autonomous agents." arXiv preprint arXiv:2307.13854 (2023). β©
-
Mialon, GrΓ©goire, et al. "Gaia: a benchmark for general ai assistants." arXiv preprint arXiv:2311.12983 (2023). β©
-
Drouin, Alexandre, et al. "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?." arXiv preprint arXiv:2403.07718 (2024). β©