Web Archive Based Benchmark for GUI Subtask Executions
We present Orby Web Agent, a comprehensive framework for developing and evaluating vision-based web agents using WARC (Web ARChive) file servers and BrowserGym environments. Our system enables realistic benchmarking of automated web interaction by serving archived web pages as live websites, providing reproducible and controlled testing environments.
The framework features SvaV4, a pure-vision agent optimized for short-horizon tasks with combined task completion evaluation and action generation in a single model call. Our approach supports diverse web automation scenarios including both real-world environments (ZenDesk, GitHub) and synthetic test cases, enabling comprehensive evaluation of agent capabilities across different interaction patterns.
This work provides researchers and practitioners with a unified toolkit for developing, testing, and benchmarking web agents in reproducible environments, advancing the state of automated web interaction and multi-step task completion.
Serve archived web pages as live websites for reproducible benchmarking with controlled, deterministic environments.
SvaV4 agent uses pure vision for web interaction with efficient single-call execution for short-horizon tasks.
Seamless integration with BrowserGym for standardized action spaces including click, type, scroll, and more.
Built-in evaluation framework with trajectory recording, visualization, and task completion metrics.
The names and data portrayed in these demonstrations are either synthetic or sourced from openly available real websites. They hold absolutely no connection to the authors, direct or indirect.
Agent navigating a real-world customer support interface from ZenDesk, demonstrating multi-step task completion.
Agent interacting with a public GitHub repository, showcasing navigation and information extraction capabilities.
Agent completing complex form interactions in a controlled synthetic environment designed for evaluation.
Agent performing multi-step navigation tasks in a synthetic environment with dynamic content.
@misc{srivastava2025warcbenchwebarchivebased,
title={WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions},
author={Sanjari Srivastava and Gang Li and Cheng Chang and Rishu Garg and Manpreet Kaur and Charlene Y. Lee and Yuezhang Li and Yining Mao and Ignacio Cases and Yanan Xie and Peng Qi},
year={2025},
eprint={2510.09872},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.09872},
}