Odysseys

Benchmarking Web Agents on Realistic Long Horizon Tasks

Lawrence Jang*, Jing Yu Koh*, Daniel Fried, Ruslan Salakhutdinov

Carnegie Mellon University ยท *Equal contribution

Existing web-agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. Real-world web use, however, consists of long-horizon, multi-site workflows โ€” comparing products across retailers, planning travel across booking platforms, or synthesizing information from many search queries.

Odysseys is a benchmark of 120 long-horizon web tasks derived from real browsing sessions and evaluated on the live internet. Binary pass/fail evaluation is inadequate for long-horizon settings, so we annotate each task with an average of 5.8 graded rubric checkpoints. The strongest model we test achieves a perfect-task rate of 53%, leaving substantial room for future progress.

120
tasks
699
rubric items
53%
best model
21
domains

Example task

Each task is a first-person request that a user might realistically give a computer-use agent. Tasks are paired with rubric items that decompose progress into verifiable checkpoints.

Odysseys task figure showing an example task, trajectory, and rubric evaluation
Figure 1. An Odysseys task (with simplified task description), the agent's browsing trajectory, and the rubric-based evaluation that grades partial progress across the full workflow.

Headline result

Performance scales with step budget but plateaus well short of full completion. All models show a broadly sigmoidal curve; frontier API models climb steeper and higher, but none approach the ceiling.

Success vs step budget for all models on Odysseys
Figure 2. Perfect rubric rate as a function of step budget, for each evaluated model. Rates stay near zero for the first ~15 steps, rise through the 20โ€“70 range, and taper past ~80 as models approach their practical ceiling.

Contents