Odysseys
Benchmarking Web Agents on Realistic Long Horizon Tasks
Carnegie Mellon University ยท *Equal contribution
Existing web-agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. Real-world web use, however, consists of long-horizon, multi-site workflows โ comparing products across retailers, planning travel across booking platforms, or synthesizing information from many search queries.
Odysseys is a benchmark of 120 long-horizon web tasks derived from real browsing sessions and evaluated on the live internet. Binary pass/fail evaluation is inadequate for long-horizon settings, so we annotate each task with an average of 5.8 graded rubric checkpoints. The strongest model we test achieves a perfect-task rate of 53%, leaving substantial room for future progress.
Example task
Each task is a first-person request that a user might realistically give a computer-use agent. Tasks are paired with rubric items that decompose progress into verifiable checkpoints.
Headline result
Performance scales with step budget but plateaus well short of full completion. All models show a broadly sigmoidal curve; frontier API models climb steeper and higher, but none approach the ceiling.