Odysseys

Benchmarking Web Agents on Realistic Long Horizon Tasks

Lawrence Jang*, Jing Yu Koh*, Daniel Fried, Ruslan Salakhutdinov

Carnegie Mellon University · *Equal contribution

Abstract

Most web agent benchmarks test short, single-site tasks — the kind frontier models are already close to saturating. Real browsing rarely looks like that. It is long and cross-site: comparing products across retailers, planning a trip across several booking platforms, or synthesising answers from many searches.

We introduce Odysseys: 120 long-horizon web tasks derived from real browsing sessions and run on the live internet. Binary pass/fail does not work well at this length, so every task comes with a set of graded rubric checkpoints — 5.8 per task on average.

Rubric scoring agrees with human judges more often than the usual trajectory-level LLM judge. On our leaderboard, the strongest model we tested reaches 53% perfect task success, leaving substantial headroom.

120
Long-horizon tasks
699
Rubric checkpoints
53%
Best model · perfect rate
21
Top-level domains

1 · Introduction

Large language models have started to function as computer-use agents. They browse websites, read screenshots, click interface elements, and carry out multi-step instructions in real software.

Most current benchmarks test these abilities in short, tightly scoped episodes. That leaves a key regime underexplored: web workflows that stretch across many pages, tabs, and domains.

Real browsing isn't confined to one site. People compare products across retailers, plan trips across booking platforms, and synthesise information from search results into a deliverable. Solving those tasks takes more than clicking the right buttons. Agents have to hold context over long horizons, reason across heterogeneous sites, break open-ended goals into sub-goals, and decide when to stop exploring.

That's what Odysseys targets: a benchmark of 120 long-horizon web tasks derived from real user browsing, evaluated on the live internet. Each task unfolds across multiple sites, and tasks are grounded in annotated human browsing journeys rather than synthetic templates.

Along the way we noticed a related problem. Trajectory-level LLM-as-a-judge scoring, common in computer-use benchmarks, gets noisier as trajectories grow longer. We replace it with rubric-based evaluation: each task is decomposed into a set of verifiable checkpoints.

When we run leading frontier and open-weight agents on Odysseys, the best model we tested (Opus 4.6) reaches 53% perfect task success. Current agents clearly struggle with sustained planning, cross-site coordination, and deciding when to stop gathering information and start producing the deliverable the user asked for.

easy Opus 4.6
Rubric satisfaction over agent steps step 0 / 0
Full task prompt given to the agent

2 · Dataset

Odysseys is 120 long-horizon, multi-site web tasks designed to measure web agents on realistic browsing workflows. Every task starts from a Google search page and requires the agent to navigate across several websites — comparing products across e-commerce sites, planning travel itineraries, setting up a playlist and watching lectures, and so on.

Collection process

We recruited 248 participants through Prolific. Each one ran a desktop app that reads their Chrome history and lets them annotate their own browsing into web agent tasks. Chrome's Journey algorithm first segments activity into coherent clusters.

For each journey, participants recorded four things:

  • a key URL that represents the success state,
  • an automation preference — whether they'd actually want this task automated,
  • a task label: a natural-language description, written the way they'd prompt an AI tool,
  • a feasibility judgment: is this task actually doable?

The result: 2,380 labelled journeys, spanning comparison shopping, travel, media, research, and more.

Annotation interface
Figure. The annotation interface used by participants to label their Chrome browsing journeys. The interface guides participants through four steps: selecting the key URL, indicating an automation preference, writing a descriptive task label, and judging feasibility.

From journeys to Odysseys

Raw journey labels are noisy. Participants sometimes wrote labels that didn't match the URLs, or described the task too vaguely. We cleaned these up in two stages: an LLM screening pass, then a manual review by the authors. After filtering for label accuracy, feasibility, login requirements, and overall quality, 696 journeys (29.2%) remained.

These surviving journeys are real but short — a quirk of Chrome's segmentation. To get long-horizon tasks, we cluster related journeys by embedding similarity and use an LLM to stitch compatible subsets into coherent multi-step workflows.

For each composed task, the LLM generates a natural-language prompt, a step plan, a rubric with verification procedures, and a coherence score. Every source journey is used at most 3 times, so no single journey dominates.

Composed tasks are then filtered: we drop any that span fewer than 2 sites, have incomplete rubrics, or score below 2/5 on coherence. Difficulty is decided by step count and domain spread:

  • easy — ≤ 5 steps, ≤ 3 domains
  • medium — 6–8 steps, or 4+ domains
  • hard — exceeding both

On top of the 90 composed tasks, the authors hand-wrote 30 more from personal queries, bringing the total to 120 tasks.

Dataset statistics

  • 120 tasks across three tiers — 45 easy, 46 medium, 29 hard.
  • 265.8 words per instruction on average (median 264.5, range 76–387).
  • 699 rubric items, averaging 5.8 per task (median 6, range 3–12).
  • 21 top-level domains and 71 fine-grained SimilarWeb categories.
  • Largest domains: Travel & Tourism (34), Science & Education (33), E-commerce (25), Tech (25).

3 · Rubric evaluation

Most computer-use benchmarks use one of two evaluation styles. Execution-based verification relies on handcrafted, task-specific rewards. LLM-as-a-judge over trajectories feeds the screenshots and actions of a run to an LLM and asks it for a verdict.

Neither fits Odysseys well. Writing rules-based rewards for every task is infeasible, and LLM judges lose reliability as trajectories get longer and more complex.

So we generate a set of rubrics for each task instead. Rubrics are produced alongside the task during LLM composition. Each rubric item consists of a requirement — a single verifiable checkpoint — and a verification description telling a grader how to check it. Tasks contain 3–12 rubric items, averaging 5.8 per task.

Every item is reviewed by an author. We also run an LLM augmented with text web search to cross-check verification criteria against the live web.

QA interface
Figure. The Odysseys QA interface for manual review of chained tasks. Reviewers can expand each task to inspect the full prompt and individual rubric items with their requirements and verification criteria.

SPL (Success weighted by Path Length)

Raw rubric scores measure what an agent accomplished but not how efficiently it did so. Two agents may both complete a task, yet one may take 30 steps while the other takes 100. To capture this, we report SPL, adapted from the navigation metric of Anderson et al. (2018):

(1 / N) · Σi si / ni

where si is the rubric score on task i (either averaged or perfect) and ni is the number of agent steps. Unlike Vision-Language Navigation, we do not know the oracle number of steps required to complete a task, so we do not normalize by it. Higher SPL indicates stronger outcomes in fewer steps, penalizing inefficient trajectories.

How well do rubrics agree with humans?

We measured agreement between automated LLM judges and human annotations at three granularities:

  • Rubric average. Treat each of the 699 task–rubric pairs as an independent observation.
  • Perfect. A task counts as passing only if every one of its rubrics is satisfied.
  • Trajectory judge. The holistic Online-Mind2Web LLM judge, which issues one pass/fail per trajectory.

We collected human annotations of all 120 Opus 4.6 trajectories, with annotators marking each rubric as satisfied or not.

Rubric-based evaluation substantially beats the trajectory judge at every level:

From the paper

Our rubric-based evaluation — decomposing each task into fine-grained, state-grounded checkpoints — yields more reliable automated judgments than holistic trajectory-based assessment, especially for the long-horizon tasks in Odysseys. Rubrics also give partial credit to weaker models, which might otherwise only receive zero scores under the trajectory judge.

4 · Results

We benchmark several frontier API models and open-weight models on Odysseys. Each model is implemented with the recommended settings for computer-use, and we launch them in the same virtual Ubuntu environments from OSWorld.

The primary application the models use is Google Chrome. Some occasionally reach for others — for instance LibreOffice to generate a report.

All experiments run under the same settings: at most 100 steps, with the maximum reasoning effort each model supports (if any). At 100 steps, Opus 4.6 takes about 30 minutes and $2.50 per task. We use the OSWorld runner, and each model starts with a Google Chrome window at a provided starting URL (if any).

Table 2 · Model performance

The best frontier computer-use models — Opus 4.6 and GPT-5.4 — lead across most metrics, with Opus 4.6 slightly ahead on rubric quality. GPT-5.4 (and GPT-5.4 Mini) have higher SPLavg scores though, which means they are more efficient per step. Open-weight models trail starkly behind, and Qwen-3.5-30B-A3B (the largest we were able to run locally) is the strongest open-weight we tested.

4.1 · Scaling with step budget

Perfect rubric rate vs step budget

All models follow a broadly sigmoidal scaling curve.

Perfect scores stay near zero for the first ~15 steps. That's the minimum number of interactions needed to complete even the simplest Odysseys tasks. Scores then climb steadily through the 20–70 step range as progressively harder tasks get fully solved. Past ~80 steps, improvements taper off.

In other words, most achievable tasks sit inside a moderate difficulty band, and each model approaches a practical ceiling where extra steps don't help much.

Claude Opus 4.6 and GPT-5.4 plateau highest, at ~53% and ~48% perfect respectively, and climb more steeply through the middle — that's more capability and better per-step efficiency. Claude Sonnet 4.6 is similar in shape but ceilings at ~44%. GPT-5.4 Mini (~27%) and Qwen 3.5 VL (~13%) climb much more slowly and level off much lower — their shortfall looks like a capability gap, not a step-budget problem.

4.2 · Difficulty levels

Performance by difficulty

Performance drops sharply on hard tasks. Opus 4.6 averages 57.7 rubric score on hard, substantially outperforming GPT-5.4 at 43.0.

Opus 4.6 and Sonnet 4.6 also take more steps at every difficulty level, particularly on hard tasks — they explore longer before giving up or returning a result.

4.3 · Failure modes

All the frontier models we tested have distinct failure patterns on long-horizon tasks.

Opus 4.6 most commonly fails by over-investing in research. It keeps gathering information but never transitions to producing the deliverable — the final document, the spreadsheet — and eventually exhausts the step budget with an empty artifact. This pattern shows up in 6 of its 12 zero-score runs. Opus stays productive but incomplete until termination, hitting the 100-step cap on 39% of tasks; its partial-credit runs average 97.4 steps.

GPT-5.4, by contrast, more often fails through inaction despite correct high-level reasoning. On 4 of its 7 zero-score tasks, it generates long, detailed plans that correctly identify which pages to visit and what information to collect, then terminates with an empty action after only a few steps — sometimes without any browser interaction at all.

Opus 4.6 · research over-investment

Produces but never delivers

Keeps gathering information but never transitions to producing the required deliverable. Exhausts the step budget with an empty output artifact.

6 / 12 zero-score runs · 39% hit 100-step cap · partial-credit runs average 97.4 steps.

GPT-5.4 · inaction after planning

Plans correctly, then stops

Generates long, detailed plans that correctly identify which pages to visit, then terminates with an empty action after only a few steps — sometimes without any browser interaction at all.

4 / 7 zero-score tasks.

Both models share a common failure mode on especially broad tasks with many parallel subtasks. Tasks that require visiting 10 to 30 venues — like planning a trip to every MLB stadium — often leave both models stuck in the first phase, collecting schedules, without progressing to flights, hotels, or final compilation. All such tasks receive zero scores from both models.

The takeaway: current agents struggle not only with long sequential horizons, but also with high-fanout task structure — where effort has to be allocated across many related subtasks. This is one direction models would likely benefit from sub-agents or a multi-agent setup.

4.4 · Surprising capabilities

Both GPT-5.4 and Opus 4.6 also exhibit some notably sophisticated behaviours. GPT-5.4 in particular develops a few strategies that repurpose browser and system primitives for information extraction.

GPT-5.4 · base64 clipboard paste for spreadsheets

We observed GPT-5.4 using an unprompted strategy for bulk spreadsheet entry. Rather than pasting cell contents directly, it encodes the full table as a base64 string inside its Python action, decodes it at runtime, and pastes the result through the clipboard so tab and newline delimiters are preserved. We see this in 8 runs.

Base64 paste trajectory
Figure. After researching 10 surgeon profiles, GPT-5.4 encoded the entire data table as a base64 string inside the Python action, then decoded it at runtime with pyperclip.copy(_text); pyautogui.hotkey('ctrl', 'v'). This preserves exact tab and newline characters and populates the spreadsheet in a single paste with the correct column mapping.

GPT-5.4 · view-source extraction

When a product page fails to render correctly due to JavaScript errors, the model navigates to the raw HTML through the view-source: protocol, searches within the source using ctrl+f, and extracts structured product metadata from embedded JSON-LD markup — variant identifiers, stock status, size mappings. We see this in 23 runs.

View source trajectory
Figure. When margauxny.com rendered blank due to JavaScript failures, GPT-5.4 typed view-source:https://margauxny.com/products/... directly in the address bar, then used ctrl+f to search for variants, InStock, 40.5, and US 9.5 in the raw HTML. From the embedded JSON-LD schema it decoded EU 40.0 = US 9.5, confirmed the SKU was in stock, and reconstructed the direct variant URL — all without the product page ever rendering visually.

Opus 4.6 · Wayback Machine fallback

Opus 4.6 has a different set of strengths. When target websites return 403 errors, it sometimes falls back to the Wayback Machine — constructing archived URLs and retrieving cached versions of otherwise inaccessible pages. That's how it recovers class schedule information that wouldn't be available through direct browsing.

Wayback trajectory
Figure. Both yoga studio websites returned 403 Forbidden. Opus immediately navigated to https://web.archive.org/web/2024/https://coilyoga.com/classes/ and repeated the same strategy for Tower Yoga, successfully retrieving archived pages from June 2024 that contained the full class schedule.

Opus 4.6 · middle-click and ctrl-F as hypothesis test

Opus makes substantially heavier use of middle-click to open links in background tabs without losing its place — 131 times across all runs, compared to just 7 for GPT-5.4.

It also uses ctrl+f not just to locate information, but to falsify hypotheses about page relevance. After searching for a keyword and observing zero matches, it often treats the absence as evidence that the page is unproductive and moves on immediately.

Ctrl-F probe trajectory
Figure. After navigating to brave.com/linux, Opus used ctrl+f to search for Chromebook and observed 0/0 matches, confirming the page did not cover the topic. Rather than scrolling to verify, it treated the absence of a match as decisive evidence and pivoted immediately.

5 · Conclusion

We introduced Odysseys, a benchmark for evaluating web agents on realistic long-horizon tasks pulled from real browsing behaviour. Where most existing benchmarks stay within short, single-site episodes, Odysseys focuses on long workflows that cross multiple sites and demand sustained planning.

We also found that trajectory-level LLM judges are insufficient for long-horizon runs. Our rubric-based evaluation, which decomposes each task into verifiable checkpoints, agrees with human judgment substantially more often.

Even so, long-horizon web interaction is far from solved. The best frontier models only reach about 53% perfect-task success. Performance drops sharply on harder tasks, and gains flatten as step budgets grow.

Those limitations point to deeper challenges: long-context planning, staying coherent across sites, and executing extended workflows without drifting. Improving agents here — for example through reinforcement learning or inference-time search — is a promising direction for future work.