Leaderboard

Seven models evaluated on all 120 Odysseys tasks under identical settings: a 100-step budget, maximum reasoning effort, and a Google Chrome window in an OSWorld Ubuntu VM.

Model Type O-M2W Judge Rubric Avg Perfect SPLavg SPLperf

Notes. Rubric Avg treats each of the 699 task–rubric pairs as an independent observation and averages them. Perfect marks a task as passing only if every rubric is satisfied. O-M2W Judge is the trajectory-level holistic LLM judge from Online-Mind2Web. SPL (Success weighted by Path Length) is (1/N) · Σ sᵢ / nᵢ, where sᵢ is either the averaged or perfect rubric score on task i and nᵢ is the number of agent steps — higher means strong outcomes achieved in fewer steps. All scores are reproduced from Table 2 of the paper.

Breakdown by difficulty

Tasks come in three tiers: easy (≤5 steps, ≤3 domains), medium (6–8 steps or 4+ domains), and hard (exceeding both). Each bar shows the perfect rubric rate — the share of tasks a model solves with every rubric item satisfied. Rubric average and average steps taken sit just underneath.