Leaderboard

Nine entries on all 200 Odysseys tasks. Eight CUA models were evaluated by us under identical settings — 100-step budget, maximum reasoning effort, Google Chrome in an OSWorld Ubuntu VM.

Sort by

Show

Agent

Model	Type	Agent	O-M2W Judge	Rubric Avg	Perfect	Avg. Steps	Traj. Eff.

Notes. Rubric Avg treats each task rubric pair as an independent observation and averages them. Perfect marks a task as passing only if every rubric is satisfied. O-M2W Judge is the trajectory-level holistic LLM judge from Online-Mind2Web. Avg. Steps is the mean number of interaction steps the agent takes per task, where lower is more efficient. Traj. Eff. (Trajectory Efficiency) is (1/N) · Σ sᵢ / nᵢ, where sᵢ is the averaged rubric score on task i and nᵢ is the number of agent steps. Higher values mean stronger outcomes achieved in fewer steps. Agent distinguishes CUA (computer-use agents that consume screenshots and emit GUI actions) from Terminal agents that drive the browser by writing code (e.g. Playwright). Cells marked — are metrics not reported by the source. CUA scores are reproduced from Table 2 of the paper; WebWright is reproduced from the WebWright article.

Breakdown by difficulty

Tasks come in three tiers. Easy tasks use at most 5 steps and 3 domains, medium tasks use 6 to 8 steps or 4 or more domains, and hard tasks exceed both thresholds. Each bar shows the perfect rubric rate, the share of tasks a model solves with every rubric item satisfied.

Steps vs perfect score

Each model's perfect rubric rate plotted against its average number of steps per task, with the Pareto frontier overlaid. Opus 4.6 sits at the capability end of the frontier, with GPT-5.4 and GPT-5.4 Mini trading off step budget for perfect rate. Each additional step of compute buys progressively smaller gains.

Scatter of perfect rubric rate vs average steps per task, with Pareto frontier — Figure. Sonnet 4.6 is Pareto-dominated by both Opus 4.6 and GPT-5.4, which achieve higher perfect rates with fewer steps. Open-weight models sit well inside the frontier, spending more steps for substantially lower perfect rates, pointing to capability ceilings rather than step-budget shortfalls.