TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal Accuracy Max 100.0% Released Apr 2024
7
Results
6
Models scored
75.0%
Top: GPT 5.4
66.3%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2025Dec 2025Mar 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Opus 4.8 83.4% 0-shot · CoT · agentic 28 May 2026 Self-reported
2 Gemini 3.5 Flash 78.4% 0-shot · CoT · agentic 19 May 2026 Self-reported
3 GPT 5.4 75.0% 05 Mar 2026 Self-reported Primary
4 GPT 5.3 Codex 74.0% 05 Mar 2026 Self-reported Primary
5 Claude Sonnet 4.6 72.5% 17 Feb 2026 Self-reported Primary
6 MiniMax M3 70.1% 0-shot · CoT · agentic 01 Jun 2026 Self-reported
7 Claude Opus 4.5 66.3% 24 Nov 2025 Self-reported Primary
8 Claude Sonnet 4.5 61.4% 29 Sep 2025 Self-reported Primary
9 Claude Haiku 4.5 50.7% 15 Oct 2025 Self-reported Primary
10 Claude Haiku 4.5 50.7% 15 Oct 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task