TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

56
Benchmarks
8
Categories
538
Results recorded
113
Models scored

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

ARC-AGI-2

Abstraction and Reasoning Corpus — AGI v2

Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.

Reasoning Text 17 results
Top results
1
GPT 5.5
85.0%
2
GPT 5.4 Pro
83.3%
3
Gemini 3.1 Pro
77.1%
Last eval Apr 23, 2026 View leaderboard →

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text 6 results
Top results
1
Seed 1.5
91.6%
2
Nova Pro
86.9%
3
Claude Opus 3
86.8%
Last eval Jan 22, 2025 View leaderboard →

DROP Saturated

Discrete Reasoning Over Paragraphs

Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.

Reasoning Text 8 results
Top results
1
Seed 1.5
93.0%
2
Command A
91.1%
3
Nova Pro
85.4%
Last eval Apr 16, 2025 View leaderboard →

LiveBench

LiveBench (Reasoning Avg)

Continuously refreshed benchmark across reasoning, coding, math, data analysis, language and instruction following. New questions every month to stay contamination-free.

Reasoning Text 2 results
Top results
1
Qwen3 235B A22B
77.1
2
Qwen3 30B A3B
74.3
Last eval Apr 28, 2025 View leaderboard →

Aider Polyglot

Aider Polyglot Coding Benchmark

225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.

Coding Text 12 results
Top results
1
Claude Opus 4.5
89.4%
2
GPT 5.1
88.0%
3
GPT 5 (Thinking)
88.0%
Last eval Nov 24, 2025 View leaderboard →

CodeForces ELO

CodeForces Live Contest ELO

Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.

Coding Text 12 results
Top results
1
Deepseek V4 Pro
3,206.0rating
2
o4 mini
2,719.0rating
3
o3
2,706.0rating
Last eval Apr 24, 2026 View leaderboard →

HumanEval Saturated

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…

Coding Text 14 results
Top results
1
GPT-4o
90.2%
2
Llama 3.3
88.4%
3
Claude Haiku 3.5
88.1%
Last eval Apr 3, 2026 View leaderboard →

HumanEval+ Saturated

HumanEval+ (EvalPlus)

HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.

Coding Text 2 results
Top results
1
Phi 4 reasoning plus
92.3%
2
WizardCoder
64.6%
Last eval Jul 8, 2025 View leaderboard →

LiveCodeBench

Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.

Coding Text 26 results
Top results
1
Deepseek V4 Pro
93.5%
2
Kimi K2.6
89.6%
3
Kimi K2.5
85.0%
Last eval Apr 24, 2026 View leaderboard →

MBPP Saturated

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text 5 results
Top results
1
Nemotron 3 Super
78.4%
2
Code Llama
66.7%
3
Mixtral 8x7B
60.7%
Last eval Apr 3, 2026 View leaderboard →

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal 1 results
Top results
1
Deepseek 3.2
70.2%
Last eval Dec 1, 2025 View leaderboard →

SWE-bench Verified

500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…

Coding Text 49 results
Top results
1
Claude Opus 4.7
87.6%
2
Claude Opus 4.5
80.9%
3
Claude Opus 4.6
80.8%
Last eval Apr 27, 2026 View leaderboard →

AIME 2024

American Invitational Mathematics Examination 2024

30 problems from AIME I and II 2024. Standard high-school competition math eval before AIME 2025 superseded it as primary signal.

Math Text 19 results
Top results
1
o4 mini
93.4%
2
o3
91.6%
3
Qwen3 235B A22B
85.7%
Last eval Apr 3, 2026 View leaderboard →

AIME 2025

American Invitational Mathematics Examination 2025

30 problems from the 2025 AIME I and II contests. High-school competition math with integer answers 0-999; valuable post-cutoff signal for 2024-trained models.

Math Text 37 results
Top results
1
Grok 4 Heavy
100.0%
2
GPT 5.2 Thinking
100.0%
3
DeepSeek 3.2 Speciale
96.0%
Last eval May 5, 2026 View leaderboard →

GSM8K Saturated

Grade School Math 8K

8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.

Math Text 12 results
Top results
1
Claude Opus 3
95.0%
2
Nova Pro
94.8%
3
Nova Lite
94.5%
Last eval Apr 3, 2026 View leaderboard →

MATH Saturated

MATH (Hendrycks)

12.5k competition mathematics problems (AMC, AIME, USAMO style). Reported as overall % or split by Level 1-5 difficulty. The "easy" levels are now saturated; Level 5 still discri…

Math Text 17 results
Top results
1
Seed 1.5
88.6%
2
Nemotron 3 Super
84.8%
3
Command A
80.0%
Last eval Apr 3, 2026 View leaderboard →

MATH-500

MATH-500 (OpenAI subset)

500-question subset of MATH popularised by OpenAI's o-series releases. Reported widely as the standard 'MATH' number on modern leaderboards.

Math Text 6 results
Top results
1
DeepSeek-R1
97.3%
2
Claude Sonnet 3.7 (Thinking)
96.2%
3
Llama 4 Behemoth
95.0%
Last eval Apr 30, 2025 View leaderboard →

USAMO 2025

USA Mathematical Olympiad 2025

Six proof-based problems from the 2025 USAMO. Graded out of 42 (7 points per problem) by expert judges.

Math Text 2 results
Top results
1
Grok 4 Heavy
61.9points
2
Grok 4
37.5points
Last eval Sep 7, 2025 View leaderboard →

ARC Challenge Saturated

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text 10 results
Top results
1
Gemma 2
554.0%
2
Claude Opus 3
96.4%
3
Nemotron 3 Super
96.1%
Last eval Apr 3, 2026 View leaderboard →

GPQA Diamond

Graduate-Level Google-Proof Q&A — Diamond subset

PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…

Knowledge Text 82 results
Top results
1
GPT 5.4 Pro
94.4%
2
Gemini 3.1 Pro
94.3%
3
Claude Opus 4.7
94.2%
Last eval Jul 8, 2026 View leaderboard →

Humanity's Last Exam

Humanity's Last Exam (HLE)

A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.

Knowledge Text 24 results
Top results
1
Kimi K2.6
54.0%
2
Claude Opus 4.7
46.9%
3
Grok 4 Heavy
44.4%
Last eval Apr 23, 2026 View leaderboard →

MMLU Saturated

Massive Multitask Language Understanding

Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.

Knowledge Text 22 results
Top results
1
GPT 4.1
90.2%
2
GPT-4o
88.7%
3
Seed 1.5
88.6%
Last eval Apr 3, 2026 View leaderboard →

MMLU-Pro

A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.

Knowledge Text 23 results
Top results
1
GPT OSS 120B
90.0%
2
Qwen 3.5 27B
86.1%
3
Gemma 4
85.2%
Last eval Apr 3, 2026 View leaderboard →

SimpleQA Verified

SimpleQA Verified (Epoch curation)

Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.

Knowledge Text 2 results
Top results
1
Gemini 3 Pro (Thinking)
72.1%
2
Gemini 2.5 Flash-Lite
10.7%
Last eval Nov 18, 2025 View leaderboard →

TriviaQA Saturated

Open-domain question answering over 650k trivia question/answer pairs. Long-tail world knowledge.

Knowledge Text 3 results
Top results
1
LLaMA 2
85.0%
2
Mixtral 8x7B
71.5%
3
Gemma 2
59.4%
Last eval Feb 25, 2025 View leaderboard →

TruthfulQA

817 questions designed to elicit imitative falsehoods. Measures whether models repeat common misconceptions.

Knowledge Text 1 results
Top results
1
Mistral NeMo
50.3%
Last eval Jul 18, 2024 View leaderboard →

ChartQA

Question answering over charts and plots, mixing extraction and visual-reasoning questions.

Multimodal Multimodal 5 results
Top results
1
Llama 4 Maverick
90.0%
2
Llama 4 Scout
88.8%
3
Pixtral Large
88.1%
Last eval Apr 16, 2025 View leaderboard →

DocVQA

Document VQA over scanned business documents. Tests OCR-grounded reading.

Multimodal Multimodal 5 results
Top results
1
Llama 4 Maverick
94.4%
2
Llama 4 Scout
94.4%
3
Pixtral Large
93.3%
Last eval Apr 16, 2025 View leaderboard →

MathVista

MathVista (testmini)

Mathematical reasoning over visual contexts: figures, charts, diagrams, geometric drawings.

Multimodal Multimodal 8 results
Top results
1
o3
86.8%
2
o4 mini
84.3%
3
Llama 4 Maverick
73.7%
Last eval Apr 16, 2025 View leaderboard →

MMMU

Massive Multi-discipline Multimodal Understanding

11.5k college-level questions across 30 subjects requiring image + text reasoning (charts, diagrams, medical scans, music notation, …).

Multimodal Multimodal 26 results
Top results
1
GPT 5.1
84.2%
2
GPT 5 (Thinking)
84.2%
3
o3
82.9%
Last eval Apr 24, 2026 View leaderboard →

MMMU-Pro

Harder MMMU variant: filters out text-only-solvable items and adds a vision-only setting where the question itself is rendered into the image.

Multimodal Multimodal 14 results
Top results
1
GPT 5.4
81.2%
2
Gemini 3 Flash (Thinking)
81.2%
3
Gemini 3 Pro
81.0%
Last eval May 5, 2026 View leaderboard →

BFCL v3

Berkeley Function-Calling Leaderboard v3

Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.

Agentic Text 2 results
Top results
1
Qwen3 235B A22B
70.8%
2
Qwen3 30B A3B
69.1%
Last eval Apr 28, 2025 View leaderboard →

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal 7 results
Top results
1
GPT 5.4
75.0%
2
GPT 5.3 Codex
74.0%
3
Claude Sonnet 4.6
72.5%
Last eval Mar 5, 2026 View leaderboard →

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text 12 results
Top results
1
Gemini 3.1 Pro
99.3%
2
GPT 5.4
98.9%
3
Claude Opus 4.5
98.2%
Last eval Apr 23, 2026 View leaderboard →

Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text 25 results
Top results
1
GPT 5.5
82.7%
2
GPT 5.3 Codex
77.3%
3
GLM 4.6
75.9%
Last eval Apr 24, 2026 View leaderboard →

IFBench

Instruction Following Benchmark

Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.

Language Text 2 results
Top results
1
Qwen 3.5 122B A10B
76.1%
2
Qwen 3.5 35B A3B
70.2%
Last eval Apr 24, 2026 View leaderboard →

IFEval

Instruction-Following Eval

Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).

Language Text 12 results
Top results
1
Claude Sonnet 3.7 (Thinking)
93.2%
2
Nova Pro
92.1%
3
Llama 3.3
92.1%
Last eval Feb 24, 2026 View leaderboard →

MGSM

Multilingual Grade School Math

GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.

Language Text 5 results
Top results
1
Llama 3.3
91.1%
2
Claude Opus 3
90.7%
3
GPT-4o
90.5%
Last eval Apr 3, 2026 View leaderboard →

RULER 128k

RULER (128k context)

Synthetic long-context evaluation suite measuring needle-in-a-haystack, multi-key retrieval and tracing across 128k token contexts.

Language Text 1 results
Top results
1
Nemotron 3 Super
88.3%
Last eval Apr 3, 2026 View leaderboard →
0 AIs selected
Clear selection
#
Name
Task