AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
ARC-AGI-2
Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.
BIG-Bench Hard
The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.
DROP Saturated
Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.
LiveBench
Continuously refreshed benchmark across reasoning, coding, math, data analysis, language and instruction following. New questions every month to stay contamination-free.
Aider Polyglot
225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.
CodeForces ELO
Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.
HumanEval Saturated
OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…
HumanEval+ Saturated
HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.
LiveCodeBench
Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.
MBPP Saturated
974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.
SWE-bench Multimodal
Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.
SWE-bench Verified
500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…
AIME 2024
30 problems from AIME I and II 2024. Standard high-school competition math eval before AIME 2025 superseded it as primary signal.
AIME 2025
30 problems from the 2025 AIME I and II contests. High-school competition math with integer answers 0-999; valuable post-cutoff signal for 2024-trained models.
GSM8K Saturated
8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.
MATH Saturated
12.5k competition mathematics problems (AMC, AIME, USAMO style). Reported as overall % or split by Level 1-5 difficulty. The "easy" levels are now saturated; Level 5 still discri…
MATH-500
500-question subset of MATH popularised by OpenAI's o-series releases. Reported widely as the standard 'MATH' number on modern leaderboards.
USAMO 2025
Six proof-based problems from the 2025 USAMO. Graded out of 42 (7 points per problem) by expert judges.
ARC Challenge Saturated
Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.
GPQA Diamond
PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…
Humanity's Last Exam
A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.
MMLU Saturated
Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.
MMLU-Pro
A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.
SimpleQA Verified
Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.
TriviaQA Saturated
Open-domain question answering over 650k trivia question/answer pairs. Long-tail world knowledge.
TruthfulQA
817 questions designed to elicit imitative falsehoods. Measures whether models repeat common misconceptions.
ChartQA
Question answering over charts and plots, mixing extraction and visual-reasoning questions.
DocVQA
Document VQA over scanned business documents. Tests OCR-grounded reading.
MathVista
Mathematical reasoning over visual contexts: figures, charts, diagrams, geometric drawings.
MMMU
11.5k college-level questions across 30 subjects requiring image + text reasoning (charts, diagrams, medical scans, music notation, …).
MMMU-Pro
Harder MMMU variant: filters out text-only-solvable items and adds a vision-only setting where the question itself is rendered into the image.
BFCL v3
Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.
OSWorld
369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.
Tau2-Bench Telecom
Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.
Terminal-Bench Hard
The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.
IFBench
Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.
IFEval
Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).
MGSM
GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.
RULER 128k
Synthetic long-context evaluation suite measuring needle-in-a-haystack, multi-key retrieval and tracing across 128k token contexts.
