AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
ARC Challenge Saturated
Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.
GPQA Diamond
PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…
Humanity's Last Exam
A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.
MMLU Saturated
Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.
MMLU-Pro
A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.
SimpleQA Verified
Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.
TriviaQA Saturated
Open-domain question answering over 650k trivia question/answer pairs. Long-tail world knowledge.
TruthfulQA
817 questions designed to elicit imitative falsehoods. Measures whether models repeat common misconceptions.
