AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

ARC Challenge Saturated

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text 10 results

Last eval 03 Apr 2026 View leaderboard →

GPQA Diamond

Graduate-Level Google-Proof Q&A — Diamond subset

PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…

Knowledge Text 90 results

Last eval 09 Jul 2026 View leaderboard →

Humanity's Last Exam

Humanity's Last Exam (HLE)

A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.

Knowledge Text 30 results

Last eval 30 Jun 2026 View leaderboard →

MMLU Saturated

Massive Multitask Language Understanding

Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.

Knowledge Text 22 results

Last eval 03 Apr 2026 View leaderboard →

MMLU-Pro

A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.

Knowledge Text 26 results

Last eval 03 Apr 2026 View leaderboard →

SimpleQA Verified

SimpleQA Verified (Epoch curation)

Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.

Knowledge Text 2 results

Top results

Gemini 3 Pro (Thinking)

72.1%

Gemini 2.5 Flash-Lite

10.7%

Last eval 18 Nov 2025 View leaderboard →