TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

56
Benchmarks
8
Categories
538
Results recorded
113
Models scored

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

ARC Challenge Saturated

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text 10 results
Top results
1
Gemma 2
554.0%
2
Claude Opus 3
96.4%
3
Nemotron 3 Super
96.1%
Last eval Apr 3, 2026 View leaderboard →

GPQA Diamond

Graduate-Level Google-Proof Q&A — Diamond subset

PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…

Knowledge Text 82 results
Top results
1
GPT 5.4 Pro
94.4%
2
Gemini 3.1 Pro
94.3%
3
Claude Opus 4.7
94.2%
Last eval Jul 8, 2026 View leaderboard →

Humanity's Last Exam

Humanity's Last Exam (HLE)

A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.

Knowledge Text 24 results
Top results
1
Kimi K2.6
54.0%
2
Claude Opus 4.7
46.9%
3
Grok 4 Heavy
44.4%
Last eval Apr 23, 2026 View leaderboard →

MMLU Saturated

Massive Multitask Language Understanding

Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.

Knowledge Text 22 results
Top results
1
GPT 4.1
90.2%
2
GPT-4o
88.7%
3
Seed 1.5
88.6%
Last eval Apr 3, 2026 View leaderboard →

MMLU-Pro

A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.

Knowledge Text 23 results
Top results
1
GPT OSS 120B
90.0%
2
Qwen 3.5 27B
86.1%
3
Gemma 4
85.2%
Last eval Apr 3, 2026 View leaderboard →

SimpleQA Verified

SimpleQA Verified (Epoch curation)

Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.

Knowledge Text 2 results
Top results
1
Gemini 3 Pro (Thinking)
72.1%
2
Gemini 2.5 Flash-Lite
10.7%
Last eval Nov 18, 2025 View leaderboard →

TriviaQA Saturated

Open-domain question answering over 650k trivia question/answer pairs. Long-tail world knowledge.

Knowledge Text 3 results
Top results
1
LLaMA 2
85.0%
2
Mixtral 8x7B
71.5%
3
Gemma 2
59.4%
Last eval Feb 25, 2025 View leaderboard →

TruthfulQA

817 questions designed to elicit imitative falsehoods. Measures whether models repeat common misconceptions.

Knowledge Text 1 results
Top results
1
Mistral NeMo
50.3%
Last eval Jul 18, 2024 View leaderboard →
0 AIs selected
Clear selection
#
Name
Task