AI benchmarks

AA-LCR

Artificial Analysis Long-Context Reasoning

Long-context reasoning evaluation across documents up to 128k tokens. Used as one component of the Artificial Analysis Intelligence Index.

Reasoning Text 1 results

Top results

1

Kimi K2.7 Code

66.3%

Last eval 12 Jun 2026 View leaderboard →

ARC-AGI-2

Abstraction and Reasoning Corpus — AGI v2

Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.

Reasoning Text 17 results

Top results

1

GPT 5.5

85.0%

2

GPT 5.4 Pro

83.3%

3

Gemini 3.1 Pro

77.1%

Last eval 23 Apr 2026 View leaderboard →

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text 6 results

Top results

1

Seed 1.5

91.6%

2

Nova Pro

86.9%

3

Claude Opus 3

86.8%

Last eval 22 Jan 2025 View leaderboard →

DROP Saturated

Discrete Reasoning Over Paragraphs

Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.

Reasoning Text 8 results

Top results

1

Seed 1.5

93.0%

2

Command A

91.1%

3

Nova Pro

85.4%

Last eval 16 Apr 2025 View leaderboard →

LiveBench

LiveBench (Reasoning Avg)

Continuously refreshed benchmark across reasoning, coding, math, data analysis, language and instruction following. New questions every month to stay contamination-free.

Reasoning Text 2 results

Top results

1

Qwen3 235B A22B

77.1

2

Qwen3 30B A3B

74.3

Last eval 28 Apr 2025 View leaderboard →

Aider Polyglot

Aider Polyglot Coding Benchmark

225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.

Coding Text 12 results

Top results

1

Claude Opus 4.5

89.4%

2

GPT 5.1

88.0%

3

GPT 5 (Thinking)

88.0%

Last eval 24 Nov 2025 View leaderboard →

CodeForces ELO

CodeForces Live Contest ELO

Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.

Coding Text 12 results

Top results

1

Deepseek V4 Pro

3,206.0rating

2

o4 mini

2,719.0rating

3

o3

2,706.0rating

Last eval 24 Apr 2026 View leaderboard →

HumanEval Saturated

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…

Coding Text 14 results

Top results

1

GPT-4o

90.2%

2

Llama 3.3

88.4%

3

Claude Haiku 3.5

88.1%

Last eval 03 Apr 2026 View leaderboard →

HumanEval+ Saturated

HumanEval+ (EvalPlus)

HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.

Coding Text 2 results

Top results

1

Phi 4 reasoning plus

92.3%

2

WizardCoder

64.6%

Last eval 08 Jul 2025 View leaderboard →

LiveCodeBench

Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.

Coding Text 28 results

Top results

1

Deepseek V4 Pro

93.5%

2

Kimi K2.6

89.6%

3

Kimi K2.5

85.0%

Last eval 24 Apr 2026 View leaderboard →

MBPP Saturated

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text 5 results

Top results

1

Nemotron 3 Super

78.4%

2

Code Llama

66.7%

3

Mixtral 8x7B

60.7%

Last eval 03 Apr 2026 View leaderboard →

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal 1 results

Top results

1

Deepseek 3.2

70.2%

Last eval 01 Dec 2025 View leaderboard →

SWE-bench Verified

500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…

Coding Text 53 results

Top results

1

Claude Opus 4.7

87.6%

2

Claude Opus 4.5

80.9%

3

Claude Opus 4.6

80.8%

Last eval 27 Apr 2026 View leaderboard →

AIME 2024

American Invitational Mathematics Examination 2024

30 problems from AIME I and II 2024. Standard high-school competition math eval before AIME 2025 superseded it as primary signal.

Math Text 19 results

Top results

1

o4 mini

93.4%

2

o3

91.6%

3

Qwen3 235B A22B

85.7%

Last eval 03 Apr 2026 View leaderboard →

AIME 2025

American Invitational Mathematics Examination 2025

30 problems from the 2025 AIME I and II contests. High-school competition math with integer answers 0-999; valuable post-cutoff signal for 2024-trained models.

Math Text 39 results

Top results

1

Grok 4 Heavy

100.0%

2

GPT 5.2 Thinking

100.0%

3

Trinity Large Thinking

96.3%

Last eval 05 May 2026 View leaderboard →

GSM8K Saturated

Grade School Math 8K

8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.

Math Text 12 results

Top results

1

Claude Opus 3

95.0%

2

Nova Pro

94.8%

3

Nova Lite

94.5%

Last eval 03 Apr 2026 View leaderboard →

MATH Saturated

MATH (Hendrycks)

12.5k competition mathematics problems (AMC, AIME, USAMO style). Reported as overall % or split by Level 1-5 difficulty. The "easy" levels are now saturated; Level 5 still discri…

Math Text 17 results

Top results

1

Seed 1.5

88.6%

2

Nemotron 3 Super

84.8%

3

Command A

80.0%

Last eval 03 Apr 2026 View leaderboard →

MATH-500

MATH-500 (OpenAI subset)

500-question subset of MATH popularised by OpenAI's o-series releases. Reported widely as the standard 'MATH' number on modern leaderboards.

Math Text 6 results

Top results

1

DeepSeek-R1

97.3%

2

Claude Sonnet 3.7 (Thinking)

96.2%

3

Llama 4 Behemoth

95.0%

Last eval 30 Apr 2025 View leaderboard →

USAMO 2025

USA Mathematical Olympiad 2025

Six proof-based problems from the 2025 USAMO. Graded out of 42 (7 points per problem) by expert judges.

Math Text 2 results

Top results

1

Grok 4 Heavy

61.9points

2

Grok 4

37.5points

Last eval 07 Sep 2025 View leaderboard →

ARC Challenge Saturated

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text 10 results

Top results

1

Gemma 2

554.0%

2

Claude Opus 3

96.4%

3

Nemotron 3 Super

96.1%

Last eval 03 Apr 2026 View leaderboard →

GPQA Diamond

Graduate-Level Google-Proof Q&A — Diamond subset

PhD-level multiple-choice questions in biology, physics, and chemistry, written by domain experts so non-experts cannot answer them even with web search. Diamond is the hardest c…

Knowledge Text 90 results

Top results

1

GPT-5.6 Sol

94.6%

2

GPT 5.4 Pro

94.4%

3

Gemini 3.1 Pro

94.3%

Last eval 09 Jul 2026 View leaderboard →

Humanity's Last Exam

Humanity's Last Exam (HLE)

A 2,500-question exam crowdsourced from subject-matter experts across hundreds of disciplines. Designed to remain unsaturated by frontier models for as long as possible.

Knowledge Text 30 results

Top results

1

Claude Fable 5

59.0%

2

Kimi K2.6

54.0%

3

MiMo V2.5 Pro

48.0%

Last eval 30 Jun 2026 View leaderboard →

MMLU Saturated

Massive Multitask Language Understanding

Multiple-choice questions across 57 academic subjects (humanities, STEM, social sciences, professional). Standard 5-shot accuracy. Largely saturated by frontier models.

Knowledge Text 22 results

Top results

1

GPT 4.1

90.2%

2

GPT-4o

88.7%

3

Seed 1.5

88.6%

Last eval 03 Apr 2026 View leaderboard →

MMLU-Pro

A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.

Knowledge Text 26 results

Top results

1

GPT OSS 120B

90.0%

2

Qwen 3.6 27B

86.2%

3

Qwen 3.5 27B

86.1%

Last eval 03 Apr 2026 View leaderboard →

SimpleQA Verified

SimpleQA Verified (Epoch curation)

Short-form factual questions with single, unambiguous answers. Tests world knowledge and (critically) hallucination — refusing or hedging counts as not-correct.

Knowledge Text 2 results

Top results

1

Gemini 3 Pro (Thinking)

72.1%

2

Gemini 2.5 Flash-Lite

10.7%

Last eval 18 Nov 2025 View leaderboard →

TriviaQA Saturated

Open-domain question answering over 650k trivia question/answer pairs. Long-tail world knowledge.

Knowledge Text 3 results

Top results

1

LLaMA 2

85.0%

2

Mixtral 8x7B

71.5%

3

Gemma 2

59.4%

Last eval 25 Feb 2025 View leaderboard →

TruthfulQA

817 questions designed to elicit imitative falsehoods. Measures whether models repeat common misconceptions.

Knowledge Text 1 results

Top results

1

Mistral NeMo

50.3%

Last eval 18 Jul 2024 View leaderboard →

ChartQA

Question answering over charts and plots, mixing extraction and visual-reasoning questions.

Multimodal Multimodal 5 results

Top results

1

Llama 4 Maverick

90.0%

2

Llama 4 Scout

88.8%

3

Pixtral Large

88.1%

Last eval 16 Apr 2025 View leaderboard →

DocVQA

Document VQA over scanned business documents. Tests OCR-grounded reading.

Multimodal Multimodal 5 results

Top results

1

Llama 4 Maverick

94.4%

2

Llama 4 Scout

94.4%

3

Pixtral Large

93.3%

Last eval 16 Apr 2025 View leaderboard →

MathVista

MathVista (testmini)

Mathematical reasoning over visual contexts: figures, charts, diagrams, geometric drawings.

Multimodal Multimodal 9 results

Top results

1

Qwen 3.6 27B

87.4%

2

o3

86.8%

3

o4 mini

84.3%

Last eval 16 Apr 2025 View leaderboard →

MMMU

Massive Multi-discipline Multimodal Understanding

11.5k college-level questions across 30 subjects requiring image + text reasoning (charts, diagrams, medical scans, music notation, …).

Multimodal Multimodal 27 results

Top results

1

GPT 5.1

84.2%

2

GPT 5 (Thinking)

84.2%

3

o3

82.9%

Last eval 24 Apr 2026 View leaderboard →

MMMU-Pro

Harder MMMU variant: filters out text-only-solvable items and adds a vision-only setting where the question itself is rendered into the image.

Multimodal Multimodal 18 results

Top results

1

GPT-5.6 Sol

83.0%

2

GPT 5.4

81.2%

3

Gemini 3 Flash (Thinking)

81.2%

Last eval 09 Jul 2026 View leaderboard →

VideoMME

Video-MME

Comprehensive video-understanding benchmark across 30+ categories with short, medium and long videos.

Multimodal Multimodal 1 results

Top results

1

Qwen 3.6 27B

87.7%

View leaderboard →

ARC-AGI-3

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

An interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of en…

Agentic Vision 3 results

Top results

1

GPT-5.6 Sol (Max)

7.80%

2

GPT-5.6 Terra (Max)

0.80%

3

GPT-5.6 Luna (Max)

0.20%

Last eval 09 Jul 2026 View leaderboard →

BFCL v3

Berkeley Function-Calling Leaderboard v3

Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.

Agentic Text 2 results

Top results

1

Qwen3 235B A22B

70.8%

2

Qwen3 30B A3B

69.1%

Last eval 28 Apr 2025 View leaderboard →

GeneBench-Pro

GeneBench-Pro: Evaluating Multistage Statistical Reasoning in Genomics, Quantitative Biology, and Translational Biomedicine

A 129-problem benchmark testing whether AI agents can perform realistic, multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine. Each pr…

Agentic Text 11 results

Top results

1

GPT 5.6 Sol (max)

28.7%

2

GPT-5.6 Terra (max)

23.3%

3

GPT-5.6 Luna (max)

16.5%

Last eval 30 Jun 2026 View leaderboard →

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal 7 results

Top results

1

GPT 5.4

75.0%

2

GPT 5.3 Codex

74.0%

3

Claude Sonnet 4.6

72.5%

Last eval 05 Mar 2026 View leaderboard →

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text 14 results

Top results

1

Gemini 3.1 Pro

99.3%

2

GPT 5.4

98.9%

3

Claude Opus 4.5

98.2%

Last eval 12 Jun 2026 View leaderboard →

Terminal-Bench 2.1

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-Bench 2.1 evaluates AI agents on 89 hard, realistic tasks in command-line terminal environments inspired by real workflows, spanning domains such as compiling code, trai…

Agentic Text 5 results

Top results

1

GPT-5.6 Sol

88.8%

2

GPT 5.6 Terra

84.3%

3

GPT 5.6 Luna

82.5%

Last eval 30 Jun 2026 View leaderboard →

Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text 26 results

Top results

1

GPT 5.5

82.7%

2

GPT 5.3 Codex

77.3%

3

GLM 4.6

75.9%

Last eval 12 Jun 2026 View leaderboard →

ViralBench

ViralBench: The First AI Marketing Benchmark

ViralBench evaluates AI models on their ability to autonomously generate viral TikTok content in the fitness space, running each model twice daily in an agentic loop with tools f…

Agentic Multimodal 3 results

Top results

1

GPT 5.5

3,945.0views/day

2

Claude Opus 4.8

3,201.0views/day

3

Kimi K2.6

2,306.0views/day

Last eval 26 Jun 2026 View leaderboard →

IFBench

Instruction Following Benchmark

Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.

Language Text 5 results

Top results

1

Qwen 3.5 122B A10B

76.1%

2

Nemotron 3

71.5%

3

Qwen 3.5 35B A3B

70.2%

Last eval 12 Jun 2026 View leaderboard →

IFEval

Instruction-Following Eval

Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).

Language Text 12 results

Top results

1

Claude Sonnet 3.7 (Thinking)

93.2%

2

Nova Pro

92.1%

3

Llama 3.3

92.1%

Last eval 24 Feb 2026 View leaderboard →

MGSM

Multilingual Grade School Math

GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.

Language Text 5 results

Top results

1

Llama 3.3

91.1%

2

Claude Opus 3

90.7%

3

GPT-4o

90.5%

Last eval 03 Apr 2026 View leaderboard →

RULER 128k

RULER (128k context)

Synthetic long-context evaluation suite measuring needle-in-a-haystack, multi-key retrieval and tracing across 128k token contexts.

Language Text 1 results

Top results

1

Nemotron 3 Super

88.3%

Last eval 03 Apr 2026 View leaderboard →

Go to section

Search

AI benchmarks

Cross-benchmark model leaderboard

AA-LCR

ARC-AGI-2

BIG-Bench Hard

DROP Saturated

LiveBench

Aider Polyglot

CodeForces ELO

HumanEval Saturated

HumanEval+ Saturated

LiveCodeBench

MBPP Saturated

SWE-bench Multimodal

SWE-bench Verified

AIME 2024

AIME 2025

GSM8K Saturated

MATH Saturated

MATH-500

USAMO 2025

ARC Challenge Saturated

GPQA Diamond

Humanity's Last Exam

MMLU Saturated

MMLU-Pro

SimpleQA Verified

TriviaQA Saturated

TruthfulQA

ChartQA

DocVQA

MathVista

MMMU

MMMU-Pro

VideoMME

ARC-AGI-3

BFCL v3

GeneBench-Pro

OSWorld

Tau2-Bench Telecom

Terminal-Bench 2.1

Terminal-Bench Hard

ViralBench

IFBench

IFEval

MGSM

RULER 128k

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: