TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

56
Benchmarks
8
Categories
538
Results recorded
113
Models scored

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

Aider Polyglot

Aider Polyglot Coding Benchmark

225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.

Coding Text 12 results
Top results
1
Claude Opus 4.5
89.4%
2
GPT 5.1
88.0%
3
GPT 5 (Thinking)
88.0%
Last eval Nov 24, 2025 View leaderboard →

CodeForces ELO

CodeForces Live Contest ELO

Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.

Coding Text 12 results
Top results
1
Deepseek V4 Pro
3,206.0rating
2
o4 mini
2,719.0rating
3
o3
2,706.0rating
Last eval Apr 24, 2026 View leaderboard →

HumanEval Saturated

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…

Coding Text 14 results
Top results
1
GPT-4o
90.2%
2
Llama 3.3
88.4%
3
Claude Haiku 3.5
88.1%
Last eval Apr 3, 2026 View leaderboard →

HumanEval+ Saturated

HumanEval+ (EvalPlus)

HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.

Coding Text 2 results
Top results
1
Phi 4 reasoning plus
92.3%
2
WizardCoder
64.6%
Last eval Jul 8, 2025 View leaderboard →

LiveCodeBench

Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.

Coding Text 26 results
Top results
1
Deepseek V4 Pro
93.5%
2
Kimi K2.6
89.6%
3
Kimi K2.5
85.0%
Last eval Apr 24, 2026 View leaderboard →

MBPP Saturated

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text 5 results
Top results
1
Nemotron 3 Super
78.4%
2
Code Llama
66.7%
3
Mixtral 8x7B
60.7%
Last eval Apr 3, 2026 View leaderboard →

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal 1 results
Top results
1
Deepseek 3.2
70.2%
Last eval Dec 1, 2025 View leaderboard →

SWE-bench Verified

500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…

Coding Text 49 results
Top results
1
Claude Opus 4.7
87.6%
2
Claude Opus 4.5
80.9%
3
Claude Opus 4.6
80.8%
Last eval Apr 27, 2026 View leaderboard →
0 AIs selected
Clear selection
#
Name
Task