AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

Aider Polyglot

Aider Polyglot Coding Benchmark

225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.

Coding Text 12 results

Last eval 24 Nov 2025 View leaderboard →

CodeForces ELO

CodeForces Live Contest ELO

Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.

Coding Text 12 results

Last eval 24 Apr 2026 View leaderboard →

HumanEval Saturated

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…

Coding Text 14 results

Last eval 03 Apr 2026 View leaderboard →

HumanEval+ Saturated

HumanEval+ (EvalPlus)

HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.

Coding Text 2 results

Last eval 08 Jul 2025 View leaderboard →

LiveCodeBench

Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.

Coding Text 28 results

Last eval 24 Apr 2026 View leaderboard →

MBPP Saturated

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text 5 results

Last eval 03 Apr 2026 View leaderboard →

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal 1 results

Top results

Deepseek 3.2

70.2%

Last eval 01 Dec 2025 View leaderboard →

SWE-bench Verified

500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…

Coding Text 53 results

Last eval 27 Apr 2026 View leaderboard →