AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
Aider Polyglot
225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.
CodeForces ELO
Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.
HumanEval Saturated
OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…
HumanEval+ Saturated
HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.
LiveCodeBench
Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.
MBPP Saturated
974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.
SWE-bench Multimodal
Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.
SWE-bench Verified
500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…
