AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
ARC-AGI-2
Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.
BIG-Bench Hard
The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.
DROP Saturated
Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.
LiveBench
Continuously refreshed benchmark across reasoning, coding, math, data analysis, language and instruction following. New questions every month to stay contamination-free.
