AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

AA-LCR

Artificial Analysis Long-Context Reasoning

Long-context reasoning evaluation across documents up to 128k tokens. Used as one component of the Artificial Analysis Intelligence Index.

Reasoning Text 1 results

Top results

Kimi K2.7 Code

66.3%

Last eval 12 Jun 2026 View leaderboard →

ARC-AGI-2

Abstraction and Reasoning Corpus — AGI v2

Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.

Reasoning Text 17 results

Last eval 23 Apr 2026 View leaderboard →

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text 6 results

Last eval 22 Jan 2025 View leaderboard →

DROP Saturated

Discrete Reasoning Over Paragraphs

Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.

Reasoning Text 8 results

Last eval 16 Apr 2025 View leaderboard →

LiveBench

LiveBench (Reasoning Avg)

Continuously refreshed benchmark across reasoning, coding, math, data analysis, language and instruction following. New questions every month to stay contamination-free.

Reasoning Text 2 results

Last eval 28 Apr 2025 View leaderboard →