ARC-AGI-2

Abstraction and Reasoning Corpus — AGI v2

Abstract visual-grid puzzles designed to resist memorisation. Each task can be solved by humans from a few examples; LLMs typically struggle without test-time adaptation.

Reasoning Text Accuracy Max 100.0% Released Mar 2025

Homepage Paper Code

Results

Models scored

85.0%

Top: GPT 5.5

45.1%

Median

Best results

Top primary scores; one row per model.

85.0%

83.3%

77.1%

73.3%

69.2%

58.3%

54.2%

52.9%

45.1%

42.5%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT 5.5	85.0%	CoT	23 Apr 2026	Self-reported	Primary
2	GPT 5.4 Pro	83.3%	CoT	05 Mar 2026	Self-reported	Primary
3	Gemini 3.1 Pro	77.1%	CoT	19 Feb 2026	Self-reported	Primary
4	GPT 5.4	73.3%	CoT	05 Mar 2026	Self-reported	Primary
5	Claude Opus 4.6	69.2%	—	05 Feb 2026	Self-reported	Primary
6	Claude Sonnet 4.6	58.3%	CoT	17 Feb 2026	Self-reported	Primary
7	GPT 5.2 Pro	54.2%	—	11 Dec 2025	Self-reported	Primary
8	GPT 5.2 Thinking	52.9%	CoT	11 Dec 2025	Self-reported	Primary
9	Gemini 3 Deep Think	45.1%	CoT	12 Feb 2026	Self-reported	Primary
10	Muse Spark	42.5%	CoT	08 Apr 2026	Self-reported	Primary
11	Claude Opus 4.5	37.6%	—	24 Nov 2025	Self-reported	Primary
12	Gemini 3 Flash (Thinking)	33.6%	—	17 Dec 2025	Self-reported	Primary
13	Gemini 3 Pro	31.1%	CoT	18 Nov 2025	Self-reported	Primary
14	GPT 5.1	17.6%	—	13 Nov 2025	Self-reported	Primary
15	Grok 4	15.9%	CoT	09 Jul 2025	Self-reported	Primary
16	Gemini 2.5 Pro (Thinking)	4.90%	—	17 Dec 2025	Self-reported	Primary
17	Gemini 2.5 Flash (Thinking)	2.50%	—	17 Dec 2025	Self-reported	Primary

Go to section

Search

ARC-AGI-2

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: