TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

ARC Challenge

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text Accuracy Max 100.0% Released Mar 2018 Saturated Possibly contaminated
10
Results
9
Models scored
554.0%
Top: Gemma 2
91.3%
Median

Best results

Top primary scores; one row per model.
1
554.0%
4
94.8%
5
92.4%
7
78.6%
10
55.6%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2023Dec 2024Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Gemma 2 554.0% 25 Feb 2025 Self-reported Primary
2 Claude Opus 3 96.4% 25-shot 22 Oct 2024 Self-reported Primary
3 Nemotron 3 Super 96.1% 25-shot 03 Apr 2026 Self-reported Primary
4 Nova Pro 94.8% 0-shot 03 Dec 2024 Self-reported Primary
5 Nova Lite 92.4% 0-shot 03 Dec 2024 Self-reported Primary
6 Claude 2 91.0% 5-shot · standard 11 Jul 2023 Self-reported
7 Nova Micro 90.2% 0-shot 03 Dec 2024 Self-reported Primary
8 Claude Haiku 3 89.2% 25-shot · standard 04 Mar 2024 Self-reported
9 GPT 3.5 85.2% 25-shot · standard 14 Mar 2023 Self-reported
10 Llama 3.2 78.6% 0-shot 22 Oct 2024 Self-reported Primary
11 Mixtral 8x7B 59.7% 01 Dec 2023 Self-reported Primary
12 Mixtral 8x7B 59.7% 08 Jan 2024 Self-reported Primary
13 Mistral 7B 55.6% 01 Sep 2023 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task