TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

ARC Challenge

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text Accuracy Max 100.0% Released Mar 2018 Saturated Possibly contaminated
10
Results
9
Models scored
554.0%
Top: Gemma 2
91.3%
Median

Best results

Top primary scores; one row per model.
1
554.0%
4
94.8%
5
92.4%
7
78.6%
10
55.6%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2023Dec 2024Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Gemma 2 554.0% 25 Feb 2025 Self-reported Primary
2 Claude Opus 3 96.4% 25-shot 22 Oct 2024 Self-reported Primary
3 Nemotron 3 Super 96.1% 25-shot 03 Apr 2026 Self-reported Primary
4 Nova Pro 94.8% 0-shot 03 Dec 2024 Self-reported Primary
5 Nova Lite 92.4% 0-shot 03 Dec 2024 Self-reported Primary
6 Nova Micro 90.2% 0-shot 03 Dec 2024 Self-reported Primary
7 Llama 3.2 78.6% 0-shot 22 Oct 2024 Self-reported Primary
8 Mixtral 8x7B 59.7% 01 Dec 2023 Self-reported Primary
9 Mixtral 8x7B 59.7% 08 Jan 2024 Self-reported Primary
10 Mistral 7B 55.6% 01 Sep 2023 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task