TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

GSM8K

Grade School Math 8K

8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.

Math Text accuracy Max 100.0% Released Oct 2021 Saturated Possibly contaminated
12
Results
11
Models scored
95.0%
Top: Claude Opus 3
84.2%
Median

Best results

Top primary scores; one row per model.
2
94.8%
3
94.5%
5
92.0%
7
77.7%
10
56.8%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Nov 2024Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Claude Opus 3 95.0% 0-shot · CoT Oct 22, 2024 self reported primary
2 Nova Pro 94.8% 0-shot · CoT Dec 3, 2024 self reported primary
3 Nova Lite 94.5% 0-shot · CoT Dec 3, 2024 self reported primary
4 Nova Micro 92.3% 0-shot · CoT Dec 3, 2024 self reported primary
5 GPT-4 92.0% 5-shot · CoT Mar 4, 2024 self reported primary
6 Nemotron 3 Super 90.7% 8-shot Apr 3, 2026 self reported primary
7 Llama 3.2 77.7% 8-shot · CoT Sep 25, 2024 self reported primary
8 Mixtral 8x7B 74.4% Dec 1, 2023 self reported primary
9 Mixtral 8x7B 74.4% Jan 8, 2024 self reported primary
10 LLaMA 2 56.8% 8-shot Jul 19, 2023 paper primary verified
11 Mistral 7B 52.2% Sep 1, 2023 self reported primary
12 Gemma 2 23.9% 5-shot · Maj@1 Feb 25, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task