GSM8K
8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 3 | 95.0% | 0-shot · CoT | Oct 22, 2024 | self reported | primary |
| 2 | Nova Pro | 94.8% | 0-shot · CoT | Dec 3, 2024 | self reported | primary |
| 3 | Nova Lite | 94.5% | 0-shot · CoT | Dec 3, 2024 | self reported | primary |
| 4 | Nova Micro | 92.3% | 0-shot · CoT | Dec 3, 2024 | self reported | primary |
| 5 | GPT-4 | 92.0% | 5-shot · CoT | Mar 4, 2024 | self reported | primary |
| 6 | Nemotron 3 Super | 90.7% | 8-shot | Apr 3, 2026 | self reported | primary |
| 7 | Llama 3.2 | 77.7% | 8-shot · CoT | Sep 25, 2024 | self reported | primary |
| 8 | Mixtral 8x7B | 74.4% | — | Dec 1, 2023 | self reported | primary |
| 9 | Mixtral 8x7B | 74.4% | — | Jan 8, 2024 | self reported | primary |
| 10 | LLaMA 2 | 56.8% | 8-shot | Jul 19, 2023 | paper | primary verified |
| 11 | Mistral 7B | 52.2% | — | Sep 1, 2023 | self reported | primary |
| 12 | Gemma 2 | 23.9% | 5-shot · Maj@1 | Feb 25, 2025 | self reported | primary |
