MGSM
GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 3.5 | 91.6% | 0-shot · CoT · standard | 20 Jun 2024 | Self-reported | |
| 2 | Llama 3.3 | 91.1% | 0-shot | 06 Dec 2024 | Self-reported | Primary |
| 3 | Claude Opus 3 | 90.7% | 0-shot | 22 Oct 2024 | Self-reported | Primary |
| 4 | GPT-4o | 90.5% | — | 16 Apr 2025 | Self-reported | Primary |
| 5 | Gemini 1.5 Pro | 87.5% | 8-shot · standard | 01 May 2024 | Self-reported | |
| 6 | Nemotron 3 Super | 87.5% | 8-shot | 03 Apr 2026 | Self-reported | Primary |
| 7 | Gemini 1.5 Flash | 82.6% | 8-shot · standard | 01 May 2024 | Self-reported | |
| 8 | Gemini Ultra | 79.0% | 8-shot · CoT | 06 Dec 2023 | Paper | |
| 9 | Claude Haiku 3 | 75.1% | 0-shot · CoT · standard | 04 Mar 2024 | Self-reported | |
| 10 | Llama 3.2 | 58.2% | 0-shot · CoT | 25 Oct 2024 | Self-reported | Primary |
MongoDB - Build AI That Scales
