TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MGSM

Multilingual Grade School Math

GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.

Language Text Accuracy Max 100.0% Released Oct 2022
5
Results
5
Models scored
91.1%
Top: Llama 3.3
90.5%
Median

Best results

Top primary scores; one row per model.
1
91.1%
3
90.5%
5
58.2%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Oct 2024Jul 2025Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Sonnet 3.5 91.6% 0-shot · CoT · standard 20 Jun 2024 Self-reported
2 Llama 3.3 91.1% 0-shot 06 Dec 2024 Self-reported Primary
3 Claude Opus 3 90.7% 0-shot 22 Oct 2024 Self-reported Primary
4 GPT-4o 90.5% 16 Apr 2025 Self-reported Primary
5 Gemini 1.5 Pro 87.5% 8-shot · standard 01 May 2024 Self-reported
6 Nemotron 3 Super 87.5% 8-shot 03 Apr 2026 Self-reported Primary
7 Gemini 1.5 Flash 82.6% 8-shot · standard 01 May 2024 Self-reported
8 Gemini Ultra 79.0% 8-shot · CoT 06 Dec 2023 Paper
9 Claude Haiku 3 75.1% 0-shot · CoT · standard 04 Mar 2024 Self-reported
10 Llama 3.2 58.2% 0-shot · CoT 25 Oct 2024 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task