MGSM

Multilingual Grade School Math

GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.

Language Text Accuracy Max 100.0% Released Oct 2022

Results

Models scored

91.1%

Top: Llama 3.3

90.5%

Median

Best results

Top primary scores; one row per model.

91.1%

90.7%

90.5%

87.5%

58.2%

Each dot is one model result; the line traces the running best score.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Claude Sonnet 3.5	91.6%	0-shot · CoT · standard	20 Jun 2024	Self-reported
2	Llama 3.3	91.1%	0-shot	06 Dec 2024	Self-reported	Primary
3	Claude Opus 3	90.7%	0-shot	22 Oct 2024	Self-reported	Primary
4	GPT-4o	90.5%	—	16 Apr 2025	Self-reported	Primary
5	Gemini 1.5 Pro	87.5%	8-shot · standard	01 May 2024	Self-reported
6	Nemotron 3 Super	87.5%	8-shot	03 Apr 2026	Self-reported	Primary
7	Gemini 1.5 Flash	82.6%	8-shot · standard	01 May 2024	Self-reported
8	Gemini Ultra	79.0%	8-shot · CoT	06 Dec 2023	Paper
9	Claude Haiku 3	75.1%	0-shot · CoT · standard	04 Mar 2024	Self-reported
10	Llama 3.2	58.2%	0-shot · CoT	25 Oct 2024	Self-reported	Primary