MBPP

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text Pass@k Max 100.0% Released Aug 2021 Saturated Possibly contaminated

Results

Models scored

78.4%

Top: Nemotron 3 Super

60.7%

Median

Best results

Top primary scores; one row per model.

78.4%

66.7%

60.7%

41.4%

29.6%

Each dot is one model result; the line traces the running best score.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Claude Haiku 3	80.4%	0-shot · standard	04 Mar 2024	Self-reported
2	Nemotron 3 Super	78.4%	3-shot · pass@1 n=32	03 Apr 2026	Self-reported	Primary
3	Code Llama	66.7%	pass@10	01 Aug 2023	Self-reported	Primary
4	Mixtral 8x7B	60.7%	—	08 Jan 2024	Self-reported	Primary
5	Mistral 7B	47.5%	3-shot	10 Oct 2023	Paper
6	Code Llama	41.4%	pass@1	01 Aug 2023	Paper	Primary
7	Gemma 2	29.6%	3-shot	25 Feb 2025	Self-reported	Primary