MBPP
974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Haiku 3 | 80.4% | 0-shot · standard | 04 Mar 2024 | Self-reported | |
| 2 | Nemotron 3 Super | 78.4% | 3-shot · pass@1 n=32 | 03 Apr 2026 | Self-reported | Primary |
| 3 | Code Llama | 66.7% | pass@10 | 01 Aug 2023 | Self-reported | Primary |
| 4 | Mixtral 8x7B | 60.7% | — | 08 Jan 2024 | Self-reported | Primary |
| 5 | Mistral 7B | 47.5% | 3-shot | 10 Oct 2023 | Paper | |
| 6 | Code Llama | 41.4% | pass@1 | 01 Aug 2023 | Paper | Primary |
| 7 | Gemma 2 | 29.6% | 3-shot | 25 Feb 2025 | Self-reported | Primary |
