TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MBPP

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text Pass@k Max 100.0% Released Aug 2021 Saturated Possibly contaminated
5
Results
4
Models scored
78.4%
Top: Nemotron 3 Super
60.7%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Aug 2023Dec 2024Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Haiku 3 80.4% 0-shot · standard 04 Mar 2024 Self-reported
2 Nemotron 3 Super 78.4% 3-shot · pass@1 n=32 03 Apr 2026 Self-reported Primary
3 Code Llama 66.7% pass@10 01 Aug 2023 Self-reported Primary
4 Mixtral 8x7B 60.7% 08 Jan 2024 Self-reported Primary
5 Mistral 7B 47.5% 3-shot 10 Oct 2023 Paper
6 Code Llama 41.4% pass@1 01 Aug 2023 Paper Primary
7 Gemma 2 29.6% 3-shot 25 Feb 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task