TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MBPP

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text pass at k Max 100.0% Released Aug 2021 Saturated Possibly contaminated
5
Results
4
Models scored
78.4%
Top: Nemotron 3 Super
60.7%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Aug 2023Dec 2024Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Nemotron 3 Super 78.4% 3-shot · pass@1 n=32 Apr 3, 2026 self reported primary
2 Code Llama 66.7% pass@10 Aug 1, 2023 self reported primary
3 Mixtral 8x7B 60.7% Jan 8, 2024 self reported primary
4 Code Llama 41.4% pass@1 Aug 1, 2023 paper primary
5 Gemma 2 29.6% 3-shot Feb 25, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task