HumanEval
OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 3.5 | 92.0% | 0-shot · standard | 20 Jun 2024 | Self-reported | |
| 2 | GPT-4o | 90.2% | — | 16 Apr 2025 | Self-reported | Primary |
| 3 | Llama 3.3 | 88.4% | 0-shot · Pass@1 | 06 Dec 2024 | Self-reported | Primary |
| 4 | Claude Haiku 3.5 | 88.1% | 0-shot | 22 Oct 2024 | Self-reported | Primary |
| 5 | Claude Opus 3 | 84.9% | 0-shot | 22 Oct 2024 | Self-reported | Primary |
| 6 | Mistral Small 3 | 84.8% | Pass@1 | 30 Dec 0025 | Self-reported | Primary |
| 7 | Gemini 1.5 | 84.1% | 0-shot · standard | 01 May 2024 | Self-reported | |
| 8 | Nemotron 3 Super | 79.4% | 0-shot · pass@1 n=32 | 03 Apr 2026 | Self-reported | Primary |
| 9 | Claude Haiku 3 | 75.9% | 0-shot · standard | 04 Mar 2024 | Self-reported | |
| 10 | Gemini Ultra | 74.4% | 0-shot · standard | 06 Dec 2023 | Self-reported | |
| 11 | Gemini 1.5 Flash | 74.3% | 0-shot · standard | 01 May 2024 | Self-reported | |
| 12 | WizardCoder | 73.2% | — | 01 Aug 2023 | Paper | Primary |
| 13 | Pixtral 12B | 72.0% | Pass@1 | 10 Oct 2024 | Self-reported | Primary |
| 14 | Claude 2 | 71.2% | 0-shot · standard | 11 Jul 2023 | Self-reported | |
| 15 | Code Llama | 67.8% | — | 01 Aug 2023 | Paper | Primary |
| 16 | GPT 3.5 | 48.1% | 0-shot · standard | 14 Mar 2023 | Self-reported | |
| 17 | Mixtral 8x7B | 40.2% | — | 01 Dec 2023 | Paper | Primary |
| 18 | Mixtral 8x7B | 40.2% | — | 08 Jan 2024 | Self-reported | Primary |
| 19 | Mistral 7B | 30.5% | — | 01 Sep 2023 | Paper | Primary |
| 20 | LLaMA 2 | 29.9% | 0-shot | 19 Jul 2023 | Paper | Primary Verified |
| 21 | LLaMA 2 70B | 29.9% | 0-shot | 11 Jul 2023 | Paper | |
| 22 | Gemma 2 | 17.7% | Pass@1 | 25 Feb 2025 | Self-reported | Primary |
