HumanEval

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.

Coding Text Pass@k Max 100.0% Released Jul 2021 Saturated Possibly contaminated

Homepage Paper Code

Results

Models scored

90.2%

Top: GPT-4o

72.6%

Median

Best results

Top primary scores; one row per model.

90.2%

88.4%

88.1%

84.9%

84.8%

79.4%

73.2%

72.0%

67.8%

40.2%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Claude Sonnet 3.5	92.0%	0-shot · standard	20 Jun 2024	Self-reported
2	GPT-4o	90.2%	—	16 Apr 2025	Self-reported	Primary
3	Llama 3.3	88.4%	0-shot · Pass@1	06 Dec 2024	Self-reported	Primary
4	Claude Haiku 3.5	88.1%	0-shot	22 Oct 2024	Self-reported	Primary
5	Claude Opus 3	84.9%	0-shot	22 Oct 2024	Self-reported	Primary
6	Mistral Small 3	84.8%	Pass@1	30 Dec 0025	Self-reported	Primary
7	Gemini 1.5	84.1%	0-shot · standard	01 May 2024	Self-reported
8	Nemotron 3 Super	79.4%	0-shot · pass@1 n=32	03 Apr 2026	Self-reported	Primary
9	Claude Haiku 3	75.9%	0-shot · standard	04 Mar 2024	Self-reported
10	Gemini Ultra	74.4%	0-shot · standard	06 Dec 2023	Self-reported
11	Gemini 1.5 Flash	74.3%	0-shot · standard	01 May 2024	Self-reported
12	WizardCoder	73.2%	—	01 Aug 2023	Paper	Primary
13	Pixtral 12B	72.0%	Pass@1	10 Oct 2024	Self-reported	Primary
14	Claude 2	71.2%	0-shot · standard	11 Jul 2023	Self-reported
15	Code Llama	67.8%	—	01 Aug 2023	Paper	Primary
16	GPT 3.5	48.1%	0-shot · standard	14 Mar 2023	Self-reported
17	Mixtral 8x7B	40.2%	—	01 Dec 2023	Paper	Primary
18	Mixtral 8x7B	40.2%	—	08 Jan 2024	Self-reported	Primary
19	Mistral 7B	30.5%	—	01 Sep 2023	Paper	Primary
20	LLaMA 2	29.9%	0-shot	19 Jul 2023	Paper	Primary Verified
21	LLaMA 2 70B	29.9%	0-shot	11 Jul 2023	Paper
22	Gemma 2	17.7%	Pass@1	25 Feb 2025	Self-reported	Primary

Go to section

Search

HumanEval

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: