BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text accuracy Max 100.0% Released Oct 2022

Results

Models scored

91.6%

Top: Seed 1.5

84.6%

Median

Best results

Top primary scores; one row per model.

91.6%

86.9%

86.8%

82.4%

79.5%

51.2%

Each dot is one model result; the line traces the running best score.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Claude Sonnet 3.5	93.1%	3-shot · CoT · standard	Jun 20, 2024	self reported
2	Seed 1.5	91.6%	—	Jan 22, 2025	self reported	primary
3	Gemini 1.5	89.2%	3-shot · CoT · standard	May 1, 2024	self reported
4	Nova Pro	86.9%	3-shot · CoT	Dec 3, 2024	self reported	primary
5	Claude Opus 3	86.8%	3-shot · CoT	Oct 22, 2024	self reported	primary
6	Gemini 1.5 Flash	85.5%	3-shot · CoT · standard	May 1, 2024	self reported
7	Gemini Ultra	83.6%	3-shot · CoT · standard	Dec 6, 2023	self reported
8	Nova Lite	82.4%	3-shot · CoT	Dec 3, 2024	self reported	primary
9	Nova Micro	79.5%	3-shot · CoT	Dec 3, 2024	self reported	primary
10	Claude Haiku 3	73.7%	3-shot · CoT · standard	Mar 4, 2024	self reported
11	LLaMA 2	51.2%	3-shot	Jul 19, 2023	paper	primary verified
12	LLaMA 2 70B	51.2%	3-shot	Jul 11, 2023	paper