TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text accuracy Max 100.0% Released Oct 2022
6
Results
6
Models scored
91.6%
Top: Seed 1.5
84.6%
Median

Best results

Top primary scores; one row per model.
1
91.6%
2
86.9%
4
82.4%
6
51.2%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Apr 2024Jan 2025

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Sonnet 3.5 93.1% 3-shot · CoT · standard Jun 20, 2024 self reported
2 Seed 1.5 91.6% Jan 22, 2025 self reported primary
3 Gemini 1.5 89.2% 3-shot · CoT · standard May 1, 2024 self reported
4 Nova Pro 86.9% 3-shot · CoT Dec 3, 2024 self reported primary
5 Claude Opus 3 86.8% 3-shot · CoT Oct 22, 2024 self reported primary
6 Gemini 1.5 Flash 85.5% 3-shot · CoT · standard May 1, 2024 self reported
7 Gemini Ultra 83.6% 3-shot · CoT · standard Dec 6, 2023 self reported
8 Nova Lite 82.4% 3-shot · CoT Dec 3, 2024 self reported primary
9 Nova Micro 79.5% 3-shot · CoT Dec 3, 2024 self reported primary
10 Claude Haiku 3 73.7% 3-shot · CoT · standard Mar 4, 2024 self reported
11 LLaMA 2 51.2% 3-shot Jul 19, 2023 paper primary verified
12 LLaMA 2 70B 51.2% 3-shot Jul 11, 2023 paper
0 AIs selected
Clear selection
#
Name
Task