TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text accuracy Max 100.0% Released Oct 2022
6
Results
6
Models scored
91.6%
Top: Seed 1.5
84.6%
Median

Best results

Top primary scores; one row per model.
1
91.6%
2
86.9%
4
82.4%
6
51.2%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Apr 2024Jan 2025

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Seed 1.5 91.6% Jan 22, 2025 self reported primary
2 Nova Pro 86.9% 3-shot · CoT Dec 3, 2024 self reported primary
3 Claude Opus 3 86.8% 3-shot · CoT Oct 22, 2024 self reported primary
4 Nova Lite 82.4% 3-shot · CoT Dec 3, 2024 self reported primary
5 Nova Micro 79.5% 3-shot · CoT Dec 3, 2024 self reported primary
6 LLaMA 2 51.2% 3-shot Jul 19, 2023 paper primary verified
0 AIs selected
Clear selection
#
Name
Task