BIG-Bench Hard
The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Seed 1.5 | 91.6% | — | Jan 22, 2025 | self reported | primary |
| 2 | Nova Pro | 86.9% | 3-shot · CoT | Dec 3, 2024 | self reported | primary |
| 3 | Claude Opus 3 | 86.8% | 3-shot · CoT | Oct 22, 2024 | self reported | primary |
| 4 | Nova Lite | 82.4% | 3-shot · CoT | Dec 3, 2024 | self reported | primary |
| 5 | Nova Micro | 79.5% | 3-shot · CoT | Dec 3, 2024 | self reported | primary |
| 6 | LLaMA 2 | 51.2% | 3-shot | Jul 19, 2023 | paper | primary verified |
