TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

IFEval

Instruction-Following Eval

Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).

Language Text Accuracy Max 100.0% Released Nov 2023
12
Results
12
Models scored
93.2%
Top: Claude Sonnet 3.7 (Thinking)
89.6%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Dec 2024Jul 2025Feb 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Claude Sonnet 3.7 (Thinking) 93.2% 24 Feb 2025 Self-reported Primary
2 Nova Pro 92.1% 0-shot 03 Dec 2024 Self-reported Primary
3 Llama 3.3 92.1% 06 Dec 2024 Self-reported Primary
4 Command A 90.9% 07 Apr 2025 Self-reported Primary
5 Claude Sonnet 3.7 90.8% 24 Feb 2025 Self-reported Primary
6 Nova Lite 89.7% 0-shot 03 Dec 2024 Self-reported Primary
7 Seed 1.5 89.5% 0-shot · CoT 22 Jan 2025 Self-reported Primary
8 Nova Micro 87.2% 0-shot 03 Dec 2024 Self-reported Primary
9 GPT 4.1 87.0% 14 Apr 2025 Self-reported Primary
10 Mistral Small 3 82.9% 30 Jan 2025 Self-reported Primary
11 Llama 3.2 77.4% 25 Sep 2025 Self-reported Primary
12 Qwen 3.5 27B 76.5% 24 Feb 2026 Third-party Primary Verified
0 AIs selected
Clear selection
#
Name
Task