TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

IFEval

Instruction-Following Eval

Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).

Language Text Accuracy Max 100.0% Released Nov 2023
12
Results
12
Models scored
93.2%
Top: Claude Sonnet 3.7 (Thinking)
89.6%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Dec 2024Jul 2025Feb 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Qwen 3.7 Max 94.3% 0-shot · CoT · standard 20 May 2026 Self-reported
2 Claude Sonnet 3.7 (Thinking) 93.2% 24 Feb 2025 Self-reported Primary
3 Nova Pro 92.1% 0-shot 03 Dec 2024 Self-reported Primary
4 Llama 3.3 92.1% 06 Dec 2024 Self-reported Primary
5 Command A 90.9% 07 Apr 2025 Self-reported Primary
6 Claude Sonnet 3.7 90.8% 24 Feb 2025 Self-reported Primary
7 Nova Lite 89.7% 0-shot 03 Dec 2024 Self-reported Primary
8 Seed 1.5 89.5% 0-shot · CoT 22 Jan 2025 Self-reported Primary
9 Claude Sonnet 3.5 87.8% 0-shot · standard 22 Oct 2024 Self-reported
10 Nova Micro 87.2% 0-shot 03 Dec 2024 Self-reported Primary
11 GPT 4.1 87.0% 14 Apr 2025 Self-reported Primary
12 Mistral Small 3 82.9% 30 Jan 2025 Self-reported Primary
13 Llama 3.2 77.4% 25 Sep 2025 Self-reported Primary
14 Claude Haiku 3 77.2% 0-shot · standard 22 Oct 2024 Self-reported
15 Qwen 3.5 27B 76.5% 24 Feb 2026 Third-party Primary Verified
0 AIs selected
Clear selection
#
Name
Task