IFBench

Instruction Following Benchmark

Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.

Language Text Accuracy Max 100.0% Released Jun 2025

Homepage

Results

Models scored

76.1%

Top: Qwen 3.5 122B A10B

73.2%

Median

Best results

Top primary scores; one row per model.

76.1%

70.2%

Each dot is one model result; the line traces the running best score.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Qwen 3.7 Max	79.1%	0-shot · CoT · standard	20 May 2026	Self-reported
2	Qwen 3.5 122B A10B	76.1%	—	24 Apr 2026	Third-party	Primary Verified
3	Qwen 3.5 35B A3B	70.2%	—	15 Feb 2025	Third-party	Primary Verified