TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI model leaderboard

Every tracked model ranked across the headline benchmarks. The Intelligence Index averages each model's normalized scores; click any benchmark column header to sort by it.

Best overall
90.8 Intelligence Index
Best at knowledge
94.4 GPQA Diamond
Best at math
100.0 AIME 2025
Best at coding
87.6 SWE-bench Verified
Best at multimodal
84.2 MMMU

Models × benchmarks

Cells are best primary scores. Color intensity reflects normalized score. Click a column header to sort.
# Model MMLU-Pro GPQA Diamond Humanity's Last Exam AIME 2025 SWE-bench Verified LiveCodeBench MMMU AA-LCR Intelligence Index
1 Deepseek V4 Pro 93.5%
2 Kimi K2.6 90.5% 54.0% 80.2% 89.6% 78.6 4/8
3 Kimi K2.5 76.8% 85.0%
4 Deepseek 3.2 85.0% 82.4% 40.8% 93.1% 73.1% 83.3% 76.3 6/8
5 GLM 4.6 81.0% 17.2% 93.9% 68.0% 82.8% 68.6 5/8
6 Gemma 4 85.2% 84.3% 80.0% 83.2 3/8
7 Grok 4 Heavy 88.4% 44.4% 100.0% 79.4% 78.1 4/8
8 Grok 3 Think 84.6% 93.3% 79.4% 85.8 3/8
9 Grok 4 87.5% 25.4% 91.7% 79.0% 70.9 4/8
10 DeepSeek V3.1 Terminus 85.0% 80.7% 21.7% 88.4% 74.9% 70.1 5/8
11 DeepSeek V3.2 Exp 85.0% 79.9% 89.3% 67.8% 74.1% 79.2 5/8
12 Qwen3-235B-A22B 81.5% 70.7%
13 Qwen3 235B A22B 81.5% 70.7%
14 Gemini 2.5 Pro 84.0% 18.8% 86.7% 63.8% 70.4% 64.7 5/8
15 Nemotron 3 Nano 78.3% 75.0% 89.1% 68.3% 77.7 4/8
16 Qwen3 30B A3B 65.8% 70.9% 62.6% 66.4 3/8
17 Grok 3 79.9% 75.4% 57.0% 73.2% 71.4 4/8
18 Kimi K2 Instruct 75.1% 49.5% 65.8% 53.7% 61.0 4/8
19 Magistral Medium 70.8% 64.9% 50.3% 62.0 3/8
20 Llama 4 Behemoth 82.2% 73.7% 49.4% 76.1% 70.4 4/8
21 Llama 4 Maverick 80.5% 69.8% 43.4% 73.4% 66.8 4/8
22 Grok 3 mini 78.9% 66.2% 41.5% 69.4% 64.0 4/8
23 Mistral Large 3 43.9% 34.4%
24 Gemini 2.5 Flash-Lite 64.6% 5.10% 49.8% 31.6% 33.7% 72.9% 43.0 6/8
25 Llama 4 Scout 74.3% 57.2% 32.8% 69.4% 58.4 4/8
26 Claude Haiku 3.5 41.6% 65.0% 40.6% 49.1 3/8
27 Claude Haiku 4.5 73.0% 80.7% 73.3% 73.2% 75.1 4/8
28 Claude Opus 3 50.4%
29 Claude Opus 4.5 87.0% 80.9% 80.7% 82.9 3/8
30 Claude Opus 4.6 91.3% 80.8%
31 Claude Opus 4.7 94.2% 46.9% 87.6% 76.2 3/8
32 Claude Sonnet 3.7 62.3% 62.3% 71.8% 65.5 3/8
33 Claude Sonnet 3.7 (Thinking) 78.2% 62.3% 75.0% 71.8 3/8
34 Claude Sonnet 4 75.4% 70.5% 72.7% 74.4% 73.3 4/8
35 Claude Sonnet 4.5 83.4% 87.0% 77.2% 77.8% 81.4 4/8
36 Claude Sonnet 4.6 89.9% 33.2% 79.6% 67.6 3/8
37 Command A 69.6% 50.8%
38 DeepSeek 3.2 Speciale 30.6% 96.0%
39 DeepSeek V3 75.9% 59.1% 42.0% 59.0 3/8
40 DeepSeek-R1 84.0% 71.5% 70.0% 49.2% 68.7 4/8
41 Devstral 2 72.2%
42 Gemini 2.5 Flash (Thinking) 82.8% 11.0% 72.0% 60.4% 56.6 4/8
43 Gemini 2.5 Pro (Thinking) 86.4% 21.6% 88.0% 59.6% 63.9 4/8
44 Gemini 3 Deep Think 93.8% 41.0%
45 Gemini 3 Flash 90.4% 78.0%
46 Gemini 3 Flash (Thinking) 90.4% 33.7% 95.2% 78.0% 74.3 4/8
47 Gemini 3 Pro 91.9% 37.5% 95.0% 76.2% 75.2 4/8
48 Gemini 3.1 Pro 94.3% 44.4% 80.6% 73.1 3/8
49 Gemma 3 78.0% 72.6%
50 GLM 5 86.0% 77.8%
51 GLM-5.1 86.2% 31.0%
52 GPT 4.1 66.3% 55.0% 75.0% 65.4 3/8
53 GPT 5 77.8% 6.30% 61.9% 52.8% 74.4% 54.6 5/8
54 GPT 5 (Thinking) 85.7% 24.8% 94.6% 74.9% 84.2% 72.8 5/8
55 GPT 5.1 88.1% 94.6% 74.9% 84.2% 85.5 4/8
56 GPT 5.1 Thinking 88.1% 94.6%
57 GPT 5.2 Pro 93.2%
58 GPT 5.2 Thinking 92.4% 100.0% 80.0% 90.8 3/8
59 GPT 5.3 Codex 92.6% 56.8%
60 GPT 5.4 92.8% 57.7%
61 GPT 5.4 Mini 88.0%
62 GPT 5.4 Nano 82.8%
63 GPT 5.4 Pro 94.4%
64 GPT 5.5 93.6% 41.4%
65 GPT 5.5 Instant 81.2%
66 GPT OSS 120B 90.0% 80.1%
67 GPT-4 Turbo 50.4%
68 GPT-4o 53.6% 69.1%
69 Grok Code Fast 1 70.8%
70 Llama 3.1 Nemotron Ultra 76.0%
71 Llama 3.2 32.8%
72 Llama 3.3 68.9% 50.5%
73 MiniMax M2.5 80.2%
74 Mistral Medium 3.5 77.6%
75 Mistral Small 3 66.3%
76 Muse Spark 89.5% 42.8% 77.4% 69.9 3/8
77 Nemotron 3 Super 75.7% 60.0%
78 Nova Lite 42.0%
79 Nova Micro 40.0%
80 Nova Premier 42.4%
81 Nova Pro 46.9%
82 o1 78.0% 8.12% 79.2% 48.9% 77.6% 58.4 5/8
83 o3 83.3% 20.3% 88.9% 69.1% 82.9% 68.9 5/8
84 o4 mini 92.7% 68.1% 81.6% 80.8 3/8
85 Opus 4.1 Thinking 80.9% 74.5%
86 Phi 4 reasoning plus 76.0% 69.3% 78.0% 74.4 3/8
87 Pixtral 12B 52.0%
88 Pixtral Large 64.0%
89 Qwen 3.5 122B A10B 86.6% 72.0% 76.9% 78.5 3/8
90 Qwen 3.5 27B 86.1% 85.5% 72.4% 81.3 3/8
91 Qwen 3.5 35B A3B 84.2% 69.2%
92 Qwen3 Coder 67.0%
93 Qwen3-30B-A3B 65.8%
94 R1 1776 71.5% 70.0%
95 Seed 1.5 80.1% 65.0% 73.9% 73.0 3/8

Capability scatter

Each dot is a model. Position shows two-axis capability; size reflects how many headline benchmarks the model has been scored on.
Capability scatter: MMLU-Pro vs SWE-bench Verified0.000.0025.025.050.050.075.075.0100.0100.0MMLU-ProSWE-bench VerifiedDeepseek 3.2 • MMLU-Pro: 85.0 • SWE-bench Verified: 73.1Deepseek 3.2DeepSeek V3.2 Exp • MMLU-Pro: 85.0 • SWE-bench Verified: 67.8DeepSeek V3.2 ExpClaude Haiku 3.5 • MMLU-Pro: 41.6 • SWE-bench Verified: 40.6Claude Haiku 3.5DeepSeek V3 • MMLU-Pro: 75.9 • SWE-bench Verified: 42.0DeepSeek V3DeepSeek-R1 • MMLU-Pro: 84.0 • SWE-bench Verified: 49.2DeepSeek-R1Qwen 3.5 27B • MMLU-Pro: 86.1 • SWE-bench Verified: 72.4Qwen 3.5 27B
0 AIs selected
Clear selection
#
Name
Task