TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI model leaderboard

Every tracked model ranked across the headline benchmarks. The Intelligence Index averages each model's normalized scores; click any benchmark column header to sort by it.

Best overall
90.8 Intelligence Index
Best at knowledge
94.4 GPQA Diamond
Best at math
100.0 AIME 2025
Best at coding
87.6 SWE-bench Verified
Best at multimodal
84.2 MMMU

Models × benchmarks

Cells are best primary scores. Color intensity reflects normalized score. Click a column header to sort.
# Model MMLU-Pro GPQA Diamond Humanity's Last Exam AIME 2025 SWE-bench Verified LiveCodeBench MMMU AA-LCR Intelligence Index
1 GPT 5.2 Thinking 92.4% 100.0% 80.0% 90.8 3/8
2 Grok 3 Think 84.6% 93.3% 79.4% 85.8 3/8
3 GPT 5.1 88.1% 94.6% 74.9% 84.2% 85.5 4/8
4 Gemma 4 85.2% 84.3% 80.0% 83.2 3/8
5 Claude Opus 4.5 87.0% 80.9% 80.7% 82.9 3/8
6 Claude Sonnet 4.5 83.4% 87.0% 77.2% 77.8% 81.4 4/8
7 Qwen 3.5 27B 86.1% 85.5% 72.4% 81.3 3/8
8 o4 mini 92.7% 68.1% 81.6% 80.8 3/8
9 DeepSeek V3.2 Exp 85.0% 79.9% 89.3% 67.8% 74.1% 79.2 5/8
10 Kimi K2.6 90.5% 54.0% 80.2% 89.6% 78.6 4/8
11 Qwen 3.5 122B A10B 86.6% 72.0% 76.9% 78.5 3/8
12 Grok 4 Heavy 88.4% 44.4% 100.0% 79.4% 78.1 4/8
13 Nemotron 3 Nano 78.3% 75.0% 89.1% 68.3% 77.7 4/8
14 Deepseek 3.2 85.0% 82.4% 40.8% 93.1% 73.1% 83.3% 76.3 6/8
15 Claude Opus 4.7 94.2% 46.9% 87.6% 76.2 3/8
16 Gemini 3 Pro 91.9% 37.5% 95.0% 76.2% 75.2 4/8
17 Claude Haiku 4.5 73.0% 80.7% 73.3% 73.2% 75.1 4/8
18 Phi 4 reasoning plus 76.0% 69.3% 78.0% 74.4 3/8
19 Gemini 3 Flash (Thinking) 90.4% 33.7% 95.2% 78.0% 74.3 4/8
20 Claude Sonnet 4 75.4% 70.5% 72.7% 74.4% 73.3 4/8
21 Gemini 3.1 Pro 94.3% 44.4% 80.6% 73.1 3/8
22 Seed 1.5 80.1% 65.0% 73.9% 73.0 3/8
23 GPT 5 (Thinking) 85.7% 24.8% 94.6% 74.9% 84.2% 72.8 5/8
24 Claude Sonnet 3.7 (Thinking) 78.2% 62.3% 75.0% 71.8 3/8
25 Grok 3 79.9% 75.4% 57.0% 73.2% 71.4 4/8
26 Grok 4 87.5% 25.4% 91.7% 79.0% 70.9 4/8
27 Llama 4 Behemoth 82.2% 73.7% 49.4% 76.1% 70.4 4/8
28 DeepSeek V3.1 Terminus 85.0% 80.7% 21.7% 88.4% 74.9% 70.1 5/8
29 Muse Spark 89.5% 42.8% 77.4% 69.9 3/8
30 o3 83.3% 20.3% 88.9% 69.1% 82.9% 68.9 5/8
31 DeepSeek-R1 84.0% 71.5% 70.0% 49.2% 68.7 4/8
32 GLM 4.6 81.0% 17.2% 93.9% 68.0% 82.8% 68.6 5/8
33 Claude Sonnet 4.6 89.9% 33.2% 79.6% 67.6 3/8
34 Llama 4 Maverick 80.5% 69.8% 43.4% 73.4% 66.8 4/8
35 Qwen3 30B A3B 65.8% 70.9% 62.6% 66.4 3/8
36 Claude Sonnet 3.7 62.3% 62.3% 71.8% 65.5 3/8
37 GPT 4.1 66.3% 55.0% 75.0% 65.4 3/8
38 Gemini 2.5 Pro 84.0% 18.8% 86.7% 63.8% 70.4% 64.7 5/8
39 Grok 3 mini 78.9% 66.2% 41.5% 69.4% 64.0 4/8
40 Gemini 2.5 Pro (Thinking) 86.4% 21.6% 88.0% 59.6% 63.9 4/8
41 Magistral Medium 70.8% 64.9% 50.3% 62.0 3/8
42 Kimi K2 Instruct 75.1% 49.5% 65.8% 53.7% 61.0 4/8
43 DeepSeek V3 75.9% 59.1% 42.0% 59.0 3/8
44 Llama 4 Scout 74.3% 57.2% 32.8% 69.4% 58.4 4/8
45 o1 78.0% 8.12% 79.2% 48.9% 77.6% 58.4 5/8
46 Gemini 2.5 Flash (Thinking) 82.8% 11.0% 72.0% 60.4% 56.6 4/8
47 GPT 5 77.8% 6.30% 61.9% 52.8% 74.4% 54.6 5/8
48 Claude Haiku 3.5 41.6% 65.0% 40.6% 49.1 3/8
49 Gemini 2.5 Flash-Lite 64.6% 5.10% 49.8% 31.6% 33.7% 72.9% 43.0 6/8
50 GPT 5.5 Instant 81.2%
51 Mistral Medium 3.5 77.6%
52 Deepseek V4 Pro 93.5%
53 GPT 5.5 93.6% 41.4%
54 GLM-5.1 86.2% 31.0%
55 GPT 5.4 Pro 94.4%
56 GPT 5.4 92.8% 57.7%
57 Gemma 3 78.0% 72.6%
58 GPT-4 Turbo 50.4%
59 GPT 5.1 Thinking 88.1% 94.6%
60 Claude Opus 3 50.4%
61 Mistral Small 3 66.3%
62 Pixtral Large 64.0%
63 GPT 5.4 Mini 88.0%
64 GPT 5.4 Nano 82.8%
65 Qwen 3.5 35B A3B 84.2% 69.2%
66 Nova Premier 42.4%
67 Command A 69.6% 50.8%
68 Qwen3 Coder 67.0%
69 Claude Opus 4.6 91.3% 80.8%
70 GPT 5.2 Pro 93.2%
71 Gemini 3 Flash 90.4% 78.0%
72 Gemini 3 Deep Think 93.8% 41.0%
73 MiniMax M2.5 80.2%
74 GLM 5 86.0% 77.8%
75 Kimi K2.5 76.8% 85.0%
76 DeepSeek 3.2 Speciale 30.6% 96.0%
77 Devstral 2 72.2%
78 Mistral Large 3 43.9% 34.4%
79 GPT OSS 120B 90.0% 80.1%
80 Nemotron 3 Super 75.7% 60.0%
81 Qwen3-235B-A22B 81.5% 70.7%
82 Qwen3-30B-A3B 65.8%
83 R1 1776 71.5% 70.0%
84 Grok Code Fast 1 70.8%
85 Llama 3.1 Nemotron Ultra 76.0%
86 Opus 4.1 Thinking 80.9% 74.5%
87 Nova Pro 46.9%
88 Nova Lite 42.0%
89 Nova Micro 40.0%
90 GPT 5.3 Codex 92.6% 56.8%
91 Llama 3.2 32.8%
92 Pixtral 12B 52.0%
93 GPT-4o 53.6% 69.1%
94 Qwen3 235B A22B 81.5% 70.7%
95 Llama 3.3 68.9% 50.5%

Capability scatter

Each dot is a model. Position shows two-axis capability; size reflects how many headline benchmarks the model has been scored on.
Capability scatter: MMLU-Pro vs SWE-bench Verified0.000.0025.025.050.050.075.075.0100.0100.0MMLU-ProSWE-bench VerifiedQwen 3.5 27B • MMLU-Pro: 86.1 • SWE-bench Verified: 72.4Qwen 3.5 27BDeepSeek V3.2 Exp • MMLU-Pro: 85.0 • SWE-bench Verified: 67.8DeepSeek V3.2 ExpDeepseek 3.2 • MMLU-Pro: 85.0 • SWE-bench Verified: 73.1Deepseek 3.2DeepSeek-R1 • MMLU-Pro: 84.0 • SWE-bench Verified: 49.2DeepSeek-R1DeepSeek V3 • MMLU-Pro: 75.9 • SWE-bench Verified: 42.0DeepSeek V3Claude Haiku 3.5 • MMLU-Pro: 41.6 • SWE-bench Verified: 40.6Claude Haiku 3.5
0 AIs selected
Clear selection
#
Name
Task