AI大模型排行榜 全球领先的人工智能模型性能评测
247
模型总数
2025-08-12
更新日期
模型 | 公司 | 上下文长度 | AI分析指数 ℹ️ | MMLUPRO推理与知识 ℹ️ | GPQA_DIAMOND科学推理 ℹ️ | 期末考试 | LIVECODEBENCH | SCICODE | HUMANEVAL | MATH500 | AIME2024 | Chatbot_Arena | 每百万Tokens价格 | 每秒输出Tokens |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-5 (high) |
|
400k |
69 |
87% |
85% |
27% |
67% |
43% |
73% |
94% |
76% |
96% |
99% |
99% |
GPT-5 (medium) |
|
400k |
68 |
87% |
84% |
24% |
70% |
41% |
71% |
92% |
73% |
92% |
99% |
98% |
Grok 4 |
|
256k |
68 |
87% |
88% |
24% |
82% |
46% |
54% |
93% |
68% |
94% |
99% |
98% |
o3-pro |
|
200k |
68 |
85% |
||||||||||
o3 |
|
200k |
67 |
85% |
83% |
20% |
78% |
41% |
71% |
88% |
69% |
90% |
99% |
99% |
o4-mini (high) |
|
200k |
65 |
83% |
78% |
18% |
80% |
47% |
69% |
91% |
55% |
94% |
99% |
99% |
Gemini 2.5 Pro |
|
1m |
65 |
86% |
84% |
21% |
80% |
43% |
49% |
88% |
66% |
89% |
97% |
|
GPT-5 mini |
|
400k |
64 |
83% |
80% |
15% |
69% |
41% |
71% |
85% |
66% |
|||
Qwen3 235B 2507 (Reasoning) |
|
256k |
64 |
84% |
79% |
15% |
79% |
42% |
51% |
91% |
67% |
94% |
98% |
98% |
GPT-5 (low) |
|
400k |
63 |
86% |
81% |
18% |
75% |
39% |
67% |
83% |
59% |
83% |
99% |
99% |
Claude 4.1 Opus Thinking |
|
200k |
61 |
|||||||||||
gpt-oss-120B (high) |
|
131k |
61 |
81% |
78% |
19% |
64% |
36% |
69% |
89% |
51% |
|||
Gemini 2.5 Pro (Mar '25) |
|
1m |
59 |
86% |
84% |
17% |
78% |
40% |
87% |
98% |
99% |
|||
Claude 4 Sonnet Thinking |
|
200k |
59 |
84% |
78% |
10% |
66% |
40% |
55% |
74% |
65% |
77% |
99% |
|
DeepSeek R1 0528 |
![]() |
128k |
59 |
85% |
81% |
15% |
77% |
40% |
40% |
76% |
56% |
89% |
98% |
97% |
Gemini 2.5 Flash (Reasoning) |
|
1m |
58 |
83% |
79% |
11% |
70% |
39% |
50% |
73% |
62% |
82% |
98% |
96% |
Grok 3 mini Reasoning (high) |
|
1m |
58 |
83% |
79% |
11% |
70% |
41% |
46% |
85% |
50% |
93% |
99% |
98% |
Gemini 2.5 Pro (May' 25) |
|
1m |
58 |
84% |
82% |
15% |
77% |
42% |
84% |
99% |
99% |
|||
GLM-4.5 |
|
128k |
56 |
84% |
78% |
12% |
74% |
35% |
44% |
74% |
48% |
87% |
98% |
98% |
o3-mini (high) |
|
200k |
55 |
80% |
77% |
12% |
73% |
40% |
86% |
99% |
||||
Claude 4 Opus Thinking |
|
200k |
55 |
87% |
80% |
12% |
64% |
40% |
54% |
73% |
34% |
76% |
98% |
|
GPT-5 nano |
|
400k |
54 |
77% |
67% |
8% |
60% |
34% |
66% |
78% |
40% |
|||
Qwen3 30B 2507 (Reasoning) |
|
33k |
53 |
81% |
71% |
10% |
71% |
33% |
91% |
98% |
||||
MiniMax M1 80k |
![]() |
1m |
53 |
82% |
70% |
8% |
71% |
37% |
42% |
61% |
54% |
85% |
98% |
|
o3-mini |
|
200k |
53 |
79% |
75% |
9% |
72% |
40% |
77% |
97% |
97% |
|||
Llama Nemotron Super 49B v1.5 (Reasoning) |
|
128k |
52 |
81% |
75% |
7% |
74% |
35% |
37% |
77% |
34% |
86% |
98% |
95% |
o1 |
|
200k |
52 |
84% |
75% |
8% |
68% |
36% |
72% |
97% |
97% |
|||
MiniMax M1 40k |
![]() |
1m |
51 |
81% |
68% |
8% |
66% |
38% |
81% |
97% |
||||
Qwen3 235B 2507 (Non-reasoning) |
|
256k |
51 |
83% |
75% |
11% |
52% |
36% |
46% |
72% |
31% |
72% |
98% |
96% |
Sonar Reasoning Pro |
|
127k |
51 |
79% |
96% |
|||||||||
EXAONE 4.0 32B (Reasoning) |
![]() |
131k |
51 |
82% |
74% |
11% |
75% |
34% |
36% |
80% |
14% |
84% |
98% |
97% |
Gemini 2.5 Flash (April '25) (Reasoning) |
|
1m |
50 |
80% |
70% |
12% |
51% |
36% |
84% |
98% |
||||
DeepSeek R1 (Jan '25) |
![]() |
128k |
50 |
84% |
71% |
9% |
62% |
36% |
68% |
97% |
98% |
|||
GLM-4.5-Air |
|
128k |
49 |
82% |
73% |
7% |
68% |
31% |
67% |
97% |
93% |
|||
o1-preview |
|
128k |
49 |
92% |
96% |
|||||||||
gpt-oss-20B (high) |
|
131k |
49 |
74% |
62% |
9% |
72% |
35% |
61% |
62% |
19% |
|||
Claude 4.1 Opus |
|
200k |
49 |
|||||||||||
Kimi K2 |
![]() |
128k |
49 |
82% |
77% |
7% |
56% |
18% |
42% |
57% |
51% |
69% |
97% |
93% |
Qwen3 235B (Reasoning) |
|
33k |
48 |
83% |
70% |
12% |
62% |
40% |
39% |
82% |
0% |
84% |
93% |
|
QwQ-32B |
|
131k |
48 |
76% |
59% |
8% |
63% |
36% |
78% |
96% |
98% |
|||
Gemini 2.5 Flash |
|
1m |
47 |
81% |
68% |
5% |
50% |
29% |
39% |
60% |
46% |
50% |
93% |
95% |
Claude 3.7 Sonnet Thinking |
|
200k |
47 |
84% |
77% |
10% |
47% |
40% |
49% |
95% |
98% |
|||
GPT-4.1 |
|
1m |
47 |
81% |
67% |
5% |
46% |
38% |
43% |
35% |
61% |
44% |
91% |
96% |
Claude 4 Opus |
|
200k |
47 |
86% |
70% |
6% |
54% |
41% |
43% |
36% |
36% |
56% |
94% |
97% |
Llama Nemotron Ultra Reasoning |
|
128k |
46 |
83% |
73% |
8% |
64% |
35% |
38% |
64% |
7% |
75% |
95% |
|
Qwen3 30B 2507 (Non-reasoning) |
|
33k |
46 |
78% |
66% |
7% |
52% |
30% |
73% |
98% |
94% |
|||
Claude 4 Sonnet |
|
200k |
46 |
84% |
68% |
4% |
45% |
37% |
45% |
38% |
44% |
41% |
93% |
97% |
Grok 3 Reasoning Beta |
|
1m |
46 |
|||||||||||
o1-pro |
|
200k |
46 |
|||||||||||
Qwen3 14B (Reasoning) |
|
33k |
45 |
77% |
60% |
4% |
52% |
32% |
76% |
96% |
96% |
|||
Qwen3 Coder 480B |
|
262k |
45 |
79% |
62% |
4% |
59% |
36% |
41% |
39% |
42% |
48% |
94% |
97% |
Gemini 2.5 Flash-Lite (Reasoning) |
|
1m |
44 |
76% |
63% |
6% |
59% |
19% |
70% |
97% |
97% |
|||
Qwen3 32B (Reasoning) |
|
33k |
44 |
80% |
67% |
8% |
55% |
35% |
36% |
73% |
0% |
81% |
96% |
|
DeepSeek V3 0324 (Mar '25) |
![]() |
128k |
44 |
82% |
66% |
5% |
41% |
36% |
41% |
41% |
41% |
52% |
94% |
92% |
GPT-5 (minimal) |
|
400k |
44 |
81% |
67% |
5% |
56% |
39% |
46% |
32% |
25% |
37% |
86% |
95% |
Solar Pro 2 (Reasoning) |
![]() |
66k |
43 |
81% |
69% |
7% |
62% |
30% |
37% |
61% |
0% |
69% |
97% |
97% |
o1-mini |
|
128k |
43 |
74% |
60% |
5% |
58% |
32% |
60% |
94% |
97% |
|||
GPT-4.5 (Preview) |
|
128k |
42 |
|||||||||||
Qwen3 30B (Reasoning) |
|
33k |
42 |
78% |
62% |
7% |
51% |
28% |
42% |
72% |
0% |
75% |
96% |
|
GPT-4.1 mini |
|
1m |
42 |
78% |
66% |
5% |
48% |
40% |
43% |
93% |
95% |
|||
Llama 4 Maverick |
|
1m |
42 |
81% |
67% |
5% |
40% |
33% |
43% |
19% |
46% |
39% |
89% |
88% |
Gemini 2.0 Flash Thinking exp. (Jan '25) |
|
1m |
42 |
80% |
70% |
7% |
32% |
33% |
50% |
94% |
||||
DeepSeek R1 0528 Qwen3 8B |
![]() |
33k |
42 |
74% |
61% |
6% |
51% |
20% |
65% |
93% |
91% |
|||
DeepSeek R1 Distill Qwen 32B |
![]() |
128k |
41 |
74% |
62% |
6% |
27% |
38% |
69% |
94% |
95% |
|||
Qwen3 8B (Reasoning) |
|
33k |
41 |
74% |
59% |
4% |
41% |
23% |
75% |
90% |
||||
Llama 3.3 Nemotron Super 49B Reasoning |
|
128k |
40 |
79% |
64% |
7% |
28% |
28% |
58% |
96% |
96% |
|||
EXAONE 4.0 32B |
![]() |
131k |
40 |
77% |
63% |
5% |
47% |
25% |
47% |
94% |
91% |
|||
Solar Pro 2 (Reasoning) |
![]() |
64k |
40 |
77% |
58% |
6% |
46% |
16% |
66% |
90% |
||||
Grok 3 |
|
1m |
40 |
80% |
69% |
5% |
43% |
37% |
33% |
87% |
91% |
|||
GPT-4o (March 2025) |
|
128k |
40 |
80% |
66% |
5% |
43% |
37% |
26% |
33% |
89% |
96% |
||
Mistral Medium 3 |
![]() |
128k |
39 |
76% |
58% |
4% |
40% |
33% |
39% |
30% |
28% |
44% |
91% |
90% |
Gemini 2.0 Pro Experimental |
|
2m |
38 |
81% |
62% |
7% |
35% |
31% |
36% |
92% |
95% |
|||
DeepSeek R1 Distill Qwen 14B |
![]() |
128k |
38 |
74% |
48% |
4% |
38% |
24% |
67% |
95% |
93% |
|||
Sonar Reasoning |
|
127k |
38 |
62% |
77% |
92% |
||||||||
Gemini 2.5 Flash (April '25) |
|
1m |
38 |
78% |
59% |
5% |
41% |
23% |
43% |
93% |
||||
Gemini 2.0 Flash |
|
1m |
38 |
78% |
62% |
5% |
33% |
33% |
40% |
22% |
28% |
33% |
93% |
90% |
Magistral Medium |
![]() |
40k |
38 |
75% |
68% |
10% |
53% |
30% |
25% |
40% |
0% |
70% |
92% |
|
DeepSeek R1 Distill Llama 70B |
![]() |
128k |
37 |
80% |
40% |
6% |
27% |
31% |
67% |
94% |
97% |
|||
Claude 3.7 Sonnet |
|
200k |
37 |
80% |
66% |
5% |
39% |
38% |
21% |
22% |
85% |
95% |
||
Qwen3 4B (Reasoning) |
|
32k |
36 |
70% |
52% |
5% |
47% |
4% |
66% |
93% |
91% |
|||
Reka Flash 3 |
![]() |
128k |
36 |
67% |
53% |
5% |
44% |
27% |
51% |
89% |
95% |
|||
Magistral Small |
![]() |
40k |
36 |
75% |
64% |
7% |
51% |
24% |
25% |
41% |
0% |
71% |
96% |
96% |
Gemini 2.0 Flash (exp) |
|
1m |
36 |
78% |
64% |
5% |
21% |
34% |
30% |
91% |
91% |
|||
Nova Premier |
![]() |
1m |
35 |
73% |
57% |
5% |
32% |
28% |
36% |
17% |
30% |
17% |
84% |
91% |
Gemini 2.5 Flash-Lite |
|
1m |
35 |
72% |
47% |
4% |
40% |
18% |
50% |
93% |
93% |
|||
DeepSeek V3 (Dec '24) |
![]() |
128k |
35 |
75% |
56% |
4% |
36% |
35% |
25% |
89% |
91% |
|||
Qwen2.5 Max |
|
32k |
34 |
76% |
59% |
5% |
36% |
34% |
23% |
84% |
93% |
|||
Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) |
|
128k |
34 |
56% |
41% |
5% |
49% |
10% |
71% |
95% |
||||
Gemini 1.5 Pro (Sep) |
|
2m |
34 |
75% |
59% |
5% |
32% |
30% |
23% |
88% |
90% |
|||
Solar Pro 2 |
![]() |
64k |
34 |
73% |
54% |
4% |
39% |
27% |
30% |
87% |
88% |
|||
Claude 3.5 Sonnet (Oct) |
|
200k |
33 |
77% |
60% |
4% |
38% |
37% |
16% |
77% |
93% |
|||
Qwen3 Coder 30B |
|
262k |
33 |
71% |
52% |
4% |
40% |
28% |
30% |
89% |
92% |
|||
Qwen3 235B |
|
33k |
33 |
76% |
61% |
5% |
34% |
30% |
37% |
24% |
0% |
33% |
90% |
|
Solar Pro 2 |
![]() |
66k |
33 |
75% |
56% |
4% |
42% |
25% |
34% |
30% |
0% |
41% |
89% |
88% |
Llama 4 Scout |
|
10m |
33 |
75% |
59% |
4% |
30% |
17% |
40% |
14% |
26% |
28% |
84% |
83% |
Sonar |
|
127k |
32 |
69% |
47% |
7% |
30% |
23% |
49% |
82% |
82% |
|||
Mistral Small 3.2 |
![]() |
128k |
32 |
68% |
51% |
4% |
28% |
26% |
34% |
27% |
17% |
32% |
88% |
85% |
Sonar Pro |
|
200k |
32 |
76% |
58% |
8% |
28% |
23% |
29% |
75% |
85% |
|||
Command A |
|
256k |
32 |
71% |
53% |
5% |
29% |
28% |
37% |
13% |
18% |
10% |
82% |
82% |
QwQ 32B-Preview |
|
33k |
32 |
65% |
56% |
5% |
34% |
4% |
45% |
91% |
87% |
|||
Devstral Medium |
![]() |
256k |
31 |
71% |
49% |
4% |
34% |
29% |
30% |
5% |
29% |
7% |
71% |
94% |
Llama 3.3 70B |
|
128k |
31 |
71% |
50% |
4% |
29% |
26% |
47% |
8% |
15% |
30% |
77% |
86% |
Gemini 2.0 Flash-Lite (Feb '25) |
|
1m |
30 |
72% |
54% |
4% |
19% |
25% |
28% |
87% |
88% |
|||
Qwen3 30B |
|
33k |
30 |
71% |
52% |
5% |
32% |
26% |
32% |
22% |
0% |
26% |
86% |
|
GPT-4.1 nano |
|
1m |
30 |
66% |
51% |
4% |
33% |
26% |
24% |
85% |
88% |
|||
Qwen3 14B |
|
33k |
30 |
68% |
47% |
4% |
28% |
27% |
28% |
87% |
||||
Qwen3 32B |
|
33k |
30 |
73% |
54% |
4% |
29% |
28% |
32% |
20% |
0% |
30% |
87% |
90% |
GPT-4o (May '24) |
|
128k |
30 |
74% |
53% |
3% |
33% |
31% |
11% |
79% |
94% |
|||
Gemini 2.0 Flash-Lite (Preview) |
|
1m |
30 |
54% |
4% |
18% |
25% |
30% |
87% |
90% |
||||
GPT-4o (Nov '24) |
|
128k |
30 |
75% |
54% |
3% |
31% |
33% |
34% |
6% |
0% |
15% |
76% |
93% |
GPT-4o (Aug '24) |
|
128k |
29 |
52% |
3% |
32% |
12% |
80% |
93% |
|||||
Llama 3.1 405B |
|
128k |
29 |
73% |
52% |
4% |
31% |
30% |
21% |
70% |
85% |
|||
Qwen2.5 72B |
|
131k |
29 |
72% |
49% |
4% |
28% |
27% |
16% |
86% |
88% |
|||
MiniMax-Text-01 |
![]() |
4m |
29 |
76% |
58% |
4% |
25% |
25% |
13% |
75% |
86% |
|||
Nova Pro |
![]() |
300k |
29 |
69% |
50% |
3% |
23% |
21% |
38% |
7% |
19% |
11% |
79% |
83% |
Claude 3.5 Sonnet (June) |
|
200k |
29 |
75% |
56% |
4% |
32% |
10% |
70% |
90% |
||||
Tulu3 405B |
|
128k |
29 |
72% |
52% |
4% |
29% |
30% |
13% |
78% |
89% |
|||
GPT-4o (ChatGPT) |
|
128k |
29 |
77% |
51% |
4% |
33% |
53% |
10% |
80% |
94% |
|||
Llama 3.3 Nemotron Super 49B v1 |
|
128k |
28 |
70% |
52% |
4% |
28% |
23% |
19% |
78% |
83% |
|||
Grok 2 |
|
131k |
28 |
71% |
51% |
4% |
27% |
28% |
13% |
78% |
86% |
|||
Phi-4 |
![]() |
16k |
28 |
71% |
57% |
4% |
23% |
26% |
24% |
18% |
0% |
14% |
81% |
87% |
Gemini 1.5 Flash (Sep) |
|
1m |
28 |
68% |
46% |
4% |
27% |
27% |
18% |
83% |
84% |
|||
GPT-4 Turbo |
|
128k |
28 |
69% |
3% |
29% |
32% |
15% |
74% |
92% |
||||
Mistral Large 2 (Nov '24) |
![]() |
128k |
27 |
70% |
49% |
4% |
29% |
29% |
11% |
74% |
90% |
|||
Llama Nemotron Super 49B v1.5 |
|
128k |
27 |
69% |
48% |
4% |
29% |
24% |
33% |
22% |
14% |
77% |
86% |
|
Qwen3 1.7B (Reasoning) |
|
32k |
27 |
57% |
36% |
5% |
31% |
4% |
51% |
89% |
85% |
|||
Mistral Small 3.1 |
![]() |
128k |
26 |
66% |
45% |
5% |
21% |
27% |
30% |
4% |
14% |
9% |
71% |
86% |
Grok Beta |
|
128k |
26 |
70% |
47% |
5% |
24% |
30% |
10% |
74% |
87% |
|||
Pixtral Large |
![]() |
128k |
26 |
70% |
51% |
4% |
26% |
29% |
7% |
71% |
85% |
|||
Qwen2.5 Instruct 32B |
|
128k |
26 |
70% |
47% |
4% |
25% |
23% |
11% |
81% |
90% |
|||
Llama 3.1 Nemotron 70B |
|
128k |
26 |
69% |
47% |
5% |
17% |
23% |
25% |
73% |
82% |
|||
Qwen3 8B |
|
33k |
25 |
64% |
45% |
3% |
20% |
17% |
24% |
83% |
||||
Mistral Large 2 (Jul '24) |
![]() |
128k |
25 |
68% |
47% |
3% |
27% |
27% |
9% |
71% |
89% |
|||
Gemma 3 27B |
|
128k |
25 |
67% |
43% |
5% |
14% |
21% |
32% |
21% |
0% |
25% |
88% |
89% |
Qwen2.5 Coder 32B |
|
131k |
25 |
64% |
42% |
4% |
30% |
27% |
12% |
77% |
90% |
|||
GPT-4 |
|
8k |
25 |
|||||||||||
Nova Lite |
![]() |
300k |
25 |
59% |
43% |
5% |
17% |
14% |
34% |
7% |
18% |
11% |
77% |
84% |
GPT-4o mini |
|
128k |
24 |
65% |
43% |
4% |
23% |
23% |
31% |
15% |
12% |
79% |
88% |
|
Llama 3.1 70B |
|
128k |
24 |
68% |
41% |
5% |
23% |
27% |
17% |
65% |
81% |
|||
Gemma 3 12B |
|
128k |
24 |
60% |
35% |
5% |
14% |
17% |
37% |
18% |
7% |
22% |
85% |
83% |
Mistral Small 3 |
![]() |
32k |
24 |
65% |
46% |
4% |
25% |
24% |
8% |
72% |
85% |
|||
DeepSeek-V2.5 (Dec '24) |
![]() |
128k |
24 |
76% |
88% |
|||||||||
Qwen3 4B |
|
32k |
24 |
59% |
40% |
4% |
23% |
17% |
21% |
84% |
||||
Claude 3 Opus |
|
200k |
24 |
70% |
49% |
3% |
28% |
23% |
3% |
64% |
85% |
|||
Claude 3.5 Haiku |
|
200k |
23 |
63% |
41% |
4% |
31% |
27% |
3% |
72% |
86% |
|||
Gemini 2.0 Flash Thinking exp. (Dec '24) |
|
2m |
23 |
48% |
94% |
|||||||||
DeepSeek-V2.5 |
![]() |
128k |
23 |
87% |
||||||||||
Devstral Small (May '25) |
![]() |
256k |
23 |
63% |
43% |
4% |
26% |
25% |
7% |
68% |
85% |
|||
Mistral Saba |
![]() |
32k |
23 |
61% |
42% |
4% |
24% |
13% |
68% |
85% |
||||
DeepSeek R1 Distill Llama 8B |
![]() |
128k |
23 |
54% |
30% |
4% |
23% |
12% |
33% |
85% |
84% |
|||
Reka Core |
![]() |
128k |
22 |
56% |
73% |
|||||||||
Gemini 1.5 Pro (May) |
|
2m |
22 |
66% |
37% |
4% |
24% |
27% |
8% |
67% |
83% |
|||
R1 1776 |
|
128k |
22 |
95% |
||||||||||
Qwen2.5 Turbo |
|
1m |
22 |
63% |
41% |
4% |
16% |
15% |
12% |
81% |
85% |
|||
Reka Flash |
![]() |
128k |
22 |
53% |
74% |
|||||||||
Llama 3.2 90B (Vision) |
|
128k |
22 |
67% |
43% |
5% |
21% |
24% |
5% |
63% |
82% |
|||
Solar Mini |
![]() |
4k |
22 |
33% |
59% |
|||||||||
Reka Flash (Feb '24) |
![]() |
128k |
22 |
33% |
61% |
|||||||||
Reka Edge |
![]() |
128k |
21 |
22% |
41% |
|||||||||
Grok-1 |
|
8k |
21 |
|||||||||||
Qwen2 72B |
|
131k |
21 |
62% |
37% |
4% |
16% |
23% |
15% |
70% |
83% |
|||
Devstral Small |
![]() |
256k |
21 |
62% |
41% |
4% |
25% |
24% |
0% |
64% |
85% |
|||
Nova Micro |
![]() |
130k |
20 |
53% |
36% |
5% |
14% |
9% |
29% |
6% |
10% |
8% |
70% |
80% |
Gemma 2 27B |
|
8k |
20 |
57% |
36% |
4% |
28% |
13% |
30% |
54% |
76% |
|||
Gemini 1.5 Flash-8B |
|
1m |
19 |
57% |
36% |
5% |
22% |
23% |
3% |
69% |
12% |
|||
Llama 3.1 8B |
|
128k |
19 |
48% |
26% |
5% |
12% |
13% |
29% |
4% |
16% |
8% |
52% |
67% |
Gemma 3n E4B |
|
32k |
18 |
49% |
30% |
4% |
15% |
8% |
28% |
14% |
0% |
14% |
77% |
|
DeepHermes 3 - Mistral 24B |
![]() |
32k |
18 |
58% |
38% |
4% |
20% |
23% |
5% |
60% |
75% |
|||
Jamba 1.7 Large |
|
256k |
18 |
58% |
39% |
4% |
18% |
19% |
6% |
60% |
71% |
|||
Jamba 1.5 Large |
|
256k |
18 |
57% |
43% |
4% |
14% |
16% |
5% |
61% |
24% |
|||
Granite 3.3 8B |
|
128k |
18 |
47% |
34% |
4% |
13% |
10% |
22% |
7% |
4% |
5% |
67% |
71% |
Hermes 3 - Llama-3.1 70B |
![]() |
128k |
17 |
57% |
40% |
4% |
19% |
23% |
2% |
54% |
75% |
|||
DeepSeek-Coder-V2 |
![]() |
128k |
17 |
74% |
87% |
|||||||||
Jamba 1.6 Large |
|
256k |
17 |
56% |
39% |
4% |
17% |
18% |
5% |
58% |
70% |
|||
Gemini 1.5 Flash (May) |
|
1m |
17 |
57% |
32% |
4% |
20% |
18% |
9% |
55% |
72% |
|||
Yi-Large |
![]() |
32k |
16 |
59% |
36% |
3% |
11% |
19% |
7% |
56% |
74% |
|||
Claude 3 Sonnet |
|
200k |
16 |
58% |
40% |
4% |
18% |
23% |
5% |
41% |
71% |
|||
Codestral (Jan '25) |
![]() |
256k |
16 |
45% |
31% |
5% |
24% |
25% |
4% |
61% |
85% |
|||
Llama 3 70B |
|
8k |
16 |
57% |
38% |
4% |
20% |
19% |
0% |
48% |
79% |
|||
Mistral Small (Sep '24) |
![]() |
33k |
16 |
53% |
38% |
4% |
14% |
16% |
6% |
56% |
81% |
|||
Gemini 1.0 Ultra |
|
33k |
16 |
|||||||||||
Gemma 3n E4B (May '25) |
|
32k |
15 |
48% |
28% |
5% |
14% |
9% |
11% |
75% |
76% |
|||
Phi-4 Multimodal |
![]() |
128k |
15 |
49% |
32% |
4% |
13% |
11% |
9% |
69% |
73% |
|||
Qwen2.5 Coder 7B |
|
131k |
15 |
47% |
34% |
5% |
13% |
15% |
5% |
66% |
90% |
|||
Mistral Large (Feb '24) |
![]() |
33k |
15 |
52% |
35% |
3% |
18% |
21% |
0% |
53% |
71% |
|||
Jamba Instruct |
|
256k |
15 |
34% |
27% |
5% |
5% |
8% |
24% |
0% |
||||
Mixtral 8x22B |
![]() |
65k |
14 |
54% |
33% |
4% |
15% |
19% |
0% |
55% |
72% |
|||
Phi-4 Mini |
![]() |
128k |
14 |
47% |
33% |
4% |
13% |
11% |
3% |
70% |
74% |
|||
Llama 2 Chat 7B |
|
4k |
14 |
16% |
23% |
6% |
0% |
0% |
0% |
6% |
13% |
|||
Gemma 3 4B |
|
128k |
14 |
42% |
29% |
5% |
11% |
7% |
28% |
13% |
6% |
77% |
72% |
|
Llama 3.2 11B (Vision) |
|
128k |
13 |
46% |
22% |
5% |
11% |
9% |
52% |
69% |
||||
Qwen3 1.7B |
|
32k |
13 |
41% |
28% |
5% |
13% |
7% |
10% |
72% |
||||
Qwen1.5 Chat 110B |
|
32k |
13 |
29% |
||||||||||
Phi-3 Medium 14B |
![]() |
128k |
13 |
54% |
33% |
5% |
15% |
12% |
1% |
46% |
0% |
|||
Claude 2.1 |
|
200k |
12 |
50% |
32% |
4% |
20% |
18% |
3% |
37% |
16% |
|||
Claude 3 Haiku |
|
200k |
12 |
15% |
19% |
1% |
39% |
76% |
||||||
Pixtral 12B |
![]() |
128k |
11 |
47% |
34% |
5% |
12% |
14% |
0% |
46% |
78% |
|||
Qwen3 0.6B (Reasoning) |
|
32k |
11 |
35% |
24% |
6% |
12% |
3% |
10% |
75% |
49% |
|||
Claude 2.0 |
|
100k |
11 |
49% |
34% |
17% |
19% |
0% |
||||||
DeepSeek-V2 |
![]() |
128k |
11 |
87% |
||||||||||
Mistral Small (Feb '24) |
![]() |
33k |
11 |
42% |
30% |
4% |
11% |
13% |
1% |
56% |
79% |
|||
Mistral Medium |
![]() |
33k |
11 |
49% |
35% |
3% |
10% |
12% |
4% |
41% |
||||
GPT-3.5 Turbo |
|
4k |
11 |
46% |
30% |
44% |
70% |
|||||||
Gemma 3n E2B |
|
32k |
10 |
38% |
23% |
4% |
10% |
5% |
9% |
69% |
||||
Ministral 8B |
![]() |
128k |
10 |
39% |
28% |
5% |
11% |
12% |
4% |
57% |
77% |
|||
Gemma 2 9B |
|
8k |
10 |
50% |
31% |
4% |
13% |
1% |
0% |
52% |
65% |
|||
Phi-3 Mini |
![]() |
4k |
10 |
44% |
32% |
4% |
12% |
9% |
4% |
46% |
25% |
|||
Arctic |
|
4k |
10 |
75% |
||||||||||
Qwen Chat 72B |
|
34k |
10 |
|||||||||||
LFM 40B |
|
32k |
10 |
43% |
33% |
5% |
10% |
7% |
2% |
48% |
51% |
|||
Command-R+ |
|
128k |
9 |
43% |
34% |
5% |
11% |
12% |
0% |
40% |
63% |
|||
Llama 3 8B |
|
8k |
9 |
41% |
30% |
5% |
10% |
12% |
0% |
50% |
71% |
|||
PALM-2 |
|
8k |
9 |
|||||||||||
Gemini 1.0 Pro |
|
33k |
9 |
43% |
28% |
5% |
12% |
12% |
1% |
40% |
2% |
|||
DeepSeek Coder V2 Lite |
![]() |
128k |
8 |
43% |
32% |
5% |
16% |
14% |
||||||
Codestral (May '24) |
![]() |
33k |
8 |
33% |
26% |
5% |
21% |
22% |
0% |
35% |
80% |
|||
Aya Expanse 32B |
|
128k |
8 |
38% |
23% |
5% |
14% |
15% |
0% |
45% |
68% |
|||
Llama 2 Chat 70B |
|
4k |
8 |
41% |
33% |
5% |
10% |
0% |
32% |
34% |
||||
DeepSeek LLM 67B (V1) |
![]() |
4k |
8 |
75% |
||||||||||
Llama 2 Chat 13B |
|
4k |
8 |
41% |
32% |
5% |
10% |
12% |
2% |
33% |
0% |
|||
Command-R+ (Apr '24) |
|
128k |
8 |
43% |
32% |
5% |
12% |
12% |
1% |
28% |
64% |
|||
OpenChat 3.5 |
![]() |
8k |
8 |
31% |
23% |
5% |
12% |
0% |
31% |
68% |
||||
DBRX |
|
33k |
8 |
40% |
33% |
7% |
9% |
12% |
3% |
28% |
67% |
|||
Ministral 3B |
![]() |
128k |
8 |
34% |
26% |
6% |
7% |
9% |
0% |
54% |
74% |
|||
Mistral NeMo |
![]() |
128k |
8 |
40% |
31% |
4% |
6% |
10% |
0% |
40% |
65% |
|||
Llama 3.2 3B |
|
128k |
7 |
35% |
26% |
5% |
8% |
5% |
7% |
49% |
56% |
|||
DeepSeek R1 Distill Qwen 1.5B |
![]() |
128k |
7 |
27% |
10% |
3% |
7% |
7% |
18% |
69% |
45% |
|||
Jamba 1.5 Mini |
|
256k |
6 |
37% |
30% |
5% |
6% |
8% |
1% |
36% |
63% |
|||
Jamba 1.7 Mini |
|
258k |
6 |
39% |
32% |
5% |
6% |
9% |
1% |
26% |
48% |
|||
Jamba 1.6 Mini |
|
256k |
5 |
37% |
30% |
5% |
7% |
10% |
3% |
26% |
43% |
|||
Mixtral 8x7B |
![]() |
33k |
5 |
39% |
29% |
5% |
7% |
3% |
0% |
30% |
1% |
|||
Qwen3 0.6B |
|
32k |
4 |
23% |
23% |
5% |
7% |
4% |
2% |
52% |
34% |
|||
DeepHermes 3 - Llama-3.1 8B |
![]() |
128k |
4 |
37% |
27% |
4% |
9% |
9% |
0% |
22% |
54% |
|||
Aya Expanse 8B |
|
8k |
4 |
31% |
25% |
5% |
7% |
8% |
0% |
32% |
44% |
|||
Command-R |
|
128k |
3 |
34% |
29% |
5% |
4% |
9% |
0% |
15% |
42% |
|||
Command-R (Mar '24) |
|
128k |
2 |
34% |
28% |
5% |
5% |
6% |
1% |
16% |
40% |
|||
Claude Instant |
|
100k |
2 |
43% |
33% |
4% |
11% |
0% |
26% |
2% |
||||
Qwen Chat 14B |
|
8k |
2 |
|||||||||||
Codestral-Mamba |
![]() |
256k |
2 |
21% |
21% |
5% |
13% |
11% |
0% |
24% |
80% |
|||
Gemma 3 1B |
|
32k |
1 |
14% |
24% |
5% |
2% |
1% |
0% |
48% |
32% |
|||
Llama 3.2 1B |
|
128k |
1 |
20% |
20% |
5% |
2% |
2% |
0% |
14% |
40% |
|||
Llama 65B |
|
2k |
1 |
|||||||||||
Mistral 7B |
![]() |
8k |
1 |
25% |
18% |
4% |
5% |
2% |
0% |
12% |
40% |
|||
Grok 3 mini Reasoning (low) |
|
1m |
||||||||||||
GPT-4o mini Realtime (Dec '24) |
|
128k |
||||||||||||
GPT-4o Realtime (Dec '24) |
|
128k |
||||||||||||
GPT-3.5 Turbo (0613) |
|
4k |