Skip to content

AI Model Leaderboard

Compare cutting-edge language models based on comprehensive benchmark performance. Track progress in reasoning, coding, mathematics, and general knowledge.

134

Total Models

90.70

Top Score

41.18

Avg Score

62

Open Source

37

Multimodal

Qwen3 235B-A22B Thinking 2507

Latest Release

Sort by:
Showing 134 of 134 models
1

Qwen3 235B-A22B Thinking 2507

Alibaba

Overall Score
90.70
No AGI benchmarks
93.2%
MMLU
89.7%
Code
84.5%
Math
95.1%
GSM8K
94.3%
HellaSwag
95.8%
ARC
Context:33K
Released:Jul 2025
License:Open Source

Advanced reasoning model with chain-of-thought capabilities and extended thinking time

2

GLM 4.5

Z.ai

Overall Score
89.46
No AGI benchmarks
92.5%
MMLU
88.3%
Code
82.1%
Math
94.2%
GSM8K
93.8%
HellaSwag
95.1%
ARC
Context:128K
Released:Jul 2025
License:Proprietary

Large-scale mixture of experts model with 355B total parameters and 32B active

3

Qwen3 235B-A22B Instruct 2507

Alibaba

Overall Score
88.75
No AGI benchmarks
91.8%
MMLU
87.5%
Code
81.3%
Math
93.6%
GSM8K
93.2%
HellaSwag
94.7%
ARC
Context:33K
Released:Jul 2025
License:Open Source

Large-scale MoE model with 235B total parameters and 22B active, optimized for instruction following

4

GLM 4.5 Air

Z.ai

Overall Score
85.45
No AGI benchmarks
88.7%
MMLU
83.2%
Code
76.8%
Math
91.5%
GSM8K
91.2%
HellaSwag
92.8%
ARC
Context:128K
Released:Jul 2025
License:Proprietary

Lightweight version of GLM 4.5 with 106B total parameters and 12B active

5

Step 2.0

StepFun

Overall Score
82.59
No AGI benchmarks
85.6%
MMLU
79.8%
Code
73.2%
Math
88.9%
GSM8K
90.1%
HellaSwag
91.7%
ARC
Context:33K
Released:May 2024
License:Proprietary

Second generation model with improved reasoning and extended context

6

Nvidia Nemotron 4 340B

Nvidia

Overall Score
82.04
No AGI benchmarks
85.3%
MMLU
78.9%
Code
72.6%
Math
88.4%
GSM8K
89.7%
HellaSwag
91.2%
ARC
Context:4K
Released:Jun 2024
License:Proprietary

Large-scale model optimized for synthetic data generation and instruction following

7

EXAONE 3.5 27B

LG AI

Overall Score
80.49
No AGI benchmarks
83.7%
MMLU
76.8%
Code
71.2%
Math
86.9%
GSM8K
88.4%
HellaSwag
90.3%
ARC
Context:8K
Released:Aug 2024
License:Open Source

Enhanced bilingual model with improved reasoning and longer context

8

Nvidia NeMo 43B

Nvidia

Overall Score
75.63
No AGI benchmarks
78.2%
MMLU
72.5%
Code
65.3%
Math
82.1%
GSM8K
85.4%
HellaSwag
87.2%
ARC
Context:8K
Released:Oct 2023
License:Proprietary

Early large language model from Nvidia focused on enterprise applications

9

StarCoder2 15B

Hugging Face

Overall Score
75.30
No AGI benchmarks
75.3%
Code
Context:8K
Released:Feb 2024
License:Open Source

Advanced code generation model with 4x larger context than StarCoder

10

Step 1.0

StepFun

Overall Score
73.10
No AGI benchmarks
74.8%
MMLU
70.2%
Code
62.5%
Math
79.8%
GSM8K
84.2%
HellaSwag
86.1%
ARC
Context:16K
Released:Oct 2023
License:Proprietary

First generation model from StepFun with focus on Chinese language understanding

11

Test Model

Test AI Corp

Overall Score
72.35
ARC-AGI 2: 9.9%HLE (Text): 99.0%
9.9%
ARC-AGI 2
99.0%
HLE (Text)
99.9%
MMLU
99.9%
Code
99.9%
Math
99.9%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Jun 2025
License:Proprietary

Test model for benchmarking purposes with exceptional performance across all metrics

12

EXAONE 3.0 7B

LG AI

Overall Score
70.22
No AGI benchmarks
71.4%
MMLU
68.2%
Code
58.9%
Math
76.3%
GSM8K
82.1%
HellaSwag
84.5%
ARC
Context:4K
Released:Dec 2023
License:Open Source

Bilingual Korean-English model with strong performance on both languages

13

o3

OpenAI

Overall Score
67.06
ARC-AGI 2: 75.7%HLE: 20.3%
75.7%
ARC-AGI 2
20.3%
HLE
95.2%
MMLU
96.3%
Code
96.7%
Math
98.9%
GSM8K
Context:256K
Released:Jan 2025
License:Proprietary

Revolutionary reasoning model achieving near-human performance on ARC-AGI

14

Mistral Magistral

Mistral AI

Overall Score
63.73
ARC-AGI 2: 21.8%HLE: 45.7%
21.8%
ARC-AGI 2
45.7%
HLE
93.1%
MMLU
91.8%
Code
87.4%
Math
95.7%
GSM8K
Parameters:405B
Context:128K
Released:Jan 2025
License:Proprietary

Flagship multimodal model from Mistral AI with 405B parameters, designed for advanced reasoning and instruction following

15

Qwen3 Coder 480B

Alibaba

Overall Score
62.36
ARC-AGI 2: 19.5%HLE (Text): 65.3%
19.5%
ARC-AGI 2
65.3%
HLE (Text)
92.7%
MMLU
94.8%
Code
88.2%
Math
95.1%
GSM8K
Parameters:480B
Context:66K
Released:Jan 2025
License:Open Source

Massive-scale coding-specialized model with 480B parameters, designed for complex software development tasks

16

Llama 3.3 Nemotron Super 70B

Nvidia

Overall Score
59.16
ARC-AGI 2: 17.2%HLE (Text): 61.4%
17.2%
ARC-AGI 2
61.4%
HLE (Text)
90.4%
MMLU
88.7%
Code
83.9%
Math
94.2%
GSM8K
Parameters:70B
Context:131K
Released:Jan 2025
License:Proprietary

Advanced Llama 3.3-based model enhanced by Nvidia with superior reasoning and instruction following capabilities

17

OpenReasoning Nemotron 27B

Nvidia

Overall Score
59.03
ARC-AGI 2: 18.1%HLE (Text): 62.7%
18.1%
ARC-AGI 2
62.7%
HLE (Text)
87.6%
MMLU
84.9%
Code
85.2%
Math
92.8%
GSM8K
Parameters:27B
Context:33K
Released:Jan 2025
License:Open Source

Open-source reasoning-focused model built on Nemotron architecture with enhanced logical thinking capabilities

18

Mistral Voxtral

Mistral AI

Overall Score
58.70
ARC-AGI 2: 16.7%HLE (Text): 61.8%
16.7%
ARC-AGI 2
61.8%
HLE (Text)
89.3%
MMLU
87.6%
Code
81.9%
Math
93.4%
GSM8K
Parameters:70B
Context:128K
Released:Jan 2025
License:Proprietary

Multimodal voice and text model from Mistral AI with advanced audio processing and speech understanding capabilities

19

DBRX Instruct

Databricks

Overall Score
58.61
No AGI benchmarks
74.5%
MMLU
48.1%
Code
38.2%
Math
72.8%
GSM8K
Context:8K
Released:Mar 2024
License:Open Source

Instruction-tuned version of DBRX with enhanced capabilities

20

Qwen3 30B-A3B Thinking

Alibaba

Overall Score
57.60
ARC-AGI 2: 16.3%HLE (Text): 58.9%
16.3%
ARC-AGI 2
58.9%
HLE (Text)
89.1%
MMLU
85.3%
Code
84.7%
Math
93.2%
GSM8K
Parameters:30B total, 3B active
Context:33K
Released:Jan 2025
License:Open Source

Advanced reasoning-focused MoE model with enhanced thinking capabilities and step-by-step problem solving

21

Mistral Devstral

Mistral AI

Overall Score
57.10
ARC-AGI 2: 15.6%HLE (Text): 58.4%
15.6%
ARC-AGI 2
58.4%
HLE (Text)
85.7%
MMLU
92.4%
Code
84.1%
Math
91.3%
GSM8K
Parameters:22B
Context:33K
Released:Jan 2025
License:Open Source

Specialized coding model from Mistral AI with 22B parameters, optimized for software development and programming tasks

22

Kimi K2 Instruct

Moonshot AI

Overall Score
57.07
ARC-AGI 2: 15.1%HLE (Text): 59.7%
15.1%
ARC-AGI 2
59.7%
HLE (Text)
88.9%
MMLU
86.4%
Code
79.2%
Math
92.1%
GSM8K
Parameters:120B
Context:200K
Released:Jan 2025
License:Proprietary

Advanced Chinese-English bilingual model with exceptional long-context understanding and instruction following

23

A.X 4.0 VL Light

SK Telecom

Overall Score
56.31
No AGI benchmarks
71.2%
MMLU
45.3%
Code
38.7%
Math
68.9%
GSM8K
Context:16K
Released:Jan 2025
License:Proprietary

SK Telecom's vision-language model optimized for Korean and English

24

Qwen3 Coder 30B

Alibaba

Overall Score
55.89
ARC-AGI 2: 14.2%HLE (Text): 56.8%
14.2%
ARC-AGI 2
56.8%
HLE (Text)
86.3%
MMLU
89.2%
Code
82.5%
Math
91.7%
GSM8K
Parameters:30B
Context:33K
Released:Jan 2025
License:Open Source

Efficient coding-specialized model with 30B parameters, optimized for software development and programming tasks

25

DBRX

Databricks

Overall Score
55.83
No AGI benchmarks
73.2%
MMLU
43.5%
Code
35.8%
Math
68.4%
GSM8K
Context:8K
Released:Mar 2024
License:Open Source

132B parameter MoE model outperforming GPT-3.5 and competitive with GPT-4

26

Kanana 1.5 15.7B

Kakao

Overall Score
54.30
No AGI benchmarks
69.8%
MMLU
43.6%
Code
36.2%
Math
65.4%
GSM8K
Context:16K
Released:Jan 2025
License:Open Source

Kakao's Korean-optimized language model with strong instruction following

27

Kimi K2 Base

Moonshot AI

Overall Score
54.18
ARC-AGI 2: 13.4%HLE (Text): 55.2%
13.4%
ARC-AGI 2
55.2%
HLE (Text)
86.2%
MMLU
83.7%
Code
76.8%
Math
89.3%
GSM8K
Parameters:120B
Context:200K
Released:Jan 2025
License:Research Only

Base version of Kimi K2 model optimized for research and fine-tuning applications with strong foundational capabilities

28

Llama 3.1 405B

Meta AI

Overall Score
53.31
ARC-AGI 2: 0.5%HLE (Text): 64.0%
0.5%
ARC-AGI 2
64.0%
HLE (Text)
87.3%
MMLU
89.0%
Code
73.8%
Math
96.8%
GSM8K
Parameters:405B
Context:128K
Released:Jul 2024
License:Open Source

Largest open-source model rivaling proprietary systems

29

Qwen3 30B-A3B Instruct

Alibaba

Overall Score
53.12
ARC-AGI 2: 12.8%HLE (Text): 52.1%
12.8%
ARC-AGI 2
52.1%
HLE (Text)
87.2%
MMLU
82.1%
Code
76.8%
Math
89.4%
GSM8K
Parameters:30B total, 3B active
Context:33K
Released:Jan 2025
License:Open Source

Advanced MoE model with 30B total parameters and 3B active, optimized for instruction following with high efficiency

30

OLMoCR 7B

Allen Institute for AI

Overall Score
52.97
No AGI benchmarks
68.2%
MMLU
42.8%
Code
35.6%
Math
62.4%
GSM8K
Context:8K
Released:Jul 2025
License:Open Source

Open language model optimized for OCR and document understanding

31

OpenReasoning Nemotron 7B

Nvidia

Overall Score
52.11
ARC-AGI 2: 13.7%HLE (Text): 51.3%
13.7%
ARC-AGI 2
51.3%
HLE (Text)
82.4%
MMLU
79.3%
Code
78.6%
Math
87.9%
GSM8K
Parameters:7B
Context:33K
Released:Jan 2025
License:Open Source

Compact reasoning-focused model with 7B parameters, optimized for efficient deployment with strong logical capabilities

32

Mistral Large 2

Mistral AI

Overall Score
51.38
ARC-AGI 2: 0.1%HLE: 4.5%
0.1%
ARC-AGI 2
4.5%
HLE
84.0%
MMLU
92.0%
Code
71.5%
Math
93.4%
GSM8K
Parameters:123B
Context:128K
Released:Jul 2024
License:Commercial

Flagship model with exceptional coding capabilities

33

Llama 3.3 70B Instruct

Meta AI

Overall Score
51.32
ARC-AGI 2: 0.3%HLE (Text): 58.0%
0.3%
ARC-AGI 2
58.0%
HLE (Text)
86.0%
MMLU
88.4%
Code
77.0%
Math
95.1%
GSM8K
Parameters:70B
Context:128K
Released:Dec 2024
License:Open Source

Llama 3.3 70B delivers performance nearly matching the 405B model while being significantly more efficient and cost-effective.

34

Falcon 180B

Technology Innovation Institute

Overall Score
51.02
No AGI benchmarks
70.5%
MMLU
36.8%
Code
32.1%
Math
58.9%
GSM8K
Context:8K
Released:Sep 2023
License:Open Source

180B parameter model, top open-source model at launch

35

Qwen 2.5 72B

Alibaba

Overall Score
50.96
ARC-AGI 2: 0.5%HLE (Text): 58.0%
0.5%
ARC-AGI 2
58.0%
HLE (Text)
85.9%
MMLU
86.4%
Code
74.2%
Math
93.8%
GSM8K
Parameters:72B
Context:131K
Released:Sep 2024
License:Open Source

Leading Chinese model with strong math and coding

36

DeepSeek-V2

DeepSeek

Overall Score
50.35
ARC-AGI 2: 0.1%HLE (Text): 60.0%
0.1%
ARC-AGI 2
60.0%
HLE (Text)
83.7%
MMLU
80.1%
Code
72.8%
Math
92.3%
GSM8K
Parameters:236B
Context:128K
Released:May 2024
License:Open Source

Large MoE model with efficient architecture

37

Step 3

StepFun

Overall Score
49.31
ARC-AGI 2: 15.2%HLE (Text): 42.5%
15.2%
ARC-AGI 2
42.5%
HLE (Text)
92.8%
MMLU
86.5%
Code
78.2%
Math
93.6%
GSM8K
Context:8K
Released:Jan 2025
License:Open Source

Advanced multimodal model with strong reasoning capabilities

38

GPT-4o mini

OpenAI

Overall Score
48.92
ARC-AGI 2: 0.2%HLE (Text): 55.0%
0.2%
ARC-AGI 2
55.0%
HLE (Text)
82.0%
MMLU
87.2%
Code
70.2%
Math
93.1%
GSM8K
Context:128K
Released:Jul 2024
License:Proprietary

GPT-4o mini is OpenAI's most cost-efficient small model, designed for fast responses and handling large contexts.

39

Stable LM 2 12B

Stability AI

Overall Score
48.74
No AGI benchmarks
68.7%
MMLU
35.4%
Code
28.9%
Math
55.2%
GSM8K
Context:8K
Released:Jan 2024
License:Open Source

Stability AI's 12B parameter language model with improved performance

40

AFM 4.5B

Arcee AI

Overall Score
47.26
No AGI benchmarks
62.4%
MMLU
38.2%
Code
28.9%
Math
56.7%
GSM8K
Context:8K
Released:Jan 2025
License:Open Source

Arcee AI's efficient 4.5B parameter model focused on domain adaptation

41

Llama 3.1 70B

Meta AI

Overall Score
47.25
ARC-AGI 2: 0.1%HLE (Text): 52.0%
0.1%
ARC-AGI 2
52.0%
HLE (Text)
83.6%
MMLU
80.5%
Code
64.7%
Math
93.0%
GSM8K
Parameters:70B
Context:128K
Released:Jul 2024
License:Open Source

Highly capable open model with extended context

42

Claude 3.5 Haiku

Anthropic

Overall Score
47.08
ARC-AGI 2: 0.0%HLE (Text): 58.0%
0.0%
ARC-AGI 2
58.0%
HLE (Text)
78.7%
MMLU
88.2%
Code
44.5%
Math
91.2%
GSM8K
Context:200K
Released:Oct 2024
License:Proprietary

Claude 3.5 Haiku is Anthropic's fastest and most cost-effective model, optimized for speed while maintaining strong performance.

43

Mixtral 8x22B

Mistral AI

Overall Score
45.80
ARC-AGI 2: 0.0%HLE (Text): 54.0%
0.0%
ARC-AGI 2
54.0%
HLE (Text)
77.8%
MMLU
74.5%
Code
58.7%
Math
88.4%
GSM8K
Parameters:141B
Context:66K
Released:Apr 2024
License:Open Source

Large mixture-of-experts model with strong performance

44

StripedHyena 7B

Together AI

Overall Score
45.54
No AGI benchmarks
61.3%
MMLU
36.2%
Code
27.8%
Math
52.4%
GSM8K
Context:8K
Released:Dec 2023
License:Open Source

Efficient transformer alternative with state-space model architecture

45

Claude 3 Sonnet

Anthropic

Overall Score
45.17
ARC-AGI 2: 0.0%HLE (Text): 52.0%
0.0%
ARC-AGI 2
52.0%
HLE (Text)
79.0%
MMLU
73.0%
Code
53.1%
Math
92.3%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Mar 2024
License:Proprietary

Balanced model striking ideal balance between intelligence and speed for enterprise workloads

46

Llama 3.2 90B Vision

Meta AI

Overall Score
44.39
ARC-AGI 2: 0.1%HLE: 5.5%
0.1%
ARC-AGI 2
5.5%
HLE
85.5%
MMLU
75.2%
Code
58.3%
Math
91.8%
GSM8K
Parameters:90B
Context:128K
Released:Sep 2024
License:Open Source

First multimodal Llama model with vision capabilities

47

Gemini 2.5 Pro

Google DeepMind

Overall Score
44.23
ARC-AGI 2: 5.0%HLE: 21.6%
5.0%
ARC-AGI 2
21.6%
HLE
91.7%
MMLU
91.5%
Code
82.3%
Math
97.2%
GSM8K
Context:2.0M
Released:Jun 2025
License:Proprietary

Advanced multimodal model with unprecedented 2M token context and superior reasoning

48

Yi-Large

01.AI

Overall Score
44.07
ARC-AGI 2: 0.1%HLE (Text): 54.0%
0.1%
ARC-AGI 2
54.0%
HLE (Text)
85.4%
MMLU
75.2%
Code
54.8%
Math
91.7%
GSM8K
Parameters:Undisclosed
Context:33K
Released:May 2024
License:Proprietary

Bilingual model excelling in English and Chinese

49

Grok 4 Heavy

xAI

Overall Score
43.84
ARC-AGI 2: 12.8%HLE: 9.8%
12.8%
ARC-AGI 2
9.8%
HLE
94.5%
MMLU
92.8%
Code
88.5%
Math
98.5%
GSM8K
Parameters:Undisclosed
Context:256K
Released:Jul 2025
License:Proprietary

Most capable Grok model, available at premium tier for $300/month

50

Inflection-2.5

Inflection AI

Overall Score
43.51
ARC-AGI 2: 0.1%HLE (Text): 58.0%
0.1%
ARC-AGI 2
58.0%
HLE (Text)
80.4%
MMLU
70.5%
Code
47.3%
Math
88.2%
GSM8K
Parameters:Undisclosed
Context:33K
Released:Mar 2024
License:Proprietary

Conversational AI with emotional intelligence

51

DeepSeek-R1-0528

DeepSeek

Overall Score
42.90
ARC-AGI 2: 7.0%HLE (Text): 14.0%
7.0%
ARC-AGI 2
14.0%
HLE (Text)
92.1%
MMLU
95.5%
Code
88.2%
Math
97.8%
GSM8K
Parameters:1.2T
Context:256K
Released:May 2025
License:Open Source

Advanced iteration of DeepSeek-R1 with enhanced reasoning and larger scale

52

Kimi K1.5

Moonshot AI

Overall Score
42.66
ARC-AGI 2: 6.2%HLE (Text): 38.4%
6.2%
ARC-AGI 2
38.4%
HLE (Text)
79.4%
MMLU
68.7%
Code
61.2%
Math
78.5%
GSM8K
Parameters:85B
Context:128K
Released:Oct 2024
License:Proprietary

Enhanced Kimi model with dramatically expanded context window and improved reasoning across all domains

53

o4-mini

OpenAI

Overall Score
42.55
ARC-AGI 2: 2.4%HLE: 18.1%
2.4%
ARC-AGI 2
18.1%
HLE
88.4%
MMLU
94.1%
Code
92.7%
Math
96.3%
GSM8K
Context:128K
Released:Jan 2025
License:Proprietary

Cost-efficient reasoning model with performance approaching larger models

54

o3-mini

OpenAI

Overall Score
42.20
ARC-AGI 2: 10.0%HLE (Text): 13.4%
10.0%
ARC-AGI 2
13.4%
HLE (Text)
88.2%
MMLU
92.1%
Code
85.3%
Math
94.5%
GSM8K
Context:128K
Released:Mar 2025
License:Proprietary

Efficient reasoning model with o3-level capabilities at lower cost

55

Claude 3 Haiku

Anthropic

Overall Score
41.76
ARC-AGI 2: 0.0%HLE (Text): 48.0%
0.0%
ARC-AGI 2
48.0%
HLE (Text)
75.2%
MMLU
75.0%
Code
38.9%
Math
88.9%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Mar 2024
License:Proprietary

Fastest Claude model optimized for speed and efficiency

56

o1-preview

OpenAI

Overall Score
41.12
ARC-AGI 2: 10.0%HLE: 8.1%
10.0%
ARC-AGI 2
8.1%
HLE
93.5%
MMLU
93.8%
Code
78.0%
Math
97.2%
GSM8K
Parameters:Undisclosed
Context:128K
Released:Sep 2024
License:Proprietary

Reasoning-focused model with chain-of-thought capabilities

57

Command R+

Cohere

Overall Score
40.49
ARC-AGI 2: 0.1%HLE (Text): 52.0%
0.1%
ARC-AGI 2
52.0%
HLE (Text)
75.2%
MMLU
71.3%
Code
42.5%
Math
87.3%
GSM8K
Parameters:104B
Context:128K
Released:Apr 2024
License:Commercial

Enterprise model optimized for RAG and tool use

58

DeepSeek-R1

DeepSeek

Overall Score
40.31
ARC-AGI 2: 5.0%HLE (Text): 8.5%
5.0%
ARC-AGI 2
8.5%
HLE (Text)
91.6%
MMLU
95.0%
Code
86.7%
Math
97.1%
GSM8K
Parameters:671B
Context:128K
Released:Jan 2025
License:Open Source

DeepSeek-R1 is an advanced reasoning model that rivals OpenAI o1 performance through reinforcement learning and chain-of-thought reasoning.

59

Zephyr 7B Beta

Hugging Face

Overall Score
40.20
No AGI benchmarks
61.4%
MMLU
29.5%
Code
13.1%
Math
52.2%
GSM8K
Context:8K
Released:Oct 2023
License:Open Source

Fine-tuned Mistral-7B with improved helpfulness and harmlessness

60

o1

OpenAI

Overall Score
39.87
ARC-AGI 2: 8.0%HLE: 8.0%
8.0%
ARC-AGI 2
8.0%
HLE
92.3%
MMLU
92.5%
Code
74.3%
Math
96.5%
GSM8K
Context:200K
Released:Dec 2024
License:Proprietary

OpenAI o1 is a reasoning model that uses chain-of-thought processing to solve complex problems with unprecedented accuracy.

61

Qwen3-235B

Alibaba

Overall Score
38.86
ARC-AGI 2: 3.0%HLE (Text): 11.8%
3.0%
ARC-AGI 2
11.8%
HLE (Text)
88.3%
MMLU
89.1%
Code
79.2%
Math
95.3%
GSM8K
Parameters:235B
Context:256K
Released:Apr 2025
License:Open Source

Next-generation Qwen model with significantly expanded scale and capabilities

62

Mixtral 8x22B (Historical)

Mistral AI

Overall Score
38.47
ARC-AGI 2: 2.4%HLE (Text): 31.7%
2.4%
ARC-AGI 2
31.7%
HLE (Text)
75.2%
MMLU
68.9%
Code
52.4%
Math
83.7%
GSM8K
Parameters:141B
Context:33K
Released:Apr 2024
License:Open Source

Early release version of Mixtral 8x22B with mixture-of-experts architecture, showing the initial capabilities before optimization

63

Claude 4 Opus

Anthropic

Overall Score
38.47
ARC-AGI 2: 0.1%HLE: 6.7%
0.1%
ARC-AGI 2
6.7%
HLE
93.8%
MMLU
94.2%
Code
85.7%
Math
97.8%
GSM8K
Context:500K
Released:May 2025
License:Proprietary

Most capable Claude model with enhanced reasoning and extended context

64

Claude 4 Sonnet

Anthropic

Overall Score
38.21
ARC-AGI 2: 2.0%HLE: 5.5%
2.0%
ARC-AGI 2
5.5%
HLE
92.7%
MMLU
94.5%
Code
82.1%
Math
97.9%
GSM8K
Context:400K
Released:May 2025
License:Proprietary

Balanced Claude 4 model offering Opus-level capabilities with better efficiency

65

Grok-4

xAI

Overall Score
38.20
ARC-AGI 2: 1.0%HLE (Text): 8.0%
1.0%
ARC-AGI 2
8.0%
HLE (Text)
91.5%
MMLU
92.0%
Code
82.7%
Math
96.8%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Jan 2025
License:Proprietary

Next-generation model from xAI with enhanced reasoning and real-time capabilities

66

ERNIE X1

Baidu

Overall Score
38.17
ARC-AGI 2: 5.5%HLE: 5.8%
5.5%
ARC-AGI 2
5.8%
HLE
90.2%
MMLU
88.5%
Code
78.3%
Math
94.8%
GSM8K
Parameters:Undisclosed
Context:33K
Released:Mar 2025
License:Proprietary

Deep-thinking reasoning model with multimodal capabilities, designed for complex problem solving

67

Grok 3

xAI

Overall Score
38.10
ARC-AGI 2: 5.8%HLE (Text): 5.8%
5.8%
ARC-AGI 2
5.8%
HLE (Text)
89.5%
MMLU
86.8%
Code
76.5%
Math
95.2%
GSM8K
Parameters:Undisclosed
Context:128K
Released:Feb 2025
License:Proprietary

Advanced model with 10x more compute than Grok 2, featuring reflection capabilities

68

Claude 3.7 Sonnet

Anthropic

Overall Score
37.78
ARC-AGI 2: 1.0%HLE: 8.0%
1.0%
ARC-AGI 2
8.0%
HLE
90.5%
MMLU
93.1%
Code
78.3%
Math
97.2%
GSM8K
Context:250K
Released:Feb 2025
License:Proprietary

Enhanced Sonnet model bridging the gap to Claude 4

69

Gemini 2.0 Pro

Google DeepMind

Overall Score
37.60
ARC-AGI 2: 2.0%HLE: 7.1%
2.0%
ARC-AGI 2
7.1%
HLE
90.3%
MMLU
89.2%
Code
78.5%
Math
95.8%
GSM8K
Context:2.0M
Released:Feb 2025
License:Proprietary

Gemini 2.0 Pro is Google's most capable multimodal AI model, designed for the agentic era with advanced reasoning and tool use capabilities.

70

Gemini 2.5 Flash

Google DeepMind

Overall Score
37.29
ARC-AGI 2: 2.0%HLE: 12.1%
2.0%
ARC-AGI 2
12.1%
HLE
80.5%
MMLU
88.2%
Code
75.3%
Math
93.7%
GSM8K
Context:1.0M
Released:Jan 2025
License:Proprietary

Lightning-fast multimodal model optimized for speed and efficiency

71

GLM-4 Plus

Z.ai

Overall Score
37.27
ARC-AGI 2: 4.2%HLE (Text): 5.8%
4.2%
ARC-AGI 2
5.8%
HLE (Text)
88.8%
MMLU
87.2%
Code
72.5%
Math
95.2%
GSM8K
Parameters:Undisclosed (>130B)
Context:1.0M
Released:Aug 2024
License:Proprietary

Most powerful GLM model with breakthrough 1M token context window and enhanced capabilities

72

Qwen2.5-Max

Alibaba

Overall Score
37.10
ARC-AGI 2: 4.2%HLE (Text): 5.2%
4.2%
ARC-AGI 2
5.2%
HLE (Text)
88.8%
MMLU
86.5%
Code
75.2%
Math
94.5%
GSM8K
Parameters:Large MoE
Context:262K
Released:Jan 2025
License:Proprietary

Large Mixture of Experts model trained on 20T+ tokens with extended context capabilities

73

o1-mini

OpenAI

Overall Score
36.93
ARC-AGI 2: 3.0%HLE (Text): 10.3%
3.0%
ARC-AGI 2
10.3%
HLE (Text)
85.2%
MMLU
86.2%
Code
70.0%
Math
94.8%
GSM8K
Context:128K
Released:Sep 2024
License:Proprietary

o1-mini is a cost-efficient reasoning model optimized for STEM tasks, particularly mathematics and coding, with performance matching o1 on technical benchmarks.

74

A.X 3.0

SK Telecom

Overall Score
36.65
ARC-AGI 2: 4.5%HLE (Text): 28.3%
4.5%
ARC-AGI 2
28.3%
HLE (Text)
74.8%
MMLU
61.3%
Code
54.7%
Math
72.1%
GSM8K
Parameters:30B
Context:8K
Released:Jun 2024
License:Proprietary

Third generation A.X model with significantly improved reasoning capabilities and expanded context window for enterprise applications

75

Phi-4

Microsoft Research

Overall Score
36.44
ARC-AGI 2: 3.8%HLE (Text): 4.8%
3.8%
ARC-AGI 2
4.8%
HLE (Text)
84.8%
MMLU
82.6%
Code
80.4%
Math
95.5%
GSM8K
Parameters:14B
Context:16K
Released:Dec 2024
License:Open Source

14B parameter model specialized in math and complex reasoning

76

Mixtral 8x7B

Mistral AI

Overall Score
36.21
ARC-AGI 2: 0.0%HLE (Text): 46.0%
0.0%
ARC-AGI 2
46.0%
HLE (Text)
70.6%
MMLU
40.2%
Code
28.4%
Math
74.4%
GSM8K
Parameters:46.7B
Context:33K
Released:Dec 2023
License:Open Source

Pioneering open-source mixture-of-experts model

77

Phi-3 Medium

Microsoft Research

Overall Score
36.07
ARC-AGI 2: 0.1%HLE (Text): 38.0%
0.1%
ARC-AGI 2
38.0%
HLE (Text)
78.2%
MMLU
62.4%
Code
46.7%
Math
91.0%
GSM8K
Parameters:14B
Context:128K
Released:May 2024
License:Open Source

Small language model with outsized performance

78

ERNIE 4.5

Baidu

Overall Score
35.93
ARC-AGI 2: 3.2%HLE: 4.5%
3.2%
ARC-AGI 2
4.5%
HLE
88.5%
MMLU
85.2%
Code
72.8%
Math
92.5%
GSM8K
Parameters:0.3B-424B (MoE)
Context:16K
Released:Mar 2025
License:Proprietary

Native multimodal model with Mixture of Experts architecture, supporting text, image, audio, and video

79

Grok-2

xAI

Overall Score
35.93
ARC-AGI 2: 0.1%HLE (Text): 6.6%
0.1%
ARC-AGI 2
6.6%
HLE (Text)
88.0%
MMLU
88.4%
Code
76.5%
Math
94.7%
GSM8K
Parameters:Undisclosed
Context:100K
Released:Aug 2024
License:Proprietary

Advanced model with real-time information access

80

Llama 3.2 11B Vision

Meta AI

Overall Score
35.65
ARC-AGI 2: 0.1%HLE (Text): 42.0%
0.1%
ARC-AGI 2
42.0%
HLE (Text)
73.0%
MMLU
58.4%
Code
42.1%
Math
84.2%
GSM8K
Parameters:11B
Context:128K
Released:Sep 2024
License:Open Source

Lightweight multimodal model for on-device deployment

81

Claude 3.5 Sonnet

Anthropic

Overall Score
35.32
ARC-AGI 2: 0.1%HLE: 4.1%
0.1%
ARC-AGI 2
4.1%
HLE
88.7%
MMLU
92.0%
Code
71.1%
Math
96.4%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Jun 2024
License:Proprietary

Most capable Claude model with superior coding and reasoning

82

GLM-4 Long

Z.ai

Overall Score
35.27
ARC-AGI 2: 3.2%HLE (Text): 4.8%
3.2%
ARC-AGI 2
4.8%
HLE (Text)
86.2%
MMLU
82.8%
Code
66.5%
Math
93.2%
GSM8K
Parameters:130B
Context:2.0M
Released:Sep 2024
License:Proprietary

Specialized model for ultra-long context processing with 2M token window

83

Falcon 40B

Technology Innovation Institute

Overall Score
35.04
No AGI benchmarks
56.7%
MMLU
19.6%
Code
17.8%
Math
35.4%
GSM8K
Context:8K
Released:May 2023
License:Open Source

40B parameter model optimized for performance and efficiency

84

GPT-4o

OpenAI

Overall Score
35.03
ARC-AGI 2: 0.5%HLE: 2.7%
0.5%
ARC-AGI 2
2.7%
HLE
88.7%
MMLU
90.2%
Code
76.6%
Math
95.8%
GSM8K
Parameters:Undisclosed
Context:128K
Released:May 2024
License:Proprietary

OpenAI's flagship omni-model with native multimodal capabilities

85

Kimi K1

Moonshot AI

Overall Score
34.66
ARC-AGI 2: 3.4%HLE (Text): 26.8%
3.4%
ARC-AGI 2
26.8%
HLE (Text)
71.6%
MMLU
58.9%
Code
49.3%
Math
67.8%
GSM8K
Parameters:65B
Context:33K
Released:Mar 2024
License:Proprietary

First generation Kimi model with exceptional long context capabilities and strong Chinese-English bilingual performance

86

Command A

Cohere

Overall Score
34.65
ARC-AGI 2: 2.5%HLE (Text): 4.2%
2.5%
ARC-AGI 2
4.2%
HLE (Text)
85.2%
MMLU
82.5%
Code
68.5%
Math
91.2%
GSM8K
Parameters:Undisclosed
Context:256K
Released:Jan 2025
License:Proprietary

Latest Command series model with enhanced capabilities and extended context window

87

Grok 2 mini

xAI

Overall Score
34.47
ARC-AGI 2: 2.2%HLE (Text): 4.2%
2.2%
ARC-AGI 2
4.2%
HLE (Text)
85.5%
MMLU
80.4%
Code
68.2%
Math
92.5%
GSM8K
Parameters:Undisclosed
Context:128K
Released:Aug 2024
License:Proprietary

Smaller version of Grok 2 with efficient performance

88

GLM-4

Z.ai

Overall Score
34.44
ARC-AGI 2: 2.5%HLE (Text): 4.2%
2.5%
ARC-AGI 2
4.2%
HLE (Text)
85.4%
MMLU
81.5%
Code
64.7%
Math
92.5%
GSM8K
Parameters:130B
Context:128K
Released:Jan 2024
License:Proprietary

Advanced bilingual model with significant improvements over GLM-3, competing with GPT-4 on Chinese tasks

89

GLM-4V

Z.ai

Overall Score
34.32
ARC-AGI 2: 3.5%HLE: 5.2%
3.5%
ARC-AGI 2
5.2%
HLE
84.8%
MMLU
79.5%
Code
62.8%
Math
91.5%
GSM8K
Parameters:130B
Context:128K
Released:Jun 2024
License:Proprietary

Multimodal version of GLM-4 with strong vision-language understanding capabilities

90

GPT-4 Turbo

OpenAI

Overall Score
34.27
ARC-AGI 2: 0.1%HLE (Text): 3.5%
0.1%
ARC-AGI 2
3.5%
HLE (Text)
86.4%
MMLU
87.8%
Code
72.6%
Math
94.2%
GSM8K
Parameters:Undisclosed
Context:128K
Released:Nov 2023
License:Proprietary

Enhanced GPT-4 with improved performance and extended context

91

Qwen2 72B

Alibaba

Overall Score
33.87
ARC-AGI 2: 1.8%HLE (Text): 3.8%
1.8%
ARC-AGI 2
3.8%
HLE (Text)
84.2%
MMLU
79.9%
Code
68.7%
Math
91.1%
GSM8K
Parameters:72B
Context:131K
Released:Jun 2024
License:Open Source

Large-scale open source model from Alibaba with strong multilingual capabilities

92

Claude 3 Opus

Anthropic

Overall Score
33.59
ARC-AGI 2: 0.1%HLE: 4.2%
0.1%
ARC-AGI 2
4.2%
HLE
86.8%
MMLU
84.9%
Code
60.1%
Math
95.0%
GSM8K
Parameters:Undisclosed
Context:200K
Released:Mar 2024
License:Proprietary

Powerful model for complex tasks requiring deep analysis

93

EXAONE 4.0.1 32B

LG AI

Overall Score
32.86
ARC-AGI 2: 8.5%HLE (Text): 28.3%
8.5%
ARC-AGI 2
28.3%
HLE (Text)
75.3%
MMLU
48.9%
Code
42.1%
Math
71.8%
GSM8K
Context:33K
Released:Jan 2025
License:Open Source

LG AI's flagship 32B parameter model with multilingual capabilities

94

Grok 1.5

xAI

Overall Score
32.32
ARC-AGI 2: 1.5%HLE (Text): 3.5%
1.5%
ARC-AGI 2
3.5%
HLE (Text)
81.3%
MMLU
74.1%
Code
62.5%
Math
90.8%
GSM8K
Parameters:Undisclosed
Context:128K
Released:Mar 2024
License:Proprietary

Improved Grok with extended 128K context window and better reasoning

95

Qwen2.5 32B

Alibaba

Overall Score
32.10
ARC-AGI 2: 1.2%HLE (Text): 3.2%
1.2%
ARC-AGI 2
3.2%
HLE (Text)
81.8%
MMLU
75.2%
Code
62.5%
Math
88.2%
GSM8K
Parameters:32B
Context:131K
Released:Sep 2024
License:Open Source

Mid-size model from Qwen2.5 series trained on 18T tokens with improved efficiency

96

LLaMA 3 70B

Meta AI

Overall Score
32.07
ARC-AGI 2: 1.2%HLE (Text): 3.2%
1.2%
ARC-AGI 2
3.2%
HLE (Text)
79.5%
MMLU
81.7%
Code
57.8%
Math
88.9%
GSM8K
Parameters:70B
Context:8K
Released:Apr 2024
License:Open Source

Large-scale 70B parameter model from Meta's LLaMA 3 family with enhanced capabilities

97

Kanana 1.2

Kakao

Overall Score
31.83
ARC-AGI 2: 2.8%HLE (Text): 22.1%
2.8%
ARC-AGI 2
22.1%
HLE (Text)
69.7%
MMLU
56.2%
Code
45.8%
Math
63.4%
GSM8K
Parameters:14B
Context:4K
Released:Apr 2024
License:Research Only

Enhanced version of Kanana with improved reasoning capabilities and expanded multimodal understanding

98

Gemini 2.0 Flash

Google DeepMind

Overall Score
31.70
ARC-AGI 2: 1.0%HLE: 6.6%
1.0%
ARC-AGI 2
6.6%
HLE
77.4%
MMLU
85.7%
Code
68.2%
Math
91.3%
GSM8K
Parameters:Undisclosed
Context:1.0M
Released:Dec 2024
License:Proprietary

Latest Gemini model built for the agentic era

99

Gemini 1.5 Pro

Google DeepMind

Overall Score
31.67
ARC-AGI 2: 0.0%HLE: 4.6%
0.0%
ARC-AGI 2
4.6%
HLE
81.9%
MMLU
71.9%
Code
59.4%
Math
91.7%
GSM8K
Parameters:Undisclosed
Context:1.0M
Released:Feb 2024
License:Proprietary

Multimodal model with breakthrough 1M token context window

100

Gemini 1.5 Flash

Google DeepMind

Overall Score
31.02
ARC-AGI 2: 0.0%HLE: 4.2%
0.0%
ARC-AGI 2
4.2%
HLE
78.9%
MMLU
74.3%
Code
54.9%
Math
86.2%
GSM8K
Parameters:Undisclosed
Context:1.0M
Released:May 2024
License:Proprietary

Lightweight multimodal model optimized for speed

101

ERNIE 4.0

Baidu

Overall Score
30.99
ARC-AGI 2: 0.8%HLE: 2.5%
0.8%
ARC-AGI 2
2.5%
HLE
82.4%
MMLU
72.5%
Code
58.6%
Math
85.2%
GSM8K
Parameters:Undisclosed
Context:8K
Released:Oct 2023
License:Proprietary

Advanced multimodal model from Baidu with enhanced reasoning and understanding capabilities

102

BLOOM 176B

BigScience

Overall Score
30.65
No AGI benchmarks
55.5%
MMLU
15.7%
Code
13.2%
Math
20.9%
GSM8K
Context:8K
Released:Jul 2022
License:Open Source

Multilingual 176B parameter model supporting 59 languages

103

Jamba 1.5 Large

AI21 Labs

Overall Score
30.51
ARC-AGI 2: 1.5%HLE (Text): 3.5%
1.5%
ARC-AGI 2
3.5%
HLE (Text)
73.2%
MMLU
74.5%
Code
54.8%
Math
87.5%
GSM8K
Parameters:94B (16B active)
Context:256K
Released:Aug 2024
License:Open Source

Enhanced large version of Jamba with improved performance across all benchmarks

104

A.X 2.0

SK Telecom

Overall Score
29.23
ARC-AGI 2: 2.1%HLE (Text): 18.6%
2.1%
ARC-AGI 2
18.6%
HLE (Text)
67.3%
MMLU
52.4%
Code
41.8%
Math
58.7%
GSM8K
Parameters:13B
Context:4K
Released:Aug 2023
License:Proprietary

Second generation of SK Telecom's A.X series, focused on Korean language understanding and telecommunications applications

105

AFM 3.0

Arcee AI

Overall Score
29.21
ARC-AGI 2: 2.5%HLE (Text): 19.7%
2.5%
ARC-AGI 2
19.7%
HLE (Text)
68.9%
MMLU
45.3%
Code
38.7%
Math
54.2%
GSM8K
Parameters:14B
Context:16K
Released:Oct 2024
License:Commercial

Advanced Adaptive Foundation Model with enhanced reasoning capabilities and improved enterprise features

106

Grok 1

xAI

Overall Score
28.71
ARC-AGI 2: 0.8%HLE (Text): 2.8%
0.8%
ARC-AGI 2
2.8%
HLE (Text)
73.0%
MMLU
63.2%
Code
52.8%
Math
86.5%
GSM8K
Parameters:Undisclosed
Context:8K
Released:Nov 2023
License:Proprietary

First model from xAI with real-time knowledge access through X platform

107

Stable LM 2 1.6B

Stability AI

Overall Score
28.34
No AGI benchmarks
45.3%
MMLU
18.7%
Code
12.4%
Math
28.6%
GSM8K
Context:8K
Released:Jan 2024
License:Open Source

Compact 1.6B parameter model for efficient deployment

108

Phi-3.5

Microsoft Research

Overall Score
27.89
ARC-AGI 2: 1.2%HLE (Text): 2.8%
1.2%
ARC-AGI 2
2.8%
HLE (Text)
71.5%
MMLU
62.8%
Code
45.8%
Math
85.2%
GSM8K
Parameters:3.8B
Context:128K
Released:Aug 2024
License:Open Source

Enhanced version of Phi-3 with improved performance across benchmarks

109

Jamba

AI21 Labs

Overall Score
27.41
ARC-AGI 2: 0.8%HLE (Text): 2.8%
0.8%
ARC-AGI 2
2.8%
HLE (Text)
67.4%
MMLU
65.4%
Code
46.7%
Math
81.2%
GSM8K
Parameters:52B (12B active)
Context:256K
Released:Mar 2024
License:Open Source

Hybrid Mamba-Transformer architecture with 256K context window, combining efficiency and performance

110

LLaMA 3.1 8B

Meta AI

Overall Score
27.32
ARC-AGI 2: 0.6%HLE (Text): 2.3%
0.6%
ARC-AGI 2
2.3%
HLE (Text)
69.4%
MMLU
64.2%
Code
47.2%
Math
80.6%
GSM8K
Parameters:8B
Context:128K
Released:Jul 2024
License:Open Source

Updated 8B model with 128K context window and improved multilingual support

111

CodeGeeX4

Z.ai

Overall Score
27.16
ARC-AGI 2: 0.8%HLE (Text): 2.5%
0.8%
ARC-AGI 2
2.5%
HLE (Text)
58.5%
MMLU
85.2%
Code
52.8%
Math
68.5%
GSM8K
Parameters:9B
Context:128K
Released:Aug 2024
License:Open Source

Advanced code generation model competing with GitHub Copilot and supporting 100+ programming languages

112

Jurassic-2

AI21 Labs

Overall Score
27.14
ARC-AGI 2: 0.4%HLE (Text): 2.5%
0.4%
ARC-AGI 2
2.5%
HLE (Text)
72.8%
MMLU
58.2%
Code
48.5%
Math
75.3%
GSM8K
Parameters:Undisclosed
Context:8K
Released:Mar 2023
License:Proprietary

Improved model with faster response times, better instruction following, and multilingual support

113

Kanana 1.0

Kakao

Overall Score
26.80
ARC-AGI 2: 1.9%HLE (Text): 16.4%
1.9%
ARC-AGI 2
16.4%
HLE (Text)
62.4%
MMLU
48.7%
Code
37.2%
Math
54.1%
GSM8K
Parameters:8B
Context:2K
Released:Sep 2023
License:Research Only

Kakao's first large language model focused on Korean language understanding and cultural context

114

LLaMA 3 8B

Meta AI

Overall Score
26.71
ARC-AGI 2: 0.5%HLE (Text): 2.1%
0.5%
ARC-AGI 2
2.1%
HLE (Text)
68.4%
MMLU
62.2%
Code
45.6%
Math
79.6%
GSM8K
Parameters:8B
Context:8K
Released:Apr 2024
License:Open Source

Efficient 8B parameter model from Meta's LLaMA 3 family, trained on 15T tokens

115

Command R

Cohere

Overall Score
26.65
ARC-AGI 2: 0.5%HLE (Text): 2.2%
0.5%
ARC-AGI 2
2.2%
HLE (Text)
68.2%
MMLU
61.8%
Code
45.8%
Math
78.2%
GSM8K
Parameters:35B
Context:128K
Released:Mar 2024
License:Proprietary

Retrieval-augmented generation model optimized for enterprise use with multilingual support

116

Phi-3 mini

Microsoft Research

Overall Score
26.47
ARC-AGI 2: 0.8%HLE (Text): 2.5%
0.8%
ARC-AGI 2
2.5%
HLE (Text)
69.0%
MMLU
58.5%
Code
42.3%
Math
82.5%
GSM8K
Parameters:3.8B
Context:128K
Released:Apr 2024
License:Open Source

3.8B parameter model with 4K and 128K context variants

117

Jamba 1.5 Mini

AI21 Labs

Overall Score
26.26
ARC-AGI 2: 0.6%HLE (Text): 2.5%
0.6%
ARC-AGI 2
2.5%
HLE (Text)
65.4%
MMLU
62.8%
Code
43.2%
Math
78.5%
GSM8K
Parameters:12B
Context:256K
Released:Aug 2024
License:Open Source

Efficient version of Jamba with reduced parameters while maintaining 256K context

118

Mistral 7B v0.3

Mistral AI

Overall Score
26.10
ARC-AGI 2: 1.9%HLE (Text): 17.3%
1.9%
ARC-AGI 2
17.3%
HLE (Text)
62.5%
MMLU
40.2%
Code
32.6%
Math
52.2%
GSM8K
Parameters:7B
Context:33K
Released:May 2024
License:Open Source

Enhanced version of Mistral 7B with improved instruction following and extended context capabilities

119

ERNIE 3.0 Titan

Baidu

Overall Score
25.29
ARC-AGI 2: 0.2%HLE (Text): 1.8%
0.2%
ARC-AGI 2
1.8%
HLE (Text)
75.2%
MMLU
45.8%
Code
42.3%
Math
68.5%
GSM8K
Parameters:260B
Context:4K
Released:Dec 2021
License:Proprietary

Baidu's first 100B+ parameter model, combining knowledge graphs with transformer architecture

120

GLM-3

Z.ai

Overall Score
24.97
ARC-AGI 2: 0.3%HLE (Text): 1.8%
0.3%
ARC-AGI 2
1.8%
HLE (Text)
66.4%
MMLU
57.4%
Code
38.5%
Math
72.3%
GSM8K
Parameters:6B
Context:33K
Released:Oct 2023
License:Open Source

Open bilingual (Chinese-English) language model with strong performance on both languages

121

OLMo 1.7 7B

Allen Institute for AI

Overall Score
24.46
ARC-AGI 2: 1.8%HLE (Text): 15.9%
1.8%
ARC-AGI 2
15.9%
HLE (Text)
61.3%
MMLU
34.8%
Code
28.9%
Math
42.7%
GSM8K
Parameters:7B
Context:4K
Released:Sep 2024
License:Open Source

Improved version of OLMo with enhanced training methodology and better performance across multiple benchmarks

122

AFM 1.0

Arcee AI

Overall Score
22.56
ARC-AGI 2: 1.4%HLE (Text): 13.8%
1.4%
ARC-AGI 2
13.8%
HLE (Text)
58.4%
MMLU
32.7%
Code
26.3%
Math
38.9%
GSM8K
Parameters:7B
Context:8K
Released:Mar 2024
License:Commercial

Arcee AI's first Adaptive Foundation Model, designed for efficient fine-tuning and domain adaptation

123

Jurassic-1

AI21 Labs

Overall Score
21.52
ARC-AGI 2: 0.1%HLE (Text): 1.2%
0.1%
ARC-AGI 2
1.2%
HLE (Text)
62.3%
MMLU
38.5%
Code
35.2%
Math
58.6%
GSM8K
Parameters:178B
Context:2K
Released:Aug 2021
License:Proprietary

AI21 Labs' first large language model with 178B parameters and large vocabulary

124

Phi-2

Microsoft Research

Overall Score
21.34
ARC-AGI 2: 0.3%HLE (Text): 1.5%
0.3%
ARC-AGI 2
1.5%
HLE (Text)
57.9%
MMLU
47.8%
Code
31.2%
Math
61.1%
GSM8K
Parameters:2.7B
Context:2K
Released:Dec 2023
License:Open Source

2.7B parameter model with outstanding reasoning abilities for its size

125

OLMo 1.0 7B

Allen Institute for AI

Overall Score
20.38
ARC-AGI 2: 1.2%HLE (Text): 12.3%
1.2%
ARC-AGI 2
12.3%
HLE (Text)
54.8%
MMLU
27.2%
Code
22.6%
Math
31.4%
GSM8K
Parameters:7B
Context:2K
Released:Feb 2024
License:Open Source

Open Language Model from Allen AI, designed for transparency and research with fully open training data and process

126

GPT-NeoX 20B

EleutherAI

Overall Score
18.39
No AGI benchmarks
33.6%
MMLU
11.4%
Code
8.2%
Math
7.1%
GSM8K
Context:8K
Released:Feb 2022
License:Open Source

Open-source 20B parameter autoregressive language model

127

Phi-1.5

Microsoft Research

Overall Score
15.20
ARC-AGI 2: 0.1%HLE (Text): 0.8%
0.1%
ARC-AGI 2
0.8%
HLE (Text)
42.5%
MMLU
41.4%
Code
18.5%
Math
15.2%
GSM8K
Parameters:1.3B
Context:2K
Released:Sep 2023
License:Open Source

Improved 1.3B model with better reasoning capabilities

128

Phi-1

Microsoft Research

Overall Score
12.37
ARC-AGI 2: 0.0%HLE (Text): 0.3%
0.0%
ARC-AGI 2
0.3%
HLE (Text)
30.2%
MMLU
50.6%
Code
12.3%
Math
8.7%
GSM8K
Parameters:1.3B
Context:2K
Released:Jun 2023
License:Open Source

Specialized 1.3B parameter model for Python code generation

129

BioGPT-Large

Microsoft Research

Overall Score
12.13
ARC-AGI 2: 0.0%HLE (Text): 0.8%
0.0%
ARC-AGI 2
0.8%
HLE (Text)
48.5%
MMLU
8.2%
Code
10.5%
Math
15.8%
GSM8K
Parameters:1.5B
Context:1K
Released:Jan 2023
License:Research Only

Larger biomedical model with improved accuracy on medical tasks

130

BioGPT

Microsoft Research

Overall Score
10.84
ARC-AGI 2: 0.0%HLE (Text): 0.5%
0.0%
ARC-AGI 2
0.5%
HLE (Text)
45.2%
MMLU
5.8%
Code
8.2%
Math
12.5%
GSM8K
Parameters:347M
Context:1K
Released:Jan 2023
License:Research Only

Specialized language model trained on biomedical literature

131

Midjourney V6

Midjourney

Overall Score
0.00
No AGI benchmarks
Context:8K
Released:Dec 2023
License:Proprietary

Most photorealistic model with significantly improved text rendering

132

DALL-E 3

OpenAI

Overall Score
0.00
ARC-AGI 2: 0.0%HLE: 3.5%
0.0%
ARC-AGI 2
3.5%
HLE
0.0%
MMLU
0.0%
Code
0.0%
Math
0.0%
GSM8K
Parameters:Undisclosed
Context:4K
Released:Oct 2023
License:Proprietary

Advanced text-to-image generation model with improved prompt adherence and safety

133

Whisper Large v3

OpenAI

Overall Score
0.00
ARC-AGI 2: 0.0%HLE (Text): 0.0%No AGI benchmarks
0.0%
ARC-AGI 2
0.0%
HLE (Text)
0.0%
MMLU
0.0%
Code
0.0%
Math
0.0%
GSM8K
Parameters:1.55B
Context:30K
Released:Nov 2023
License:Open Source

State-of-the-art speech recognition model supporting 99 languages with improved accuracy

134

Gen-3 Alpha

Runway ML

Overall Score
0.00
No AGI benchmarks
Context:8K
Released:Jun 2024
License:Open Source

State-of-the-art video generation model with advanced motion and physics