Tracking the Race to Artificial Superintelligence

AI Model Leaderboard

Compare cutting-edge language models based on comprehensive benchmark performance. Track progress in reasoning, coding, mathematics, and general knowledge.

134

Total Models

90.70

Top Score

41.18

Avg Score

Open Source

Multimodal

Qwen3 235B-A22B Thinking 2507

Latest Release

Sort by:

Showing 134 of 134 models

Qwen3 235B-A22B Thinking 2507

Alibaba

Overall Score

90.70

No AGI benchmarks

93.2%

MMLU

89.7%

Code

84.5%

Math

95.1%

GSM8K

94.3%

HellaSwag

95.8%

ARC

Context:33K

Released:Jul 2025

License:Open Source

Advanced reasoning model with chain-of-thought capabilities and extended thinking time

GLM 4.5

Z.ai

Overall Score

89.46

No AGI benchmarks

92.5%

MMLU

88.3%

Code

82.1%

Math

94.2%

GSM8K

93.8%

HellaSwag

95.1%

ARC

Context:128K

Released:Jul 2025

License:Proprietary

Large-scale mixture of experts model with 355B total parameters and 32B active

Qwen3 235B-A22B Instruct 2507

Alibaba

Overall Score

88.75

No AGI benchmarks

91.8%

MMLU

87.5%

Code

81.3%

Math

93.6%

GSM8K

93.2%

HellaSwag

94.7%

ARC

Context:33K

Released:Jul 2025

License:Open Source

Large-scale MoE model with 235B total parameters and 22B active, optimized for instruction following

GLM 4.5 Air

Z.ai

Overall Score

85.45

No AGI benchmarks

88.7%

MMLU

83.2%

Code

76.8%

Math

91.5%

GSM8K

91.2%

HellaSwag

92.8%

ARC

Context:128K

Released:Jul 2025

License:Proprietary

Lightweight version of GLM 4.5 with 106B total parameters and 12B active

Step 2.0

StepFun

Overall Score

82.59

No AGI benchmarks

85.6%

MMLU

79.8%

Code

73.2%

Math

88.9%

GSM8K

90.1%

HellaSwag

91.7%

ARC

Context:33K

Released:May 2024

License:Proprietary

Second generation model with improved reasoning and extended context

Nvidia Nemotron 4 340B

Nvidia

Overall Score

82.04

No AGI benchmarks

85.3%

MMLU

78.9%

Code

72.6%

Math

88.4%

GSM8K

89.7%

HellaSwag

91.2%

ARC

Context:4K

Released:Jun 2024

License:Proprietary

Large-scale model optimized for synthetic data generation and instruction following

EXAONE 3.5 27B

LG AI

Overall Score

80.49

No AGI benchmarks

83.7%

MMLU

76.8%

Code

71.2%

Math

86.9%

GSM8K

88.4%

HellaSwag

90.3%

ARC

Context:8K

Released:Aug 2024

License:Open Source

Enhanced bilingual model with improved reasoning and longer context

Nvidia NeMo 43B

Nvidia

Overall Score

75.63

No AGI benchmarks

78.2%

MMLU

72.5%

Code

65.3%

Math

82.1%

GSM8K

85.4%

HellaSwag

87.2%

ARC

Context:8K

Released:Oct 2023

License:Proprietary

Early large language model from Nvidia focused on enterprise applications

StarCoder2 15B

Hugging Face

Overall Score

75.30

No AGI benchmarks

75.3%

Code

Context:8K

Released:Feb 2024

License:Open Source

Advanced code generation model with 4x larger context than StarCoder

Step 1.0

StepFun

Overall Score

73.10

No AGI benchmarks

74.8%

MMLU

70.2%

Code

62.5%

Math

79.8%

GSM8K

84.2%

HellaSwag

86.1%

ARC

Context:16K

Released:Oct 2023

License:Proprietary

First generation model from StepFun with focus on Chinese language understanding

Test Model

Test AI Corp

Overall Score

72.35

ARC-AGI 2: 9.9%HLE (Text): 99.0%

9.9%

ARC-AGI 2

99.0%

HLE (Text)

99.9%

MMLU

99.9%

Code

99.9%

Math

99.9%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Jun 2025

License:Proprietary

Test model for benchmarking purposes with exceptional performance across all metrics

EXAONE 3.0 7B

LG AI

Overall Score

70.22

No AGI benchmarks

71.4%

MMLU

68.2%

Code

58.9%

Math

76.3%

GSM8K

82.1%

HellaSwag

84.5%

ARC

Context:4K

Released:Dec 2023

License:Open Source

Bilingual Korean-English model with strong performance on both languages

o3

OpenAI

Overall Score

67.06

ARC-AGI 2: 75.7%HLE: 20.3%

75.7%

ARC-AGI 2

20.3%

HLE

95.2%

MMLU

96.3%

Code

96.7%

Math

98.9%

GSM8K

Context:256K

Released:Jan 2025

License:Proprietary

Revolutionary reasoning model achieving near-human performance on ARC-AGI

Mistral Magistral

Mistral AI

Overall Score

63.73

ARC-AGI 2: 21.8%HLE: 45.7%

21.8%

ARC-AGI 2

45.7%

HLE

93.1%

MMLU

91.8%

Code

87.4%

Math

95.7%

GSM8K

Parameters:405B

Context:128K

Released:Jan 2025

License:Proprietary

Flagship multimodal model from Mistral AI with 405B parameters, designed for advanced reasoning and instruction following

Qwen3 Coder 480B

Alibaba

Overall Score

62.36

ARC-AGI 2: 19.5%HLE (Text): 65.3%

19.5%

ARC-AGI 2

65.3%

HLE (Text)

92.7%

MMLU

94.8%

Code

88.2%

Math

95.1%

GSM8K

Parameters:480B

Context:66K

Released:Jan 2025

License:Open Source

Massive-scale coding-specialized model with 480B parameters, designed for complex software development tasks

Llama 3.3 Nemotron Super 70B

Nvidia

Overall Score

59.16

ARC-AGI 2: 17.2%HLE (Text): 61.4%

17.2%

ARC-AGI 2

61.4%

HLE (Text)

90.4%

MMLU

88.7%

Code

83.9%

Math

94.2%

GSM8K

Parameters:70B

Context:131K

Released:Jan 2025

License:Proprietary

Advanced Llama 3.3-based model enhanced by Nvidia with superior reasoning and instruction following capabilities

OpenReasoning Nemotron 27B

Nvidia

Overall Score

59.03

ARC-AGI 2: 18.1%HLE (Text): 62.7%

18.1%

ARC-AGI 2

62.7%

HLE (Text)

87.6%

MMLU

84.9%

Code

85.2%

Math

92.8%

GSM8K

Parameters:27B

Context:33K

Released:Jan 2025

License:Open Source

Open-source reasoning-focused model built on Nemotron architecture with enhanced logical thinking capabilities

Mistral Voxtral

Mistral AI

Overall Score

58.70

ARC-AGI 2: 16.7%HLE (Text): 61.8%

16.7%

ARC-AGI 2

61.8%

HLE (Text)

89.3%

MMLU

87.6%

Code

81.9%

Math

93.4%

GSM8K

Parameters:70B

Context:128K

Released:Jan 2025

License:Proprietary

Multimodal voice and text model from Mistral AI with advanced audio processing and speech understanding capabilities

DBRX Instruct

Databricks

Overall Score

58.61

No AGI benchmarks

74.5%

MMLU

48.1%

Code

38.2%

Math

72.8%

GSM8K

Context:8K

Released:Mar 2024

License:Open Source

Instruction-tuned version of DBRX with enhanced capabilities

Qwen3 30B-A3B Thinking

Alibaba

Overall Score

57.60

ARC-AGI 2: 16.3%HLE (Text): 58.9%

16.3%

ARC-AGI 2

58.9%

HLE (Text)

89.1%

MMLU

85.3%

Code

84.7%

Math

93.2%

GSM8K

Parameters:30B total, 3B active

Context:33K

Released:Jan 2025

License:Open Source

Advanced reasoning-focused MoE model with enhanced thinking capabilities and step-by-step problem solving

Mistral Devstral

Mistral AI

Overall Score

57.10

ARC-AGI 2: 15.6%HLE (Text): 58.4%

15.6%

ARC-AGI 2

58.4%

HLE (Text)

85.7%

MMLU

92.4%

Code

84.1%

Math

91.3%

GSM8K

Parameters:22B

Context:33K

Released:Jan 2025

License:Open Source

Specialized coding model from Mistral AI with 22B parameters, optimized for software development and programming tasks

Kimi K2 Instruct

Moonshot AI

Overall Score

57.07

ARC-AGI 2: 15.1%HLE (Text): 59.7%

15.1%

ARC-AGI 2

59.7%

HLE (Text)

88.9%

MMLU

86.4%

Code

79.2%

Math

92.1%

GSM8K

Parameters:120B

Context:200K

Released:Jan 2025

License:Proprietary

Advanced Chinese-English bilingual model with exceptional long-context understanding and instruction following

A.X 4.0 VL Light

SK Telecom

Overall Score

56.31

No AGI benchmarks

71.2%

MMLU

45.3%

Code

38.7%

Math

68.9%

GSM8K

Context:16K

Released:Jan 2025

License:Proprietary

SK Telecom's vision-language model optimized for Korean and English

Qwen3 Coder 30B

Alibaba

Overall Score

55.89

ARC-AGI 2: 14.2%HLE (Text): 56.8%

14.2%

ARC-AGI 2

56.8%

HLE (Text)

86.3%

MMLU

89.2%

Code

82.5%

Math

91.7%

GSM8K

Parameters:30B

Context:33K

Released:Jan 2025

License:Open Source

Efficient coding-specialized model with 30B parameters, optimized for software development and programming tasks

DBRX

Databricks

Overall Score

55.83

No AGI benchmarks

73.2%

MMLU

43.5%

Code

35.8%

Math

68.4%

GSM8K

Context:8K

Released:Mar 2024

License:Open Source

132B parameter MoE model outperforming GPT-3.5 and competitive with GPT-4

Kanana 1.5 15.7B

Kakao

Overall Score

54.30

No AGI benchmarks

69.8%

MMLU

43.6%

Code

36.2%

Math

65.4%

GSM8K

Context:16K

Released:Jan 2025

License:Open Source

Kakao's Korean-optimized language model with strong instruction following

Kimi K2 Base

Moonshot AI

Overall Score

54.18

ARC-AGI 2: 13.4%HLE (Text): 55.2%

13.4%

ARC-AGI 2

55.2%

HLE (Text)

86.2%

MMLU

83.7%

Code

76.8%

Math

89.3%

GSM8K

Parameters:120B

Context:200K

Released:Jan 2025

License:Research Only

Base version of Kimi K2 model optimized for research and fine-tuning applications with strong foundational capabilities

Llama 3.1 405B

Meta AI

Overall Score

53.31

ARC-AGI 2: 0.5%HLE (Text): 64.0%

0.5%

ARC-AGI 2

64.0%

HLE (Text)

87.3%

MMLU

89.0%

Code

73.8%

Math

96.8%

GSM8K

Parameters:405B

Context:128K

Released:Jul 2024

License:Open Source

Largest open-source model rivaling proprietary systems

Qwen3 30B-A3B Instruct

Alibaba

Overall Score

53.12

ARC-AGI 2: 12.8%HLE (Text): 52.1%

12.8%

ARC-AGI 2

52.1%

HLE (Text)

87.2%

MMLU

82.1%

Code

76.8%

Math

89.4%

GSM8K

Parameters:30B total, 3B active

Context:33K

Released:Jan 2025

License:Open Source

Advanced MoE model with 30B total parameters and 3B active, optimized for instruction following with high efficiency

OLMoCR 7B

Allen Institute for AI

Overall Score

52.97

No AGI benchmarks

68.2%

MMLU

42.8%

Code

35.6%

Math

62.4%

GSM8K

Context:8K

Released:Jul 2025

License:Open Source

Open language model optimized for OCR and document understanding

OpenReasoning Nemotron 7B

Nvidia

Overall Score

52.11

ARC-AGI 2: 13.7%HLE (Text): 51.3%

13.7%

ARC-AGI 2

51.3%

HLE (Text)

82.4%

MMLU

79.3%

Code

78.6%

Math

87.9%

GSM8K

Parameters:7B

Context:33K

Released:Jan 2025

License:Open Source

Compact reasoning-focused model with 7B parameters, optimized for efficient deployment with strong logical capabilities

Mistral Large 2

Mistral AI

Overall Score

51.38

ARC-AGI 2: 0.1%HLE: 4.5%

0.1%

ARC-AGI 2

4.5%

HLE

84.0%

MMLU

92.0%

Code

71.5%

Math

93.4%

GSM8K

Parameters:123B

Context:128K

Released:Jul 2024

License:Commercial

Flagship model with exceptional coding capabilities

Llama 3.3 70B Instruct

Meta AI

Overall Score

51.32

ARC-AGI 2: 0.3%HLE (Text): 58.0%

0.3%

ARC-AGI 2

58.0%

HLE (Text)

86.0%

MMLU

88.4%

Code

77.0%

Math

95.1%

GSM8K

Parameters:70B

Context:128K

Released:Dec 2024

License:Open Source

Llama 3.3 70B delivers performance nearly matching the 405B model while being significantly more efficient and cost-effective.

Falcon 180B

Technology Innovation Institute

Overall Score

51.02

No AGI benchmarks

70.5%

MMLU

36.8%

Code

32.1%

Math

58.9%

GSM8K

Context:8K

Released:Sep 2023

License:Open Source

180B parameter model, top open-source model at launch

Qwen 2.5 72B

Alibaba

Overall Score

50.96

ARC-AGI 2: 0.5%HLE (Text): 58.0%

0.5%

ARC-AGI 2

58.0%

HLE (Text)

85.9%

MMLU

86.4%

Code

74.2%

Math

93.8%

GSM8K

Parameters:72B

Context:131K

Released:Sep 2024

License:Open Source

Leading Chinese model with strong math and coding

DeepSeek-V2

DeepSeek

Overall Score

50.35

ARC-AGI 2: 0.1%HLE (Text): 60.0%

0.1%

ARC-AGI 2

60.0%

HLE (Text)

83.7%

MMLU

80.1%

Code

72.8%

Math

92.3%

GSM8K

Parameters:236B

Context:128K

Released:May 2024

License:Open Source

Large MoE model with efficient architecture

Step 3

StepFun

Overall Score

49.31

ARC-AGI 2: 15.2%HLE (Text): 42.5%

15.2%

ARC-AGI 2

42.5%

HLE (Text)

92.8%

MMLU

86.5%

Code

78.2%

Math

93.6%

GSM8K

Context:8K

Released:Jan 2025

License:Open Source

Advanced multimodal model with strong reasoning capabilities

GPT-4o mini

OpenAI

Overall Score

48.92

ARC-AGI 2: 0.2%HLE (Text): 55.0%

0.2%

ARC-AGI 2

55.0%

HLE (Text)

82.0%

MMLU

87.2%

Code

70.2%

Math

93.1%

GSM8K

Context:128K

Released:Jul 2024

License:Proprietary

GPT-4o mini is OpenAI's most cost-efficient small model, designed for fast responses and handling large contexts.

Stable LM 2 12B

Stability AI

Overall Score

48.74

No AGI benchmarks

68.7%

MMLU

35.4%

Code

28.9%

Math

55.2%

GSM8K

Context:8K

Released:Jan 2024

License:Open Source

Stability AI's 12B parameter language model with improved performance

AFM 4.5B

Arcee AI

Overall Score

47.26

No AGI benchmarks

62.4%

MMLU

38.2%

Code

28.9%

Math

56.7%

GSM8K

Context:8K

Released:Jan 2025

License:Open Source

Arcee AI's efficient 4.5B parameter model focused on domain adaptation

Llama 3.1 70B

Meta AI

Overall Score

47.25

ARC-AGI 2: 0.1%HLE (Text): 52.0%

0.1%

ARC-AGI 2

52.0%

HLE (Text)

83.6%

MMLU

80.5%

Code

64.7%

Math

93.0%

GSM8K

Parameters:70B

Context:128K

Released:Jul 2024

License:Open Source

Highly capable open model with extended context

Claude 3.5 Haiku

Anthropic

Overall Score

47.08

ARC-AGI 2: 0.0%HLE (Text): 58.0%

0.0%

ARC-AGI 2

58.0%

HLE (Text)

78.7%

MMLU

88.2%

Code

44.5%

Math

91.2%

GSM8K

Context:200K

Released:Oct 2024

License:Proprietary

Claude 3.5 Haiku is Anthropic's fastest and most cost-effective model, optimized for speed while maintaining strong performance.

Mixtral 8x22B

Mistral AI

Overall Score

45.80

ARC-AGI 2: 0.0%HLE (Text): 54.0%

0.0%

ARC-AGI 2

54.0%

HLE (Text)

77.8%

MMLU

74.5%

Code

58.7%

Math

88.4%

GSM8K

Parameters:141B

Context:66K

Released:Apr 2024

License:Open Source

Large mixture-of-experts model with strong performance

StripedHyena 7B

Together AI

Overall Score

45.54

No AGI benchmarks

61.3%

MMLU

36.2%

Code

27.8%

Math

52.4%

GSM8K

Context:8K

Released:Dec 2023

License:Open Source

Efficient transformer alternative with state-space model architecture

Claude 3 Sonnet

Anthropic

Overall Score

45.17

ARC-AGI 2: 0.0%HLE (Text): 52.0%

0.0%

ARC-AGI 2

52.0%

HLE (Text)

79.0%

MMLU

73.0%

Code

53.1%

Math

92.3%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Mar 2024

License:Proprietary

Balanced model striking ideal balance between intelligence and speed for enterprise workloads

Llama 3.2 90B Vision

Meta AI

Overall Score

44.39

ARC-AGI 2: 0.1%HLE: 5.5%

0.1%

ARC-AGI 2

5.5%

HLE

85.5%

MMLU

75.2%

Code

58.3%

Math

91.8%

GSM8K

Parameters:90B

Context:128K

Released:Sep 2024

License:Open Source

First multimodal Llama model with vision capabilities

Gemini 2.5 Pro

Google DeepMind

Overall Score

44.23

ARC-AGI 2: 5.0%HLE: 21.6%

5.0%

ARC-AGI 2

21.6%

HLE

91.7%

MMLU

91.5%

Code

82.3%

Math

97.2%

GSM8K

Context:2.0M

Released:Jun 2025

License:Proprietary

Advanced multimodal model with unprecedented 2M token context and superior reasoning

Yi-Large

01.AI

Overall Score

44.07

ARC-AGI 2: 0.1%HLE (Text): 54.0%

0.1%

ARC-AGI 2

54.0%

HLE (Text)

85.4%

MMLU

75.2%

Code

54.8%

Math

91.7%

GSM8K

Parameters:Undisclosed

Context:33K

Released:May 2024

License:Proprietary

Bilingual model excelling in English and Chinese

Grok 4 Heavy

xAI

Overall Score

43.84

ARC-AGI 2: 12.8%HLE: 9.8%

12.8%

ARC-AGI 2

9.8%

HLE

94.5%

MMLU

92.8%

Code

88.5%

Math

98.5%

GSM8K

Parameters:Undisclosed

Context:256K

Released:Jul 2025

License:Proprietary

Most capable Grok model, available at premium tier for $300/month

Inflection-2.5

Inflection AI

Overall Score

43.51

ARC-AGI 2: 0.1%HLE (Text): 58.0%

0.1%

ARC-AGI 2

58.0%

HLE (Text)

80.4%

MMLU

70.5%

Code

47.3%

Math

88.2%

GSM8K

Parameters:Undisclosed

Context:33K

Released:Mar 2024

License:Proprietary

Conversational AI with emotional intelligence

DeepSeek-R1-0528

DeepSeek

Overall Score

42.90

ARC-AGI 2: 7.0%HLE (Text): 14.0%

7.0%

ARC-AGI 2

14.0%

HLE (Text)

92.1%

MMLU

95.5%

Code

88.2%

Math

97.8%

GSM8K

Parameters:1.2T

Context:256K

Released:May 2025

License:Open Source

Advanced iteration of DeepSeek-R1 with enhanced reasoning and larger scale

Kimi K1.5

Moonshot AI

Overall Score

42.66

ARC-AGI 2: 6.2%HLE (Text): 38.4%

6.2%

ARC-AGI 2

38.4%

HLE (Text)

79.4%

MMLU

68.7%

Code

61.2%

Math

78.5%

GSM8K

Parameters:85B

Context:128K

Released:Oct 2024

License:Proprietary

Enhanced Kimi model with dramatically expanded context window and improved reasoning across all domains

o4-mini

OpenAI

Overall Score

42.55

ARC-AGI 2: 2.4%HLE: 18.1%

2.4%

ARC-AGI 2

18.1%

HLE

88.4%

MMLU

94.1%

Code

92.7%

Math

96.3%

GSM8K

Context:128K

Released:Jan 2025

License:Proprietary

Cost-efficient reasoning model with performance approaching larger models

o3-mini

OpenAI

Overall Score

42.20

ARC-AGI 2: 10.0%HLE (Text): 13.4%

10.0%

ARC-AGI 2

13.4%

HLE (Text)

88.2%

MMLU

92.1%

Code

85.3%

Math

94.5%

GSM8K

Context:128K

Released:Mar 2025

License:Proprietary

Efficient reasoning model with o3-level capabilities at lower cost

Claude 3 Haiku

Anthropic

Overall Score

41.76

ARC-AGI 2: 0.0%HLE (Text): 48.0%

0.0%

ARC-AGI 2

48.0%

HLE (Text)

75.2%

MMLU

75.0%

Code

38.9%

Math

88.9%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Mar 2024

License:Proprietary

Fastest Claude model optimized for speed and efficiency

o1-preview

OpenAI

Overall Score

41.12

ARC-AGI 2: 10.0%HLE: 8.1%

10.0%

ARC-AGI 2

8.1%

HLE

93.5%

MMLU

93.8%

Code

78.0%

Math

97.2%

GSM8K

Parameters:Undisclosed

Context:128K

Released:Sep 2024

License:Proprietary

Reasoning-focused model with chain-of-thought capabilities

Command R+

Cohere

Overall Score

40.49

ARC-AGI 2: 0.1%HLE (Text): 52.0%

0.1%

ARC-AGI 2

52.0%

HLE (Text)

75.2%

MMLU

71.3%

Code

42.5%

Math

87.3%

GSM8K

Parameters:104B

Context:128K

Released:Apr 2024

License:Commercial

Enterprise model optimized for RAG and tool use

DeepSeek-R1

DeepSeek

Overall Score

40.31

ARC-AGI 2: 5.0%HLE (Text): 8.5%

5.0%

ARC-AGI 2

8.5%

HLE (Text)

91.6%

MMLU

95.0%

Code

86.7%

Math

97.1%

GSM8K

Parameters:671B

Context:128K

Released:Jan 2025

License:Open Source

DeepSeek-R1 is an advanced reasoning model that rivals OpenAI o1 performance through reinforcement learning and chain-of-thought reasoning.

Zephyr 7B Beta

Hugging Face

Overall Score

40.20

No AGI benchmarks

61.4%

MMLU

29.5%

Code

13.1%

Math

52.2%

GSM8K

Context:8K

Released:Oct 2023

License:Open Source

Fine-tuned Mistral-7B with improved helpfulness and harmlessness

o1

OpenAI

Overall Score

39.87

ARC-AGI 2: 8.0%HLE: 8.0%

8.0%

ARC-AGI 2

8.0%

HLE

92.3%

MMLU

92.5%

Code

74.3%

Math

96.5%

GSM8K

Context:200K

Released:Dec 2024

License:Proprietary

OpenAI o1 is a reasoning model that uses chain-of-thought processing to solve complex problems with unprecedented accuracy.

Qwen3-235B

Alibaba

Overall Score

38.86

ARC-AGI 2: 3.0%HLE (Text): 11.8%

3.0%

ARC-AGI 2

11.8%

HLE (Text)

88.3%

MMLU

89.1%

Code

79.2%

Math

95.3%

GSM8K

Parameters:235B

Context:256K

Released:Apr 2025

License:Open Source

Next-generation Qwen model with significantly expanded scale and capabilities

Mixtral 8x22B (Historical)

Mistral AI

Overall Score

38.47

ARC-AGI 2: 2.4%HLE (Text): 31.7%

2.4%

ARC-AGI 2

31.7%

HLE (Text)

75.2%

MMLU

68.9%

Code

52.4%

Math

83.7%

GSM8K

Parameters:141B

Context:33K

Released:Apr 2024

License:Open Source

Early release version of Mixtral 8x22B with mixture-of-experts architecture, showing the initial capabilities before optimization

Claude 4 Opus

Anthropic

Overall Score

38.47

ARC-AGI 2: 0.1%HLE: 6.7%

0.1%

ARC-AGI 2

6.7%

HLE

93.8%

MMLU

94.2%

Code

85.7%

Math

97.8%

GSM8K

Context:500K

Released:May 2025

License:Proprietary

Most capable Claude model with enhanced reasoning and extended context

Claude 4 Sonnet

Anthropic

Overall Score

38.21

ARC-AGI 2: 2.0%HLE: 5.5%

2.0%

ARC-AGI 2

5.5%

HLE

92.7%

MMLU

94.5%

Code

82.1%

Math

97.9%

GSM8K

Context:400K

Released:May 2025

License:Proprietary

Balanced Claude 4 model offering Opus-level capabilities with better efficiency

Grok-4

xAI

Overall Score

38.20

ARC-AGI 2: 1.0%HLE (Text): 8.0%

1.0%

ARC-AGI 2

8.0%

HLE (Text)

91.5%

MMLU

92.0%

Code

82.7%

Math

96.8%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Jan 2025

License:Proprietary

Next-generation model from xAI with enhanced reasoning and real-time capabilities

ERNIE X1

Baidu

Overall Score

38.17

ARC-AGI 2: 5.5%HLE: 5.8%

5.5%

ARC-AGI 2

5.8%

HLE

90.2%

MMLU

88.5%

Code

78.3%

Math

94.8%

GSM8K

Parameters:Undisclosed

Context:33K

Released:Mar 2025

License:Proprietary

Deep-thinking reasoning model with multimodal capabilities, designed for complex problem solving

Grok 3

xAI

Overall Score

38.10

ARC-AGI 2: 5.8%HLE (Text): 5.8%

5.8%

ARC-AGI 2

5.8%

HLE (Text)

89.5%

MMLU

86.8%

Code

76.5%

Math

95.2%

GSM8K

Parameters:Undisclosed

Context:128K

Released:Feb 2025

License:Proprietary

Advanced model with 10x more compute than Grok 2, featuring reflection capabilities

Claude 3.7 Sonnet

Anthropic

Overall Score

37.78

ARC-AGI 2: 1.0%HLE: 8.0%

1.0%

ARC-AGI 2

8.0%

HLE

90.5%

MMLU

93.1%

Code

78.3%

Math

97.2%

GSM8K

Context:250K

Released:Feb 2025

License:Proprietary

Enhanced Sonnet model bridging the gap to Claude 4

Gemini 2.0 Pro

Google DeepMind

Overall Score

37.60

ARC-AGI 2: 2.0%HLE: 7.1%

2.0%

ARC-AGI 2

7.1%

HLE

90.3%

MMLU

89.2%

Code

78.5%

Math

95.8%

GSM8K

Context:2.0M

Released:Feb 2025

License:Proprietary

Gemini 2.0 Pro is Google's most capable multimodal AI model, designed for the agentic era with advanced reasoning and tool use capabilities.

Gemini 2.5 Flash

Google DeepMind

Overall Score

37.29

ARC-AGI 2: 2.0%HLE: 12.1%

2.0%

ARC-AGI 2

12.1%

HLE

80.5%

MMLU

88.2%

Code

75.3%

Math

93.7%

GSM8K

Context:1.0M

Released:Jan 2025

License:Proprietary

Lightning-fast multimodal model optimized for speed and efficiency

GLM-4 Plus

Z.ai

Overall Score

37.27

ARC-AGI 2: 4.2%HLE (Text): 5.8%

4.2%

ARC-AGI 2

5.8%

HLE (Text)

88.8%

MMLU

87.2%

Code

72.5%

Math

95.2%

GSM8K

Parameters:Undisclosed (>130B)

Context:1.0M

Released:Aug 2024

License:Proprietary

Most powerful GLM model with breakthrough 1M token context window and enhanced capabilities

Qwen2.5-Max

Alibaba

Overall Score

37.10

ARC-AGI 2: 4.2%HLE (Text): 5.2%

4.2%

ARC-AGI 2

5.2%

HLE (Text)

88.8%

MMLU

86.5%

Code

75.2%

Math

94.5%

GSM8K

Parameters:Large MoE

Context:262K

Released:Jan 2025

License:Proprietary

Large Mixture of Experts model trained on 20T+ tokens with extended context capabilities

o1-mini

OpenAI

Overall Score

36.93

ARC-AGI 2: 3.0%HLE (Text): 10.3%

3.0%

ARC-AGI 2

10.3%

HLE (Text)

85.2%

MMLU

86.2%

Code

70.0%

Math

94.8%

GSM8K

Context:128K

Released:Sep 2024

License:Proprietary

o1-mini is a cost-efficient reasoning model optimized for STEM tasks, particularly mathematics and coding, with performance matching o1 on technical benchmarks.

A.X 3.0

SK Telecom

Overall Score

36.65

ARC-AGI 2: 4.5%HLE (Text): 28.3%

4.5%

ARC-AGI 2

28.3%

HLE (Text)

74.8%

MMLU

61.3%

Code

54.7%

Math

72.1%

GSM8K

Parameters:30B

Context:8K

Released:Jun 2024

License:Proprietary

Third generation A.X model with significantly improved reasoning capabilities and expanded context window for enterprise applications

Phi-4

Microsoft Research

Overall Score

36.44

ARC-AGI 2: 3.8%HLE (Text): 4.8%

3.8%

ARC-AGI 2

4.8%

HLE (Text)

84.8%

MMLU

82.6%

Code

80.4%

Math

95.5%

GSM8K

Parameters:14B

Context:16K

Released:Dec 2024

License:Open Source

14B parameter model specialized in math and complex reasoning

Mixtral 8x7B

Mistral AI

Overall Score

36.21

ARC-AGI 2: 0.0%HLE (Text): 46.0%

0.0%

ARC-AGI 2

46.0%

HLE (Text)

70.6%

MMLU

40.2%

Code

28.4%

Math

74.4%

GSM8K

Parameters:46.7B

Context:33K

Released:Dec 2023

License:Open Source

Pioneering open-source mixture-of-experts model

Phi-3 Medium

Microsoft Research

Overall Score

36.07

ARC-AGI 2: 0.1%HLE (Text): 38.0%

0.1%

ARC-AGI 2

38.0%

HLE (Text)

78.2%

MMLU

62.4%

Code

46.7%

Math

91.0%

GSM8K

Parameters:14B

Context:128K

Released:May 2024

License:Open Source

Small language model with outsized performance

ERNIE 4.5

Baidu

Overall Score

35.93

ARC-AGI 2: 3.2%HLE: 4.5%

3.2%

ARC-AGI 2

4.5%

HLE

88.5%

MMLU

85.2%

Code

72.8%

Math

92.5%

GSM8K

Parameters:0.3B-424B (MoE)

Context:16K

Released:Mar 2025

License:Proprietary

Native multimodal model with Mixture of Experts architecture, supporting text, image, audio, and video

Grok-2

xAI

Overall Score

35.93

ARC-AGI 2: 0.1%HLE (Text): 6.6%

0.1%

ARC-AGI 2

6.6%

HLE (Text)

88.0%

MMLU

88.4%

Code

76.5%

Math

94.7%

GSM8K

Parameters:Undisclosed

Context:100K

Released:Aug 2024

License:Proprietary

Advanced model with real-time information access

Llama 3.2 11B Vision

Meta AI

Overall Score

35.65

ARC-AGI 2: 0.1%HLE (Text): 42.0%

0.1%

ARC-AGI 2

42.0%

HLE (Text)

73.0%

MMLU

58.4%

Code

42.1%

Math

84.2%

GSM8K

Parameters:11B

Context:128K

Released:Sep 2024

License:Open Source

Lightweight multimodal model for on-device deployment

Claude 3.5 Sonnet

Anthropic

Overall Score

35.32

ARC-AGI 2: 0.1%HLE: 4.1%

0.1%

ARC-AGI 2

4.1%

HLE

88.7%

MMLU

92.0%

Code

71.1%

Math

96.4%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Jun 2024

License:Proprietary

Most capable Claude model with superior coding and reasoning

GLM-4 Long

Z.ai

Overall Score

35.27

ARC-AGI 2: 3.2%HLE (Text): 4.8%

3.2%

ARC-AGI 2

4.8%

HLE (Text)

86.2%

MMLU

82.8%

Code

66.5%

Math

93.2%

GSM8K

Parameters:130B

Context:2.0M

Released:Sep 2024

License:Proprietary

Specialized model for ultra-long context processing with 2M token window

Falcon 40B

Technology Innovation Institute

Overall Score

35.04

No AGI benchmarks

56.7%

MMLU

19.6%

Code

17.8%

Math

35.4%

GSM8K

Context:8K

Released:May 2023

License:Open Source

40B parameter model optimized for performance and efficiency

GPT-4o

OpenAI

Overall Score

35.03

ARC-AGI 2: 0.5%HLE: 2.7%

0.5%

ARC-AGI 2

2.7%

HLE

88.7%

MMLU

90.2%

Code

76.6%

Math

95.8%

GSM8K

Parameters:Undisclosed

Context:128K

Released:May 2024

License:Proprietary

OpenAI's flagship omni-model with native multimodal capabilities

Kimi K1

Moonshot AI

Overall Score

34.66

ARC-AGI 2: 3.4%HLE (Text): 26.8%

3.4%

ARC-AGI 2

26.8%

HLE (Text)

71.6%

MMLU

58.9%

Code

49.3%

Math

67.8%

GSM8K

Parameters:65B

Context:33K

Released:Mar 2024

License:Proprietary

First generation Kimi model with exceptional long context capabilities and strong Chinese-English bilingual performance

Command A

Cohere

Overall Score

34.65

ARC-AGI 2: 2.5%HLE (Text): 4.2%

2.5%

ARC-AGI 2

4.2%

HLE (Text)

85.2%

MMLU

82.5%

Code

68.5%

Math

91.2%

GSM8K

Parameters:Undisclosed

Context:256K

Released:Jan 2025

License:Proprietary

Latest Command series model with enhanced capabilities and extended context window

Grok 2 mini

xAI

Overall Score

34.47

ARC-AGI 2: 2.2%HLE (Text): 4.2%

2.2%

ARC-AGI 2

4.2%

HLE (Text)

85.5%

MMLU

80.4%

Code

68.2%

Math

92.5%

GSM8K

Parameters:Undisclosed

Context:128K

Released:Aug 2024

License:Proprietary

Smaller version of Grok 2 with efficient performance

GLM-4

Z.ai

Overall Score

34.44

ARC-AGI 2: 2.5%HLE (Text): 4.2%

2.5%

ARC-AGI 2

4.2%

HLE (Text)

85.4%

MMLU

81.5%

Code

64.7%

Math

92.5%

GSM8K

Parameters:130B

Context:128K

Released:Jan 2024

License:Proprietary

Advanced bilingual model with significant improvements over GLM-3, competing with GPT-4 on Chinese tasks

GLM-4V

Z.ai

Overall Score

34.32

ARC-AGI 2: 3.5%HLE: 5.2%

3.5%

ARC-AGI 2

5.2%

HLE

84.8%

MMLU

79.5%

Code

62.8%

Math

91.5%

GSM8K

Parameters:130B

Context:128K

Released:Jun 2024

License:Proprietary

Multimodal version of GLM-4 with strong vision-language understanding capabilities

GPT-4 Turbo

OpenAI

Overall Score

34.27

ARC-AGI 2: 0.1%HLE (Text): 3.5%

0.1%

ARC-AGI 2

3.5%

HLE (Text)

86.4%

MMLU

87.8%

Code

72.6%

Math

94.2%

GSM8K

Parameters:Undisclosed

Context:128K

Released:Nov 2023

License:Proprietary

Enhanced GPT-4 with improved performance and extended context

Qwen2 72B

Alibaba

Overall Score

33.87

ARC-AGI 2: 1.8%HLE (Text): 3.8%

1.8%

ARC-AGI 2

3.8%

HLE (Text)

84.2%

MMLU

79.9%

Code

68.7%

Math

91.1%

GSM8K

Parameters:72B

Context:131K

Released:Jun 2024

License:Open Source

Large-scale open source model from Alibaba with strong multilingual capabilities

Claude 3 Opus

Anthropic

Overall Score

33.59

ARC-AGI 2: 0.1%HLE: 4.2%

0.1%

ARC-AGI 2

4.2%

HLE

86.8%

MMLU

84.9%

Code

60.1%

Math

95.0%

GSM8K

Parameters:Undisclosed

Context:200K

Released:Mar 2024

License:Proprietary

Powerful model for complex tasks requiring deep analysis

EXAONE 4.0.1 32B

LG AI

Overall Score

32.86

ARC-AGI 2: 8.5%HLE (Text): 28.3%

8.5%

ARC-AGI 2

28.3%

HLE (Text)

75.3%

MMLU

48.9%

Code

42.1%

Math

71.8%

GSM8K

Context:33K

Released:Jan 2025

License:Open Source

LG AI's flagship 32B parameter model with multilingual capabilities

Grok 1.5

xAI

Overall Score

32.32

ARC-AGI 2: 1.5%HLE (Text): 3.5%

1.5%

ARC-AGI 2

3.5%

HLE (Text)

81.3%

MMLU

74.1%

Code

62.5%

Math

90.8%

GSM8K

Parameters:Undisclosed

Context:128K

Released:Mar 2024

License:Proprietary

Improved Grok with extended 128K context window and better reasoning

Qwen2.5 32B

Alibaba

Overall Score

32.10

ARC-AGI 2: 1.2%HLE (Text): 3.2%

1.2%

ARC-AGI 2

3.2%

HLE (Text)

81.8%

MMLU

75.2%

Code

62.5%

Math

88.2%

GSM8K

Parameters:32B

Context:131K

Released:Sep 2024

License:Open Source

Mid-size model from Qwen2.5 series trained on 18T tokens with improved efficiency

LLaMA 3 70B

Meta AI

Overall Score

32.07

ARC-AGI 2: 1.2%HLE (Text): 3.2%

1.2%

ARC-AGI 2

3.2%

HLE (Text)

79.5%

MMLU

81.7%

Code

57.8%

Math

88.9%

GSM8K

Parameters:70B

Context:8K

Released:Apr 2024

License:Open Source

Large-scale 70B parameter model from Meta's LLaMA 3 family with enhanced capabilities

Kanana 1.2

Kakao

Overall Score

31.83

ARC-AGI 2: 2.8%HLE (Text): 22.1%

2.8%

ARC-AGI 2

22.1%

HLE (Text)

69.7%

MMLU

56.2%

Code

45.8%

Math

63.4%

GSM8K

Parameters:14B

Context:4K

Released:Apr 2024

License:Research Only

Enhanced version of Kanana with improved reasoning capabilities and expanded multimodal understanding

Gemini 2.0 Flash

Google DeepMind

Overall Score

31.70

ARC-AGI 2: 1.0%HLE: 6.6%

1.0%

ARC-AGI 2

6.6%

HLE

77.4%

MMLU

85.7%

Code

68.2%

Math

91.3%

GSM8K

Parameters:Undisclosed

Context:1.0M

Released:Dec 2024

License:Proprietary

Latest Gemini model built for the agentic era

Gemini 1.5 Pro

Google DeepMind

Overall Score

31.67

ARC-AGI 2: 0.0%HLE: 4.6%

0.0%

ARC-AGI 2

4.6%

HLE

81.9%

MMLU

71.9%

Code

59.4%

Math

91.7%

GSM8K

Parameters:Undisclosed

Context:1.0M

Released:Feb 2024

License:Proprietary

Multimodal model with breakthrough 1M token context window

100

Gemini 1.5 Flash

Google DeepMind

Overall Score

31.02

ARC-AGI 2: 0.0%HLE: 4.2%

0.0%

ARC-AGI 2

4.2%

HLE

78.9%

MMLU

74.3%

Code

54.9%

Math

86.2%

GSM8K

Parameters:Undisclosed

Context:1.0M

Released:May 2024

License:Proprietary

Lightweight multimodal model optimized for speed

101

ERNIE 4.0

Baidu

Overall Score

30.99

ARC-AGI 2: 0.8%HLE: 2.5%

0.8%

ARC-AGI 2

2.5%

HLE

82.4%

MMLU

72.5%

Code

58.6%

Math

85.2%

GSM8K

Parameters:Undisclosed

Context:8K

Released:Oct 2023

License:Proprietary

Advanced multimodal model from Baidu with enhanced reasoning and understanding capabilities

102

BLOOM 176B

BigScience

Overall Score

30.65

No AGI benchmarks

55.5%

MMLU

15.7%

Code

13.2%

Math

20.9%

GSM8K

Context:8K

Released:Jul 2022

License:Open Source

Multilingual 176B parameter model supporting 59 languages

103

Jamba 1.5 Large

AI21 Labs

Overall Score

30.51

ARC-AGI 2: 1.5%HLE (Text): 3.5%

1.5%

ARC-AGI 2

3.5%

HLE (Text)

73.2%

MMLU

74.5%

Code

54.8%

Math

87.5%

GSM8K

Parameters:94B (16B active)

Context:256K

Released:Aug 2024

License:Open Source

Enhanced large version of Jamba with improved performance across all benchmarks

104

A.X 2.0

SK Telecom

Overall Score

29.23

ARC-AGI 2: 2.1%HLE (Text): 18.6%

2.1%

ARC-AGI 2

18.6%

HLE (Text)

67.3%

MMLU

52.4%

Code

41.8%

Math

58.7%

GSM8K

Parameters:13B

Context:4K

Released:Aug 2023

License:Proprietary

Second generation of SK Telecom's A.X series, focused on Korean language understanding and telecommunications applications

105

AFM 3.0

Arcee AI

Overall Score

29.21

ARC-AGI 2: 2.5%HLE (Text): 19.7%

2.5%

ARC-AGI 2

19.7%

HLE (Text)

68.9%

MMLU

45.3%

Code

38.7%

Math

54.2%

GSM8K

Parameters:14B

Context:16K

Released:Oct 2024

License:Commercial

Advanced Adaptive Foundation Model with enhanced reasoning capabilities and improved enterprise features

106

Grok 1

xAI

Overall Score

28.71

ARC-AGI 2: 0.8%HLE (Text): 2.8%

0.8%

ARC-AGI 2

2.8%

HLE (Text)

73.0%

MMLU

63.2%

Code

52.8%

Math

86.5%

GSM8K

Parameters:Undisclosed

Context:8K

Released:Nov 2023

License:Proprietary

First model from xAI with real-time knowledge access through X platform

107

Stable LM 2 1.6B

Stability AI

Overall Score

28.34

No AGI benchmarks

45.3%

MMLU

18.7%

Code

12.4%

Math

28.6%

GSM8K

Context:8K

Released:Jan 2024

License:Open Source

Compact 1.6B parameter model for efficient deployment

108

Phi-3.5

Microsoft Research

Overall Score

27.89

ARC-AGI 2: 1.2%HLE (Text): 2.8%

1.2%

ARC-AGI 2

2.8%

HLE (Text)

71.5%

MMLU

62.8%

Code

45.8%

Math

85.2%

GSM8K

Parameters:3.8B

Context:128K

Released:Aug 2024

License:Open Source

Enhanced version of Phi-3 with improved performance across benchmarks

109

Jamba

AI21 Labs

Overall Score

27.41

ARC-AGI 2: 0.8%HLE (Text): 2.8%

0.8%

ARC-AGI 2

2.8%

HLE (Text)

67.4%

MMLU

65.4%

Code

46.7%

Math

81.2%

GSM8K

Parameters:52B (12B active)

Context:256K

Released:Mar 2024

License:Open Source

Hybrid Mamba-Transformer architecture with 256K context window, combining efficiency and performance

110

LLaMA 3.1 8B

Meta AI

Overall Score

27.32

ARC-AGI 2: 0.6%HLE (Text): 2.3%

0.6%

ARC-AGI 2

2.3%

HLE (Text)

69.4%

MMLU

64.2%

Code

47.2%

Math

80.6%

GSM8K

Parameters:8B

Context:128K

Released:Jul 2024

License:Open Source

Updated 8B model with 128K context window and improved multilingual support

111

CodeGeeX4

Z.ai

Overall Score

27.16

ARC-AGI 2: 0.8%HLE (Text): 2.5%

0.8%

ARC-AGI 2

2.5%

HLE (Text)

58.5%

MMLU

85.2%

Code

52.8%

Math

68.5%

GSM8K

Parameters:9B

Context:128K

Released:Aug 2024

License:Open Source

Advanced code generation model competing with GitHub Copilot and supporting 100+ programming languages

112

Jurassic-2

AI21 Labs

Overall Score

27.14

ARC-AGI 2: 0.4%HLE (Text): 2.5%

0.4%

ARC-AGI 2

2.5%

HLE (Text)

72.8%

MMLU

58.2%

Code

48.5%

Math

75.3%

GSM8K

Parameters:Undisclosed

Context:8K

Released:Mar 2023

License:Proprietary

Improved model with faster response times, better instruction following, and multilingual support

113

Kanana 1.0

Kakao

Overall Score

26.80

ARC-AGI 2: 1.9%HLE (Text): 16.4%

1.9%

ARC-AGI 2

16.4%

HLE (Text)

62.4%

MMLU

48.7%

Code

37.2%

Math

54.1%

GSM8K

Parameters:8B

Context:2K

Released:Sep 2023

License:Research Only

Kakao's first large language model focused on Korean language understanding and cultural context

114

LLaMA 3 8B

Meta AI

Overall Score

26.71

ARC-AGI 2: 0.5%HLE (Text): 2.1%

0.5%

ARC-AGI 2

2.1%

HLE (Text)

68.4%

MMLU

62.2%

Code

45.6%

Math

79.6%

GSM8K

Parameters:8B

Context:8K

Released:Apr 2024

License:Open Source

Efficient 8B parameter model from Meta's LLaMA 3 family, trained on 15T tokens

115

Command R

Cohere

Overall Score

26.65

ARC-AGI 2: 0.5%HLE (Text): 2.2%

0.5%

ARC-AGI 2

2.2%

HLE (Text)

68.2%

MMLU

61.8%

Code

45.8%

Math

78.2%

GSM8K

Parameters:35B

Context:128K

Released:Mar 2024

License:Proprietary

Retrieval-augmented generation model optimized for enterprise use with multilingual support

116

Phi-3 mini

Microsoft Research

Overall Score

26.47

ARC-AGI 2: 0.8%HLE (Text): 2.5%

0.8%

ARC-AGI 2

2.5%

HLE (Text)

69.0%

MMLU

58.5%

Code

42.3%

Math

82.5%

GSM8K

Parameters:3.8B

Context:128K

Released:Apr 2024

License:Open Source

3.8B parameter model with 4K and 128K context variants

117

Jamba 1.5 Mini

AI21 Labs

Overall Score

26.26

ARC-AGI 2: 0.6%HLE (Text): 2.5%

0.6%

ARC-AGI 2

2.5%

HLE (Text)

65.4%

MMLU

62.8%

Code

43.2%

Math

78.5%

GSM8K

Parameters:12B

Context:256K

Released:Aug 2024

License:Open Source

Efficient version of Jamba with reduced parameters while maintaining 256K context

118

Mistral 7B v0.3

Mistral AI

Overall Score

26.10

ARC-AGI 2: 1.9%HLE (Text): 17.3%

1.9%

ARC-AGI 2

17.3%

HLE (Text)

62.5%

MMLU

40.2%

Code

32.6%

Math

52.2%

GSM8K

Parameters:7B

Context:33K

Released:May 2024

License:Open Source

Enhanced version of Mistral 7B with improved instruction following and extended context capabilities

119

ERNIE 3.0 Titan

Baidu

Overall Score

25.29

ARC-AGI 2: 0.2%HLE (Text): 1.8%

0.2%

ARC-AGI 2

1.8%

HLE (Text)

75.2%

MMLU

45.8%

Code

42.3%

Math

68.5%

GSM8K

Parameters:260B

Context:4K

Released:Dec 2021

License:Proprietary

Baidu's first 100B+ parameter model, combining knowledge graphs with transformer architecture

120

GLM-3

Z.ai

Overall Score

24.97

ARC-AGI 2: 0.3%HLE (Text): 1.8%

0.3%

ARC-AGI 2

1.8%

HLE (Text)

66.4%

MMLU

57.4%

Code

38.5%

Math

72.3%

GSM8K

Parameters:6B

Context:33K

Released:Oct 2023

License:Open Source

Open bilingual (Chinese-English) language model with strong performance on both languages

121

OLMo 1.7 7B

Allen Institute for AI

Overall Score

24.46

ARC-AGI 2: 1.8%HLE (Text): 15.9%

1.8%

ARC-AGI 2

15.9%

HLE (Text)

61.3%

MMLU

34.8%

Code

28.9%

Math

42.7%

GSM8K

Parameters:7B

Context:4K

Released:Sep 2024

License:Open Source

Improved version of OLMo with enhanced training methodology and better performance across multiple benchmarks

122

AFM 1.0

Arcee AI

Overall Score

22.56

ARC-AGI 2: 1.4%HLE (Text): 13.8%

1.4%

ARC-AGI 2

13.8%

HLE (Text)

58.4%

MMLU

32.7%

Code

26.3%

Math

38.9%

GSM8K

Parameters:7B

Context:8K

Released:Mar 2024

License:Commercial

Arcee AI's first Adaptive Foundation Model, designed for efficient fine-tuning and domain adaptation

123

Jurassic-1

AI21 Labs

Overall Score

21.52

ARC-AGI 2: 0.1%HLE (Text): 1.2%

0.1%

ARC-AGI 2

1.2%

HLE (Text)

62.3%

MMLU

38.5%

Code

35.2%

Math

58.6%

GSM8K

Parameters:178B

Context:2K

Released:Aug 2021

License:Proprietary

AI21 Labs' first large language model with 178B parameters and large vocabulary

124

Phi-2

Microsoft Research

Overall Score

21.34

ARC-AGI 2: 0.3%HLE (Text): 1.5%

0.3%

ARC-AGI 2

1.5%

HLE (Text)

57.9%

MMLU

47.8%

Code

31.2%

Math

61.1%

GSM8K

Parameters:2.7B

Context:2K

Released:Dec 2023

License:Open Source

2.7B parameter model with outstanding reasoning abilities for its size

125

OLMo 1.0 7B

Allen Institute for AI

Overall Score

20.38

ARC-AGI 2: 1.2%HLE (Text): 12.3%

1.2%

ARC-AGI 2

12.3%

HLE (Text)

54.8%

MMLU

27.2%

Code

22.6%

Math

31.4%

GSM8K

Parameters:7B

Context:2K

Released:Feb 2024

License:Open Source

Open Language Model from Allen AI, designed for transparency and research with fully open training data and process

126

GPT-NeoX 20B

EleutherAI

Overall Score

18.39

No AGI benchmarks

33.6%

MMLU

11.4%

Code

8.2%

Math

7.1%

GSM8K

Context:8K

Released:Feb 2022

License:Open Source

Open-source 20B parameter autoregressive language model

127

Phi-1.5

Microsoft Research

Overall Score

15.20

ARC-AGI 2: 0.1%HLE (Text): 0.8%

0.1%

ARC-AGI 2

0.8%

HLE (Text)

42.5%

MMLU

41.4%

Code

18.5%

Math

15.2%

GSM8K

Parameters:1.3B

Context:2K

Released:Sep 2023

License:Open Source

Improved 1.3B model with better reasoning capabilities

128

Phi-1

Microsoft Research

Overall Score

12.37

ARC-AGI 2: 0.0%HLE (Text): 0.3%

0.0%

ARC-AGI 2

0.3%

HLE (Text)

30.2%

MMLU

50.6%

Code

12.3%

Math

8.7%

GSM8K

Parameters:1.3B

Context:2K

Released:Jun 2023

License:Open Source

Specialized 1.3B parameter model for Python code generation

129

BioGPT-Large

Microsoft Research

Overall Score

12.13

ARC-AGI 2: 0.0%HLE (Text): 0.8%

0.0%

ARC-AGI 2

0.8%

HLE (Text)

48.5%

MMLU

8.2%

Code

10.5%

Math

15.8%

GSM8K

Parameters:1.5B

Context:1K

Released:Jan 2023

License:Research Only

Larger biomedical model with improved accuracy on medical tasks

130

BioGPT

Microsoft Research

Overall Score

10.84

ARC-AGI 2: 0.0%HLE (Text): 0.5%

0.0%

ARC-AGI 2

0.5%

HLE (Text)

45.2%

MMLU

5.8%

Code

8.2%

Math

12.5%

GSM8K

Parameters:347M

Context:1K

Released:Jan 2023

License:Research Only

Specialized language model trained on biomedical literature

131

Midjourney V6

Midjourney

Overall Score

0.00

No AGI benchmarks

Context:8K

Released:Dec 2023

License:Proprietary

Most photorealistic model with significantly improved text rendering

132

DALL-E 3

OpenAI

Overall Score

0.00

ARC-AGI 2: 0.0%HLE: 3.5%

0.0%

ARC-AGI 2

3.5%

HLE

0.0%

MMLU

0.0%

Code

0.0%

Math

0.0%

GSM8K

Parameters:Undisclosed

Context:4K

Released:Oct 2023

License:Proprietary

Advanced text-to-image generation model with improved prompt adherence and safety

133

Whisper Large v3

OpenAI

Overall Score

0.00

ARC-AGI 2: 0.0%HLE (Text): 0.0%No AGI benchmarks

0.0%

ARC-AGI 2

0.0%

HLE (Text)

0.0%

MMLU

0.0%

Code

0.0%

Math

0.0%

GSM8K

Parameters:1.55B

Context:30K

Released:Nov 2023

License:Open Source

State-of-the-art speech recognition model supporting 99 languages with improved accuracy

134

Gen-3 Alpha

Runway ML

Overall Score

0.00

No AGI benchmarks

Context:8K

Released:Jun 2024

License:Open Source

State-of-the-art video generation model with advanced motion and physics