Top 3 Models by Overall Score

#1 Overall

Claude Opus 4

86
SWE: 84 · LiveCode: 88 · HumanEval: 96
#2 Overall

o3

85
SWE: 82 · LiveCode: 88 · HumanEval: 96
#3 Overall

o1

83
SWE: 80 · LiveCode: 84 · HumanEval: 95

All Models — Ranked by Overall Score

RankModelProvider OverallSWE-benchLiveCodeBenchHumanEvalBigCodeBench Input $/MPrice/Point
1 Claude Opus 4 Anthropic 86 84 88 96 76 $15.00 0.174
2 o3 OpenAI 85 82 88 96 74 $10.00 0.167
3 o1 OpenAI 83 80 84 95 73 $15.00 0.181
4 GPT-4.1 OpenAI 80 76 82 94 68 $2.00 0.063
5 o3-mini OpenAI 80 76 85 94 65 $1.10 0.014
6 Claude Sonnet 4 Anthropic 78 74 82 92 64 $3.00 0.038
7 Claude 3 Opus Anthropic 78 74 80 94 64 $15.00 0.192
8 Gemini 2.5 Pro Google 76 72 79 89 64 0.016
9 GPT-4o OpenAI 75 70 78 90 62 $2.50 0.033
10 Claude 3.5 Sonnet Anthropic 72 68 75 90 58 $3.00 0.042
11 o4-mini OpenAI 72 66 74 92 56 $1.10 0.015
12 Qwen3 6 Plus Qwen 72 66 74 88 58 $3.00 0.011
13 DeepSeek Reasoner (R1) DeepSeek 72 68 76 90 56 $0.550 0.008
14 Claude Sonnet 4 Lite Anthropic 70 64 74 88 56 $1.00 0.014
15 GPT-4 Turbo OpenAI 70 64 72 88 56 $10.00 0.133
16 o1-mini OpenAI 70 64 72 90 54 $1.10 0.016
17 Grok 3 xAI 70 64 72 88 56 $3.00 0.043
18 GPT-4.1 Mini OpenAI 68 62 70 86 54 $0.400 0.022
19 GPT-4 OpenAI 68 60 70 86 54 $30.00 0.441
20 Gemini 2.0 Pro Google 68 62 70 86 54 0.018
21 Qwen Max Qwen 68 62 70 86 54 $1.60 0.024
22 Claude 3 Sonnet Anthropic 65 58 68 85 50 $3.00 0.046
23 Gemini 2.5 Flash Google 65 58 68 85 50 0.002
24 Mistral Large 2 Mistral 65 58 66 84 52 $2.00 0.031
25 Grok Code Fast 1 xAI 65 58 68 84 50 0.077
26 Gemini 1.5 Pro Google 62 56 64 82 46 0.020
27 DeepSeek Chat (V3) DeepSeek 62 56 64 84 46 $0.270 0.004
28 Qwen Coder Qwen 60 54 62 82 45 0.007
29 Mistral Codestral Mistral 60 54 64 82 44 $0.300 0.005
30 GPT-4o Mini OpenAI 58 50 60 78 44 $0.150 0.003
31 DeepSeek Coder V2 DeepSeek 58 50 60 82 42 $0.270 0.005
32 Claude 4 Haiku Anthropic 55 48 58 78 40 $0.800 0.015
33 Gemini 2.0 Flash Google 55 48 56 78 40 0.002
34 Qwen Plus Qwen 55 48 58 78 40 $0.400 0.007
35 Claude 3.5 Haiku Anthropic 52 45 55 75 38 $0.800 0.015
36 Llama 3.3 70B Meta 52 44 54 76 38 0.004
37 Gemini 1.5 Flash Google 50 42 52 72 36 0.001
38 Grok 3 Mini xAI 50 42 52 72 36 $0.300 0.006
39 Mistral Nemo Mistral 48 40 50 70 32 $0.150 0.002
40 Claude 3 Haiku Anthropic 45 38 46 68 30 $0.250 0.006
41 Microsoft Phi-4 Microsoft 45 38 46 68 30 $0.100 0.002
42 Qwen Turbo Qwen 42 35 44 65 28 $0.080 0.002
43 Mistral Small Mistral 42 35 44 65 28 $0.100 0.002
44 GPT-3.5 Turbo OpenAI 40 32 42 62 26 $0.500 0.013
45 Reka Flash Reka 40 32 42 62 26 $0.200 0.025

Best Value — Lowest Price per Score Point

Ranking by overall score divided by input cost. The lower the price/point, the more coding capability you get per dollar.

RankModelProviderOverall ScoreInput PricePrice per Point
1 Gemini 1.5 Flash Google 50 $0.001
2 Gemini 2.5 Flash Google 65 $0.002
3 Gemini 2.0 Flash Google 55 $0.002
4 Qwen Turbo Qwen 42 $0.080/M $0.002
5 Mistral Nemo Mistral 48 $0.150/M $0.002
6 Mistral Small Mistral 42 $0.100/M $0.002
7 Microsoft Phi-4 Microsoft 45 $0.100/M $0.002
8 GPT-4o Mini OpenAI 58 $0.150/M $0.003
9 DeepSeek Chat (V3) DeepSeek 62 $0.270/M $0.004
10 Llama 3.3 70B Meta 52 $0.004

Benchmark Sources

SWE-bench Verified

Resolving real GitHub issues in production codebases

LiveCodeBench

Competitive programming problems

HumanEval

Function-level code generation

BigCodeBench

Practical, multi-step coding tasks

Methodology

Aggregated scores from published third-party benchmarks. SWE-bench measures real GitHub issue resolution. LiveCodeBench measures competitive programming ability. HumanEval measures basic code generation. BigCodeBench measures practical, multi-step coding tasks. All scores normalized to 0-100 scale.

Data compiled on April 18, 2026. Scores normalized to 0-100 scale. Where multiple sources exist, the highest verified score is used.