AI Coding Model Benchmarks 2026
Third-party benchmark scores for 45 AI coding models across 4 standard tests. Aggregated scores from published third-party benchmarks. SWE-bench measures real GitHub issue resolution. LiveCodeBench measures competitive programming ability. HumanEval measures basic code generation. BigCodeBench measures practical, multi-step coding tasks. All scores normalized to 0-100 scale.
Top 3 Models by Overall Score
All Models — Ranked by Overall Score
| Rank | Model | Provider | Overall | SWE-bench | LiveCodeBench | HumanEval | BigCodeBench | Input $/M | Price/Point |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4 | Anthropic | 86 | 84 | 88 | 96 | 76 | $15.00 | 0.174 |
| 2 | o3 | OpenAI | 85 | 82 | 88 | 96 | 74 | $10.00 | 0.167 |
| 3 | o1 | OpenAI | 83 | 80 | 84 | 95 | 73 | $15.00 | 0.181 |
| 4 | GPT-4.1 | OpenAI | 80 | 76 | 82 | 94 | 68 | $2.00 | 0.063 |
| 5 | o3-mini | OpenAI | 80 | 76 | 85 | 94 | 65 | $1.10 | 0.014 |
| 6 | Claude Sonnet 4 | Anthropic | 78 | 74 | 82 | 92 | 64 | $3.00 | 0.038 |
| 7 | Claude 3 Opus | Anthropic | 78 | 74 | 80 | 94 | 64 | $15.00 | 0.192 |
| 8 | Gemini 2.5 Pro | 76 | 72 | 79 | 89 | 64 | — | 0.016 | |
| 9 | GPT-4o | OpenAI | 75 | 70 | 78 | 90 | 62 | $2.50 | 0.033 |
| 10 | Claude 3.5 Sonnet | Anthropic | 72 | 68 | 75 | 90 | 58 | $3.00 | 0.042 |
| 11 | o4-mini | OpenAI | 72 | 66 | 74 | 92 | 56 | $1.10 | 0.015 |
| 12 | Qwen3 6 Plus | Qwen | 72 | 66 | 74 | 88 | 58 | $3.00 | 0.011 |
| 13 | DeepSeek Reasoner (R1) | DeepSeek | 72 | 68 | 76 | 90 | 56 | $0.550 | 0.008 |
| 14 | Claude Sonnet 4 Lite | Anthropic | 70 | 64 | 74 | 88 | 56 | $1.00 | 0.014 |
| 15 | GPT-4 Turbo | OpenAI | 70 | 64 | 72 | 88 | 56 | $10.00 | 0.133 |
| 16 | o1-mini | OpenAI | 70 | 64 | 72 | 90 | 54 | $1.10 | 0.016 |
| 17 | Grok 3 | xAI | 70 | 64 | 72 | 88 | 56 | $3.00 | 0.043 |
| 18 | GPT-4.1 Mini | OpenAI | 68 | 62 | 70 | 86 | 54 | $0.400 | 0.022 |
| 19 | GPT-4 | OpenAI | 68 | 60 | 70 | 86 | 54 | $30.00 | 0.441 |
| 20 | Gemini 2.0 Pro | 68 | 62 | 70 | 86 | 54 | — | 0.018 | |
| 21 | Qwen Max | Qwen | 68 | 62 | 70 | 86 | 54 | $1.60 | 0.024 |
| 22 | Claude 3 Sonnet | Anthropic | 65 | 58 | 68 | 85 | 50 | $3.00 | 0.046 |
| 23 | Gemini 2.5 Flash | 65 | 58 | 68 | 85 | 50 | — | 0.002 | |
| 24 | Mistral Large 2 | Mistral | 65 | 58 | 66 | 84 | 52 | $2.00 | 0.031 |
| 25 | Grok Code Fast 1 | xAI | 65 | 58 | 68 | 84 | 50 | — | 0.077 |
| 26 | Gemini 1.5 Pro | 62 | 56 | 64 | 82 | 46 | — | 0.020 | |
| 27 | DeepSeek Chat (V3) | DeepSeek | 62 | 56 | 64 | 84 | 46 | $0.270 | 0.004 |
| 28 | Qwen Coder | Qwen | 60 | 54 | 62 | 82 | 45 | — | 0.007 |
| 29 | Mistral Codestral | Mistral | 60 | 54 | 64 | 82 | 44 | $0.300 | 0.005 |
| 30 | GPT-4o Mini | OpenAI | 58 | 50 | 60 | 78 | 44 | $0.150 | 0.003 |
| 31 | DeepSeek Coder V2 | DeepSeek | 58 | 50 | 60 | 82 | 42 | $0.270 | 0.005 |
| 32 | Claude 4 Haiku | Anthropic | 55 | 48 | 58 | 78 | 40 | $0.800 | 0.015 |
| 33 | Gemini 2.0 Flash | 55 | 48 | 56 | 78 | 40 | — | 0.002 | |
| 34 | Qwen Plus | Qwen | 55 | 48 | 58 | 78 | 40 | $0.400 | 0.007 |
| 35 | Claude 3.5 Haiku | Anthropic | 52 | 45 | 55 | 75 | 38 | $0.800 | 0.015 |
| 36 | Llama 3.3 70B | Meta | 52 | 44 | 54 | 76 | 38 | — | 0.004 |
| 37 | Gemini 1.5 Flash | 50 | 42 | 52 | 72 | 36 | — | 0.001 | |
| 38 | Grok 3 Mini | xAI | 50 | 42 | 52 | 72 | 36 | $0.300 | 0.006 |
| 39 | Mistral Nemo | Mistral | 48 | 40 | 50 | 70 | 32 | $0.150 | 0.002 |
| 40 | Claude 3 Haiku | Anthropic | 45 | 38 | 46 | 68 | 30 | $0.250 | 0.006 |
| 41 | Microsoft Phi-4 | Microsoft | 45 | 38 | 46 | 68 | 30 | $0.100 | 0.002 |
| 42 | Qwen Turbo | Qwen | 42 | 35 | 44 | 65 | 28 | $0.080 | 0.002 |
| 43 | Mistral Small | Mistral | 42 | 35 | 44 | 65 | 28 | $0.100 | 0.002 |
| 44 | GPT-3.5 Turbo | OpenAI | 40 | 32 | 42 | 62 | 26 | $0.500 | 0.013 |
| 45 | Reka Flash | Reka | 40 | 32 | 42 | 62 | 26 | $0.200 | 0.025 |
Best Value — Lowest Price per Score Point
Ranking by overall score divided by input cost. The lower the price/point, the more coding capability you get per dollar.
| Rank | Model | Provider | Overall Score | Input Price | Price per Point |
|---|---|---|---|---|---|
| 1 | Gemini 1.5 Flash | 50 | — | $0.001 | |
| 2 | Gemini 2.5 Flash | 65 | — | $0.002 | |
| 3 | Gemini 2.0 Flash | 55 | — | $0.002 | |
| 4 | Qwen Turbo | Qwen | 42 | $0.080/M | $0.002 |
| 5 | Mistral Nemo | Mistral | 48 | $0.150/M | $0.002 |
| 6 | Mistral Small | Mistral | 42 | $0.100/M | $0.002 |
| 7 | Microsoft Phi-4 | Microsoft | 45 | $0.100/M | $0.002 |
| 8 | GPT-4o Mini | OpenAI | 58 | $0.150/M | $0.003 |
| 9 | DeepSeek Chat (V3) | DeepSeek | 62 | $0.270/M | $0.004 |
| 10 | Llama 3.3 70B | Meta | 52 | — | $0.004 |
Benchmark Sources
SWE-bench Verified
Resolving real GitHub issues in production codebases
LiveCodeBench
Competitive programming problems
HumanEval
Function-level code generation
BigCodeBench
Practical, multi-step coding tasks
Methodology
Aggregated scores from published third-party benchmarks. SWE-bench measures real GitHub issue resolution. LiveCodeBench measures competitive programming ability. HumanEval measures basic code generation. BigCodeBench measures practical, multi-step coding tasks. All scores normalized to 0-100 scale.
Data compiled on April 18, 2026. Scores normalized to 0-100 scale. Where multiple sources exist, the highest verified score is used.