AI Coding Model Benchmarks 2026 — SWE-bench, LiveCodeBench, HumanEval | AI Dev Tools

Top 3 Models by Overall Score

#1 Overall

Claude Opus 4

86

SWE: 84 · LiveCode: 88 · HumanEval: 96

#2 Overall

o3

85

SWE: 82 · LiveCode: 88 · HumanEval: 96

#3 Overall

o1

83

SWE: 80 · LiveCode: 84 · HumanEval: 95

All Models — Ranked by Overall Score

Rank	Model	Provider	Overall	SWE-bench	LiveCodeBench	HumanEval	BigCodeBench	Input $/M	Price/Point
1	Claude Opus 4	Anthropic	86	84	88	96	76	$15.00	0.174
2	o3	OpenAI	85	82	88	96	74	$10.00	0.167
3	o1	OpenAI	83	80	84	95	73	$15.00	0.181
4	GPT-4.1	OpenAI	80	76	82	94	68	$2.00	0.063
5	o3-mini	OpenAI	80	76	85	94	65	$1.10	0.014
6	Claude Sonnet 4	Anthropic	78	74	82	92	64	$3.00	0.038
7	Claude 3 Opus	Anthropic	78	74	80	94	64	$15.00	0.192
8	Gemini 2.5 Pro	Google	76	72	79	89	64	—	0.016
9	GPT-4o	OpenAI	75	70	78	90	62	$2.50	0.033
10	Claude 3.5 Sonnet	Anthropic	72	68	75	90	58	$3.00	0.042
11	o4-mini	OpenAI	72	66	74	92	56	$1.10	0.015
12	Qwen3 6 Plus	Qwen	72	66	74	88	58	$3.00	0.011
13	DeepSeek Reasoner (R1)	DeepSeek	72	68	76	90	56	$0.550	0.008
14	Claude Sonnet 4 Lite	Anthropic	70	64	74	88	56	$1.00	0.014
15	GPT-4 Turbo	OpenAI	70	64	72	88	56	$10.00	0.133
16	o1-mini	OpenAI	70	64	72	90	54	$1.10	0.016
17	Grok 3	xAI	70	64	72	88	56	$3.00	0.043
18	GPT-4.1 Mini	OpenAI	68	62	70	86	54	$0.400	0.022
19	GPT-4	OpenAI	68	60	70	86	54	$30.00	0.441
20	Gemini 2.0 Pro	Google	68	62	70	86	54	—	0.018
21	Qwen Max	Qwen	68	62	70	86	54	$1.60	0.024
22	Claude 3 Sonnet	Anthropic	65	58	68	85	50	$3.00	0.046
23	Gemini 2.5 Flash	Google	65	58	68	85	50	—	0.002
24	Mistral Large 2	Mistral	65	58	66	84	52	$2.00	0.031
25	Grok Code Fast 1	xAI	65	58	68	84	50	—	0.077
26	Gemini 1.5 Pro	Google	62	56	64	82	46	—	0.020
27	DeepSeek Chat (V3)	DeepSeek	62	56	64	84	46	$0.270	0.004
28	Qwen Coder	Qwen	60	54	62	82	45	—	0.007
29	Mistral Codestral	Mistral	60	54	64	82	44	$0.300	0.005
30	GPT-4o Mini	OpenAI	58	50	60	78	44	$0.150	0.003
31	DeepSeek Coder V2	DeepSeek	58	50	60	82	42	$0.270	0.005
32	Claude 4 Haiku	Anthropic	55	48	58	78	40	$0.800	0.015
33	Gemini 2.0 Flash	Google	55	48	56	78	40	—	0.002
34	Qwen Plus	Qwen	55	48	58	78	40	$0.400	0.007
35	Claude 3.5 Haiku	Anthropic	52	45	55	75	38	$0.800	0.015
36	Llama 3.3 70B	Meta	52	44	54	76	38	—	0.004
37	Gemini 1.5 Flash	Google	50	42	52	72	36	—	0.001
38	Grok 3 Mini	xAI	50	42	52	72	36	$0.300	0.006
39	Mistral Nemo	Mistral	48	40	50	70	32	$0.150	0.002
40	Claude 3 Haiku	Anthropic	45	38	46	68	30	$0.250	0.006
41	Microsoft Phi-4	Microsoft	45	38	46	68	30	$0.100	0.002
42	Qwen Turbo	Qwen	42	35	44	65	28	$0.080	0.002
43	Mistral Small	Mistral	42	35	44	65	28	$0.100	0.002
44	GPT-3.5 Turbo	OpenAI	40	32	42	62	26	$0.500	0.013
45	Reka Flash	Reka	40	32	42	62	26	$0.200	0.025

Best Value — Lowest Price per Score Point

Ranking by overall score divided by input cost. The lower the price/point, the more coding capability you get per dollar.

Rank	Model	Provider	Overall Score	Input Price	Price per Point
1	Gemini 1.5 Flash	Google	50	—	$0.001
2	Gemini 2.5 Flash	Google	65	—	$0.002
3	Gemini 2.0 Flash	Google	55	—	$0.002
4	Qwen Turbo	Qwen	42	$0.080/M	$0.002
5	Mistral Nemo	Mistral	48	$0.150/M	$0.002
6	Mistral Small	Mistral	42	$0.100/M	$0.002
7	Microsoft Phi-4	Microsoft	45	$0.100/M	$0.002
8	GPT-4o Mini	OpenAI	58	$0.150/M	$0.003
9	DeepSeek Chat (V3)	DeepSeek	62	$0.270/M	$0.004
10	Llama 3.3 70B	Meta	52	—	$0.004

Interactive Benchmark Explorer

Filter by provider, set minimum scores, and sort by any benchmark. Explore 45 models interactively.

Provider

Minimum Overall Score: 0

Sort By

Benchmark Sources

SWE-bench Verified

Resolving real GitHub issues in production codebases

Methodology

Aggregated scores from published third-party benchmarks. SWE-bench measures real GitHub issue resolution. LiveCodeBench measures competitive programming ability. HumanEval measures basic code generation. BigCodeBench measures practical, multi-step coding tasks. All scores normalized to 0-100 scale.

Data compiled on April 18, 2026. Scores normalized to 0-100 scale. Where multiple sources exist, the highest verified score is used.