Performance Scores

Overall

72
Rank #10 of 45 — Top 78%

SWE-bench

68
Rank #10 of 45 — Top 78%

LiveCodeBench

75
Rank #11 of 45 — Top 76%

HumanEval

90
Rank #9 of 45 — Top 80%

BigCodeBench

58
Rank #10 of 45 — Top 78%

Strengths & Weaknesses

Strengths

  • Balanced performance
  • Computer use capability
  • Artifact generation

Weaknesses

  • Older architecture
  • Falling behind Sonnet 4

Compare with Similar-Priced Models

ModelOverall ScoreInput $/M
Claude 3.5 Sonnet 72 $3.00
Claude Sonnet 4 78 $3.00
Claude 3 Sonnet 65 $3.00
GPT-4o 75 $2.50
Qwen 3.6 Plus 72 $3.00
Qwen Max 68 $1.60