Performance Scores

Overall

75
Rank #9 of 45 — Top 80%

SWE-bench

70
Rank #9 of 45 — Top 80%

LiveCodeBench

78
Rank #9 of 45 — Top 80%

HumanEval

90
Rank #9 of 45 — Top 80%

BigCodeBench

62
Rank #9 of 45 — Top 80%

Strengths & Weaknesses

Strengths

  • Strong general-purpose
  • Good multimodal

Weaknesses

  • Less consistent on coding than Claude

Compare with Similar-Priced Models

ModelOverall ScoreInput $/M
GPT-4o 75 $2.50
Claude Sonnet 4 78 $3.00
Claude 3.5 Sonnet 72 $3.00
Claude 3 Sonnet 65 $3.00
Qwen 3.6 Plus 72 $3.00
Qwen Max 68 $1.60