Performance Scores

Overall

86
Rank #1 of 45 — Top 98%

SWE-bench

84
Rank #1 of 45 — Top 98%

LiveCodeBench

88
Rank #1 of 45 — Top 98%

HumanEval

96
Rank #1 of 45 — Top 98%

BigCodeBench

76
Rank #1 of 45 — Top 98%

Strengths & Weaknesses

Strengths

  • Best at complex reasoning
  • Strong system design
  • Excellent debugging

Weaknesses

  • Expensive for bulk tasks
  • Slower response times

Compare with Similar-Priced Models

ModelOverall ScoreInput $/M
Claude Opus 4 86 $15.00
Claude 3 Opus 78 $15.00
GPT-4 Turbo 70 $10.00
OpenAI o1 83 $15.00
OpenAI o3 85 $10.00