Performance Scores

Overall

83
Rank #3 of 45 — Top 93%

SWE-bench

80
Rank #3 of 45 — Top 93%

LiveCodeBench

84
Rank #4 of 45 — Top 91%

HumanEval

95
Rank #3 of 45 — Top 93%

BigCodeBench

73
Rank #3 of 45 — Top 93%

Strengths & Weaknesses

Strengths

  • Strong step-by-step reasoning
  • Best at math-heavy coding

Weaknesses

  • Expensive
  • Slow

Compare with Similar-Priced Models

ModelOverall ScoreInput $/M
o1 83 $15.00
Claude Opus 4 86 $15.00
Claude 3 Opus 78 $15.00
GPT-4 Turbo 70 $10.00
OpenAI o3 85 $10.00