Performance Scores

Overall

85
Rank #2 of 45 — Top 96%

SWE-bench

82
Rank #2 of 45 — Top 96%

LiveCodeBench

88
Rank #1 of 45 — Top 98%

HumanEval

96
Rank #1 of 45 — Top 98%

BigCodeBench

74
Rank #2 of 45 — Top 96%

Strengths & Weaknesses

Strengths

  • Latest reasoning model
  • Top-tier across all benchmarks

Weaknesses

  • Very expensive
  • Slow

Compare with Similar-Priced Models

ModelOverall ScoreInput $/M
o3 85 $10.00
GPT-4 Turbo 70 $10.00