o3 thinks before it speaks — literally. It runs extended internal reasoning chains before producing a response, which makes it noticeably slower but significantly more reliable on problems requiring multi-step logic, mathematics, or careful deduction. It handles ambiguous or hard problems by working through them rather than pattern-matching to a quick answer, though that deliberation comes at a higher compute cost.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| LCR | 69.3 | pass@1_accuracy | 5d ago |
| IFBench | 69.3 | prompt_level_loose_accuracy | 5d ago |
| SWE-Bench | 69.1 | accuracy | 5d ago |
| AIME 2024 | 91.6 | accuracy | 5d ago |
| Aider Polyglot | 76.9 | accuracy | 5d ago |
| AIME 2025 | 88.9 | accuracy | 5d ago |