o1 thinks before it speaks — literally. It spends extra time reasoning through problems internally before producing an answer, which makes it noticeably stronger at multi-step logic, math, and complex coding tasks than models that respond immediately. The trade-off is speed and cost: it's slower and more expensive per query, and it can feel like overkill for straightforward conversational tasks.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| TAU2 | 50.0 | accuracy | 5d ago |
| SWE-Bench | 48.9 | accuracy | 5d ago |
| AIME 2024 | 74.3 | accuracy | 5d ago |
| AIME 2025 | 79.2 | accuracy | 5d ago |