R1 thinks out loud — it works through problems step by step, showing its reasoning chain before arriving at an answer. This makes it particularly transparent on math, logic, and coding tasks, where you can follow (and verify) its work. The trade-off is verbosity: responses are often long, and the model can over-deliberate on simple questions.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| LiveCodeBench | 65.9 | accuracy | 5d ago |
| IFBench | 38.0 | prompt_level_loose_accuracy | 5d ago |
| SWE-Bench | 49.2 | accuracy | 5d ago |
| Aider Polyglot | 56.9 | accuracy | 5d ago |
| AIME 2024 | 79.8 | accuracy | 5d ago |
| SciCode | 4.6 | main_problem_pass@1 | 5d ago |
| AIME 2025 | 87.5 | accuracy | 5d ago |