The heavyweight of the open-weight world, this model brings frontier-level reasoning and instruction-following to teams willing to run their own infrastructure. It handles complex multi-step problems, nuanced writing, and long-context tasks with notable coherence, though its sheer size means deployment requires serious compute resources. Think of it as the colleague who can do almost anything but needs a large desk to work at.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| HellaSwag | 88.5 | accuracy | 5d ago |
| SciCode | 1.5 | main_problem_pass@1 | 5d ago |
| ARC-Challenge | 95.0 | accuracy | 5d ago |
| TruthfulQA | 65.3 | accuracy | 5d ago |
| WinoGrande | 87.2 | accuracy | 5d ago |
| GSM8K | 96.0 | accuracy | 5d ago |