GPT-4o is OpenAI's multimodal workhorse — comfortable switching between analyzing images and handling text without missing a beat. It carries a large 128k context window, meaning it can hold long documents or extended conversations in memory at once. The trade-off is that, like most large hosted models, its internals are opaque and parameter counts aren't disclosed.
| Benchmark | Score | Type | Recorded |
|---|---|---|---|
| SciCode | 1.5 | main_problem_pass@1 | 9d ago |
| TAU2 | 42.8 | accuracy | 9d ago |
| HumanEval+ | 86.6 | pass@1 | 2y ago |
| MMLU | 88.7 | 5-shot | 2y ago |
| Aider Polyglot | 23.1 | accuracy | 9d ago |
| MT-Bench | 9.3 | GPT-4-judge | 2y ago |
| MATH | 76.6 | 4-shot | 2y ago |
| Chatbot Arena | 1285.0 | Bradley-Terry Elo |
| 1y ago |
| SWE-Bench | 33.2 | accuracy | 9d ago |
| AIME 2024 | 12.0 | accuracy | 9d ago |
| Humanity's Last Exam | 9.4 | 0-shot | 1y ago |
| HumanEval | 90.2 | pass@1 | 2y ago |
| GPQA Diamond | 53.6 | 0-shot-CoT | 2y ago |