Long Context Retrieval
Tests models on retrieving specific information from very long documents, measuring long-context comprehension and retrieval accuracy
Models must locate and extract specific facts, figures, or passages from long documents (10K–1M tokens). Tests robustness of attention mechanisms and context utilisation at extended lengths.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.7 | 70.3% |
| 2 | o3 | 69.3% |
| 3 | Grok 4 | 68.0% |
| 4 | Claude Sonnet 4.5 | 66.0% |
| 5 | Gemini 2.5 Pro | 66.0% |
| 6 | Claude Opus 4.5 | 65.3% |
| 7 | Claude Sonnet 4 |
| 65.0% |