Multistep Soft Reasoning
Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking
Algorithmically generated reasoning problems embedded in realistic narratives. Requires models to chain multiple reasoning steps across natural language context. Problems span murder mystery deduction, constraint-based team allocation, and spatial object tracking.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Qwen2.5 32B | 22.7% |
| 2 | gpt2 | 15.3% |
| 3 | Qwen2.5 7B | 14.1% |
| 4 | phi 2 | 13.8% |
| 5 | Qwen2.5 32B Instruct | 13.5 |
| 6 | Qwen2 1.5B Instruct | 12.0% |
| 7 | Phi 3 medium 128k instruct | 11.4% |
| 8 | distilgpt2 | 11.2% |
| 9 | pythia 160m | 10.7% |
| 10 | Qwen2.5 14B Instruct | 10.2% |
| 11 | gemma 2 9b it | 9.7% |
| 12 | Qwen2.5 Coder 7B Instruct | 9.5% |
| 13 | gemma 2 27b it | 9.1% |
| 14 | Llama 3.1 8B Instruct | 8.6% |
| 15 | Qwen2.5 7B Instruct | 8.5% |
| 16 | Mistral 7B Instruct v0.2 | 7.6% |
| 17 | Qwen2.5 3B Instruct | 7.6% |
| 18 | gemma 2 2b it | 7.1% |
| 19 | Meta Llama 3 8B | 6.2% |
| 20 | gpt2 large | 5.7% |
| 21 | gpt j 6b | 5.3% |
| 22 | TinyLlama 1.1B Chat v1.0 | 4.3% |
| 23 | Llama 3.2 3B | 3.8% |
| 24 | falcon 7b instruct | 3.3% |
| 25 | Qwen2.5 1.5B Instruct | 3.2% |
| 26 | Llama 3.2 1B Instruct | 3.0% |
| 27 | Llama 3.2 1B | 2.6% |
| 28 | Llama 3.2 3B Instruct | 1.4% |
| 29 | Qwen2.5 0.5B Instruct | 0.9% |