MuSR

Multistep Soft Reasoning

reasoningScore: 0-100 (% accuracy)29 models scored

About

Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking

Methodology

Algorithmically generated reasoning problems embedded in realistic narratives. Requires models to chain multiple reasoning steps across natural language context. Problems span murder mystery deduction, constraint-based team allocation, and spatial object tracking.

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Qwen2.5 32B	22.7%
2	gpt2	15.3%
3	Qwen2.5 7B	14.1%
4	phi 2	13.8%
5	Qwen2.5 32B Instruct	13.5