Multistep Soft Reasoning
Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking
Algorithmically generated reasoning problems embedded in realistic narratives. Requires models to chain multiple reasoning steps across natural language context. Problems span murder mystery deduction, constraint-based team allocation, and spatial object tracking.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.