DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson et al.|May 25, 2026arXiv

Key Takeaway

LLMs can predict physics outcomes but struggle with true scientific discovery: the strongest models pass only 50% of worlds, and good prediction accuracy doesn't guarantee conceptual understanding of the underlying laws.

Summary

DiscoverPhysics is a benchmark that tests whether large language models can discover unknown physics laws by designing experiments in simulated worlds with non-standard physics.

reasoning evaluation agents

Key Terms

agentic-reasoning hypothesis-refinement long-horizon-reasoning n-body-simulator