LLMs can predict physics outcomes but struggle with true scientific discovery: the strongest models pass only 50% of worlds, and good prediction accuracy doesn't guarantee conceptual understanding of the underlying laws.
DiscoverPhysics is a benchmark that tests whether large language models can discover unknown physics laws by designing experiments in simulated worlds with non-standard physics.