Current LLMs cannot reliably evaluate research soundness on their own; they systematically overestimate the viability of weak research ideas, which is critical if you're building AI systems to automate scientific discovery.
This paper introduces SoundnessBench, a benchmark of 1,099 machine learning research proposals labeled by reviewer scores, to test whether AI models can reliably judge if a research idea is methodologically sound before running experiments.