LLMs show promise for scientific discovery tasks—GPT-4 variants maintain strong alignment with real research conclusions even with limited context—but current benchmarks don't adequately test the creative hypothesis generation needed for genuine scientific breakthroughs.
This paper introduces ProjectionBench, a benchmark that tests whether large language models can generate scientific hypotheses like real researchers do. Models receive a research question with details gradually revealed, and their generated hypotheses are compared to actual paper conclusions using semantic similarity.