Current LLMs struggle with clinical decision-making not just in what they decide, but critically in how they gather information—they ask redundant questions and fail at management decisions even when they get diagnoses right, revealing a gap invisible to outcome-only evaluation.
ClinEnv is an interactive benchmark that tests how well AI models act as doctors by simulating real patient cases over multiple decision stages. Unlike static medical benchmarks, it requires models to actively gather information from specialized agents before making treatment decisions, and scores both the quality of decisions and the process of gathering information.