LLM agents excel at finding papers but fail at the harder task of determining whether papers actually meet study eligibility criteria—a bottleneck that stage-specific metrics can help diagnose better than single overall scores.
MetaSyn is a benchmark dataset of 442 expert-curated meta-analyses from Nature journals that tests how well AI systems can perform evidence synthesis—retrieving relevant studies, screening them against eligibility criteria, and aggregating results.