Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai|June 15, 2026arXiv

Key Takeaway

LLM agents excel at finding papers but fail at the harder task of determining whether papers actually meet study eligibility criteria—a bottleneck that stage-specific metrics can help diagnose better than single overall scores.

Summary

MetaSyn is a benchmark dataset of 442 expert-curated meta-analyses from Nature journals that tests how well AI systems can perform evidence synthesis—retrieving relevant studies, screening them against eligibility criteria, and aggregating results.

evaluation reasoning agents

Key Terms

rag-pipeline screening-bottleneck stage-attributed-metrics pi-eco-criteria hard-negatives