For scientific summarization, training on carefully selected high-quality examples outperforms training on larger random datasets—quality matters more than quantity when building summarization systems.
This paper creates a large biomedical summarization dataset (1.88M articles) and shows that author-written abstracts vary in quality. By selecting high-quality training examples based on alignment with source articles, models achieve better results with less data than random sampling, improving both efficiency and factuality.