AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

Atilla Kaan Alkan, Felix Grezes, Sergi Blanco-Cuaresma, Jennifer Lynn Bartlett, Daniel Chivvis et al.|April 2, 2026arXiv

Key Takeaway

When classifying scientific text with thousands of rare concepts, vocabulary-constrained LLMs perform competitively with specialized models, suggesting you don't always need heavy domain adaptation—but frequency-stratified evaluation is critical to spot performance gaps hidden by aggregate metr...

Summary

AstroConcepts is a dataset of 21,702 astrophysics paper abstracts labeled with 2,367 specialized astronomy concepts, designed to study extreme class imbalance in scientific text classification.

data evaluation applications

Key Terms

multi-label-classification class-imbalance domain-adaptation vocabulary-constrained frequency-stratified-evaluation