For mid-training, let the data sources themselves tell you what quality matters: MIRA discovers source-specific rubrics automatically, making data selection both more effective and more scalable than fixed evaluation criteria.
MIRA is a data selection method for mid-training large language models that automatically discovers what quality criteria matter for each data source, then uses those criteria to filter training data efficiently.