For non-English language models, aggressively filtering data for quality and repeating it multiple times beats training once on larger, diverse datasets—a practical insight for resource-constrained language model development.
This paper challenges the assumption that diverse data is always better for language model training. For German, the researchers found that repeatedly training on a smaller, high-quality filtered dataset outperforms training once on a larger, less-filtered dataset—even after 7 epochs of repetition.