Removing redundant or low-frequency facts from training data helps models memorize more unique facts within their capacity limits, letting smaller models achieve larger model performance.
This paper shows that LLMs struggle to memorize facts when training data contains too many facts or has skewed frequency distributions. The researchers propose a data pruning method that selects which facts to include in training, enabling smaller models to memorize significantly more facts—a 110M parameter model trained on pruned data matches a 1.3B parameter model trained on full data.