The sequence of training data is as important as which data you select; reordering data using simple principles can boost LLM training efficiency and stability with minimal overhead.
This paper shows how the order in which you feed data to language models during training matters significantly. The researchers identified four key principles for organizing training data—Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity—and created two new data ordering methods that improve training stability and performance without extra computational cost.