Training models with randomized positional encodings from a larger range than your data helps them generalize to much longer sequences without requiring long-context training data.
This paper proposes Randomized YaRN, a training method that helps language models generalize to much longer text sequences than they were trained on. By randomly sampling positional encodings from a larger range during training on short sequences, the model learns to handle longer contexts it hasn't seen before—improving performance on reasoning tasks with 16K-128K token contexts.