You can extend transformer context length by simply reusing and accumulating the KV cache across chunks—no training needed, and the approach stays numerically stable even across very long sequences.
KV-Fold enables long-context inference by treating the key-value cache as an accumulator that gets passed between sequence chunks. The model processes each chunk while attending to cached information from previous chunks, allowing it to handle contexts up to 128K tokens without retraining or architectural changes.