VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay et al.|May 28, 2026arXiv

Key Takeaway

Low-rank KV cache compression works in video diffusion not because attention is inherently low-rank, but because the model learns to use whatever rank capacity is available—this insight could improve efficiency of long-context generation across domains.

Summary

This paper introduces VideoMLA, a technique that compresses the key-value cache in video diffusion models by using shared low-rank representations instead of per-head storage.

efficiency architecture

Key Terms

kv-cache low-rank-approximation multi-head-latent-attention sliding-window-attention spectral-properties