Low-rank KV cache compression works in video diffusion not because attention is inherently low-rank, but because the model learns to use whatever rank capacity is available—this insight could improve efficiency of long-context generation across domains.
This paper introduces VideoMLA, a technique that compresses the key-value cache in video diffusion models by using shared low-rank representations instead of per-head storage.