Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He et al.|May 1, 2026arXiv

Key Takeaway

PVM solves a fundamental problem in vision-language models where visual understanding degrades during long text generation by creating a separate, always-accessible pathway to visual information—improving reasoning tasks with minimal added parameters.

Summary

Large vision-language models struggle when generating long text because visual information gets diluted by accumulated text tokens. This paper introduces Persistent Visual Memory (PVM), a lightweight add-on module that maintains direct access to visual embeddings throughout generation, preventing the model from losing sight of the image as it produces longer outputs.

architecture multimodal efficiency

Key Terms

vision-language-models visual-signal-dilution attention-partition retrieval-pathway feed-forward-network