The degradation of visual understanding in models as generated text accumulates, causing attention to shift away from image tokens.