Background knowledge and patterns learned from text that models rely on, sometimes at the expense of visual information.