Multimodal representations that preserve spatial and geometric information about the scene to maintain disambiguating context.