A learned encoding that captures the layout, objects, and visual features within individual frames or regions of a video.