A self-supervised learning approach that predicts future embeddings from video without reconstructing pixels.