A training approach where a model learns to predict missing parts of video by understanding both spatial and temporal patterns without reconstructing actual pixels.