A self-supervised learning approach for audio that learns meaningful speech representations by predicting masked portions of audio, similar to how language models learn from text.