A neural network that compresses audio into discrete tokens for language model processing and reconstructs waveforms from those tokens.