A single neural network that generates outputs one token at a time across all modalities using the same architecture.