Multi-token prediction helps LLMs learn better world models than single-token prediction, but requires grounding in actual state representations to avoid learning shortcuts that violate real-world constraints.
This paper investigates whether large language models develop coherent internal world models by comparing next-token prediction with multi-token prediction. The authors propose LSE-MTP, a method that anchors token predictions to ground-truth hidden states to reduce hallucinations and improve the model's ability to learn structured representations of the world.