Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou et al.|April 2, 2026arXiv

Key Takeaway

Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.

Summary

When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.

training efficiency applications

Key Terms

semantic-vector embedding-space fine-tuning vocabulary-extension spectral-properties