Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.
When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.