When building coreference systems for software mentions, choose between lexical and contextual methods based on your upstream noise type and corpus size: embeddings handle boundary noise better and scale linearly, while string matching degrades more gracefully under substitution errors.
This paper compares two approaches for identifying when software names refer to the same project across documents: a simple string-matching method and an embedding-based approach. Testing on noisy data shows they fail in different ways—embeddings handle boundary errors better, while string matching handles substitution errors better—and embeddings scale more efficiently to large datasets.