PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

Connor Douglas, Utkucan Balci, Joseph Aylett-Bullock|April 3, 2026arXiv

Key Takeaway

You can get high-quality topic discovery by using LLMs to label a small sample of documents, then training a cheap, locally-deployable embedding model on those labels—avoiding the cost of querying LLMs at scale.

Summary

PRISM combines large language models with lightweight clustering to discover precise topics in text. It fine-tunes a sentence encoder using sparse LLM labels, then clusters the resulting embeddings to separate closely related topics—achieving better results than existing methods while requiring fewer LLM queries and staying interpretable and deployable locally.

training efficiency

Key Terms

semantic-clustering sentence-embeddings knowledge-distillation topic-modeling local-deployment