Style diversity in synthetic data is more important than topic diversity for training intent classifiers—varying how things are said matters more than varying what's discussed.
This paper presents a method for generating synthetic training data for intent classification without any human annotations. The approach uses intent definitions and LLM generation with style and topic diversity controls, plus post-hoc stylization models to create varied, realistic dialogue. Results show the synthetic data reaches 93% of the performance of human-annotated data.