The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

Zahra Abbasiantaeb, Zeno Belligoli, Omar Essam, Mohammad Aliannejadi|June 18, 2026arXiv

Key Takeaway

Style diversity in synthetic data is more important than topic diversity for training intent classifiers—varying how things are said matters more than varying what's discussed.

Summary

This paper presents a method for generating synthetic training data for intent classification without any human annotations. The approach uses intent definitions and LLM generation with style and topic diversity controls, plus post-hoc stylization models to create varied, realistic dialogue. Results show the synthetic data reaches 93% of the performance of human-annotated data.

data training evaluation

Key Terms

synthetic-data intent-classification llm-as-a-judge data-diversity