You can use synthetic data for research if you can prove your task is exchangeable with historical tasks where real data is available; this framework provides statistical guarantees that your conclusions remain valid.
This paper provides statistical methods for safely using synthetic data in research by introducing 'task exchangeability'—a condition ensuring your current research question is mathematically similar to past tasks where real data exists. The authors develop inference techniques with validity guarantees and test them on LLM-generated survey responses and AI evaluation tasks.