You can audit an LLM's training data composition by analyzing its outputs, even without access to the original training corpus, using statistical techniques to correct for classifier confusion and recover the underlying data mixture.
This paper introduces a method to reverse-engineer what data was used to train large language models by analyzing their generated text.