Synthetic data can efficiently teach a model grammatical structure for low-resource languages, but semantic understanding requires authentic data—synthetic bootstrapping works best as a primer before curriculum learning with real examples.
This paper tackles machine translation for Q'eqchi' Mayan, an Indigenous language with almost no digital text. Instead of scraping the web (which violates data sovereignty), researchers created synthetic training data from dictionaries and fine-tuned a multilingual model using LoRA adapters.