Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee|June 8, 2026arXiv

Key Takeaway

Synthetic data can efficiently teach a model grammatical structure for low-resource languages, but semantic understanding requires authentic data—synthetic bootstrapping works best as a primer before curriculum learning with real examples.

Summary

This paper tackles machine translation for Q'eqchi' Mayan, an Indigenous language with almost no digital text. Instead of scraping the web (which violates data sovereignty), researchers created synthetic training data from dictionaries and fine-tuned a multilingual model using LoRA adapters.

data efficiency training

Key Terms

parameter-efficient-fine-tuning lora synthetic-data curriculum-learning negative-transfer