You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.
CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.