CRAFT: Clustered Regression for Adaptive Filtering of Training data

Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda|April 24, 2026arXiv

Key Takeaway

You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.

Summary

CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.

data training efficiency

Key Terms

training-data-curation k-means-clustering kl-divergence conditional-expected-distance lora