Smarter scale selection in Block Floating Point quantization can reduce quantization error by 27% and improve language model performance by up to 15 points without slowing down inference.
This paper improves quantization for AI models by optimizing how Block Floating Point formats choose their scale factors. Instead of using a fixed maximum-based scale, ScaleSearch searches for better scales that minimize quantization error. The method works with existing quantization techniques and includes a specialized attention algorithm, showing 15-point improvements on math reasoning tasks.