GSQ achieves near-frontier compression accuracy at 2-3 bits using standard scalar quantization compatible with existing inference hardware, making ultra-low-precision models practical without complex custom implementations.
GSQ is a new quantization method that compresses large language models to 2-3 bits per parameter while maintaining accuracy. It uses a mathematical technique called Gumbel-Softmax to intelligently assign weights to discrete values, bridging the gap between simple but limited scalar quantization and complex vector quantization methods that are hard to deploy.