GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic et al.|April 20, 2026arXiv

Key Takeaway

GSQ achieves near-frontier compression accuracy at 2-3 bits using standard scalar quantization compatible with existing inference hardware, making ultra-low-precision models practical without complex custom implementations.

Summary

GSQ is a new quantization method that compresses large language models to 2-3 bits per parameter while maintaining accuracy. It uses a mathematical technique called Gumbel-Softmax to intelligently assign weights to discrete values, bridging the gap between simple but limited scalar quantization and complex vector quantization methods that are hard to deploy.

efficiency training

Key Terms

quantization scalar-quantization bit-width gumbel-softmax post-training-quantization