SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla|May 4, 2026arXiv

Key Takeaway

Speculative decoding's performance depends heavily on compression level and task type—adaptive speculation length selection based on draft model signals can significantly outperform fixed hyperparameters with minimal computational overhead.

Summary

SpecKV improves LLM inference speed by dynamically choosing how many tokens a draft model should propose at each step, rather than using a fixed number. The system learns from draft model signals (confidence and entropy) to predict which proposal lengths will be accepted most often, achieving 56% faster inference than standard fixed-length approaches.

efficiency

Key Terms

speculative-decoding draft-model speculation-length acceptance-rate model-compression