Speculative decoding's performance depends heavily on compression level and task type—adaptive speculation length selection based on draft model signals can significantly outperform fixed hyperparameters with minimal computational overhead.
SpecKV improves LLM inference speed by dynamically choosing how many tokens a draft model should propose at each step, rather than using a fixed number. The system learns from draft model signals (confidence and entropy) to predict which proposal lengths will be accepted most often, achieving 56% faster inference than standard fixed-length approaches.