Draft-based Approximate Inference for LLMs
Kevin Galim*, Ethan Ewer*, Wonjun Kang, and 3 more authors
In International Conference on Learning Representations, 2026
* Equal contribution
We present a unified framework for approximate inference in long-context LLMs using small draft models to predict token and KV-cache importance. We introduce SpecKV, SpecPC, and SpecKV-PC, enabling more accurate KV-cache and prompt compression while preserving the same efficiency gains in memory usage, latency, and throughput.