Attention Drift: What Autoregressive Speculative Decoding Models Learn

“`html

A new phenomenon called “attention drift” has been identified in speculative decoding models, where the drafter’s attention progressively shifts from the prompt to its own generated tokens as it continues to draft subsequent sequences.
To mitigate this issue, two architectural changes are proposed: applying post-norm on the drafter’s hidden states and incorporating per-hidden-state RMSNorm after capturing target hidden states. These interventions improve the model’s acceptance length by up to 2× under template perturbation and 1.10× across various benchmarks.
The proposed modifications allow for shorter training times while enabling the model to generalize over longer drafting sequences, offering a more robust solution to the problem of attention drift in speculative decoding models.

“`