Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference
HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.
Speculative sampling is a widely adopted loss‑less acceleration technique for large language model (LLM) inference. Recent works have enriched draft models with contextual information from the target LLM (e.g., hidden states and KV cache) to boost speed, but these methods introduce a mismatch between training and decoding contexts as well as inconsistencies in training objectives.
The Xiaohongshu algorithm team proposes the HASS algorithm, which aligns both the objective and the context of the draft model across training and decoding phases. HASS achieves a 2.81–4.05× speedup over ordinary inference and improves 8%–20% over the state‑of‑the‑art EAGLE‑2 method. The paper is available at https://arxiv.org/pdf/2408.15766 .
Speculative sampling follows a draft‑then‑verify paradigm: (1) an efficient draft model generates multiple draft tokens; (2) the target LLM verifies these tokens in parallel; (3) accepted tokens are kept, and if any token is rejected, a corrective token is sampled or a new token is added.
Two major issues are identified: (a) context inconsistency—during training the draft model can access hidden states from previous steps, while during decoding it cannot; (b) objective mismatch—draft models should prioritize high‑probability tokens that the target LLM would accept, but existing methods ignore this decoding‑stage goal.
HASS addresses these problems with two components: Harmonized Objective Distillation , which applies ranking‑distillation (Top‑K loss) to transfer the ordering of high‑probability tokens from the target LLM to the draft model; and Harmonized Context Alignment , a multi‑step alignment training strategy that keeps the draft model’s context identical to that used in decoding. The training consists of n steps: step 1 follows EAGLE’s training, step 2 uses features from step 1 to construct queries while preserving attention masks, and steps j ≥ 3 repeat this process. Although training cost is n‑times that of EAGLE, decoding overhead remains unchanged.
Extensive experiments (Tables 1‑2) show that HASS consistently yields the longest acceptance lengths and the best speedup across various datasets (including HumanEval) and target LLMs. Ablation studies explore the impact of the Top‑K loss hyper‑parameters, the number of alignment steps, and step‑wise weighting, demonstrating that 3–4 alignment steps and a moderate weighting of early steps provide optimal performance. Acceptance‑rate curves further confirm the effectiveness of coordinated context alignment.
In summary, HASS delivers significant inference acceleration while preserving generation quality, and the technique has already been deployed in real‑world scenarios at Xiaohongshu.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.