Artificial Intelligence 11 min read

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

Xiaohongshu Tech REDtech

Oct 11, 2024

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

Speculative sampling is a widely adopted loss‑less acceleration technique for large language model (LLM) inference. Recent works have enriched draft models with contextual information from the target LLM (e.g., hidden states and KV cache) to boost speed, but these methods introduce a mismatch between training and decoding contexts as well as inconsistencies in training objectives.

The Xiaohongshu algorithm team proposes the HASS algorithm, which aligns both the objective and the context of the draft model across training and decoding phases. HASS achieves a 2.81–4.05× speedup over ordinary inference and improves 8%–20% over the state‑of‑the‑art EAGLE‑2 method. The paper is available at https://arxiv.org/pdf/2408.15766 .

Speculative sampling follows a draft‑then‑verify paradigm: (1) an efficient draft model generates multiple draft tokens; (2) the target LLM verifies these tokens in parallel; (3) accepted tokens are kept, and if any token is rejected, a corrective token is sampled or a new token is added.

Two major issues are identified: (a) context inconsistency—during training the draft model can access hidden states from previous steps, while during decoding it cannot; (b) objective mismatch—draft models should prioritize high‑probability tokens that the target LLM would accept, but existing methods ignore this decoding‑stage goal.

HASS addresses these problems with two components: Harmonized Objective Distillation , which applies ranking‑distillation (Top‑K loss) to transfer the ordering of high‑probability tokens from the target LLM to the draft model; and Harmonized Context Alignment , a multi‑step alignment training strategy that keeps the draft model’s context identical to that used in decoding. The training consists of n steps: step 1 follows EAGLE’s training, step 2 uses features from step 1 to construct queries while preserving attention masks, and steps j ≥ 3 repeat this process. Although training cost is n‑times that of EAGLE, decoding overhead remains unchanged.

Extensive experiments (Tables 1‑2) show that HASS consistently yields the longest acceptance lengths and the best speedup across various datasets (including HumanEval) and target LLMs. Ablation studies explore the impact of the Top‑K loss hyper‑parameters, the number of alignment steps, and step‑wise weighting, demonstrating that 3–4 alignment steps and a moderate weighting of early steps provide optimal performance. Acceptance‑rate curves further confirm the effectiveness of coordinated context alignment.

In summary, HASS delivers significant inference acceleration while preserving generation quality, and the technique has already been deployed in real‑world scenarios at Xiaohongshu.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning AI Inference Acceleration HASS Speculative Sampling

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.