Decoding Strategies for Generative Models: Top‑k, Top‑p, Contrastive Search, Beam Search, and Sampling
The article explains how generative models use deterministic methods like greedy and beam search and stochastic techniques such as top‑k, top‑p, contrastive search and sampling, describing their mechanisms, temperature control, repetition penalties, and practical trade‑offs for balancing fluency, diversity and coherence.
Generative models use two main categories of decoding methods: deterministic (e.g., greedy search and beam search) and stochastic (e.g., sampling, top‑k, top‑p, contrastive search). Deterministic methods often produce less natural text, while stochastic methods introduce randomness to improve diversity and fluency.
Top‑k sampling : At each decoding step the model keeps the k highest‑probability tokens and randomly selects one of them as the next token.
Top‑p (nucleus) sampling : The model sorts tokens by probability, accumulates them until the cumulative probability exceeds a threshold p , and then samples from this dynamic set.
Temperature controls randomness: a higher temperature yields a flatter distribution and more diverse output, while a lower temperature makes the distribution sharper and more deterministic.
Contrastive search : Combines model confidence with a degeneration penalty based on cosine similarity between the candidate token and previous context tokens. The penalty discourages repeats; when the penalty weight α is zero, contrastive search reduces to greedy decoding.
Code example for contrastive search:
output = model.generate(
input_ids,
penalty_alpha=0.6, # α in contrastive search
top_k=4, # k in contrastive search
max_length=512
)Beam search : Keeps the num_beams most likely tokens at each step, expands them, and finally selects the highest‑probability sequence. It mitigates the risk of missing high‑probability sequences but can still produce repeated fragments.
An n‑gram repetition penalty can be applied to beam search to prevent duplicate n‑grams:
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2, # prevent repeated 2‑grams
early_stopping=True
)Sampling (do_sample=True) : Makes generation nondeterministic. Lowering the temperature makes the distribution sharper; setting temperature to 0 collapses sampling back to greedy decoding, inheriting its repetition issues.
Example of activating sampling without top‑k:
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=0
)Example with temperature control:
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=0,
temperature=0.7
)Combining top‑k and top‑p (and returning multiple sequences):
sample_outputs = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=50,
top_p=0.95,
num_return_sequences=3
)The article discusses practical trade‑offs: choosing appropriate decoding methods, randomness parameters, and temperature values based on the task and desired output characteristics. It also cites research indicating that high‑quality human language does not strictly follow maximum‑probability rules, highlighting the importance of incorporating randomness and creativity into generation.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.