Sampling Strategies for Large Language Models: Greedy, Beam, Top‑K, Top‑p, and Temperature
The article explains how greedy search, beam search, Top‑K, Top‑p (nucleus) sampling, and temperature each shape large language model generation, comparing their effects on repetition, diversity, and creativity, and provides concise TensorFlow‑based code examples illustrating these inference‑time strategies.
In recent years, large Transformer‑based language models such as OpenAI’s ChatGPT and Meta’s LLaMA have driven rapid progress in open‑domain text generation.
During inference, several parameters control the randomness of the output: greedy search, beam search, Top‑K sampling, Top‑p (nucleus) sampling, and temperature.
Greedy search always picks the token with the highest probability at each step, which can lead to repetitive or sub‑optimal sequences.
Beam search keeps the num_beams most likely hypotheses at each step and selects the highest‑scoring final sequence, reducing the risk of missing high‑probability paths but still not guaranteeing a global optimum.
Top‑K sampling restricts sampling to the K most probable tokens, renormalising their probabilities before drawing a token.
Top‑p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, allowing the candidate set size to adapt to the distribution.
Temperature rescales the logits before the softmax: T<1 sharpens the distribution, T>1 flattens it, affecting the “creativity” of the generated text.
Below are concise code examples using the transformers library (TensorFlow backend) to demonstrate each strategy.
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q tensorflow==2.1
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
# Greedy
greedy_output = model.generate(**model_inputs, max_new_tokens=40)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
# Beam search
beam_output = model.generate(input_ids=model_inputs['input_ids'],
max_length=50,
num_beams=5,
early_stopping=True)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
# Beam with no‑repeat n‑gram
beam_output = model.generate(input_ids=model_inputs['input_ids'],
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
# Top‑K
sample_output = model.generate(input_ids=model_inputs['input_ids'],
do_sample=True,
max_length=50,
top_k=50)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
# Top‑p
sample_output = model.generate(input_ids=model_inputs['input_ids'],
do_sample=True,
max_length=50,
top_p=0.92,
top_k=0)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))Experiments show that greedy search often repeats phrases, beam search reduces repetition but may still produce duplicates, and adding a no‑repeat n‑gram constraint eliminates them. Top‑K and top‑p generate more diverse text, with top‑p adapting the candidate set size to the probability distribution. Adjusting temperature further controls the trade‑off between determinism and creativity.
In summary, greedy, beam, Top‑K, Top‑p, and temperature are inference‑time parameters that operate on the LLM’s output probability distribution, each offering different balances of quality, diversity, and computational cost.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.