Artificial Intelligence 15 min read

Sampling Strategies for Large Language Models: Greedy, Beam, Top‑K, Top‑p, and Temperature

The article explains how greedy search, beam search, Top‑K, Top‑p (nucleus) sampling, and temperature each shape large language model generation, comparing their effects on repetition, diversity, and creativity, and provides concise TensorFlow‑based code examples illustrating these inference‑time strategies.

DaTaobao Tech

May 27, 2024

Sampling Strategies for Large Language Models: Greedy, Beam, Top‑K, Top‑p, and Temperature

In recent years, large Transformer‑based language models such as OpenAI’s ChatGPT and Meta’s LLaMA have driven rapid progress in open‑domain text generation.

During inference, several parameters control the randomness of the output: greedy search, beam search, Top‑K sampling, Top‑p (nucleus) sampling, and temperature.

Greedy search always picks the token with the highest probability at each step, which can lead to repetitive or sub‑optimal sequences.

Beam search keeps the num_beams most likely hypotheses at each step and selects the highest‑scoring final sequence, reducing the risk of missing high‑probability paths but still not guaranteeing a global optimum.

Top‑K sampling restricts sampling to the K most probable tokens, renormalising their probabilities before drawing a token.

Top‑p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, allowing the candidate set size to adapt to the distribution.

Temperature rescales the logits before the softmax: T<1 sharpens the distribution, T>1 flattens it, affecting the “creativity” of the generated text.

Below are concise code examples using the transformers library (TensorFlow backend) to demonstrate each strategy.

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q tensorflow==2.1

import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
# Greedy
greedy_output = model.generate(**model_inputs, max_new_tokens=40)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# Beam search
beam_output = model.generate(input_ids=model_inputs['input_ids'],
                            max_length=50,
                            num_beams=5,
                            early_stopping=True)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# Beam with no‑repeat n‑gram
beam_output = model.generate(input_ids=model_inputs['input_ids'],
                            max_length=50,
                            num_beams=5,
                            no_repeat_ngram_size=2,
                            early_stopping=True)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# Top‑K
sample_output = model.generate(input_ids=model_inputs['input_ids'],
                              do_sample=True,
                              max_length=50,
                              top_k=50)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

# Top‑p
sample_output = model.generate(input_ids=model_inputs['input_ids'],
                              do_sample=True,
                              max_length=50,
                              top_p=0.92,
                              top_k=0)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Experiments show that greedy search often repeats phrases, beam search reduces repetition but may still produce duplicates, and adding a no‑repeat n‑gram constraint eliminates them. Top‑K and top‑p generate more diverse text, with top‑p adapting the candidate set size to the probability distribution. Adjusting temperature further controls the trade‑off between determinism and creativity.

In summary, greedy, beam, Top‑K, Top‑p, and temperature are inference‑time parameters that operate on the LLM’s output probability distribution, each offering different balances of quality, diversity, and computational cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI LLM Transformers generation Sampling

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.