Artificial Intelligence 10 min read

Training a Positive Review Generator with RLHF and PPO

This article demonstrates how to apply Reinforcement Learning from Human Feedback (RLHF) using a sentiment‑analysis model as a reward function and Proximal Policy Optimization (PPO) to fine‑tune a language model that generates positive product reviews, complete with code snippets and experimental results.

Architect

Feb 19, 2023

Training a Positive Review Generator with RLHF and PPO

With the rise of ChatGPT, many are interested in the core idea of RLHF (Reinforcement Learning from Human Feedback). Using reinforcement learning instead of pure supervised learning allows a model to explore updates beyond the performance ceiling of supervised methods.

The article walks through a concrete example: training a language model to generate positive reviews. The model receives a prompt such as "刚收到货，感觉" (just received the product, feeling) and must complete it with a positive comment.

prompt: 刚收到货，感觉<br/><br/>output 1: 刚收到货，感觉 有 点 不 符 合 预 期 ，不 好<br/>output 2: 刚收到货，感觉 挺 无 奈 的 送 货 速 度 不 太 行<br/>...

Initially the model has no preference and may produce negative reviews. By introducing a reward signal based on a sentiment‑analysis model, we can guide the model toward positive outputs.

The reward is obtained by feeding the concatenated prompt and generated response into a pretrained sentiment classifier (e.g., a RoBERTa model fine‑tuned on Chinese sentiment data). The classifier returns a probability between 0.0 and 1.0, which serves as the reward.

# Sentiment model initialization
senti_tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
senti_model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese')
sentiment_pipe = pipeline('sentiment-analysis', model=senti_model, tokenizer=senti_tokenizer, device=pipe_device)

texts = [q + r for q,r in zip(batch['query'], batch['response'])]
pipe_outputs = sentiment_pipe(texts)  # returns sentiment scores

With rewards computed, the model is updated using PPO. The PPO step is a single line:

ppo_trainer.step(query_tensors, response_tensors, rewards)  # PPO Update

PPO optimizes two losses: the policy‑gradient loss (pg_loss) and the value‑function loss (value_loss). The article shows the relevant formulas and code snippets for computing advantages, importance ratios, and the combined loss.

for t in reversed(range(gen_len)):
    nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0
    delta = rewards[:, t] + self.ppo_params['gamma'] * nextvalues - values[:, t]
    lastgaelam = delta + self.ppo_params['gamma'] * self.ppo_params['lam'] * lastgaelam
    advantages_reversed.append(lastgaelam)
advantages = torch.stack(advantages_reversed[::-1]).transpose(0, 1)

logits, _, vpred = self.model(model_input)
logprob = logprobs_from_logits(logits[:,:-1,:], model_input[:, 1:])
ratio = torch.exp(logprob - old_logprobs)
pg_losses = -advantages * ratio

To provide a value estimate for each token, a Value Head is added to the GPT‑2 model:

class GPT2HeadWithValueModel(GPT2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        config.num_labels = 1
        self.transformer = GPT2Model(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.v_head = ValueHead(config)  # added Value Head
        self.init_weights()

class ValueHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.summary = nn.Linear(config.hidden_size, 1)

Training curves show that the average reward rises from around 0.68 to 0.85 as training progresses. Early in training the model generates random or negative comments, while later it consistently produces positive sentiment.

Examples of generated outputs before and after training are illustrated with images in the original article.

Finally, the full source code is available at github.com/HarderThenHarder/transformers_tasks/tree/main/RLHF .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sentiment Analysis reinforcement learning Transformers RLHF PPO language model

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.