Training a Positive Review Generator with RLHF and PPO
This article demonstrates how to apply Reinforcement Learning from Human Feedback (RLHF) using a sentiment‑analysis model as a reward function and Proximal Policy Optimization (PPO) to fine‑tune a language model that generates positive product reviews, complete with code snippets and experimental results.
With the rise of ChatGPT, many are interested in the core idea of RLHF (Reinforcement Learning from Human Feedback). Using reinforcement learning instead of pure supervised learning allows a model to explore updates beyond the performance ceiling of supervised methods.
The article walks through a concrete example: training a language model to generate positive reviews. The model receives a prompt such as "刚收到货,感觉" (just received the product, feeling) and must complete it with a positive comment.
prompt: 刚收到货,感觉 output 1: 刚收到货,感觉 有 点 不 符 合 预 期 ,不 好 output 2: 刚收到货,感觉 挺 无 奈 的 送 货 速 度 不 太 行 ...
Initially the model has no preference and may produce negative reviews. By introducing a reward signal based on a sentiment‑analysis model, we can guide the model toward positive outputs.
The reward is obtained by feeding the concatenated prompt and generated response into a pretrained sentiment classifier (e.g., a RoBERTa model fine‑tuned on Chinese sentiment data). The classifier returns a probability between 0.0 and 1.0, which serves as the reward.
# Sentiment model initialization senti_tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese') senti_model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-jd-binary-chinese') sentiment_pipe = pipeline('sentiment-analysis', model=senti_model, tokenizer=senti_tokenizer, device=pipe_device) texts = [q + r for q,r in zip(batch['query'], batch['response'])] pipe_outputs = sentiment_pipe(texts) # returns sentiment scores
With rewards computed, the model is updated using PPO. The PPO step is a single line:
ppo_trainer.step(query_tensors, response_tensors, rewards) # PPO Update
PPO optimizes two losses: the policy‑gradient loss (pg_loss) and the value‑function loss (value_loss). The article shows the relevant formulas and code snippets for computing advantages, importance ratios, and the combined loss.
for t in reversed(range(gen_len)): nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0 delta = rewards[:, t] + self.ppo_params['gamma'] * nextvalues - values[:, t] lastgaelam = delta + self.ppo_params['gamma'] * self.ppo_params['lam'] * lastgaelam advantages_reversed.append(lastgaelam) advantages = torch.stack(advantages_reversed[::-1]).transpose(0, 1) logits, _, vpred = self.model(model_input) logprob = logprobs_from_logits(logits[:,:-1,:], model_input[:, 1:]) ratio = torch.exp(logprob - old_logprobs) pg_losses = -advantages * ratio
To provide a value estimate for each token, a Value Head is added to the GPT‑2 model:
class GPT2HeadWithValueModel(GPT2PreTrainedModel): def __init__(self, config): super().__init__(config) config.num_labels = 1 self.transformer = GPT2Model(config) self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) self.v_head = ValueHead(config) # added Value Head self.init_weights() class ValueHead(nn.Module): def __init__(self, config): super().__init__() self.summary = nn.Linear(config.hidden_size, 1)
Training curves show that the average reward rises from around 0.68 to 0.85 as training progresses. Early in training the model generates random or negative comments, while later it consistently produces positive sentiment.
Examples of generated outputs before and after training are illustrated with images in the original article.
Finally, the full source code is available at github.com/HarderThenHarder/transformers_tasks/tree/main/RLHF .
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.