Artificial Intelligence 17 min read

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

The article introduces PAI‑ChatLearn, a flexible and high‑performance framework developed by Alibaba Cloud's PAI team that supports full‑pipeline RLHF training for large models, explains the evolution of parallel training strategies, details the framework’s architecture and configuration, and showcases performance results and practical usage examples.

DataFunTalk
DataFunTalk
DataFunTalk
PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

Overview PAI‑ChatLearn is an Alibaba Cloud Machine Learning Platform (PAI) framework designed for efficient reinforcement learning from human feedback (RLHF) on massive models (e.g., 175B + 175B). It supports supervised fine‑tuning (SFT), reward model (RM) training, and full RLHF pipelines with flexible backend selection.

Evolution of Large‑Model Training As model sizes grew beyond the capacity of a single GPU/CPU, three parallelism strategies emerged:

Data Parallelism – replicates the whole model on multiple devices and synchronises gradients (e.g., ZeRO, FSDP).

Model Parallelism – splits the model itself, including Tensor Parallelism and Pipeline Parallelism, each with specific trade‑offs.

Task Parallelism – distributes different training tasks (SFT, RM, RLHF) across devices to improve resource utilisation.

Hybrid parallelism combines these methods to train extremely large models efficiently.

RLHF Process The RLHF workflow consists of three stages:

Pre‑train and SFT a base language model.

Train a reward model (RM) that scores model outputs against human preferences.

Apply Proximal Policy Optimization (PPO) to fine‑tune the policy model using the RM.

Typical data formats are shown below:

{'query': 'question', 'response': 'reply'}
{'query': 'question', 'response': ['reply1', 'reply2'], 'score': [0.8, 0.2]}
{'prompt': 'question'}

PAI‑ChatLearn Architecture

API Layer – defines abstract RLHF modules, configuration objects (RLHF Config, Model Config) and model‑building interfaces.

Engine Layer – handles resource allocation, scheduling, and execution. It uses DistActor to encapsulate each model as a distributed actor, allowing separate backends for training (Megatron, DeepSpeed, custom) and inference (PyTorch, vLLM).

Parallel Strategy – each model can be assigned its own parallelism (data, tensor, pipeline) and resource quota, enabling mixed‑parallel training of multiple models simultaneously.

Training Workflow Users configure the entire environment via YAML files, specifying DLC or local execution, model resources, and RLHF hyper‑parameters (batch size, checkpoint interval, evaluation schedule). The typical steps are:

Initialize chatlearn and define models (policy, reward, value).

Create an Engine and dataset.

Call engine.learn() to start the RLHF training loop.

Usage Example The article walks through a concrete example using an open‑source transformer model (e.g., Vicuna‑13B): preparing SFT data, converting the model to Megatron format, training SFT, RM, and finally RLHF, followed by offline batch inference or online serving via PAI‑EAS/vLLM.

Performance Highlights

ChatLearn outperforms DeepSpeed‑chat by 48%–82% on 7B‑30B scales.

It can train 66B + 66B on 32 GPUs where DeepSpeed‑chat OOMs, and supports 175B + 175B training.

On the MT‑Bench benchmark, RLHF models achieve an 11% average score increase over SFT models.

Vicuna‑13B RLHF experiments show superior results compared with other open‑source models of the same size.

Q&A Highlights

Multiple reward models can be combined by adding distributed actors and aggregating their scores.

Tensor‑parallel or pipeline‑parallel splitting does not increase model size; earlier Megatron versions had a bug that duplicated checkpoints.

The RLHF pipeline follows a standard PPO implementation; most hyper‑parameters are required and not arbitrarily configurable.

Overall, PAI‑ChatLearn provides a modular, scalable, and easy‑to‑use solution for large‑scale RLHF training, enabling researchers and engineers to focus on model performance rather than low‑level parallelism details.

deep learningRLHFdistributed computingAI FrameworkLarge Model TrainingPAI-ChatLearn
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.