Artificial Intelligence 9 min read

Understanding Reward Model Training in InstructGPT Using Ranking Sequences

This article explains how InstructGPT's reward model is trained by collecting human‑annotated ranking sequences instead of absolute scores, describes the rank‑loss formulation, provides Python code for the model and loss computation, and presents experimental results demonstrating the approach.

DataFunSummit
DataFunSummit
DataFunSummit
Understanding Reward Model Training in InstructGPT Using Ranking Sequences

In ChatGPT (InstructGPT), the reward model is trained using human‑annotated ranking sequences rather than direct numeric scores for each sentence, as illustrated by the accompanying diagram.

The article compares this to essay grading, where absolute scores vary widely among teachers, but ordering essays from best to worst yields a more consistent signal for training.

Annotators are asked to rank generated responses (e.g., two example sentences about bananas) instead of assigning absolute scores, which simplifies the labeling task and produces uniform supervision.

The rank‑loss function is introduced: for a sorted list A > B > C > D, the loss sums the differences between higher‑ranked and lower‑ranked rewards, applies a sigmoid to bound each term between 0 and 1, and finally takes the negative to turn minimization into maximization of the reward gap.

Python implementation details are provided. The reward model wraps an ERNIE backbone with a linear layer to produce a scalar reward, and the loss is computed by iterating over all ordered pairs in each ranking list:

class RewardModel(nn.Module):
    def __init__(self, encoder):
        """init func.
        Args:
            encoder (transformers.AutoModel): backbone, default ERNIE 3.0"""
        super().__init__()
        self.encoder = encoder
        self.reward_layer = nn.Linear(768, 1)  # reward layer maps to 1‑D reward

    def forward(self, input_ids, token_type_ids, attention_mask=None, pos_ids=None) -> torch.tensor:
        """forward returns a score for each sentence.
        Returns:
            reward: (batch, 1)"""
        pooler_output = self.encoder(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=pos_ids,
            attention_mask=attention_mask,
        )["pooler_output"]
        reward = self.reward_layer(pooler_output)
        return reward
def compute_rank_list_loss(rank_rewards_list: List[List[torch.tensor]], device='cpu') -> torch.Tensor:
    """Compute rank loss from ordered reward lists.
    The loss is the sum of sigmoid‑scaled differences between higher‑ranked and lower‑ranked rewards, averaged and negated.
    """
    if type(rank_rewards_list) != list:
        raise TypeError(f'@param rank_rewards expected "list", received {type(rank_rewards)}.')
    loss, add_count = torch.tensor([0]).to(device), 0
    for rank_rewards in rank_rewards_list:
        for i in range(len(rank_rewards)-1):
            for j in range(i+1, len(rank_rewards)):
                diff = F.sigmoid(rank_rewards[i] - rank_rewards[j])
                loss = loss + diff
                add_count += 1
    loss = loss / add_count
    return -loss

Experimental results show the loss decreasing over training steps (e.g., from –0.52 to –0.71) and an evaluation accuracy of 0.5, with example predictions where a positive comment receives a high score (~10.7) and a negative comment receives a low score (~‑9.3).

The article also mentions an annotation platform for collecting ranking data and provides a link to the full source code repository.

Machine LearningPythonRankingReward ModelRLHFInstructGPT
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.