Artificial Intelligence 14 min read

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

This article explains how to replicate Deepseek RI's slow‑thinking inference using the GRPO reinforcement‑learning algorithm on the Qwen2.5‑7B model in a free Colab notebook, covering the underlying COT concept, reward‑function design, data preparation, training configuration, and observed results.

Big Data Technology Architecture

Feb 9, 2025

Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab

Deepseek RI has recently attracted massive attention, demonstrating that AI can acquire a "system‑2" or slow‑thinking capability, a concept originally introduced by OpenAI’s o1 model but made affordable and open‑source by Deepseek R1.

Effect Demonstration

After only 100 training steps on the Qwen2.5‑7B model, the fine‑tuned model can solve arithmetic problems, and the GRPO reward score rises accordingly.

Question: 哈利的图书馆里有 50 本书。他的妹妹弗洛拉（Flora）拥有的书籍数量是哈利的两倍，而他们的表弟加里（Gary）拥有的书籍数量是哈利的一半。他们三人总共拥有多少本书？ (Answer: 175)
Response:
<reasoning>哈利有50本书。弗洛拉拥有的书籍数量是哈利的两倍，即100本。加里拥有的书籍数量是哈利的一半，即25本。他们三人共有的书籍数量是50 + 100 + 25 = 175本。</reasoning>
<answer>175</answer>

Question: 由于新冠病毒的限制，当地商店老板特伦特先生将店内商品的价格每件提高了15%，以维持与之前相同的收入。如果购买一个橙子的价格为 40 美元，购买一个芒果的价格为 50 美元，请计算现在以新价格购买 10 个橙子和 10 个芒果总共需要花费多少钱。 (Answer: 1035 )
Response:
<reasoning>原价总和：10 * 40 + 10 * 50 = 400 + 500 = 900
提价幅度：15%
提价后总和：900 * (1 + 15%) = 900 * 1.15 = 1035</reasoning>
<answer>1035</answer>

GRPO Algorithm Overview

Deepseek R1’s reasoning ability is bootstrapped from the GRPO (Group‑wise Reward‑based Policy Optimization) reinforcement‑learning method, which avoids the heavy value‑network used in traditional PPO. Instead, it generates multiple candidate outputs for each input, evaluates them with a reward function, and computes a group‑wise advantage by normalising rewards within the sample set.

Multiple‑sample generation: For each question, the model produces several candidate answers using the current policy.

Reward evaluation: Rewards consider answer correctness, format compliance, and reasoning quality.

Group advantage calculation: The mean and standard deviation of the rewards are used to normalise each candidate’s reward, eliminating the need for a separate value model.

This approach reduces memory consumption and improves training stability compared with PPO.

Training Process

Loading the Model

Necessary packages are installed, the quantised Qwen2.5‑7B model is loaded, and LoRA parameters are added for fine‑tuning.

%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
# !pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024
lora_rank = 32
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-7B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Loading Data

The dataset consists of Chinese GSM8K math questions, each with a single numeric answer and no explicit reasoning data; the model learns to reason through reinforcement learning.

import re
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """采用下方格式回复用户问题:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """\
<reasoning>{reasoning}</reasoning>
<answer>{answer}</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('swulling/gsm8k_chinese')[split]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question_zh-cn']}
        ],
        'answer': x['answer_only']
    })
    return data

dataset = get_gsm8k_questions()

Reward Functions Definition

Five reward functions are implemented: correctness, integer detection, strict XML format, soft XML format, and XML structure scoring.

# Reward functions

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [c[0]['content'] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [c[0]['content'] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"^<reasoning>
.*?
</reasoning>
<answer>
.*?
</answer>
$"
    responses = [c[0]["content"] for c in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if m else 0.0 for m in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [c[0]["content"] for c in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if m else 0.0 for m in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>
") == 1: count += 0.125
    if text.count("
</reasoning>
") == 1: count += 0.125
    if text.count("
<answer>
") == 1: count += 0.125
    if text.count("
</answer>") == 1: count += 0.125
    # penalise extra characters after </answer>
    extra = len(text.split("
</answer>
")[-1])
    count -= extra * 0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [c[0]["content"] for c in completions]
    return [count_xml(c) for c in contents]

Model Training

GRPO training is performed with the TRL library, using the defined reward functions and a modest training schedule suitable for a single 14 GB Colab GPU.

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 6,
    max_prompt_length = 256,
    max_completion_length = 200,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

After the training run, the model demonstrates genuine reasoning ability without any explicit chain‑of‑thought data, confirming the effectiveness of the GRPO approach.

References

Colab notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb4

Dataset: https://huggingface.co/datasets/swulling/gsm8k_chinese

Model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM fine-tuning reinforcement learning GRPO Qwen2.5

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.