Reproducing Deepseek RI Reasoning Ability with GRPO on Qwen2.5‑7B in Colab
This article explains how to replicate Deepseek RI's slow‑thinking inference using the GRPO reinforcement‑learning algorithm on the Qwen2.5‑7B model in a free Colab notebook, covering the underlying COT concept, reward‑function design, data preparation, training configuration, and observed results.
Deepseek RI has recently attracted massive attention, demonstrating that AI can acquire a "system‑2" or slow‑thinking capability, a concept originally introduced by OpenAI’s o1 model but made affordable and open‑source by Deepseek R1.
Effect Demonstration
After only 100 training steps on the Qwen2.5‑7B model, the fine‑tuned model can solve arithmetic problems, and the GRPO reward score rises accordingly.
Question: 哈利的图书馆里有 50 本书。他的妹妹弗洛拉(Flora)拥有的书籍数量是哈利的两倍,而他们的表弟加里(Gary)拥有的书籍数量是哈利的一半。他们三人总共拥有多少本书? (Answer: 175)
Response:
哈利有50本书。弗洛拉拥有的书籍数量是哈利的两倍,即100本。加里拥有的书籍数量是哈利的一半,即25本。他们三人共有的书籍数量是50 + 100 + 25 = 175本。
175
Question: 由于新冠病毒的限制,当地商店老板特伦特先生将店内商品的价格每件提高了15%,以维持与之前相同的收入。如果购买一个橙子的价格为 40 美元,购买一个芒果的价格为 50 美元,请计算现在以新价格购买 10 个橙子和 10 个芒果总共需要花费多少钱。 (Answer: 1035 )
Response:
原价总和:10 * 40 + 10 * 50 = 400 + 500 = 900
提价幅度:15%
提价后总和:900 * (1 + 15%) = 900 * 1.15 = 1035
1035GRPO Algorithm Overview
Deepseek R1’s reasoning ability is bootstrapped from the GRPO (Group‑wise Reward‑based Policy Optimization) reinforcement‑learning method, which avoids the heavy value‑network used in traditional PPO. Instead, it generates multiple candidate outputs for each input, evaluates them with a reward function, and computes a group‑wise advantage by normalising rewards within the sample set.
Multiple‑sample generation: For each question, the model produces several candidate answers using the current policy.
Reward evaluation: Rewards consider answer correctness, format compliance, and reasoning quality.
Group advantage calculation: The mean and standard deviation of the rewards are used to normalise each candidate’s reward, eliminating the need for a separate value model.
This approach reduces memory consumption and improves training stability compared with PPO.
Training Process
Loading the Model
Necessary packages are installed, the quantised Qwen2.5‑7B model is loaded, and LoRA parameters are added for fine‑tuning.
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
# !pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024
lora_rank = 32
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen2.5-7B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True,
fast_inference = True,
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6,
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = lora_rank,
use_gradient_checkpointing = "unsloth",
random_state = 3407,
)Loading Data
The dataset consists of Chinese GSM8K math questions, each with a single numeric answer and no explicit reasoning data; the model learns to reason through reinforcement learning.
import re
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """采用下方格式回复用户问题:
...
...
"""
XML_COT_FORMAT = """\
{reasoning}
{answer}
"""
def extract_xml_answer(text: str) -> str:
answer = text.split("
")[-1]
answer = answer.split("
")[0]
return answer.strip()
def get_gsm8k_questions(split = "train") -> Dataset:
data = load_dataset('swulling/gsm8k_chinese')[split]
data = data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question_zh-cn']}
],
'answer': x['answer_only']
})
return data
dataset = get_gsm8k_questions()Reward Functions Definition
Five reward functions are implemented: correctness, integer detection, strict XML format, soft XML format, and XML structure scoring.
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [c[0]['content'] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
responses = [c[0]['content'] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
pattern = r"^
\n.*?\n
\n
\n.*?\n
\n$"
responses = [c[0]["content"] for c in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if m else 0.0 for m in matches]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
pattern = r"
.*?
\s*
.*?
"
responses = [c[0]["content"] for c in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if m else 0.0 for m in matches]
def count_xml(text) -> float:
count = 0.0
if text.count("
\n") == 1: count += 0.125
if text.count("\n
\n") == 1: count += 0.125
if text.count("\n
\n") == 1: count += 0.125
if text.count("\n
") == 1: count += 0.125
# penalise extra characters after
extra = len(text.split("\n
\n")[-1])
count -= extra * 0.001
return count
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
contents = [c[0]["content"] for c in completions]
return [count_xml(c) for c in contents]Model Training
GRPO training is performed with the TRL library, using the defined reward functions and a modest training schedule suitable for a single 14 GB Colab GPU.
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
use_vllm = True,
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "paged_adamw_8bit",
logging_steps = 1,
bf16 = is_bfloat16_supported(),
fp16 = not is_bfloat16_supported(),
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1,
num_generations = 6,
max_prompt_length = 256,
max_completion_length = 200,
max_steps = 250,
save_steps = 250,
max_grad_norm = 0.1,
report_to = "none",
output_dir = "outputs",
)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func],
args = training_args,
train_dataset = dataset,
)
trainer.train()After the training run, the model demonstrates genuine reasoning ability without any explicit chain‑of‑thought data, confirming the effectiveness of the GRPO approach.
References
Colab notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb4
Dataset: https://huggingface.co/datasets/swulling/gsm8k_chinese
Model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.