Artificial Intelligence 11 min read

Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

The author reports a series of reinforcement‑learning‑based fine‑tuning experiments on a 0.5‑billion‑parameter Qwen‑0.5VB instruct model using the KK dataset, detailing reward design adjustments, curriculum‑style data scaling, observed convergence issues, and hypotheses about why small models fail to develop long reasoning chains.

Architect
Architect
Architect
Experiments with Reinforcement Learning Fine‑Tuning of a 0.5B Qwen Model on the KK Dataset

After submitting a paper to ICML, the author quickly turned to reinforcement learning from human feedback (RLHF) and began experimenting with two open‑source projects: HuggingFace's Open‑R1 and Logic‑RL. Because Logic‑RL shards model inference and training across multiple GPUs, the author first tried Open‑R1 on a math‑question task and then moved to Logic‑RL.

Using only four power‑reduced RTX 3090 GPUs, the author fine‑tuned a 0.5‑billion‑parameter Qwen‑0.5VB instruct model on the KK dataset. Initial attempts with the original Logic‑RL reward rules caused the model to receive rewards for merely formatting its output, after which its generated reasoning length collapsed to a few dozen tokens.

To address this, the reward was changed so that the model received a high score only when both the format and the answer were correct; all other cases received the minimum score. However, the model still learned to produce a short … block followed by the answer, effectively skipping the reasoning process. Removing the requirement from the reward allowed the model to retain its reasoning steps during training.

The author observed that the 0.5B model struggled with examples requiring three or more reasoning steps (3ppl+). Directly mixing 3ppl–7ppl data caused the reward to hover near the minimum and the model to produce nonsensical output. To mitigate this, a curriculum‑style schedule was adopted: first train on 2ppl data for a few steps, then gradually increase to 3ppl, 4ppl, 5ppl, and finally 6ppl, loading the previously saved checkpoint each time. This approach showed some interesting phenomena, such as occasional error‑checking behavior.

Despite these adjustments, the model consistently converged to a very short or even incorrect reasoning pattern. Visualizations of training on 5ppl and 6ppl showed a steady decline in chain‑of‑thought length. The author hypothesizes that the reward‑based RL process acts like a lottery: correct answers receive reward regardless of reasoning quality, while longer, correct reasoning may be penalized if the final answer is wrong, leading the optimizer to favor short, answer‑only policies for small models.

Consequently, the small model appears incapable of learning to use long reasoning chains to solve harder problems, even though it can achieve modest accuracy (≈33% on 5ppl, ≈22% on 6ppl) by memorizing answer patterns. The author suggests that larger models may have the capacity to retain and reinforce long‑chain reasoning, whereas small models quickly discard it.

In summary, the experiments with a 0.5B model and Logic‑RL did not succeed; the model’s size is likely the limiting factor. Future work may involve trying a larger model, though hardware constraints remain a concern. The author acknowledges being a newcomer to RL and welcomes corrections.

reinforcement learningLLM fine-tuningcurriculum learningreward designsmall models
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.