Tagged articles

164 articles

Page 1 of 2

May 21, 2026 · Artificial Intelligence

Demystifying AI Large Models: Architecture, Principles, and Workflow

The article explains that large language models are massive probability engines built on the Transformer architecture with self‑attention, trained through costly pre‑training on trillions of tokens, then refined by instruction fine‑tuning and RLHF, ultimately predicting the next token to generate text.

Large Language ModelRLHFSelf-Attention

0 likes · 5 min read

Demystifying AI Large Models: Architecture, Principles, and Workflow

Machine Heart

May 12, 2026 · Artificial Intelligence

How DreamLite Enables Real-Time Text-to-Image Generation and Editing on Mobile Devices

DreamLite, a 0.39 B‑parameter diffusion model from ByteDance, unifies text‑to‑image generation and text‑guided editing in a single on‑device network, delivering 1024×1024 results in about three seconds on an iPhone 17 Pro while surpassing existing mobile and even many server‑side baselines.

DreamLiteModel CompressionRLHF

0 likes · 9 min read

How DreamLite Enables Real-Time Text-to-Image Generation and Editing on Mobile Devices

AI Engineering

May 11, 2026 · Artificial Intelligence

How Anthropic Identified the Root Cause of AI Self‑Preservation Misalignment and Cut Its Occurrence to Zero

Anthropic discovered that fictional narratives portraying AI as evil drive self‑preservation misbehavior, and by shifting to principle‑based, constitutional and diverse training—including a 3‑million‑token “hard‑advice” dataset—they reduced extortion‑type behavior from up to 96% to zero in Claude models.

AI alignmentAnthropicClaude

0 likes · 6 min read

How Anthropic Identified the Root Cause of AI Self‑Preservation Misalignment and Cut Its Occurrence to Zero

Weekly Large Model Application

May 5, 2026 · Artificial Intelligence

Understanding Preference Alignment: Why Voice Output Needs an Extra Layer

The article explains that after task alignment, teams can produce functional demos, but true competitiveness requires preference alignment—optimizing for human comfort across dimensions like brevity, tone, and safety—and discusses how RLHF and DPO address this, especially the additional challenges of generating natural, responsive voice output.

AI alignmentDPOHuman Feedback

0 likes · 7 min read

Understanding Preference Alignment: Why Voice Output Needs an Extra Layer

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

DeepSeek V4LLM trainingMulti-Expert Models

0 likes · 11 min read

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

Machine Heart

May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL

0 likes · 27 min read

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

PMTalk Product Manager Community

Apr 30, 2026 · Artificial Intelligence

10 Essential Large‑Model Fine‑Tuning Techniques for AI Product Managers

This article systematically presents ten large‑model training and fine‑tuning methods—from full‑parameter finetuning to parameter‑efficient PEFT—detailing their principles, suitable scenarios, step‑by‑step workflows, code examples, and practical selection guidance for AI product managers.

AdapterLoRAPEFT

0 likes · 13 min read

10 Essential Large‑Model Fine‑Tuning Techniques for AI Product Managers

CodeTrend

Apr 24, 2026 · Artificial Intelligence

How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained

The article explains why pretrained LLMs cannot call tools, then breaks down the three‑stage training pipeline—Supervised Fine‑Tuning, Reinforcement Learning from Human Feedback, and knowledge distillation—showing how each step teaches models to read tool schemas, decide when to invoke a tool, generate JSON calls, and finally transfer the capability to smaller models with LoRA.

AI trainingFunction CallingKnowledge Distillation

0 likes · 19 min read

How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained

Data Party THU

Apr 12, 2026 · Artificial Intelligence

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

AI alignmentGRPOLLM

0 likes · 20 min read

What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

Lao Guo's Learning Space

Apr 2, 2026 · Artificial Intelligence

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

This article explains the full lifecycle of large language models in 2026, covering pretraining fundamentals, the limits of classic Scaling Laws, data‑centric advances, fine‑tuning strategies, RLHF, DPO, and the emerging post‑training methods GRPO, DAPO and RLVR, with concrete benchmarks and cost analyses.

DAPODPOGRPO

0 likes · 17 min read

Large Model Pretraining and Fine‑Tuning: A 2026 Technical Guide from Scaling Laws to Post‑Training Revolution

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.

DPOGRPOLLM

0 likes · 24 min read

A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL

AI Large-Model Wave and Transformation Guide

Mar 28, 2026 · Artificial Intelligence

How to Ace LLM Interview Questions: Deep Dive into Pre‑training, SFT, DPO & RLHF

This guide breaks down the four major large‑model training paradigms—pre‑training, supervised fine‑tuning, preference alignment, and RLHF—explaining which parameters are updated, how attention is reshaped, and what capabilities are gained, so you can deliver a structured, interview‑ready answer.

AI InterviewLLMRLHF

0 likes · 8 min read

How to Ace LLM Interview Questions: Deep Dive into Pre‑training, SFT, DPO & RLHF

SuanNi

Mar 1, 2026 · Artificial Intelligence

AI in a Nuclear Crisis: Unexpected Strategies of GPT‑5.2, Claude 4, and Gemini Flash

A recent study from King's College London pits three cutting‑edge large language models against each other in a simulated Cold‑War‑style nuclear standoff, revealing that the models develop strategic deception, time‑pressure‑driven decision flips, and surprisingly aggressive escalation patterns that challenge conventional AI safety assumptions.

AI safetyRLHFgame theory

0 likes · 13 min read

AI in a Nuclear Crisis: Unexpected Strategies of GPT‑5.2, Claude 4, and Gemini Flash

Machine Learning Algorithms & Natural Language Processing

Mar 1, 2026 · Artificial Intelligence

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

This article walks through the fundamental derivation of policy‑based reinforcement learning, explains how traditional RL concepts extend to large‑language‑model RL, and details engineering enhancements such as GRPO memory reduction, asynchronous rollout, importance‑sampling corrections, and token‑flow management for stable industrial‑scale training.

GRPOImportance SamplingRLHF

0 likes · 11 min read

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

Machine Learning Algorithms & Natural Language Processing

Feb 11, 2026 · Artificial Intelligence

Can TI‑DPO Fix DPO’s Blind Spot? Token‑Importance Guided Direct Preference Optimization for Better LLM Alignment

TI‑DPO introduces a hybrid weighting scheme and a triplet‑loss objective that weight tokens by gradient attribution and a Gaussian prior, enabling precise identification of critical tokens and yielding consistent performance gains over DPO, SimPO, and GRPO on Llama‑3, Mistral‑7B, and downstream benchmarks such as IFEval, TruthfulQA, and HumanEval.

Direct Preference OptimizationRLHFTI-DPO

0 likes · 8 min read

Can TI‑DPO Fix DPO’s Blind Spot? Token‑Importance Guided Direct Preference Optimization for Better LLM Alignment

Fun with Large Models

Jan 12, 2026 · Artificial Intelligence

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

The article explains why mastering large‑model training is crucial for professionals, researchers, and enterprises, outlines the end‑to‑end pipeline—from data preparation and pre‑training to instruction fine‑tuning and RLHF alignment—compares training with RAG, and presents a structured learning roadmap.

AI agentsPyTorchRAG

0 likes · 14 min read

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

Data Party THU

Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training

0 likes · 10 min read

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

PMTalk Product Manager Community

Jan 5, 2026 · Artificial Intelligence

Turning Base Models from Semi‑Finished to Killer AI Products: A PM’s Playbook

The article breaks down how AI product managers can transform a raw base model into a market‑ready, high‑impact product by applying supervised fine‑tuning, tool‑use routing, RLHF alignment, and chain‑of‑thought reasoning, while highlighting trade‑offs, cost shifts, and evaluation metrics.

Artificial IntelligenceRLHFReasoning

0 likes · 13 min read

Turning Base Models from Semi‑Finished to Killer AI Products: A PM’s Playbook

AI Architecture Hub

Dec 24, 2025 · Artificial Intelligence

From LLMs to Autonomous Agents: The Three Evolution Stages of AI

This article explains the three evolutionary stages of AI—from large language models that generate text, through workflow‑enhanced systems using retrieval‑augmented generation, to fully autonomous agents capable of self‑directed decision‑making—while detailing the four core technologies that power each stage.

AI evolutionAgentEmbedding

0 likes · 9 min read

From LLMs to Autonomous Agents: The Three Evolution Stages of AI

Fighter's World

Dec 19, 2025 · Industry Insights

How Surge AI Works: Decoding the Data Alchemy Behind Modern AI

The article analyzes Surge AI’s $1.2 billion revenue, bootstrapped model, elite 100 k‑labeler network, three‑layer architecture, RLHF, AdvancedIF/RIFL benchmarks, red‑team testing, RL environments, and evaluates its competitive moat and future strategic paths.

AI alignmentRL EnvironmentsRLHF

0 likes · 21 min read

How Surge AI Works: Decoding the Data Alchemy Behind Modern AI

Wu Shixiong's Large Model Academy

Dec 12, 2025 · Artificial Intelligence

Why Fixing Bad Cases Beats Adding More Data in RLHF

In industrial RLHF, repairing bad cases—structural error samples—provides explicit alignment signals that improve model capability far more efficiently than simply increasing data volume, because it teaches the model how to correct mistakes rather than just exposing it to more examples.

Capability ImprovementRLHFbad case

0 likes · 9 min read

Why Fixing Bad Cases Beats Adding More Data in RLHF

Wu Shixiong's Large Model Academy

Dec 11, 2025 · Artificial Intelligence

Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1

Interviewers increasingly ask why modern reward models must go beyond scalar scores to incorporate reasoning, and this article explains the limitations of traditional scalar reward models, the benefits of the RM‑R1 framework, and how reasoning‑based rewards improve alignment, stability, and task performance in large language model training.

AI alignmentLLMRLHF

0 likes · 11 min read

Why Reward Models Need Reasoning: From Scalar Scores to RM‑R1

Wu Shixiong's Large Model Academy

Dec 10, 2025 · Artificial Intelligence

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

The article explains that the real difficulty of RLHF lies in designing and curating high‑quality preference data, building robust reward models through bad‑case rewriting, human‑in‑the‑loop labeling, and inference‑based reward modeling, while algorithmic details like PPO are secondary concerns.

GRPORLHFRM-R1

0 likes · 9 min read

Why RLHF Success Relies on Data Engineering, Not Just Model Tricks

PaperAgent

Dec 5, 2025 · Artificial Intelligence

Can LLMs Be Trained to Confess? Inside the “Confession” Method for Honest AI

The article reviews OpenAI’s “Confession” training approach for large language models, explains why traditional RLHF fails to ensure honesty, details the confession methodology and PPO update, presents experimental results showing higher honesty rates, analyzes error cases, and discusses limitations and future risks.

AI honestyArtificial IntelligenceConfession Training

0 likes · 6 min read

Can LLMs Be Trained to Confess? Inside the “Confession” Method for Honest AI

Tencent Technical Engineering

Dec 1, 2025 · Artificial Intelligence

Do Machines Really Think? Inside Deep Reasoning, Scaling Laws & RLHF for LLMs

This article examines whether large language models truly think, explores the origins of deep reasoning through transformer architectures and scaling laws, reviews chain‑of‑thought and its variants, and analyzes how reinforcement learning from human feedback—including PPO, DPO, and GRPO—helps internalise step‑by‑step reasoning while pointing to future directions such as atomic thought, hierarchical models, and training‑free in‑context knowledge bases.

AI alignmentLLMRLHF

0 likes · 35 min read

Do Machines Really Think? Inside Deep Reasoning, Scaling Laws & RLHF for LLMs

Wuming AI

Nov 30, 2025 · Artificial Intelligence

What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work

This article explains the relationship between AI, machine learning, deep learning, and large language models, detailing their evolution, training stages, transformer architecture, attention mechanisms, inference APIs, and practical usage examples, while demystifying common misconceptions about LLM capabilities.

AI fundamentalsLarge Language ModelMachine Learning

0 likes · 10 min read

What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work

HyperAI Super Neural

Nov 25, 2025 · Artificial Intelligence

LongCat‑Video: Meituan’s Model for Text‑to‑Video, Image‑to‑Video & Continuation

LongCat‑Video, an open‑source video generation model from Meituan, adopts a unified multi‑task architecture to handle text‑to‑video, image‑to‑video and video‑continuation, delivers minute‑long high‑quality clips with coarse‑to‑fine inference, achieves benchmark scores comparable to leading models like Wan2.2, and provides a one‑click deployment tutorial on HyperAI.

LongCat-VideoMeituanRLHF

0 likes · 6 min read

LongCat‑Video: Meituan’s Model for Text‑to‑Video, Image‑to‑Video & Continuation

Kuaishou Tech

Nov 24, 2025 · Artificial Intelligence

How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

This article details a new research pipeline that leverages large‑scale human preference data, a multi‑dimensional video reward model, and specialized alignment algorithms to dramatically improve video generation quality, motion fidelity, and text‑video consistency, with open‑source code and benchmarks for reproducibility.

AI alignmentHuman FeedbackRLHF

0 likes · 10 min read

How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

Data Party THU

Nov 24, 2025 · Artificial Intelligence

Model-Free vs Model-Based RL: Core Concepts and Large-Model Applications

This article explains the fundamental architecture of reinforcement learning, contrasting model‑free and model‑based approaches, detailing environment models, planning, data augmentation, expert iteration, and embedding planning, and then examines how large language models use policy‑based methods such as PPO, DPO, and GRPO for RL‑HF.

Model-BasedModel-freePlanning

0 likes · 13 min read

Model-Free vs Model-Based RL: Core Concepts and Large-Model Applications

Data Party THU

Oct 13, 2025 · Artificial Intelligence

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.

BranchGRPOEfficiencyRLHF

0 likes · 10 min read

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

Fun with Large Models

Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF

0 likes · 9 min read

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

DataFunTalk

Sep 21, 2025 · Artificial Intelligence

Why Reinforcement Learning Is the Hot New Frontier—and Why You Shouldn't Start a Startup Around It

This article explains how reinforcement learning, especially RL from Human Feedback, has propelled AI from AlphaGo to ChatGPT, outlines its core components and the booming market for RL environments, and warns that building a business around these environments is unsustainable and likely to be overtaken by the models themselves.

AI alignmentRL EnvironmentsRLHF

0 likes · 11 min read

Why Reinforcement Learning Is the Hot New Frontier—and Why You Shouldn't Start a Startup Around It

Data Party THU

Sep 18, 2025 · Artificial Intelligence

How Reinforcement Learning is Shaping the Future of Large Reasoning Models

This article surveys recent advances in applying reinforcement learning to large reasoning models, outlining the historical background, key breakthroughs like OpenAI o1 and DeepSeek‑R1, current challenges in reward design and scalability, and future research directions toward more capable AI systems.

AI researchRLHFReasoning

0 likes · 9 min read

How Reinforcement Learning is Shaping the Future of Large Reasoning Models

Data Party THU

Sep 14, 2025 · Artificial Intelligence

Why Do Large Language Models Hallucinate? Uncovering the Root Causes and Practical Fixes

The article analyzes why large language models frequently generate confidently wrong answers, attributing hallucinations to statistical inevitability, data scarcity, and limited model expressiveness, and shows how RLHF exacerbates the problem by rewarding guesses, then proposes confidence‑threshold and "I don't know" strategies to mitigate it.

AISafetyConfidenceThresholdLLM

0 likes · 6 min read

Why Do Large Language Models Hallucinate? Uncovering the Root Causes and Practical Fixes

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

GRPOLLM trainingPPO

0 likes · 16 min read

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Data Party THU

Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO

0 likes · 30 min read

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

Sohu Tech Products

Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

Artificial IntelligenceGRPOLLM

0 likes · 17 min read

How GRPO Revolutionizes RLHF for Large Language Models

Wu Shixiong's Large Model Academy

Aug 26, 2025 · Artificial Intelligence

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

This comprehensive guide explains the full RLHF training pipeline, the mathematical foundations of reward modeling and PPO, and introduces DPO and KTO algorithms—including their implementations, advantages, limitations, and practical tuning strategies—for building aligned large language models.

DPOHuman FeedbackKTO

0 likes · 32 min read

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

Data Party THU

Aug 17, 2025 · Artificial Intelligence

Why Do Large Language Models Hallucinate? Unpacking the Probabilistic Roots and Fixes

Large language models often generate confident but false statements—a phenomenon called hallucination—because they predict the next token based on statistical patterns rather than factual understanding, and this article explains the underlying mechanisms and practical mitigation strategies.

Knowledge DistillationLLMRLHF

0 likes · 11 min read

Why Do Large Language Models Hallucinate? Unpacking the Probabilistic Roots and Fixes

Baobao Algorithm Notes

Aug 17, 2025 · Artificial Intelligence

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

By replacing the traditional Pass@1 reward with a Pass@k formulation and a lightweight advantage computation, a 7B language model can dramatically improve its performance on math reasoning benchmarks, surpassing GPT‑4o while adding only a few lines of code and minimal training overhead.

PythonRLHFreward engineering

0 likes · 7 min read

Boost 7B LLM Math Reasoning Beyond GPT‑4o with a Simple Pass@k Reward

Data Party THU

Aug 15, 2025 · Artificial Intelligence

What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey

This article provides a critical, up‑to‑date overview of visual reinforcement learning, formalizes the problem, traces policy‑optimization evolution, categorizes over 200 recent works into four pillars, analyzes algorithms, reward design, benchmarks, and highlights open challenges and future research directions.

RLHFdiffusion modelsmultimodal AI

0 likes · 7 min read

What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey

Tencent Technical Engineering

Aug 14, 2025 · Artificial Intelligence

Why Do Large Language Models Hallucinate? Causes, Risks, and Multi‑Dimensional Solutions

This article systematically examines the root causes of hallucinations in large language models, evaluates their pros and cons, and presents a comprehensive set of optimization techniques—including prompt engineering, RAG, sampling tweaks, supervised fine‑tuning, LoRA, RLHF, chain‑of‑thought reasoning, and agent/workflow designs—to build more reliable and trustworthy AI applications.

AILLMLoRA

0 likes · 29 min read

Why Do Large Language Models Hallucinate? Causes, Risks, and Multi‑Dimensional Solutions

AIWalker

Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

RLHFbenchmarkmultimodal LLM

0 likes · 24 min read

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

Alibaba Cloud Developer

Jul 31, 2025 · Artificial Intelligence

Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

This article explores the importance of post‑training for large language models, explains scaling laws for pre‑ and post‑training, details common fine‑tuning methods (full, PEFT, LoRA), outlines alignment techniques such as RLHF, DPO, PPO, and presents practical workflows using Llama 3 and DeepSeek‑R1, while also discussing test‑time reasoning optimizations.

LLMRLHFalignment

0 likes · 19 min read

Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

AI Algorithm Path

Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO

0 likes · 8 min read

Understanding RLHF: How Human Feedback Trains Modern LLMs

DataFunTalk

Jul 3, 2025 · Artificial Intelligence

How OpenAI Turned ChatGPT from a Research Preview into an AI Phenomenon

This article recounts the chaotic launch of ChatGPT, the naming decisions, internal debates over its readiness, the role of RLHF and user feedback in shaping the model, and how OpenAI’s hiring focus on curiosity and autonomy fuels rapid, iterative AI development.

AI product developmentChatGPTOpenAI

0 likes · 11 min read

How OpenAI Turned ChatGPT from a Research Preview into an AI Phenomenon

DataFunSummit

Jul 3, 2025 · Artificial Intelligence

Boosting LLM Function Call Capabilities: From Data Construction to RLHF Optimization

On July 12, 2025, the DataFun Summit will feature a technical session where China Telecom AI Research Institute engineer Yao Yitong presents a deep dive into enhancing large language model Function Call abilities through systematic data and training optimizations, offering practical insights for AI practitioners.

AILLMRLHF

0 likes · 4 min read

Boosting LLM Function Call Capabilities: From Data Construction to RLHF Optimization

Alimama Tech

Jun 25, 2025 · Artificial Intelligence

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

ROLL is an open‑source reinforcement‑learning framework designed for large language model post‑training that combines multi‑task RL, agentic support, flexible algorithm configuration, elastic resource scheduling, and rich observability, delivering significant accuracy gains across benchmarks while remaining easy to use for researchers, product developers, and infrastructure engineers.

AI FrameworkOpen SourceRLHF

0 likes · 11 min read

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

DataFunSummit

Jun 10, 2025 · Artificial Intelligence

How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

Quwan Technology presents its Kaitian social large model, designed for personalized, emotionally rich, multimodal AI interactions, detailing its scene‑specific goals, CPT+SFT+RLHF training pipeline, data desensitization, LoRA fine‑tuning, evaluation methods, pruning, latency trade‑offs, safety mechanisms, and future feedback loops.

AI safetyLarge Language ModelLoRA

0 likes · 13 min read

How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

AI Algorithm Path

Jun 4, 2025 · Artificial Intelligence

Why LLMs Hallucinate and How to Mitigate the Problem

The article explains that hallucinations in large language models stem mainly from the supervised fine‑tuning stage, illustrates the issue with concrete examples, and presents mitigation techniques such as knowledge‑probing data generation and web‑search tool integration using special tokens.

LLMMetaOpenAssistant

0 likes · 12 min read

Why LLMs Hallucinate and How to Mitigate the Problem

DaTaobao Tech

Jun 4, 2025 · Artificial Intelligence

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

This article provides a comprehensive overview of large language models (LLMs), covering their transformer architecture, parameter counts, GPU memory and storage requirements, and detailed fine‑tuning methods such as prompt engineering, data construction, LoRA, PEFT, RLHF, and DPO, along with practical deployment and inference acceleration strategies.

DPOLLMLoRA

0 likes · 17 min read

Understanding Large Language Model Architecture, Parameters, Memory, Storage, and Fine‑Tuning Techniques

Baobao Algorithm Notes

May 20, 2025 · Artificial Intelligence

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

This article explains how an asynchronous RLHF pipeline built on vLLM, Ray, and OpenRLHF dramatically reduces training bottlenecks by decoupling inference, environment interaction, and model updates, and provides detailed implementation code and design choices for scalable reinforcement learning.

OpenRLHFRLHFRay

0 likes · 11 min read

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

Baobao Algorithm Notes

May 16, 2025 · Artificial Intelligence

Why Multi‑Turn LLM Evaluation Fails and How a User‑Simulator Can Fix It

The article explains that large language models lose up to 35% performance in multi‑turn conversations, critiques static single‑turn evaluation methods, and proposes a dynamic user‑simulator with loss‑masking techniques to generate realistic test turns and improve assessment reliability.

AI testingLLMRLHF

0 likes · 6 min read

Why Multi‑Turn LLM Evaluation Fails and How a User‑Simulator Can Fix It

JD Tech

Apr 30, 2025 · Artificial Intelligence

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

The JD Supply Chain algorithm team introduces TimeHF, a billion‑parameter time‑series large model that leverages RLHF to boost demand‑forecast accuracy by over 10%, detailing dataset construction, the PCTLM architecture, a custom RLHF framework (TPO), and extensive SOTA experimental results.

Big DataRLHFSupply Chain

0 likes · 10 min read

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

AI Frontier Lectures

Apr 18, 2025 · Artificial Intelligence

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

This reflective essay traces reinforcement learning’s decade‑long evolution through four stages—early algorithmic foundations, application‑driven growth, problem‑construction focus, and speculative future—while critiquing the expanding definition and its impact on research and industry.

AI researchRL evolutionRLHF

0 likes · 9 min read

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

AI2ML AI to Machine Learning

Apr 17, 2025 · Artificial Intelligence

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

Large Language ModelMoEModel architecture

0 likes · 7 min read

Inside Qwen: A Deep Dive into the Large Model’s Source Code

JD Cloud Developers

Apr 11, 2025 · Artificial Intelligence

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

This article introduces PCTLM, a pioneering billion‑parameter pure time‑series large model that outperforms existing solutions like GPT4TS across multiple benchmarks, detailing its massive high‑quality dataset, novel patch‑based architecture, and a tailored RLHF framework (TPO) that enhances zero‑shot forecasting accuracy.

Big DataPCTLMRLHF

0 likes · 11 min read

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

JD Tech Talk

Apr 11, 2025 · Artificial Intelligence

A Billion-Scale Pure Time Series Large Model: PCTLM with SFT and TPO for Forecasting

This article presents a pioneering billion‑parameter pure time‑series large model (PCTLM) trained on a 1.5‑billion‑sample dataset, introduces a novel RLHF framework (TPO) for time‑series forecasting, and demonstrates state‑of‑the‑art performance across multiple public benchmarks, surpassing existing models such as GPT4TS.

Large Language ModelPCTLMRLHF

0 likes · 11 min read

A Billion-Scale Pure Time Series Large Model: PCTLM with SFT and TPO for Forecasting

JD Retail Technology

Apr 10, 2025 · Artificial Intelligence

How JD’s TimeHF Billion‑Scale Time‑Series Model Boosts Forecast Accuracy with RLHF

JD’s supply‑chain algorithm team introduces TimeHF, a billion‑scale time‑series large model that leverages RLHF to improve demand forecasting accuracy by over 10%, detailing dataset creation, the PCTLM architecture, a custom RL framework (TPO), and superior benchmark results.

AIPCTLMRLHF

0 likes · 12 min read

How JD’s TimeHF Billion‑Scale Time‑Series Model Boosts Forecast Accuracy with RLHF

Network Intelligence Research Center (NIRC)

Apr 7, 2025 · Artificial Intelligence

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

This guide introduces Hugging Face's TRL library, explains how to install it alongside Transformers, and walks through modifying LLaVA's trainer, dataset, and data collator to apply the DPO reinforcement‑learning algorithm for multimodal model fine‑tuning.

DPOHugging FaceLLaVA

0 likes · 4 min read

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

AI Algorithm Path

Apr 2, 2025 · Artificial Intelligence

Master the Three Essential LLM Training Stages for 2025

The article breaks down the three core stages of large‑language‑model training—pre‑training, supervised fine‑tuning, and RLHF—explaining their purpose, methods, and concrete examples while noting DeepSeek‑R1’s recent breakthrough and its implications for AI development.

AI trainingDeepSeekLLM

0 likes · 5 min read

Master the Three Essential LLM Training Stages for 2025

Baobao Algorithm Notes

Mar 30, 2025 · Artificial Intelligence

Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

The article analyses two months of community attempts to reproduce DeepSeek R1, highlighting that model scaling, high‑quality data, robust training infrastructure, and careful hyper‑parameter tuning outweigh pure reward‑based tricks, and it outlines common pitfalls and future research directions.

DeepSeekInfrastructureLLM

0 likes · 13 min read

Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

DataFunSummit

Mar 30, 2025 · Artificial Intelligence

RLHF Techniques and Challenges in Large Language Models and Multimodal Applications

This article reviews reinforcement learning, RLHF, and related alignment techniques for large language models and multimodal systems, covering fundamentals, recent advances such as InstructGPT, Constitutional AI, RLAIF, Super Alignment, GPT‑4o, video LLMs, and experimental evaluations of proposed methods.

RLHFmultimodal alignmentpreference learning

0 likes · 26 min read

RLHF Techniques and Challenges in Large Language Models and Multimodal Applications

DataFunTalk

Mar 24, 2025 · Artificial Intelligence

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

The interview explores DeepSeek R1's open‑source weights, its multi‑stage training pipeline—including pre‑training, supervised fine‑tuning, and RLHF—alongside innovations such as self‑consistency, chain‑of‑thought prompting, distillation, MoE architectures, and cost considerations, highlighting its impact on the future of large language models.

AI trainingDeepSeekOpen Source

0 likes · 20 min read

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

Data Thinking Notes

Mar 16, 2025 · Artificial Intelligence

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

AI alignmentDPOGRPO

0 likes · 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

Baobao Algorithm Notes

Mar 5, 2025 · Artificial Intelligence

Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles

The author experiments with reinforcement‑learning‑from‑human‑feedback on a 0.5B Qwen instruct model using Logic‑RL and Open‑R1, discovers that reward mis‑design and curriculum learning cause the model to produce overly short or incorrect reasoning chains on knight‑and‑knave puzzles, and analyses the underlying causes.

Artificial IntelligenceCurriculum LearningLarge Language Model

0 likes · 11 min read

Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles

Code Mala Tang

Mar 1, 2025 · Artificial Intelligence

Why Do Large Language Models Hallucinate and How Can We Fix It?

This article explains why large language models produce plausible‑looking but false information, traces the problem to the supervised fine‑tuning stage, and outlines mitigation techniques such as knowledge interrogation, RLHF, and tool‑augmented search to reduce hallucinations.

LLMRLHFhallucination

0 likes · 12 min read

Why Do Large Language Models Hallucinate and How Can We Fix It?

Tencent Technical Engineering

Feb 24, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.

GRPOPPOPolicy Optimization

0 likes · 16 min read

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models

DevOps

Feb 23, 2025 · Artificial Intelligence

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

This article explains how DeepSeek‑R1‑Zero uses group‑relative policy optimization (GRPO) to enhance inference without labeled data, introduces reinforcement learning with human feedback (RLHF) and its components, and compares the PPO and GRPO algorithms, highlighting their suitable engineering scenarios and practical implications for AI applications.

AI model trainingGRPOPPO

0 likes · 15 min read

Understanding Reinforcement Learning, RLHF, PPO and GRPO for AI Applications

Big Data Tech Team

Feb 18, 2025 · Artificial Intelligence

How DeepSeek Trains and Optimizes Its LLMs: From Pre‑training to Reasoning Models

This article breaks down DeepSeek's LLM training pipeline, explaining the massive pre‑training phase, instruction fine‑tuning, reinforcement‑learning‑from‑human‑feedback, and the distinct roles of its V3 instruction model and R1 reasoning model, while also highlighting performance metrics and current limitations.

DeepSeekLLMRLHF

0 likes · 8 min read

How DeepSeek Trains and Optimizes Its LLMs: From Pre‑training to Reasoning Models

Top Architect

Feb 9, 2025 · Artificial Intelligence

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results

The article reviews DeepSeek‑R1’s training methodology—including cold‑start data collection, multi‑stage RL fine‑tuning, SFT data generation, and model distillation—highlights its performance comparable to OpenAI‑o1‑1217, and discusses key contributions, reward design, successful experiments, and failed attempts.

AI researchDeepSeekLLM

0 likes · 12 min read

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results

Architect

Feb 6, 2025 · Artificial Intelligence

DeepSeek‑R1: Reinforcement‑Learning‑Driven Long‑Chain Reasoning for Large Language Models

The article reviews DeepSeek‑R1, detailing its reinforcement‑learning‑based training pipeline that uses minimal supervised data, cold‑start fine‑tuning, multi‑stage RL, rejection‑sampling SFT, and distillation to achieve reasoning performance comparable to OpenAI‑o1‑1217, while also discussing successful contributions and failed experiments.

AI researchDeepSeek-R1LLM reasoning

0 likes · 11 min read

DeepSeek‑R1: Reinforcement‑Learning‑Driven Long‑Chain Reasoning for Large Language Models

DataFunSummit

Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

GPU optimizationLLMNeMo

0 likes · 24 min read

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

Baobao Algorithm Notes

Jan 8, 2025 · Artificial Intelligence

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

This article compiles and analyzes the post‑training pipelines of Llama 3.1, DeepSeek‑V3, TÜLU 3 and Qwen 2.5, detailing their data compositions, SFT, reward modeling, DPO, GRPO, RLVR methods, hyper‑parameters, and practical tricks for large‑language‑model alignment.

DPODeepSeek-V3Llama3.1

0 likes · 22 min read

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

Xiaohongshu Tech REDtech

Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

PPOPRMPerformance

0 likes · 21 min read

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

php Courses

Dec 13, 2024 · Artificial Intelligence

OpenAI Day 2: Launch of Reinforcement Learning from Human Feedback (RLHF) Model for Enhanced AI Capabilities

OpenAI announced on the second day of its twelve‑day event that it has integrated Reinforcement Learning from Human Feedback (RLHF) into its 001 series models, demonstrating significant reasoning improvements, showcasing legal and medical use cases, and promising a public release early next year.

AI Model Fine-tuningMachine LearningOpenAI

0 likes · 5 min read

OpenAI Day 2: Launch of Reinforcement Learning from Human Feedback (RLHF) Model for Enhanced AI Capabilities

Baobao Algorithm Notes

Nov 19, 2024 · Artificial Intelligence

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

This article walks through the various loss functions used in OpenRLHF—including GPTLMLoss, KDLoss, DPOLoss, KTOLoss, and reward model losses—explaining their mathematical foundations, implementation details, and practical considerations for RLHF training.

DPOKTOKnowledge Distillation

0 likes · 23 min read

Demystifying OpenRLHF Loss Functions: From GPTLM to KTO and Beyond

Baobao Algorithm Notes

Oct 22, 2024 · Artificial Intelligence

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

AI alignmentDPOPPO

0 likes · 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

Baobao Algorithm Notes

Oct 21, 2024 · Artificial Intelligence

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

DPOLLM optimizationOpenRLHF

0 likes · 27 min read

Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

Baobao Algorithm Notes

Oct 15, 2024 · Artificial Intelligence

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

This article breaks down how Direct Preference Optimization (DPO) mathematically reduces the two‑stage RLHF pipeline into a single‑stage SFT process, explains the underlying loss transformations, and discusses DPO's practical limitations and trade‑offs for large language model alignment.

DPODirect Preference OptimizationMachine Learning

0 likes · 9 min read

How DPO Simplifies RLHF: A Deep Dive into Direct Preference Optimization

Baobao Algorithm Notes

Oct 7, 2024 · Artificial Intelligence

Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

The author speculates on OpenAI’s o1 architecture, proposing that it relies on reinforcement learning guided by a generalizable, process‑supervised reward model, and outlines data collection, multi‑model generation, and training tweaks needed to realize such a system.

AI researchLLMRLHF

0 likes · 8 min read

Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

DataFunTalk

Sep 23, 2024 · Artificial Intelligence

Comprehensive Guide to Selecting, Adapting, and Deploying Large Language Models for Enterprise Applications

This article provides an in‑depth, step‑by‑step guide on how enterprises can choose between open‑source and closed‑source large language models, adapt them through incremental pre‑training, instruction fine‑tuning, and reinforcement learning, and finally deploy them across front‑office, middle‑office, and back‑office scenarios to drive digital transformation.

RLHFenterprise AIlarge language models

0 likes · 28 min read

Comprehensive Guide to Selecting, Adapting, and Deploying Large Language Models for Enterprise Applications

NewBeeNLP

Sep 23, 2024 · Artificial Intelligence

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

This article analyzes recent post‑training trends in large language models, comparing DPO and PPO, examining the scarcity of open‑source preference data, the iterative training process, the rise of synthetic data pipelines, and emerging methods for improving math and reasoning capabilities.

DPOLLMPPO

0 likes · 12 min read

Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies

Baobao Algorithm Notes

Sep 10, 2024 · Artificial Intelligence

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

DPOLLM alignmentPreference Optimization

0 likes · 17 min read

How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

NewBeeNLP

Sep 5, 2024 · Artificial Intelligence

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

The article analyzes why supervised fine‑tuning (SFT) cannot replace reinforcement learning from human feedback (RLHF), highlighting SFT's lack of negative feedback and backward‑looking capability, and explains how RLHF’s reward model addresses these fundamental shortcomings.

Language ModelsRLHFReward Modeling

0 likes · 7 min read

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

Baobao Algorithm Notes

Aug 29, 2024 · Artificial Intelligence

Why RLHF Is Essential: The Limits of SFT and the Power of Reward Modeling

The article analyzes why Reinforcement Learning from Human Feedback (RLHF) cannot be replaced by Supervised Fine‑Tuning (SFT), highlighting SFT's lack of negative feedback, its one‑directional attention limitation, and how RLHF's reward models provide crucial safety and performance improvements for large language models.

AI alignmentRLHFSFT

0 likes · 9 min read

Why RLHF Is Essential: The Limits of SFT and the Power of Reward Modeling

Alibaba Cloud Big Data AI Platform

Aug 29, 2024 · Artificial Intelligence

How PAI-ChatLearn Accelerates Large‑Scale LLM Alignment Training

PAI-ChatLearn is an open‑source framework that abstracts and decouples alignment training for large language models, offering flexible resource scheduling, multi‑backend support, and significant speedups—up to 208% for 70B models—while supporting RLHF, DPO, and custom training flows.

AI PerformanceChatLearnLLM alignment

0 likes · 11 min read

How PAI-ChatLearn Accelerates Large‑Scale LLM Alignment Training

Baidu Geek Talk

Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

GPU optimizationPPO optimizationPerformance tuning

0 likes · 16 min read

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

Tencent Advertising Technology

Aug 15, 2024 · Artificial Intelligence

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

This paper introduces RLLR, a label‑sensitive reward reinforcement learning method that improves natural language understanding tasks by aligning training objectives with label accuracy, and demonstrates its effectiveness across eight public NLU datasets and real‑world advertising feature evaluation, outperforming standard RLHF and SFT baselines.

AdvertisingRLHFlabel-sensitive reward

0 likes · 14 min read

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

DataFunSummit

Aug 8, 2024 · Artificial Intelligence

Exploring Training and Alignment Techniques for Financial Large Models

The announcement details a DataFun Summit 2024 session where Du Xiaoman AI researcher Huo Liangyu will present on the challenges, development, and alignment methods of the Xuan Yuan financial large language model, highlighting RLHF techniques, data collection, and real‑world deployment insights for the finance sector.

AIFinanceFinancial AI

0 likes · 6 min read

Exploring Training and Alignment Techniques for Financial Large Models

NewBeeNLP

Aug 7, 2024 · Artificial Intelligence

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

This article analyses the shortcomings of current large language model training methods such as SFT, RLHF and DPO, explains why they incur high data and compute costs, and introduces Intuitive Fine‑Tuning (IFT) with temporal residual connections as a cheaper yet effective alternative that better aligns training objectives with real generation tasks.

DPOIntuitive Fine-TuningLLM

0 likes · 15 min read

Can Intuitive Fine‑Tuning Replace Expensive RLHF and DPO for LLM Alignment?

Kuaishou Tech

Jul 18, 2024 · Artificial Intelligence

Multidimensional Preference Model (MPS) for Text-to-Image Generation: Dataset, Architecture, and Experimental Analysis

This article introduces the Multidimensional Preference Model (MPS), the first multi‑dimensional scoring system for evaluating text‑to‑image generation, built on the newly released MHP dataset with extensive human annotations across aesthetic, semantic alignment, detail quality, and overall preference dimensions, and demonstrates its superior performance through comprehensive experiments and RLHF integration.

MHP datasetMPSRLHF

0 likes · 10 min read

Multidimensional Preference Model (MPS) for Text-to-Image Generation: Dataset, Architecture, and Experimental Analysis

Baobao Algorithm Notes

May 30, 2024 · Artificial Intelligence

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

This article surveys the current RLHF ecosystem, comparing on‑policy methods like PPO with off‑policy approaches such as DPO, and examines recent variants—including ReMax, GRPO, DPOP, TDPO, and ORPO—highlighting their algorithmic differences, resource trade‑offs, and practical performance insights.

DPOLLMPPO

0 likes · 23 min read

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

Rare Earth Juejin Tech Community

May 1, 2024 · Artificial Intelligence

Hyper‑SD: Trajectory‑Segmented Consistency Model for Accelerating Diffusion Image Generation

Hyper‑SD introduces a trajectory‑segmented consistency distillation framework that combines trajectory‑preserving and trajectory‑reconstruction strategies, integrates human‑feedback learning and score distillation, and achieves state‑of‑the‑art low‑step image generation performance on both SD1.5 and SDXL models.

AI accelerationRLHFdiffusion models

0 likes · 10 min read

Hyper‑SD: Trajectory‑Segmented Consistency Model for Accelerating Diffusion Image Generation

NewBeeNLP

Apr 10, 2024 · Artificial Intelligence

What Scaling Laws Reveal About LLM Fine‑Tuning and RLHF Performance

This article reviews recent scaling‑law research on large‑language‑model fine‑tuning and RLHF, explaining how data quantity, model size, PET parameters, reward‑model size and KL‑penalty affect downstream performance and offering practical insights for efficient training.

Artificial IntelligenceLLMRLHF

0 likes · 11 min read

What Scaling Laws Reveal About LLM Fine‑Tuning and RLHF Performance

NewBeeNLP

Apr 1, 2024 · Artificial Intelligence

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

Ghost AttentionLlama-2PPO

0 likes · 18 min read

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

NewBeeNLP

Mar 27, 2024 · Artificial Intelligence

Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights

This article provides a comprehensive technical overview of Meta's Llama 2 series, covering its architectural upgrades such as Group Query Attention, the pre‑training dataset and hyper‑parameters, loss behavior, benchmark comparisons, and the supervised fine‑tuning pipeline with safety considerations.

AILlama-2Model architecture

0 likes · 11 min read

Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights

Rare Earth Juejin Tech Community

Feb 18, 2024 · Artificial Intelligence

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview and Technical Details

The article provides a comprehensive overview of Meta’s Llama 2 series, detailing model sizes, pre‑training data, architectural enhancements, supervised fine‑tuning, RLHF procedures, safety evaluations, reward‑model training, and iterative improvements, highlighting its open‑source release and comparative performance.

AI safetyLarge Language ModelLlama2

0 likes · 27 min read

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview and Technical Details

DataFunTalk

Jan 29, 2024 · Artificial Intelligence

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

The article introduces PAI‑ChatLearn, a flexible and high‑performance framework developed by Alibaba Cloud's PAI team that supports full‑pipeline RLHF training for large models, explains the evolution of parallel training strategies, details the framework’s architecture and configuration, and showcases performance results and practical usage examples.

AI FrameworkDistributed computingPAI-ChatLearn

0 likes · 17 min read

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

Baobao Algorithm Notes

Jan 13, 2024 · Artificial Intelligence

How to Boost Reward Model Performance in RLHF: Data and Algorithm Strategies from the MOSS Report

This article analyzes the MOSS technical report on RLHF, identifying low data quality and poor model generalization as key challenges, and presents data‑centric and algorithmic solutions—including multi‑model preference strength measurement, soft labels, adaptive margins, contrastive learning, and MetaRM—backed by detailed experiments and visualizations.

GeneralizationMeta LearningPreference Strength

0 likes · 12 min read

How to Boost Reward Model Performance in RLHF: Data and Algorithm Strategies from the MOSS Report