Tagged articles
690 articles
Page 4 of 7
Kuaishou Large Model
Kuaishou Large Model
Aug 19, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

Klear-Reasoner, built on Qwen3‑8B‑Base, introduces the Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm to overcome traditional clip limitations, achieving state‑of‑the‑art performance on AIME2024/2025 and LiveCodeBench while providing detailed experimental analysis and data‑quality insights.

GPPOcode reasoninggradient clipping
0 likes · 11 min read
How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO
AntTech
AntTech
Aug 19, 2025 · Artificial Intelligence

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

GUI AgentSOTAbenchmark
0 likes · 7 min read
How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks
AI Info Trend
AI Info Trend
Aug 19, 2025 · Industry Insights

What’s Driving the AI Revolution in 2025? Key Trends and Insights

The 2025 H1 AI Core Achievements and Trends report reveals how agents are reshaping productivity, models are gaining inference power and becoming smaller, reinforcement learning is overtaking pre‑training, and industry competition is intensifying, with China and the US narrowing their technology gap.

AIChinaIndustry Insights
0 likes · 10 min read
What’s Driving the AI Revolution in 2025? Key Trends and Insights
Kuaishou Tech
Kuaishou Tech
Aug 18, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

The Klear‑Reasoner model, built on Qwen3‑8B‑Base and powered by the novel Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm, surpasses same‑size open‑source baselines on challenging math (AIME) and code (LiveCodeBench) benchmarks, while revealing key insights on data quality, reward design, and clipping strategies for large‑language‑model reasoning.

GPPOLLMcode reasoning
0 likes · 11 min read
How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 15, 2025 · Artificial Intelligence

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

This article systematically adapts classic deep reinforcement‑learning techniques—such as multi‑step returns, TD(λ)/GAE, V‑trace corrections, uncertainty‑aware weighting, safety constraints, distribution‑robust optimization, and value‑guided decoding—to improve large language model training and inference, providing concrete formulas, implementation tips, and empirical results.

Deep RLGAELLM
0 likes · 17 min read
Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment
0 likes · 7 min read
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It
AIWalker
AIWalker
Aug 13, 2025 · Artificial Intelligence

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Look-Back is an implicit training paradigm that enables the Qwen‑2.5‑VL‑7B multimodal LLM to autonomously re‑focus on visual inputs during reasoning, achieving a 6.3 % boost in perception tasks, outperforming prior baselines while requiring no extra image tokens or model architecture changes.

Look-BackQwen-2.5-VLimplicit training
0 likes · 26 min read
Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception
Kuaishou Tech
Kuaishou Tech
Aug 6, 2025 · Artificial Intelligence

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

This article presents the SL‑MGAC framework, a supervised‑learning‑enhanced multi‑group Actor‑Critic algorithm that improves live‑stream insertion decisions in mixed short‑video and live‑stream recommendation systems, achieving higher stability and better long‑term user engagement while satisfying platform constraints, as validated by extensive offline and online experiments.

KDD 2025actor-criticlive stream recommendation
0 likes · 9 min read
How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds
AIWalker
AIWalker
Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

RLHFbenchmarkmultimodal LLM
0 likes · 24 min read
Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks
AI Info Trend
AI Info Trend
Aug 4, 2025 · Industry Insights

How AI Agents and Small Models Are Redefining Productivity in 2025 H1

The report analyzes first‑half‑2025 AI breakthroughs, covering the rise of general‑purpose agents, rapid inference improvements, small‑model proliferation, reinforcement‑learning compute dominance, evolving transformer architectures, and shifting industry dynamics, offering actionable insights for researchers, product leaders, and decision‑makers.

AIAgentLarge Language Model
0 likes · 9 min read
How AI Agents and Small Models Are Redefining Productivity in 2025 H1
JD Tech
JD Tech
Jul 29, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article describes a QCon talk that combines causal inference with large language models to build a retrieval‑augmented generation pricing system for e‑commerce, detailing the three‑step algorithm, LLM‑driven modeling challenges, process‑reward tree search, reinforcement‑learning fine‑tuning, and experimental gains in accuracy and speed.

Retrieval-Augmented Generationcausal inferencee‑commerce pricing
0 likes · 17 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
AI Algorithm Path
AI Algorithm Path
Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO
0 likes · 8 min read
Understanding RLHF: How Human Feedback Trains Modern LLMs
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Jul 24, 2025 · Artificial Intelligence

Exploring Recent Large‑Model Agent Papers: Insights and Analyses

This article reviews a series of recent research papers on large‑model agents, covering topics such as reinforcement‑learning‑driven ML agents, premise‑critique ability of LLMs, long‑term tool‑augmented LLM evaluation, agentic RAG, set‑based retrieval for multi‑hop QA, mobile VLM agents, and broader surveys of LLM applications, summarizing each work’s problem statement, prior approaches, novel contributions, experimental results, limitations, and future directions.

LLM evaluationRetrieval-Augmented Generationagentic AI
0 likes · 46 min read
Exploring Recent Large‑Model Agent Papers: Insights and Analyses
Fun with Large Models
Fun with Large Models
Jul 23, 2025 · Artificial Intelligence

Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents

The article analyzes OpenAI's ChatGPT Agent—its launch, performance metrics, all‑in‑one tool integration, real‑world use cases, pricing tiers, core capabilities, and how it surpasses competing agents like Manus, highlighting its significance for the next generation of AI agents.

AI agentChatGPT AgentUse Cases
0 likes · 11 min read
Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents
JD Tech Talk
JD Tech Talk
Jul 23, 2025 · Artificial Intelligence

Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

This article describes how integrating causal inference with large language models and Retrieval‑Augmented Generation can automate and explain e‑commerce product pricing, detailing the three‑step workflow, reinforcement‑learning rewards, experimental results, and future directions for end‑to‑end RAG‑LLM training.

RAGcausal inferencee‑commerce pricing
0 likes · 15 min read
Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies
JD Cloud Developers
JD Cloud Developers
Jul 23, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

At QCon 2025, the author presented a novel approach that integrates causal inference with large language models using Retrieval‑Augmented Generation, process rewards, and tree‑search to generate explainable, accurate e‑commerce pricing recommendations, dramatically improving accuracy from 44% to 74% while cutting inference time to seconds.

causal inferencee‑commerce pricingreinforcement learning
0 likes · 14 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
DataFunTalk
DataFunTalk
Jul 23, 2025 · Artificial Intelligence

Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition

Alibaba’s Tongyi team unveiled the open‑source Qwen3‑Coder, a massive 450‑billion‑parameter programming model that outperforms leading closed‑source solutions, supports up to 1 M token context, offers a free CLI tool, and demonstrates impressive code generation capabilities across animations, games, and real‑world tasks.

AI programmingLarge Language ModelOpen Source
0 likes · 5 min read
Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition
Kuaishou Tech
Kuaishou Tech
Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

Knowledge DistillationLarge Language Modelauto-think
0 likes · 14 min read
Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning
JD Retail Technology
JD Retail Technology
Jul 21, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article presents a comprehensive approach that combines causal inference, large language models, and retrieval‑augmented generation to automate e‑commerce price recommendation, detailing the three‑step workflow, challenges across product categories, the RAG architecture, process‑reward‑guided tree search, reinforcement learning refinements, and experimental results showing significant accuracy and speed improvements.

causal inferencechain-of-thoughte‑commerce pricing
0 likes · 16 min read
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing
Alimama Tech
Alimama Tech
Jul 17, 2025 · Artificial Intelligence

How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code

This article details the author's experience designing a top‑performing AI Werewolf agent for the Taotian Group's AI Werewolf Challenge, covering game rules, core challenges, prompt engineering, caching, concurrent requests, model selection, reinforcement‑learning‑style tuning, and tactical strategies for each role, with code examples.

AI agentLLMWerewolf
0 likes · 25 min read
How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code
DataFunTalk
DataFunTalk
Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingHybrid Attentionlarge language models
0 likes · 16 min read
MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context
AI Algorithm Path
AI Algorithm Path
Jul 14, 2025 · Artificial Intelligence

The Most Powerful Open‑Source Agent Model: Kimi K2

Kimi K2, an open‑source trillion‑parameter AI model released by Moonshot AI, offers Base and Instruct variants, achieves leading scores on benchmarks such as SWE‑bench, LiveCodeBench and AceBench, and introduces a novel post‑training autonomous‑exploration stage with MuonClip optimization to enable robust tool use and reinforcement‑learning‑driven self‑improvement.

Benchmark performanceKimi K2Large Language Model
0 likes · 8 min read
The Most Powerful Open‑Source Agent Model: Kimi K2
AI Frontier Lectures
AI Frontier Lectures
Jul 14, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework

The article surveys recent AI self‑evolution research, highlights the SEAL self‑adapting language model framework, explains its reinforcement‑learning based self‑editing mechanism, and presents experimental results on few‑shot learning and knowledge integration, while noting limitations and providing links to the paper and code.

AI self-improvementMeta LearningSEAL
0 likes · 12 min read
Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework
Python Programming Learning Circle
Python Programming Learning Circle
Jul 10, 2025 · Artificial Intelligence

Build a DQN Autonomous Driving Agent with gym and highway‑env

This tutorial walks through installing gym and highway‑env, configuring six driving scenarios, processing observations (kinematics, images, occupancy grids), defining actions and rewards, constructing a DQN network, training it with a reinforcement‑learning loop, and analyzing collision, time, and reward metrics.

Autonomous DrivingDQNgym
0 likes · 10 min read
Build a DQN Autonomous Driving Agent with gym and highway‑env
Data Thinking Notes
Data Thinking Notes
Jul 8, 2025 · Artificial Intelligence

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

This article details Xiaohongshu's multi‑stage recommendation pipeline—using massive multi‑modal pre‑training, long‑sequence modeling, real‑time context features, reinforcement learning and online deep learning—to precisely surface valuable content, address cold‑start challenges, and break information bubbles for billions of users.

Large Language ModelMultimodal Learningonline deep learning
0 likes · 16 min read
How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation
DataFunSummit
DataFunSummit
Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceLarge ModelsMegatron
0 likes · 4 min read
Boosting Large Model Training: Optimizing Performance with the Verl Framework
AI Frontier Lectures
AI Frontier Lectures
Jul 2, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs

This article reviews recent AI self‑evolution research and provides an in‑depth analysis of the SEAL (Self‑Adapting Language) framework, which enables large language models to generate and learn from their own synthetic data through a nested reinforcement‑learning and fine‑tuning loop, with experimental results on few‑shot and knowledge‑integration tasks.

Meta LearningSEALfew-shot learning
0 likes · 11 min read
Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs
DataFunTalk
DataFunTalk
Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VOpen Source
0 likes · 10 min read
How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 30, 2025 · Artificial Intelligence

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

The article examines Kimi‑Researcher, an AI research agent built with end‑to‑end reinforcement learning, detailing its technical motivations, advantages over traditional workflow‑based and SFT methods, performance breakthroughs on benchmark exams, and diverse real‑world use cases ranging from literature reviews to legal analysis.

AI agentBenchmark performanceEnd-to-End RL
0 likes · 10 min read
How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent
Fighter's World
Fighter's World
Jun 28, 2025 · Artificial Intelligence

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

The article explains the Generator‑Verifier Gap (GVG)—the asymmetry where verifying a solution is far cheaper than generating it—covers its origin, its impact on test‑time scaling for large language models, reinforcement‑learning approaches, and how the concept can shape agent architectures and AI product strategy.

Agent ArchitectureGenerator-Verifier GapLLM
0 likes · 21 min read
What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning
Alimama Tech
Alimama Tech
Jun 25, 2025 · Artificial Intelligence

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

ROLL is an open‑source reinforcement‑learning framework designed for large language model post‑training that combines multi‑task RL, agentic support, flexible algorithm configuration, elastic resource scheduling, and rich observability, delivering significant accuracy gains across benchmarks while remaining easy to use for researchers, product developers, and infrastructure engineers.

AI FrameworkOpen SourceRLHF
0 likes · 11 min read
Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training
DataFunTalk
DataFunTalk
Jun 21, 2025 · Artificial Intelligence

Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions

This talk explores how large AI models become overconfident, leading to bias and hallucinations, examines adversarial examples in vision and language, explains why data and algorithms cause these issues, and shows how reinforcement learning can teach models to admit uncertainty and align with human values.

AI alignmentAI safetyBias
0 likes · 19 min read
Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions
Kuaishou Large Model
Kuaishou Large Model
Jun 20, 2025 · Artificial Intelligence

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

OneRec, an end-to-end generative recommendation system from Kuaishou, uses an encoder-decoder architecture, reward-based preference alignment, and reinforcement learning to dramatically improve video recommendation efficiency, boosting user engagement and reducing operational costs while achieving scaling-law performance comparable to large language models.

EfficiencyKuaishouLarge Models
0 likes · 18 min read
How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI
Kuaishou Tech
Kuaishou Tech
Jun 20, 2025 · Artificial Intelligence

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

The OneRec system from Kuaishou replaces traditional cascade recommendation pipelines with an encoder‑decoder architecture, leverages reward‑based preference alignment via reinforcement learning, achieves ten‑fold FLOPs gains, cuts operational costs by 90%, and delivers significant user‑engagement improvements across short‑video and local‑service scenarios.

Generative ModelingKuaishouOneRec
0 likes · 17 min read
How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 19, 2025 · Artificial Intelligence

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

Efficiencyadaptive inferencechain-of-thought
0 likes · 9 min read
Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?
DataFunTalk
DataFunTalk
Jun 17, 2025 · Artificial Intelligence

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.

AISWE-benchopen-source
0 likes · 7 min read
Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)
Fighter's World
Fighter's World
Jun 14, 2025 · Artificial Intelligence

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

The article analyzes how large language models can acquire true reasoning abilities for hard‑to‑score industry tasks by combining Chain‑of‑Thought prompting with reinforcement learning, addressing vague reward signals, reward hacking, and loyalty, and proposing a toolbox of reward engineering, synthetic data, hierarchical RL and multi‑agent collaboration.

LLMReward Modelingchain-of-thought
0 likes · 22 min read
How Can LLMs Learn to “Think” in Complex Industry Scenarios?
Fun with Large Models
Fun with Large Models
Jun 12, 2025 · Artificial Intelligence

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

GRPOGSM8KQwen2.5
0 likes · 23 min read
Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B
Kuaishou Tech
Kuaishou Tech
Jun 4, 2025 · Artificial Intelligence

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

The KwaiPilot team released the KwaiCoder‑AutoThink‑preview model, which introduces a novel automatic‑thinking training paradigm and a process‑supervised reinforcement‑learning method called Step‑SRPO, enabling the model to dynamically switch between thinking and non‑thinking modes, reduce inference cost, and achieve up to 20‑point gains on code and math benchmarks while handling large‑scale codebases.

AI researchLarge Language Modelautomatic thinking
0 likes · 12 min read
KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Re-distillationSample Effectlarge language models
0 likes · 23 min read
Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research
AI Frontier Lectures
AI Frontier Lectures
May 31, 2025 · Artificial Intelligence

Why Embodied Intelligence Is Exploding and What It Means for the Future

The article analyzes the recent surge in embodied intelligence, examines why physical agents matter despite advances in large language models, outlines common failure modes, discusses key research decisions such as 2D versus 3D perception and tactile sensing, and explores the roles of imitation learning, VLA, and reinforcement learning in shaping the field.

VLAVisionimitation learning
0 likes · 24 min read
Why Embodied Intelligence Is Exploding and What It Means for the Future
AI Frontier Lectures
AI Frontier Lectures
May 30, 2025 · Artificial Intelligence

Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Recent work from West Lake University's MAPLE Lab introduces a diffusion‑based “Divergent Thought Chain” that treats each intermediate denoising step of a diffusion language model as a reasoning step, using result‑based reinforcement learning to optimize non‑linear token generation and achieving state‑of‑the‑art performance on math and code tasks.

chain-of-thoughtcode generationdiffusion language models
0 likes · 14 min read
Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?
Alibaba Cloud Developer
Alibaba Cloud Developer
May 28, 2025 · Artificial Intelligence

Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

This article provides a comprehensive guide to large language model fine‑tuning, covering model architecture, parameter and memory calculations, prompt engineering, data construction, LoRA and PEFT techniques, reinforcement learning methods such as DPO, and practical deployment workflows on internal platforms.

Fine‑TuningLLMLoRA
0 likes · 21 min read
Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment
JD Cloud Developers
JD Cloud Developers
May 27, 2025 · Artificial Intelligence

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Young JD algorithm engineers share how they solve tough AI problems—from optimizing large‑model training and reward‑model design for ad image generation, to building LLM‑based query expansion, agent evaluation, and model pruning with FFT and RDP—illustrating practical breakthroughs and personal growth in cutting‑edge AI research.

AIModel PruningReward Modeling
0 likes · 15 min read
How JD’s Young AI Engineers Tackle Real-World Model Challenges
AI Algorithm Path
AI Algorithm Path
May 27, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization

This tutorial explains how to construct state feature vectors for reinforcement‑learning value‑function approximation, covering linear, polynomial, Fourier, and radial‑basis representations, as well as state aggregation techniques such as coarse coding and tile coding, and discusses non‑parametric approaches like kernel methods.

feature engineeringfourier basisfunction approximation
0 likes · 16 min read
Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization
AIWalker
AIWalker
May 26, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

VisionReasoner presents a reinforcement‑learning‑driven unified framework that simultaneously tackles detection, segmentation, and counting tasks, employing a novel multi‑target cognition strategy and efficient Hungarian‑based matching, and demonstrates substantial gains—29.1% on COCO detection, 22.1% on ReasonSeg, and 15.3% on CountBench—using only 7,000 training samples.

SegmentationVisionReasonercounting
0 likes · 20 min read
VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting
JD Tech
JD Tech
May 26, 2025 · Artificial Intelligence

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

This article details how JD Retail's young algorithm engineers tackled a series of AI engineering problems—including advertising image quality assessment with multi‑reward models, large‑language‑model‑driven query expansion, FFT‑and‑RDP‑based model pruning, and agent‑centric reinforcement learning—while sharing practical growth insights and code snippets.

AIcomputer visionlarge language models
0 likes · 15 min read
Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning
Alibaba Cloud Developer
Alibaba Cloud Developer
May 26, 2025 · Artificial Intelligence

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

AIPlanningmodel training
0 likes · 14 min read
How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training
AI Algorithm Path
AI Algorithm Path
May 25, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

This article explains why tabular reinforcement‑learning methods scale poorly, introduces supervised‑learning‑based value‑function approximation using a parameterized vector w, discusses loss design, stochastic‑gradient updates, bootstrapping, semi‑gradient techniques, and linear function approximation, and summarizes practical implications.

gradient Monte Carlolinear function approximationreinforcement learning
0 likes · 13 min read
Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods
IT Services Circle
IT Services Circle
May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekLarge Language Model
0 likes · 10 min read
DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview
AI Algorithm Path
AI Algorithm Path
May 24, 2025 · Artificial Intelligence

How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI

This tutorial explains how n-step temporal‑difference (TD) algorithms generalize the one‑step TD and Monte‑Carlo methods, presents the n‑step return update rule, walks through a three‑step TD example, shows how Sarsa and Q‑learning can be extended, and discusses how to choose the optimal n value for a given problem.

Monte CarloQ-Learningalgorithm analysis
0 likes · 9 min read
How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI
AI Algorithm Path
AI Algorithm Path
May 23, 2025 · Artificial Intelligence

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

This tutorial explains temporal‑difference (TD) learning, compares it with dynamic programming and Monte‑Carlo methods, walks through concrete soccer‑match examples, shows one‑step TD versus constant‑α Monte‑Carlo updates, discusses convergence, bias, and introduces popular TD variants such as Sarsa, Q‑learning, Expected Sarsa and double learning.

Monte CarloQ-LearningTD learning
0 likes · 18 min read
Understanding Temporal‑Difference Algorithms in Reinforcement Learning
AI Algorithm Path
AI Algorithm Path
May 22, 2025 · Artificial Intelligence

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

This tutorial explains how Monte Carlo methods are enhanced in reinforcement learning through epsilon‑greedy and epsilon‑soft policies, Monte Carlo control, a Blackjack Q‑function example, the distinction between on‑policy and off‑policy learning, importance sampling, and efficient incremental update techniques.

Epsilon-GreedyImportance SamplingMonte Carlo
0 likes · 14 min read
Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates
AIWalker
AIWalker
May 22, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting

VisionReasoner introduces a reinforcement‑learning‑driven unified framework that simultaneously handles detection, segmentation, and counting tasks within a single model, achieving 29.1% higher COCO detection AP, 22.1% better ReasonSeg segmentation, and 15.3% improvement on CountBench, while requiring only 7,000 training samples and offering efficient multi‑target matching via batch computation and the Hungarian algorithm.

LVLMObject CountingVisionReasoner
0 likes · 19 min read
VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting
JD Tech Talk
JD Tech Talk
May 22, 2025 · Artificial Intelligence

From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection

The article recounts Xiaoting’s journey from a PhD research background to leading JD.com’s ad‑fraud detection, detailing how large language models, reinforcement learning, and model distillation were applied to identify hidden address codes, reduce false‑positive rates to 0.3%, and balance accuracy with real‑time performance in a high‑traffic e‑commerce environment.

AIAd FraudAdvertising
0 likes · 11 min read
From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection
JD Retail Technology
JD Retail Technology
May 22, 2025 · Industry Insights

Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained

This article recounts the journey of a JD PhD trainee who transformed academic research on anomaly detection into a production‑grade, LLM‑enhanced anti‑fraud system that identifies concealed address codes in CPS ads, detailing model design, LoRA fine‑tuning, reinforcement learning, distillation, cost‑aware deployment, and lessons learned for scalable ad risk management.

Large Language Modelad fraud detectionindustry AI
0 likes · 12 min read
Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained
AI Algorithm Path
AI Algorithm Path
May 19, 2025 · Artificial Intelligence

Understanding Policy Evaluation and Improvement in Reinforcement Learning

This article explains how to solve Bellman equations, use iterative policy‑evaluation methods, apply the policy‑improvement theorem, and combine both steps in policy iteration, value iteration, and asynchronous variants, illustrated with a 5‑state example and a 4×4 gridworld.

Bellman equationGridWorldgeneralized policy iteration
0 likes · 15 min read
Understanding Policy Evaluation and Improvement in Reinforcement Learning
Amap Tech
Amap Tech
May 19, 2025 · Artificial Intelligence

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.

AI researchLLM fine-tuningPolicy Gradient
0 likes · 7 min read
Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning
AI Algorithm Path
AI Algorithm Path
May 18, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial Part 1: Core Concepts Explained

This article introduces the fundamental concepts of reinforcement learning, covering the agent‑environment interaction, key terminology, reward structures, task types, policies, value functions, the Bellman equations, and how optimal strategies are derived and approximated in practice.

Bellman equationMarkov Decision ProcessOptimal Policy
0 likes · 13 min read
Reinforcement Learning Tutorial Part 1: Core Concepts Explained
Kuaishou Tech
Kuaishou Tech
May 13, 2025 · Artificial Intelligence

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

KuaiModbenchmarkcontent moderation
0 likes · 19 min read
How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality
AI Frontier Lectures
AI Frontier Lectures
May 13, 2025 · Artificial Intelligence

How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

Recent large language models have shown strong reasoning abilities, and this work extends chain‑of‑thought reasoning to autoregressive image generation by introducing T2I‑R1, a dual‑level (Semantic‑CoT and Token‑CoT) framework trained with reinforcement learning that unifies high‑level planning and low‑level token generation, achieving state‑of‑the‑art results.

generative AIreinforcement learningsemantic planning
0 likes · 7 min read
How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning
AI Frontier Lectures
AI Frontier Lectures
May 13, 2025 · Artificial Intelligence

How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

This article provides a comprehensive, step‑by‑step analysis of Diffusion Policy for robot visuomotor control, covering its motivation, task characteristics, model design, dataset preparation, training pipeline, inference procedure, experimental results, and open research questions.

Machine Learningdiffusion modelspolicy learning
0 likes · 63 min read
How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning
Tencent Technical Engineering
Tencent Technical Engineering
May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMmodel architecture
0 likes · 25 min read
Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture
JD Retail Technology
JD Retail Technology
May 7, 2025 · Artificial Intelligence

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

JD Retail’s engineering team tackles hard AI problems by replacing a monolithic reward model with specialized small models for ad‑image generation, deploying an LLM‑driven query‑expansion pipeline that lifts conversion rates, and pruning text‑to‑image transformers using FFT and RDP to boost throughput 40% without loss, while building comprehensive evaluation tools and a semantic smart‑assistant.

AILarge ModelsModel Pruning
0 likes · 14 min read
Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning
AIWalker
AIWalker
May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

Supervised Fine‑Tuningautoregressivebenchmark
0 likes · 14 min read
SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL
DevOps
DevOps
May 5, 2025 · Artificial Intelligence

DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite

DeepSeek has quietly open‑sourced a new mathematics‑focused large language model, DeepSeek‑Prover‑V2 (available in 671B and 7B variants), achieving 88.9% on MiniF2F and strong results on PutnamBench, alongside the high‑quality ProverBench dataset and a novel recursive theorem‑proving pipeline.

AIDeepSeekLarge Language Model
0 likes · 4 min read
DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite
Architect
Architect
May 5, 2025 · Artificial Intelligence

How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent

Agentic RAG‑R1, an open‑source project from Peking University, combines Retrieval‑Augmented Generation with an agentic AI loop, introduces the GRPO reinforcement‑learning optimizer, supports LoRA‑based fine‑tuning, quantization and multimodal tool calls, and demonstrates significant accuracy gains on the MedQA benchmark across both Chinese and English test sets.

LLM Tool UseOpen SourceRetrieval-Augmented Generation
0 likes · 8 min read
How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent
AI Frontier Lectures
AI Frontier Lectures
May 5, 2025 · Industry Insights

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

The article reviews five years of AI model evolution, analyzes current scaling and reinforcement‑learning trends, and forecasts architectural, mathematical, and infrastructure directions for large language models through 2030, highlighting potential breakthroughs and the risks of over‑reliance on benchmarks.

AI trendsModel Scalingindustry analysis
0 likes · 22 min read
What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges
AI Algorithm Path
AI Algorithm Path
May 3, 2025 · Artificial Intelligence

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

DeepSeek‑Prover‑V2, an open‑source LLM specialized for Lean 4, bridges intuitive high‑level reasoning and strict formal verification through sub‑goal decomposition, dual operation modes, and a novel cold‑start data pipeline, achieving state‑of‑the‑art results on MiniF2F, PutnamBench and CombiBench while highlighting trade‑offs in inference cost and model scalability.

AI mathematicsDeepSeek Prover V2LLM
0 likes · 18 min read
DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning
Baobao Algorithm Notes
Baobao Algorithm Notes
May 2, 2025 · Artificial Intelligence

Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

This article analyzes whether reinforcement learning enhances large language model reasoning, compares findings from DeepSeek-Math, a Tsinghua‑Shanghai Jiao‑Tong paper, and Qwen3, and outlines practical training pipelines—including Seed‑Thinking‑v1.5, DeepSeek‑R1, Kimi‑K1.5, and Qwen3—that aim to endow LLMs with robust reasoning capabilities.

Artificial IntelligenceLLMReasoning
0 likes · 12 min read
Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models
Mafengwo Technology
Mafengwo Technology
Apr 30, 2025 · Artificial Intelligence

How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy

The article details the development, training, and evaluation of MaFengWo's 32‑billion‑parameter travel large language model (mfw‑32B), highlighting its superior itinerary planning, personalized demand capture, budget management, and resource efficiency compared to DeepSeek‑R1, and describing the SFT and reinforcement‑learning stages that enabled these gains.

Large Language ModelLoRAai-optimization
0 likes · 14 min read
How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy
AIWalker
AIWalker
Apr 28, 2025 · Artificial Intelligence

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

SimpleAR is a minimalist autoregressive visual generation framework that, with only 0.5 B parameters, achieves competitive 1024×1024 image synthesis through a three‑stage pipeline of large‑scale pretraining, supervised fine‑tuning, and GRPO‑based reinforcement learning, and demonstrates significant inference speedups using KV‑cache, vLLM, and speculative decoding.

Inference Accelerationautoregressive generationbenchmark
0 likes · 14 min read
SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters
DataFunTalk
DataFunTalk
Apr 25, 2025 · Artificial Intelligence

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.

AI researchRLVRReasoning
0 likes · 12 min read
Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study
AntTech
AntTech
Apr 24, 2025 · Artificial Intelligence

Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

At ICLR 2025 in Singapore, Ant Group and Tsinghua University showcased the open‑source reinforcement‑learning platform AReaL and the multi‑agent system AWorld, highlighting their recent breakthroughs, system design challenges, performance results on the GAIA benchmark, and upcoming development plans.

AI frameworksICLR2025Open Source
0 likes · 7 min read
Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025
Kuaishou Tech
Kuaishou Tech
Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationSRPOcross-domain training
0 likes · 12 min read
Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning
AI Frontier Lectures
AI Frontier Lectures
Apr 24, 2025 · Artificial Intelligence

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Researchers from UCLA and Meta AI introduce d1, a two‑stage post‑training framework that combines supervised fine‑tuning and a novel diffu‑GRPO reinforcement‑learning algorithm to enable efficient reasoning in masked diffusion large language models, achieving state‑of‑the‑art performance on multiple math and logic benchmarks.

AId1diffu-GRPO
0 likes · 9 min read
How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning
AI Frontier Lectures
AI Frontier Lectures
Apr 24, 2025 · Artificial Intelligence

Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive

The article argues that AI is entering a new phase where defining real‑world tasks and robust evaluation outweigh pure model improvements, highlighting the rise of reasoning‑augmented reinforcement learning, the need for product‑oriented thinking, and the shortcomings of current i.i.d. benchmark practices.

AI trendsIndustry Insightproduct focus
0 likes · 9 min read
Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive
AntTech
AntTech
Apr 21, 2025 · Artificial Intelligence

InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

The InclusionAI open‑source community, initiated by Ant Group, will showcase the latest advances of its reinforcement‑learning framework AReaL and multi‑agent framework AWorld at the ICLR 2025 conference in Singapore, highlighting performance breakthroughs, open‑source contributions, and industry‑focused AI research.

AReaLAWorldAnt Group
0 likes · 5 min read
InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025
AI Algorithm Path
AI Algorithm Path
Apr 20, 2025 · Artificial Intelligence

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

LLMRL trainingchain-of-thought
0 likes · 10 min read
Boosting Visual Reasoning in VLMs with Reinforcement Learning
Fighter's World
Fighter's World
Apr 18, 2025 · Artificial Intelligence

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

The article analyzes the emerging "Era of Experience" in AI, arguing that reliance on static human data limits progress and that reinforcement learning‑based experiential learning—exemplified by AlphaZero—offers a path toward surpassing human knowledge, while outlining the technical, safety, and ethical challenges ahead.

AGIAlphaZeroArtificial Intelligence
0 likes · 19 min read
Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority
AI Frontier Lectures
AI Frontier Lectures
Apr 17, 2025 · Artificial Intelligence

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

This article analyzes a recent study on language‑model reasoning, revealing that reinforcement learning often brings little or no improvement, while evaluation variance caused by seeds, hardware, and decoding settings can dramatically affect benchmark results, and supervised fine‑tuning emerges as a more reliable path.

LLMReproducibilityreinforcement learning
0 likes · 12 min read
Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive
Data Thinking Notes
Data Thinking Notes
Apr 15, 2025 · Artificial Intelligence

Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning

Professor Li Hongyi’s lecture provides a comprehensive, step‑by‑step exploration of AI agents, covering their definitions, reinforcement‑learning roots, LLM integration, memory mechanisms, tool usage, planning strategies, benchmarks, and practical examples, offering a valuable resource for anyone studying modern artificial intelligence.

AI agentsMemoryPlanning
0 likes · 67 min read
Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 14, 2025 · Artificial Intelligence

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

LLM Benchmarkcode repairdataset
0 likes · 11 min read
Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs
AI Algorithm Path
AI Algorithm Path
Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO
0 likes · 6 min read
Understanding GRPO: Group Relative Policy Optimization for LLM Training
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Apr 9, 2025 · Artificial Intelligence

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

The article analyzes the anti‑scaling phenomenon in video large‑language models, identifies a “temporal hacking” shortcut where models focus on a few key frames, formalizes it via reward‑hacking theory, introduces the Temporal Perplexity (TPL) metric, and proposes an Unhackable Temporal Rewarding (UTR) framework to mitigate the issue.

Scaling LawTemporal PerplexityUTR
0 likes · 14 min read
Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem
AI Algorithm Path
AI Algorithm Path
Apr 2, 2025 · Artificial Intelligence

Vision‑Reasoning Model: Enabling LLMs to See and Think

The article analyzes the limitations of current visual language models and large reasoning models, proposes a combined Vision‑Reasoning Model (VRM), details its architecture using LLaVA, describes end‑to‑end fine‑tuning and reinforcement‑learning reward design, and argues that such models will become the next breakthrough in AI.

DeepSeekLLaVALarge Language Model
0 likes · 9 min read
Vision‑Reasoning Model: Enabling LLMs to See and Think
Data Thinking Notes
Data Thinking Notes
Mar 30, 2025 · Artificial Intelligence

How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

This comprehensive analysis by the Peking University AI Alignment team dissects the technical innovations behind DeepSeek‑R1, DeepSeek‑R1 Zero, and Kimi‑K1.5, covering reinforcement‑learning‑based post‑training, rule‑based rewards, GRPO optimization, scaling laws, multimodal extensions, safety challenges, and future research directions.

AI alignmentDeepSeekKimi
0 likes · 57 min read
How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models
Fighter's World
Fighter's World
Mar 29, 2025 · Industry Insights

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

The podcast recap dissects a year of rapid AI change, highlighting surprise‑fast open‑source model releases, shifting foundation‑model dynamics, the rise of GPT wrappers, over‑hyped agents, undervalued memory, product‑market fit debates, infrastructure opportunities, and lingering mysteries like RL in non‑verifiable domains.

AI infrastructureAI trendsGPT wrappers
0 likes · 22 min read
A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast
JavaEdge
JavaEdge
Mar 27, 2025 · Artificial Intelligence

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

Artificial IntelligenceLLMVision-Language Model
0 likes · 8 min read
Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)
JD Tech
JD Tech
Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward Modeladvertising image generation
0 likes · 10 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)
AI Frontier Lectures
AI Frontier Lectures
Mar 24, 2025 · Artificial Intelligence

What Can AI Agents Learn from the Latest AIR 2025 Research?

The article compiles insights from the AIR 2025 conference and related talks, covering the evolution of agents from reinforcement‑learning to LLM‑driven systems, novel agent architectures like AIDE, GUI agents, natural‑language reinforcement learning, and scaling advances in large language models such as Qwen, while highlighting key algorithms, benchmarks, and open research questions.

AI agentsAgent ArchitectureGUI agents
0 likes · 27 min read
What Can AI Agents Learn from the Latest AIR 2025 Research?
JD Tech Talk
JD Tech Talk
Mar 24, 2025 · Artificial Intelligence

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

This article presents MaRCA, a multi‑agent reinforcement learning framework that allocates computation resources across the full ad‑serving chain by modeling user value, compute consumption, and action rewards, enabling fine‑grained power‑tilting toward high‑quality traffic and achieving significant business gains under strict latency constraints.

ad servingai-optimizationcomputation allocation
0 likes · 16 min read
MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving
Architect
Architect
Mar 23, 2025 · Artificial Intelligence

The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents

The article argues that the next wave of AI agents will shift from brittle, prompt‑driven workflows like Manus to truly autonomous, model‑centric agents trained with reinforcement learning and reasoning, exemplified by OpenAI's DeepResearch and Anthropic's Claude Sonnet 3.7, while the API‑driven market model collapses.

AI agentsClaudeDeepResearch
0 likes · 28 min read
The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 20, 2025 · Artificial Intelligence

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Algorithm OptimizationDeep RLGAE
0 likes · 19 min read
Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive