Tagged articles

690 articles

Page 4 of 7

Aug 19, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

Klear-Reasoner, built on Qwen3‑8B‑Base, introduces the Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm to overcome traditional clip limitations, achieving state‑of‑the‑art performance on AIME2024/2025 and LiveCodeBench while providing detailed experimental analysis and data‑quality insights.

GPPOcode reasoninggradient clipping

0 likes · 11 min read

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

AntTech

Aug 19, 2025 · Artificial Intelligence

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

GUI AgentSOTAbenchmark

0 likes · 7 min read

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

AI Info Trend

Aug 19, 2025 · Industry Insights

What’s Driving the AI Revolution in 2025? Key Trends and Insights

The 2025 H1 AI Core Achievements and Trends report reveals how agents are reshaping productivity, models are gaining inference power and becoming smaller, reinforcement learning is overtaking pre‑training, and industry competition is intensifying, with China and the US narrowing their technology gap.

AIChinaIndustry Insights

0 likes · 10 min read

What’s Driving the AI Revolution in 2025? Key Trends and Insights

Kuaishou Tech

Aug 18, 2025 · Artificial Intelligence

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

The Klear‑Reasoner model, built on Qwen3‑8B‑Base and powered by the novel Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm, surpasses same‑size open‑source baselines on challenging math (AIME) and code (LiveCodeBench) benchmarks, while revealing key insights on data quality, reward design, and clipping strategies for large‑language‑model reasoning.

GPPOLLMcode reasoning

0 likes · 11 min read

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

Baobao Algorithm Notes

Aug 15, 2025 · Artificial Intelligence

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

This article systematically adapts classic deep reinforcement‑learning techniques—such as multi‑step returns, TD(λ)/GAE, V‑trace corrections, uncertainty‑aware weighting, safety constraints, distribution‑robust optimization, and value‑guided decoding—to improve large language model training and inference, providing concrete formulas, implementation tips, and empirical results.

Deep RLGAELLM

0 likes · 17 min read

Unlocking LLM Performance: Classic Deep RL Tricks Reimagined for Modern Training

Baobao Algorithm Notes

Aug 14, 2025 · Artificial Intelligence

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Dynamic Fine-TuningGeneralizationLLM alignment

0 likes · 7 min read

Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

AIWalker

Aug 13, 2025 · Artificial Intelligence

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Look-Back is an implicit training paradigm that enables the Qwen‑2.5‑VL‑7B multimodal LLM to autonomously re‑focus on visual inputs during reasoning, achieving a 6.3 % boost in perception tasks, outperforming prior baselines while requiring no extra image tokens or model architecture changes.

Look-BackQwen-2.5-VLimplicit training

0 likes · 26 min read

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Alibaba Cloud Big Data AI Platform

Aug 8, 2025 · Artificial Intelligence

Reproducing the GSPO Reinforcement Learning Algorithm on Alibaba PAI: A Step‑by‑Step Guide

This article introduces the GSPO (Group Sequence Policy Optimization) reinforcement learning algorithm, explains its advantages over GRPO, and provides a detailed, end‑to‑end tutorial for reproducing GSPO training on Alibaba Cloud's PAI platform using the PAI‑ChatLearn framework.

ChatLearnGSPOPAI

0 likes · 8 min read

Reproducing the GSPO Reinforcement Learning Algorithm on Alibaba PAI: A Step‑by‑Step Guide

Kuaishou Tech

Aug 6, 2025 · Artificial Intelligence

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

This article presents the SL‑MGAC framework, a supervised‑learning‑enhanced multi‑group Actor‑Critic algorithm that improves live‑stream insertion decisions in mixed short‑video and live‑stream recommendation systems, achieving higher stability and better long‑term user engagement while satisfying platform constraints, as validated by extensive offline and online experiments.

KDD 2025actor-criticlive stream recommendation

0 likes · 9 min read

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

AIWalker

Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

RLHFbenchmarkmultimodal LLM

0 likes · 24 min read

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

AI Info Trend

Aug 4, 2025 · Industry Insights

How AI Agents and Small Models Are Redefining Productivity in 2025 H1

The report analyzes first‑half‑2025 AI breakthroughs, covering the rise of general‑purpose agents, rapid inference improvements, small‑model proliferation, reinforcement‑learning compute dominance, evolving transformer architectures, and shifting industry dynamics, offering actionable insights for researchers, product leaders, and decision‑makers.

AIAgentLarge Language Model

0 likes · 9 min read

How AI Agents and Small Models Are Redefining Productivity in 2025 H1

AIWalker

Aug 4, 2025 · Artificial Intelligence

Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

CAIG leverages a multimodal large language model, a novel reward model, and product‑centered preference optimization to generate ad images that maximize click‑through rate, achieving state‑of‑the‑art performance in both online and offline evaluations.

CTROpen Sourcead image generation

0 likes · 7 min read

Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

JD Tech

Jul 29, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article describes a QCon talk that combines causal inference with large language models to build a retrieval‑augmented generation pricing system for e‑commerce, detailing the three‑step algorithm, LLM‑driven modeling challenges, process‑reward tree search, reinforcement‑learning fine‑tuning, and experimental gains in accuracy and speed.

Retrieval-Augmented Generationcausal inferencee‑commerce pricing

0 likes · 17 min read

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

AI Algorithm Path

Jul 27, 2025 · Artificial Intelligence

Understanding RLHF: How Human Feedback Trains Modern LLMs

This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.

ChatGPTHuman FeedbackPPO

0 likes · 8 min read

Understanding RLHF: How Human Feedback Trains Modern LLMs

AI2ML AI to Machine Learning

Jul 24, 2025 · Artificial Intelligence

Exploring Recent Large‑Model Agent Papers: Insights and Analyses

This article reviews a series of recent research papers on large‑model agents, covering topics such as reinforcement‑learning‑driven ML agents, premise‑critique ability of LLMs, long‑term tool‑augmented LLM evaluation, agentic RAG, set‑based retrieval for multi‑hop QA, mobile VLM agents, and broader surveys of LLM applications, summarizing each work’s problem statement, prior approaches, novel contributions, experimental results, limitations, and future directions.

LLM evaluationRetrieval-Augmented Generationagentic AI

0 likes · 46 min read

Exploring Recent Large‑Model Agent Papers: Insights and Analyses

Fun with Large Models

Jul 23, 2025 · Artificial Intelligence

Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents

The article analyzes OpenAI's ChatGPT Agent—its launch, performance metrics, all‑in‑one tool integration, real‑world use cases, pricing tiers, core capabilities, and how it surpasses competing agents like Manus, highlighting its significance for the next generation of AI agents.

AI agentChatGPT AgentUse Cases

0 likes · 11 min read

Why ChatGPT Agent Sets the Benchmark for Future Large‑Model AI Agents

JD Tech Talk

Jul 23, 2025 · Artificial Intelligence

Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

This article describes how integrating causal inference with large language models and Retrieval‑Augmented Generation can automate and explain e‑commerce product pricing, detailing the three‑step workflow, reinforcement‑learning rewards, experimental results, and future directions for end‑to‑end RAG‑LLM training.

RAGcausal inferencee‑commerce pricing

0 likes · 15 min read

Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

JD Cloud Developers

Jul 23, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

At QCon 2025, the author presented a novel approach that integrates causal inference with large language models using Retrieval‑Augmented Generation, process rewards, and tree‑search to generate explainable, accurate e‑commerce pricing recommendations, dramatically improving accuracy from 44% to 74% while cutting inference time to seconds.

causal inferencee‑commerce pricingreinforcement learning

0 likes · 14 min read

DataFunTalk

Jul 23, 2025 · Artificial Intelligence

Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition

Alibaba’s Tongyi team unveiled the open‑source Qwen3‑Coder, a massive 450‑billion‑parameter programming model that outperforms leading closed‑source solutions, supports up to 1 M token context, offers a free CLI tool, and demonstrates impressive code generation capabilities across animations, games, and real‑world tasks.

AI programmingLarge Language ModelOpen Source

0 likes · 5 min read

Qwen3‑Coder: Open‑Source AI Programming Agent That Beats the Competition

Kuaishou Tech

Jul 21, 2025 · Artificial Intelligence

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

The article introduces KAT‑V1 AutoThink, a dual‑mode large language model that automatically switches between thinking and non‑thinking modes based on problem difficulty, details its novel training paradigm, reinforcement‑learning enhancements, performance benchmarks against leading open‑source models, and provides open‑source resources for further research.

Knowledge DistillationLarge Language Modelauto-think

0 likes · 14 min read

Can AI Models Think on Demand? Inside KAT‑V1 AutoThink’s Dynamic Reasoning

JD Retail Technology

Jul 21, 2025 · Artificial Intelligence

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article presents a comprehensive approach that combines causal inference, large language models, and retrieval‑augmented generation to automate e‑commerce price recommendation, detailing the three‑step workflow, challenges across product categories, the RAG architecture, process‑reward‑guided tree search, reinforcement learning refinements, and experimental results showing significant accuracy and speed improvements.

causal inferencechain-of-thoughte‑commerce pricing

0 likes · 16 min read

Alimama Tech

Jul 17, 2025 · Artificial Intelligence

How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code

This article details the author's experience designing a top‑performing AI Werewolf agent for the Taotian Group's AI Werewolf Challenge, covering game rules, core challenges, prompt engineering, caching, concurrent requests, model selection, reinforcement‑learning‑style tuning, and tactical strategies for each role, with code examples.

AI agentLLMWerewolf

0 likes · 25 min read

How to Build a High‑Scoring AI Werewolf Agent: Strategies, Prompt Engineering, and Code

DataFunTalk

Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingHybrid Attentionlarge language models

0 likes · 16 min read

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

AI Algorithm Path

Jul 14, 2025 · Artificial Intelligence

The Most Powerful Open‑Source Agent Model: Kimi K2

Kimi K2, an open‑source trillion‑parameter AI model released by Moonshot AI, offers Base and Instruct variants, achieves leading scores on benchmarks such as SWE‑bench, LiveCodeBench and AceBench, and introduces a novel post‑training autonomous‑exploration stage with MuonClip optimization to enable robust tool use and reinforcement‑learning‑driven self‑improvement.

Benchmark performanceKimi K2Large Language Model

0 likes · 8 min read

The Most Powerful Open‑Source Agent Model: Kimi K2

AI Frontier Lectures

Jul 14, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework

The article surveys recent AI self‑evolution research, highlights the SEAL self‑adapting language model framework, explains its reinforcement‑learning based self‑editing mechanism, and presents experimental results on few‑shot learning and knowledge integration, while noting limitations and providing links to the paper and code.

AI self-improvementMeta LearningSEAL

0 likes · 12 min read

Can Language Models Self‑Edit? Inside SEAL’s Self‑Adapting LLM Framework

Python Programming Learning Circle

Jul 10, 2025 · Artificial Intelligence

Build a DQN Autonomous Driving Agent with gym and highway‑env

This tutorial walks through installing gym and highway‑env, configuring six driving scenarios, processing observations (kinematics, images, occupancy grids), defining actions and rewards, constructing a DQN network, training it with a reinforcement‑learning loop, and analyzing collision, time, and reward metrics.

Autonomous DrivingDQNgym

0 likes · 10 min read

Build a DQN Autonomous Driving Agent with gym and highway‑env

Data Thinking Notes

Jul 8, 2025 · Artificial Intelligence

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

This article details Xiaohongshu's multi‑stage recommendation pipeline—using massive multi‑modal pre‑training, long‑sequence modeling, real‑time context features, reinforcement learning and online deep learning—to precisely surface valuable content, address cold‑start challenges, and break information bubbles for billions of users.

Large Language ModelMultimodal Learningonline deep learning

0 likes · 16 min read

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

DataFunSummit

Jul 5, 2025 · Artificial Intelligence

Boosting Large Model Training: Optimizing Performance with the Verl Framework

Join the DataFun Summit 2025 on July 12 to hear Tencent FinTech senior researcher Gong Dihong discuss how redesigning the Verl training system, integrating Megatron and Sglang, and applying new synchronization and offloading techniques dramatically speeds up large‑model reinforcement‑learning training.

AI PerformanceLarge ModelsMegatron

0 likes · 4 min read

Boosting Large Model Training: Optimizing Performance with the Verl Framework

AI Frontier Lectures

Jul 2, 2025 · Artificial Intelligence

Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs

This article reviews recent AI self‑evolution research and provides an in‑depth analysis of the SEAL (Self‑Adapting Language) framework, which enables large language models to generate and learn from their own synthetic data through a nested reinforcement‑learning and fine‑tuning loop, with experimental results on few‑shot and knowledge‑integration tasks.

Meta LearningSEALfew-shot learning

0 likes · 11 min read

Can Language Models Self‑Edit? Inside the SEAL Framework for Self‑Adapting LLMs

DataFunTalk

Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VOpen Source

0 likes · 10 min read

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Baobao Algorithm Notes

Jun 30, 2025 · Artificial Intelligence

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

The article examines Kimi‑Researcher, an AI research agent built with end‑to‑end reinforcement learning, detailing its technical motivations, advantages over traditional workflow‑based and SFT methods, performance breakthroughs on benchmark exams, and diverse real‑world use cases ranging from literature reviews to legal analysis.

AI agentBenchmark performanceEnd-to-End RL

0 likes · 10 min read

How End‑to‑End Reinforcement Learning Powers the Kimi‑Researcher AI Agent

Fighter's World

Jun 28, 2025 · Artificial Intelligence

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

The article explains the Generator‑Verifier Gap (GVG)—the asymmetry where verifying a solution is far cheaper than generating it—covers its origin, its impact on test‑time scaling for large language models, reinforcement‑learning approaches, and how the concept can shape agent architectures and AI product strategy.

Agent ArchitectureGenerator-Verifier GapLLM

0 likes · 21 min read

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

Alimama Tech

Jun 25, 2025 · Artificial Intelligence

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

ROLL is an open‑source reinforcement‑learning framework designed for large language model post‑training that combines multi‑task RL, agentic support, flexible algorithm configuration, elastic resource scheduling, and rich observability, delivering significant accuracy gains across benchmarks while remaining easy to use for researchers, product developers, and infrastructure engineers.

AI FrameworkOpen SourceRLHF

0 likes · 11 min read

Introducing ROLL: A Scalable, User‑Friendly RL Framework for Large‑Scale LLM Training

DataFunTalk

Jun 21, 2025 · Artificial Intelligence

Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions

This talk explores how large AI models become overconfident, leading to bias and hallucinations, examines adversarial examples in vision and language, explains why data and algorithms cause these issues, and shows how reinforcement learning can teach models to admit uncertainty and align with human values.

AI alignmentAI safetyBias

0 likes · 19 min read

Why AI Gets Overconfident: Bias, Hallucinations, and Reinforcement Learning Solutions

Kuaishou Large Model

Jun 20, 2025 · Artificial Intelligence

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

OneRec, an end-to-end generative recommendation system from Kuaishou, uses an encoder-decoder architecture, reward-based preference alignment, and reinforcement learning to dramatically improve video recommendation efficiency, boosting user engagement and reducing operational costs while achieving scaling-law performance comparable to large language models.

EfficiencyKuaishouLarge Models

0 likes · 18 min read

How OneRec Revolutionizes Short-Video Recommendations with End-to-End Generative AI

Kuaishou Tech

Jun 20, 2025 · Artificial Intelligence

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

The OneRec system from Kuaishou replaces traditional cascade recommendation pipelines with an encoder‑decoder architecture, leverages reward‑based preference alignment via reinforcement learning, achieves ten‑fold FLOPs gains, cuts operational costs by 90%, and delivers significant user‑engagement improvements across short‑video and local‑service scenarios.

Generative ModelingKuaishouOneRec

0 likes · 17 min read

How OneRec Redefines Recommendation with End‑to‑End Generative Modeling and RL Alignment

Xiaohongshu Tech REDtech

Jun 19, 2025 · Artificial Intelligence

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

Efficiencyadaptive inferencechain-of-thought

0 likes · 9 min read

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

DataFunTalk

Jun 17, 2025 · Artificial Intelligence

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.

AISWE-benchopen-source

0 likes · 7 min read

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Fighter's World

Jun 14, 2025 · Artificial Intelligence

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

The article analyzes how large language models can acquire true reasoning abilities for hard‑to‑score industry tasks by combining Chain‑of‑Thought prompting with reinforcement learning, addressing vague reward signals, reward hacking, and loyalty, and proposing a toolbox of reward engineering, synthetic data, hierarchical RL and multi‑agent collaboration.

LLMReward Modelingchain-of-thought

0 likes · 22 min read

How Can LLMs Learn to “Think” in Complex Industry Scenarios?

Fun with Large Models

Jun 12, 2025 · Artificial Intelligence

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

This article explains the GRPO reinforcement‑learning algorithm, shows its core idea of internal group competition without a separate evaluator model, and provides a complete, step‑by‑step code walkthrough—including environment setup, dataset preparation, reward‑function design, training configuration, and evaluation—using the Qwen2.5‑0.5B‑Instruct model on the GSM8K math dataset.

GRPOGSM8KQwen2.5

0 likes · 23 min read

Implement GRPO to Give LLMs Reasoning Ability with Qwen2.5‑0.5B

Kuaishou Tech

Jun 4, 2025 · Artificial Intelligence

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

The KwaiPilot team released the KwaiCoder‑AutoThink‑preview model, which introduces a novel automatic‑thinking training paradigm and a process‑supervised reinforcement‑learning method called Step‑SRPO, enabling the model to dynamically switch between thinking and non‑thinking modes, reduce inference cost, and achieve up to 20‑point gains on code and math benchmarks while handling large‑scale codebases.

AI researchLarge Language Modelautomatic thinking

0 likes · 12 min read

KwaiCoder-AutoThink-preview: An Automatic‑Thinking Large Model Enhanced with Step‑SRPO Reinforcement Learning

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Re-distillationSample Effectlarge language models

0 likes · 23 min read

Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

AI Frontier Lectures

May 31, 2025 · Artificial Intelligence

Why Embodied Intelligence Is Exploding and What It Means for the Future

The article analyzes the recent surge in embodied intelligence, examines why physical agents matter despite advances in large language models, outlines common failure modes, discusses key research decisions such as 2D versus 3D perception and tactile sensing, and explores the roles of imitation learning, VLA, and reinforcement learning in shaping the field.

VLAVisionimitation learning

0 likes · 24 min read

Why Embodied Intelligence Is Exploding and What It Means for the Future

AI Frontier Lectures

May 30, 2025 · Artificial Intelligence

Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Recent work from West Lake University's MAPLE Lab introduces a diffusion‑based “Divergent Thought Chain” that treats each intermediate denoising step of a diffusion language model as a reasoning step, using result‑based reinforcement learning to optimize non‑linear token generation and achieving state‑of‑the‑art performance on math and code tasks.

chain-of-thoughtcode generationdiffusion language models

0 likes · 14 min read

Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Alibaba Cloud Developer

May 28, 2025 · Artificial Intelligence

Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

This article provides a comprehensive guide to large language model fine‑tuning, covering model architecture, parameter and memory calculations, prompt engineering, data construction, LoRA and PEFT techniques, reinforcement learning methods such as DPO, and practical deployment workflows on internal platforms.

Fine‑TuningLLMLoRA

0 likes · 21 min read

Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

JD Cloud Developers

May 27, 2025 · Artificial Intelligence

How JD’s Young AI Engineers Tackle Real-World Model Challenges

Young JD algorithm engineers share how they solve tough AI problems—from optimizing large‑model training and reward‑model design for ad image generation, to building LLM‑based query expansion, agent evaluation, and model pruning with FFT and RDP—illustrating practical breakthroughs and personal growth in cutting‑edge AI research.

AIModel PruningReward Modeling

0 likes · 15 min read

How JD’s Young AI Engineers Tackle Real-World Model Challenges

AI Algorithm Path

May 27, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization

This tutorial explains how to construct state feature vectors for reinforcement‑learning value‑function approximation, covering linear, polynomial, Fourier, and radial‑basis representations, as well as state aggregation techniques such as coarse coding and tile coding, and discusses non‑parametric approaches like kernel methods.

feature engineeringfourier basisfunction approximation

0 likes · 16 min read

Reinforcement Learning Tutorial 8: Building State Feature Representations for Objective Optimization

AIWalker

May 26, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

VisionReasoner presents a reinforcement‑learning‑driven unified framework that simultaneously tackles detection, segmentation, and counting tasks, employing a novel multi‑target cognition strategy and efficient Hungarian‑based matching, and demonstrates substantial gains—29.1% on COCO detection, 22.1% on ReasonSeg, and 15.3% on CountBench—using only 7,000 training samples.

SegmentationVisionReasonercounting

0 likes · 20 min read

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

JD Tech

May 26, 2025 · Artificial Intelligence

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

This article details how JD Retail's young algorithm engineers tackled a series of AI engineering problems—including advertising image quality assessment with multi‑reward models, large‑language‑model‑driven query expansion, FFT‑and‑RDP‑based model pruning, and agent‑centric reinforcement learning—while sharing practical growth insights and code snippets.

AIcomputer visionlarge language models

0 likes · 15 min read

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

Alibaba Cloud Developer

May 26, 2025 · Artificial Intelligence

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

AIPlanningmodel training

0 likes · 14 min read

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

AI Algorithm Path

May 25, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

This article explains why tabular reinforcement‑learning methods scale poorly, introduces supervised‑learning‑based value‑function approximation using a parameterized vector w, discusses loss design, stochastic‑gradient updates, bootstrapping, semi‑gradient techniques, and linear function approximation, and summarizes practical implications.

gradient Monte Carlolinear function approximationreinforcement learning

0 likes · 13 min read

Reinforcement Learning Tutorial 7: Introducing Value Function Approximation Methods

IT Services Circle

May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekLarge Language Model

0 likes · 10 min read

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

AI Algorithm Path

May 24, 2025 · Artificial Intelligence

How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI

This tutorial explains how n-step temporal‑difference (TD) algorithms generalize the one‑step TD and Monte‑Carlo methods, presents the n‑step return update rule, walks through a three‑step TD example, shows how Sarsa and Q‑learning can be extended, and discusses how to choose the optimal n value for a given problem.

Monte CarloQ-Learningalgorithm analysis

0 likes · 9 min read

How N-step Temporal-Difference Methods Extend TD Learning in Reinforcement AI

AI Algorithm Path

May 23, 2025 · Artificial Intelligence

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

This tutorial explains temporal‑difference (TD) learning, compares it with dynamic programming and Monte‑Carlo methods, walks through concrete soccer‑match examples, shows one‑step TD versus constant‑α Monte‑Carlo updates, discusses convergence, bias, and introduces popular TD variants such as Sarsa, Q‑learning, Expected Sarsa and double learning.

Monte CarloQ-LearningTD learning

0 likes · 18 min read

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

AI Algorithm Path

May 22, 2025 · Artificial Intelligence

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

This tutorial explains how Monte Carlo methods are enhanced in reinforcement learning through epsilon‑greedy and epsilon‑soft policies, Monte Carlo control, a Blackjack Q‑function example, the distinction between on‑policy and off‑policy learning, importance sampling, and efficient incremental update techniques.

Epsilon-GreedyImportance SamplingMonte Carlo

0 likes · 14 min read

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

AIWalker

May 22, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting

VisionReasoner introduces a reinforcement‑learning‑driven unified framework that simultaneously handles detection, segmentation, and counting tasks within a single model, achieving 29.1% higher COCO detection AP, 22.1% better ReasonSeg segmentation, and 15.3% improvement on CountBench, while requiring only 7,000 training samples and offering efficient multi‑target matching via batch computation and the Hungarian algorithm.

LVLMObject CountingVisionReasoner

0 likes · 19 min read

VisionReasoner: RL‑Unified System Beats YOLO‑World on Detection, Segmentation, Counting

JD Tech Talk

May 22, 2025 · Artificial Intelligence

From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection

The article recounts Xiaoting’s journey from a PhD research background to leading JD.com’s ad‑fraud detection, detailing how large language models, reinforcement learning, and model distillation were applied to identify hidden address codes, reduce false‑positive rates to 0.3%, and balance accuracy with real‑time performance in a high‑traffic e‑commerce environment.

AIAd FraudAdvertising

0 likes · 11 min read

From Academic Research to Industrial Anti‑Fraud: Leveraging LLMs, Reinforcement Learning, and Model Distillation for Advertising Risk Detection

JD Retail Technology

May 22, 2025 · Industry Insights

Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained

This article recounts the journey of a JD PhD trainee who transformed academic research on anomaly detection into a production‑grade, LLM‑enhanced anti‑fraud system that identifies concealed address codes in CPS ads, detailing model design, LoRA fine‑tuning, reinforcement learning, distillation, cost‑aware deployment, and lessons learned for scalable ad risk management.

Large Language Modelad fraud detectionindustry AI

0 likes · 12 min read

Cracking Hidden Ad Fraud: JD’s AI‑Driven Anti‑Cheat System Explained

AI Algorithm Path

May 21, 2025 · Artificial Intelligence

Understanding Monte Carlo Algorithms for Reinforcement Learning with a Blackjack Case Study

This article explains Monte Carlo methods for reinforcement learning, compares model‑free and model‑based approaches, details V‑ and Q‑function estimation using a Blackjack example, and discusses exploration‑exploitation trade‑offs and practical advantages of MC algorithms.

BlackjackModel-freeMonte Carlo

0 likes · 13 min read

Understanding Monte Carlo Algorithms for Reinforcement Learning with a Blackjack Case Study

AI Algorithm Path

May 19, 2025 · Artificial Intelligence

Understanding Policy Evaluation and Improvement in Reinforcement Learning

This article explains how to solve Bellman equations, use iterative policy‑evaluation methods, apply the policy‑improvement theorem, and combine both steps in policy iteration, value iteration, and asynchronous variants, illustrated with a 5‑state example and a 4×4 gridworld.

Bellman equationGridWorldgeneralized policy iteration

0 likes · 15 min read

Understanding Policy Evaluation and Improvement in Reinforcement Learning

Amap Tech

May 19, 2025 · Artificial Intelligence

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.

AI researchLLM fine-tuningPolicy Gradient

0 likes · 7 min read

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

AI Algorithm Path

May 18, 2025 · Artificial Intelligence

Reinforcement Learning Tutorial Part 1: Core Concepts Explained

This article introduces the fundamental concepts of reinforcement learning, covering the agent‑environment interaction, key terminology, reward structures, task types, policies, value functions, the Bellman equations, and how optimal strategies are derived and approximated in practice.

Bellman equationMarkov Decision ProcessOptimal Policy

0 likes · 13 min read

Reinforcement Learning Tutorial Part 1: Core Concepts Explained

Kuaishou Tech

May 14, 2025 · Artificial Intelligence

StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

This article presents StableReinforce and the R1-Reward model, demonstrating how reinforcement learning techniques can stabilize training and significantly improve the performance of multimodal reward models for large language models across several benchmarks.

AILLMR1-Reward

0 likes · 15 min read

StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

Kuaishou Tech

May 13, 2025 · Artificial Intelligence

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

KuaiModbenchmarkcontent moderation

0 likes · 19 min read

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

AI Frontier Lectures

May 13, 2025 · Artificial Intelligence

How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

Recent large language models have shown strong reasoning abilities, and this work extends chain‑of‑thought reasoning to autoregressive image generation by introducing T2I‑R1, a dual‑level (Semantic‑CoT and Token‑CoT) framework trained with reinforcement learning that unifies high‑level planning and low‑level token generation, achieving state‑of‑the‑art results.

generative AIreinforcement learningsemantic planning

0 likes · 7 min read

How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

AI Frontier Lectures

May 13, 2025 · Artificial Intelligence

How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

This article provides a comprehensive, step‑by‑step analysis of Diffusion Policy for robot visuomotor control, covering its motivation, task characteristics, model design, dataset preparation, training pipeline, inference procedure, experimental results, and open research questions.

Machine Learningdiffusion modelspolicy learning

0 likes · 63 min read

How Diffusion Policy is Transforming Vision‑Based Robot Motion Learning

Tencent Technical Engineering

May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMmodel architecture

0 likes · 25 min read

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

JD Retail Technology

May 7, 2025 · Artificial Intelligence

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

JD Retail’s engineering team tackles hard AI problems by replacing a monolithic reward model with specialized small models for ad‑image generation, deploying an LLM‑driven query‑expansion pipeline that lifts conversion rates, and pruning text‑to‑image transformers using FFT and RDP to boost throughput 40% without loss, while building comprehensive evaluation tools and a semantic smart‑assistant.

AILarge ModelsModel Pruning

0 likes · 14 min read

Solving Technical Challenges with Large AI Models at JD Retail: Reward Modeling, Query Expansion, and Model Pruning

AIWalker

May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

Supervised Fine‑Tuningautoregressivebenchmark

0 likes · 14 min read

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

DevOps

May 5, 2025 · Artificial Intelligence

DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite

DeepSeek has quietly open‑sourced a new mathematics‑focused large language model, DeepSeek‑Prover‑V2 (available in 671B and 7B variants), achieving 88.9% on MiniF2F and strong results on PutnamBench, alongside the high‑quality ProverBench dataset and a novel recursive theorem‑proving pipeline.

AIDeepSeekLarge Language Model

0 likes · 4 min read

DeepSeek Releases Math‑Specialized Large Model V2 and ProverBench Evaluation Suite

Architect

May 5, 2025 · Artificial Intelligence

How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent

Agentic RAG‑R1, an open‑source project from Peking University, combines Retrieval‑Augmented Generation with an agentic AI loop, introduces the GRPO reinforcement‑learning optimizer, supports LoRA‑based fine‑tuning, quantization and multimodal tool calls, and demonstrates significant accuracy gains on the MedQA benchmark across both Chinese and English test sets.

LLM Tool UseOpen SourceRetrieval-Augmented Generation

0 likes · 8 min read

How Agentic RAG‑R1 Turns Retrieval‑Augmented Generation into an Autonomous AI Agent

AI Frontier Lectures

May 5, 2025 · Industry Insights

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

The article reviews five years of AI model evolution, analyzes current scaling and reinforcement‑learning trends, and forecasts architectural, mathematical, and infrastructure directions for large language models through 2030, highlighting potential breakthroughs and the risks of over‑reliance on benchmarks.

AI trendsModel Scalingindustry analysis

0 likes · 22 min read

What Will Large Language Models Look Like in the Next Five Years? A Deep Dive into Trends and Challenges

AI Algorithm Path

May 3, 2025 · Artificial Intelligence

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

DeepSeek‑Prover‑V2, an open‑source LLM specialized for Lean 4, bridges intuitive high‑level reasoning and strict formal verification through sub‑goal decomposition, dual operation modes, and a novel cold‑start data pipeline, achieving state‑of‑the‑art results on MiniF2F, PutnamBench and CombiBench while highlighting trade‑offs in inference cost and model scalability.

AI mathematicsDeepSeek Prover V2LLM

0 likes · 18 min read

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

Baobao Algorithm Notes

May 2, 2025 · Artificial Intelligence

Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

This article analyzes whether reinforcement learning enhances large language model reasoning, compares findings from DeepSeek-Math, a Tsinghua‑Shanghai Jiao‑Tong paper, and Qwen3, and outlines practical training pipelines—including Seed‑Thinking‑v1.5, DeepSeek‑R1, Kimi‑K1.5, and Qwen3—that aim to endow LLMs with robust reasoning capabilities.

Artificial IntelligenceLLMReasoning

0 likes · 12 min read

Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

Mafengwo Technology

Apr 30, 2025 · Artificial Intelligence

How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy

The article details the development, training, and evaluation of MaFengWo's 32‑billion‑parameter travel large language model (mfw‑32B), highlighting its superior itinerary planning, personalized demand capture, budget management, and resource efficiency compared to DeepSeek‑R1, and describing the SFT and reinforcement‑learning stages that enabled these gains.

Large Language ModelLoRAai-optimization

0 likes · 14 min read

How MaFengWo’s mfw-32B Travel LLM Outperforms DeepSeek‑R1 in Speed and Accuracy

AIWalker

Apr 28, 2025 · Artificial Intelligence

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

SimpleAR is a minimalist autoregressive visual generation framework that, with only 0.5 B parameters, achieves competitive 1024×1024 image synthesis through a three‑stage pipeline of large‑scale pretraining, supervised fine‑tuning, and GRPO‑based reinforcement learning, and demonstrates significant inference speedups using KV‑cache, vLLM, and speculative decoding.

Inference Accelerationautoregressive generationbenchmark

0 likes · 14 min read

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

DataFunTalk

Apr 25, 2025 · Artificial Intelligence

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.

AI researchRLVRReasoning

0 likes · 12 min read

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

AntTech

Apr 24, 2025 · Artificial Intelligence

Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

At ICLR 2025 in Singapore, Ant Group and Tsinghua University showcased the open‑source reinforcement‑learning platform AReaL and the multi‑agent system AWorld, highlighting their recent breakthroughs, system design challenges, performance results on the GAIA benchmark, and upcoming development plans.

AI frameworksICLR2025Open Source

0 likes · 7 min read

Key Takeaways from Ant Group and Tsinghua’s Presentations on the AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

Kuaishou Tech

Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationSRPOcross-domain training

0 likes · 12 min read

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

AI Frontier Lectures

Apr 24, 2025 · Artificial Intelligence

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Researchers from UCLA and Meta AI introduce d1, a two‑stage post‑training framework that combines supervised fine‑tuning and a novel diffu‑GRPO reinforcement‑learning algorithm to enable efficient reasoning in masked diffusion large language models, achieving state‑of‑the‑art performance on multiple math and logic benchmarks.

AId1diffu-GRPO

0 likes · 9 min read

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

AI Frontier Lectures

Apr 24, 2025 · Artificial Intelligence

Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive

The article argues that AI is entering a new phase where defining real‑world tasks and robust evaluation outweigh pure model improvements, highlighting the rise of reasoning‑augmented reinforcement learning, the need for product‑oriented thinking, and the shortcomings of current i.i.d. benchmark practices.

AI trendsIndustry Insightproduct focus

0 likes · 9 min read

Why AI’s Second Half Is About Products, Not Just Models – A Deep Dive

AntTech

Apr 21, 2025 · Artificial Intelligence

InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

The InclusionAI open‑source community, initiated by Ant Group, will showcase the latest advances of its reinforcement‑learning framework AReaL and multi‑agent framework AWorld at the ICLR 2025 conference in Singapore, highlighting performance breakthroughs, open‑source contributions, and industry‑focused AI research.

AReaLAWorldAnt Group

0 likes · 5 min read

InclusionAI Community to Present AReaL Reinforcement Learning Framework and AWorld Multi‑Agent Framework at ICLR 2025

DataFunTalk

Apr 21, 2025 · Artificial Intelligence

Mechanize: A Controversial AI Startup Aiming to Fully Automate All Work and the Global Economy

Mechanize, a new AI startup founded by Epoch AI co‑founder Tamay Besiroglu, aims to fully automate all white‑collar work and the global economy, targeting a $60 trillion labor market, but faces technical hurdles, investor scrutiny, and widespread criticism over its radical vision.

AI automationAI startupsArtificial Intelligence

0 likes · 6 min read

Mechanize: A Controversial AI Startup Aiming to Fully Automate All Work and the Global Economy

AI Algorithm Path

Apr 20, 2025 · Artificial Intelligence

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

LLMRL trainingchain-of-thought

0 likes · 10 min read

Boosting Visual Reasoning in VLMs with Reinforcement Learning

Fighter's World

Apr 18, 2025 · Artificial Intelligence

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

The article analyzes the emerging "Era of Experience" in AI, arguing that reliance on static human data limits progress and that reinforcement learning‑based experiential learning—exemplified by AlphaZero—offers a path toward surpassing human knowledge, while outlining the technical, safety, and ethical challenges ahead.

AGIAlphaZeroArtificial Intelligence

0 likes · 19 min read

Rethinking the AGI Roadmap: From Data Imitation to Experience‑Driven Superiority

AI Frontier Lectures

Apr 18, 2025 · Artificial Intelligence

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

This reflective essay traces reinforcement learning’s decade‑long evolution through four stages—early algorithmic foundations, application‑driven growth, problem‑construction focus, and speculative future—while critiquing the expanding definition and its impact on research and industry.

AI researchRL evolutionRLHF

0 likes · 9 min read

From RL’s Early Days to Its Future: A Four‑Stage Evolution of Reinforcement Learning

AI Frontier Lectures

Apr 17, 2025 · Artificial Intelligence

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

This article analyzes a recent study on language‑model reasoning, revealing that reinforcement learning often brings little or no improvement, while evaluation variance caused by seeds, hardware, and decoding settings can dramatically affect benchmark results, and supervised fine‑tuning emerges as a more reliable path.

LLMReproducibilityreinforcement learning

0 likes · 12 min read

Why Reinforcement Learning Fails to Boost Small LLM Reasoning: A Deep Dive

Data Thinking Notes

Apr 15, 2025 · Artificial Intelligence

Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning

Professor Li Hongyi’s lecture provides a comprehensive, step‑by‑step exploration of AI agents, covering their definitions, reinforcement‑learning roots, LLM integration, memory mechanisms, tool usage, planning strategies, benchmarks, and practical examples, offering a valuable resource for anyone studying modern artificial intelligence.

AI agentsMemoryPlanning

0 likes · 67 min read

Understanding AI Agents: From Reinforcement Learning to LLM-Powered Planning

Volcano Engine Developer Services

Apr 14, 2025 · Artificial Intelligence

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

LLM Benchmarkcode repairdataset

0 likes · 11 min read

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

AI Algorithm Path

Apr 13, 2025 · Artificial Intelligence

Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

DeepSeekGRPOPPO

0 likes · 6 min read

Understanding GRPO: Group Relative Policy Optimization for LLM Training

Network Intelligence Research Center (NIRC)

Apr 9, 2025 · Artificial Intelligence

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

The article analyzes the anti‑scaling phenomenon in video large‑language models, identifies a “temporal hacking” shortcut where models focus on a few key frames, formalizes it via reward‑hacking theory, introduces the Temporal Perplexity (TPL) metric, and proposes an Unhackable Temporal Rewarding (UTR) framework to mitigate the issue.

Scaling LawTemporal PerplexityUTR

0 likes · 14 min read

Why Scaling Laws Fail for Video MLLMs: Uncovering the Temporal Hacking Problem

AI Algorithm Path

Apr 2, 2025 · Artificial Intelligence

Vision‑Reasoning Model: Enabling LLMs to See and Think

The article analyzes the limitations of current visual language models and large reasoning models, proposes a combined Vision‑Reasoning Model (VRM), details its architecture using LLaVA, describes end‑to‑end fine‑tuning and reinforcement‑learning reward design, and argues that such models will become the next breakthrough in AI.

DeepSeekLLaVALarge Language Model

0 likes · 9 min read

Vision‑Reasoning Model: Enabling LLMs to See and Think

Data Thinking Notes

Mar 30, 2025 · Artificial Intelligence

How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

This comprehensive analysis by the Peking University AI Alignment team dissects the technical innovations behind DeepSeek‑R1, DeepSeek‑R1 Zero, and Kimi‑K1.5, covering reinforcement‑learning‑based post‑training, rule‑based rewards, GRPO optimization, scaling laws, multimodal extensions, safety challenges, and future research directions.

AI alignmentDeepSeekKimi

0 likes · 57 min read

How DeepSeek‑R1 and Kimi‑K1.5 Push the Boundaries of Strong Reasoning Models

Fighter's World

Mar 29, 2025 · Industry Insights

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

The podcast recap dissects a year of rapid AI change, highlighting surprise‑fast open‑source model releases, shifting foundation‑model dynamics, the rise of GPT wrappers, over‑hyped agents, undervalued memory, product‑market fit debates, infrastructure opportunities, and lingering mysteries like RL in non‑verifiable domains.

AI infrastructureAI trendsGPT wrappers

0 likes · 22 min read

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

JavaEdge

Mar 27, 2025 · Artificial Intelligence

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

Artificial IntelligenceLLMVision-Language Model

0 likes · 8 min read

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

JD Tech

Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward Modeladvertising image generation

0 likes · 10 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

AI Frontier Lectures

Mar 24, 2025 · Artificial Intelligence

What Can AI Agents Learn from the Latest AIR 2025 Research?

The article compiles insights from the AIR 2025 conference and related talks, covering the evolution of agents from reinforcement‑learning to LLM‑driven systems, novel agent architectures like AIDE, GUI agents, natural‑language reinforcement learning, and scaling advances in large language models such as Qwen, while highlighting key algorithms, benchmarks, and open research questions.

AI agentsAgent ArchitectureGUI agents

0 likes · 27 min read

What Can AI Agents Learn from the Latest AIR 2025 Research?

JD Tech Talk

Mar 24, 2025 · Artificial Intelligence

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

This article presents MaRCA, a multi‑agent reinforcement learning framework that allocates computation resources across the full ad‑serving chain by modeling user value, compute consumption, and action rewards, enabling fine‑grained power‑tilting toward high‑quality traffic and achieving significant business gains under strict latency constraints.

ad servingai-optimizationcomputation allocation

0 likes · 16 min read

MaRCA: Multi‑Agent Reinforcement Learning Computation Allocation for Full‑Chain Ad Serving

Architect

Mar 23, 2025 · Artificial Intelligence

The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents

The article argues that the next wave of AI agents will shift from brittle, prompt‑driven workflows like Manus to truly autonomous, model‑centric agents trained with reinforcement learning and reasoning, exemplified by OpenAI's DeepResearch and Anthropic's Claude Sonnet 3.7, while the API‑driven market model collapses.

AI agentsClaudeDeepResearch

0 likes · 28 min read

The Future of AI Agents: From Prompt‑Driven Workflows to Model‑as‑Product and Reinforcement‑Learning‑Powered Agents

Baobao Algorithm Notes

Mar 20, 2025 · Artificial Intelligence

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Algorithm OptimizationDeep RLGAE

0 likes · 19 min read

Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive