Tagged articles

690 articles

Page 3 of 7

Dec 23, 2025 · Industry Insights

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Microsoft’s 2025 New Future of Work report reveals that AI, driven by breakthroughs in reinforcement learning, is shifting from answering questions to executing complex tasks, while investment and corporate adoption surge unevenly and employee involvement emerges as a critical factor for sustainable productivity gains.

AIFuture of WorkIndustry Insights

0 likes · 8 min read

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Bilibili Tech

Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationEfficient ReasoningLLM

0 likes · 13 min read

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Instant Consumer Technology Team

Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA

0 likes · 9 min read

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

Network Intelligence Research Center (NIRC)

Dec 15, 2025 · Artificial Intelligence

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

The paper introduces Artanis, an intent‑based network configuration update framework that combines large‑language‑model generation with a verification‑feedback loop and reinforcement‑learning optimization, addressing hallucination‑induced errors and ensuring safe, policy‑compliant deployments across diverse network scales.

Configuration ManagementIntent-based NetworkingLLM

0 likes · 9 min read

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

Bighead's Algorithm Notes

Dec 13, 2025 · Artificial Intelligence

Key Quantitative Finance Papers (Dec 6‑12 2025) – AI‑Driven Insights

This article summarizes ten recent arXiv papers (Dec 6‑12 2025) that explore AI‑driven techniques—from neural‑network ranking and reinforcement learning to quantum models and LLM agents—for quantitative finance and investment decision‑making.

Machine LearningQuantitative Financecryptocurrency

0 likes · 18 min read

Key Quantitative Finance Papers (Dec 6‑12 2025) – AI‑Driven Insights

AntTech

Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Open-sourceScalable RLagentic AI

0 likes · 14 min read

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Data Party THU

Dec 9, 2025 · Artificial Intelligence

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

The GenMimic paper introduces a novel framework that enables humanoid robots to zero‑shot imitate human actions generated by AI video models, presenting a new dataset, a two‑stage 4D reconstruction pipeline, and a reinforcement‑learning strategy with weighted‑tracking and symmetry losses, validated in simulation and on a real 23‑DoF robot.

Humanoid RobotsVideo Generationreinforcement learning

0 likes · 11 min read

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

Baidu Tech Salon

Dec 8, 2025 · Artificial Intelligence

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

AIData FlywheelMultimodal

0 likes · 20 min read

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

Bighead's Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.

AlphaQuanterFinancial AILLM agent

0 likes · 13 min read

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingreinforcement learning

0 likes · 7 min read

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Baobao Algorithm Notes

Dec 7, 2025 · Artificial Intelligence

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Catastrophic ForgettingLLMSFT

0 likes · 11 min read

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

AntTech

Dec 4, 2025 · Artificial Intelligence

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AStateHigh Performance ComputingLarge Models

0 likes · 23 min read

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

Model Perspective

Dec 1, 2025 · Artificial Intelligence

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

This article explains the core concepts of reinforcement learning, illustrates how its reward‑based mechanism appears in media creation, career advancement, education and social media, and warns of the pitfalls of over‑optimizing external rewards while offering practical ways to balance intrinsic motivation and reflective thinking.

Artificial IntelligenceMotivationbehavioral psychology

0 likes · 12 min read

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

PaperAgent

Dec 1, 2025 · Artificial Intelligence

How Deep Research Turns LLMs into Autonomous AI Scientists

This article surveys the emerging Deep Research (DR) paradigm that upgrades large language models into research agents capable of autonomous planning, multi‑source evidence gathering, memory management, and verifiable long‑form report generation, outlining its stages, core components, training pipeline, and evaluation benchmarks.

AI agentsAI research automationLLM agents

0 likes · 6 min read

How Deep Research Turns LLMs into Autonomous AI Scientists

Data Party THU

Nov 29, 2025 · Artificial Intelligence

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

This comprehensive guide explores the concept of AI agents, detailing their definitions, classifications, and core interaction loops, then walks you through building a functional LLM‑driven travel assistant with step‑by‑step code, tool integration, and practical insights on agent versus workflow paradigms.

AI agentsAgent ArchitectureLLM

0 likes · 39 min read

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

Bighead's Algorithm Notes

Nov 28, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

This digest summarizes five recent arXiv papers on AI-driven portfolio optimization and financial time‑series forecasting, covering G‑Learning with GIRL, transfer‑learning strategies, hybrid LSTM‑PPO frameworks, time‑series foundation models, and a KAN versus LSTM performance comparison, highlighting their methods, datasets, and reported Sharpe improvements.

Financial AIportfolio optimizationreinforcement learning

0 likes · 9 min read

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

Tencent Advertising Technology

Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

EfficiencyGeneralizationMLLM

0 likes · 21 min read

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Alimama Tech

Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

InfrastructureScalable Trainingagentic AI

0 likes · 12 min read

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

ITPUB

Nov 24, 2025 · Artificial Intelligence

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

In a detailed interview, the CTO of Memory Tensor (Shanghai) explains how limited memory capacity hampers large models, outlines the MemOS memory operating system, discusses information‑theoretic metrics, multimodal extensions, and reinforcement‑learning strategies for scalable, secure, and explainable AI memory management.

AI ArchitectureMultimodal AIinformation theory

0 likes · 23 min read

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

Data Party THU

Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPolicy GradientPython

0 likes · 22 min read

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

AntTech

Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

High Performance ComputingLarge ModelsPython

0 likes · 17 min read

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Xiaohongshu Tech REDtech

Nov 20, 2025 · Artificial Intelligence

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

AI agentsGeneral AIdeep reasoning

0 likes · 14 min read

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

360 Zhihui Cloud Developer

Nov 20, 2025 · Artificial Intelligence

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

AI agentsToolPOlarge language models

0 likes · 8 min read

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

Baobao Algorithm Notes

Nov 20, 2025 · Artificial Intelligence

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.

Catastrophic ForgettingLLMOn-Policy Data

0 likes · 8 min read

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

Instant Consumer Technology Team

Nov 19, 2025 · Artificial Intelligence

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

This article details the end‑to‑end AIGC video automation system we created—from raw material ingestion and multimodal content understanding to script generation, AI‑driven editing, rendering, and multi‑channel distribution—highlighting architecture, key modules, technical choices, performance results, and lessons learned.

AIGCMultimodal AIScript Generation

0 likes · 16 min read

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

AI Tech Publishing

Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationagentic capabilitiescommon sense reasoning

0 likes · 19 min read

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

Data Party THU

Nov 15, 2025 · Artificial Intelligence

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

This article explains how reinforcement learning (RL) underpins intelligent AI agents, covering the Markov Decision Process fundamentals, key RL components, multi‑hop reasoning on knowledge graphs, and a step‑by‑step LangGraph example that integrates an RL‑driven tutoring policy with Python code.

AI agentsLangGraphPython

0 likes · 17 min read

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

Kuaishou Tech

Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-Guardflow matchinggenerative AI

0 likes · 9 min read

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

Bighead's Algorithm Notes

Nov 13, 2025 · Artificial Intelligence

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

AlphaGAT introduces a two‑stage learning framework that first extracts robust alpha factors with a CATimeMixer model and a novel loss, then dynamically weights these factors via reinforcement learning (PPO) and a graph attention network, achieving superior portfolio performance across DJIA, HSI, CSI‑100 and crypto markets despite noisy data and distribution shifts.

AlphaGATFinancial AITime-series

0 likes · 14 min read

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

Alimama Tech

Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

Asynchronous Trainingattention mechanismslarge language models

0 likes · 23 min read

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

DataFunTalk

Nov 7, 2025 · Artificial Intelligence

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Training-Free GRPO, proposed by Tencent Youtu Lab, eliminates parameter updates by iteratively building an experience knowledge base, enabling cost‑effective reinforcement learning for large language models, dramatically reducing training expenses from thousands of dollars to under $20 while maintaining strong performance across math reasoning and web search tasks.

AICost Reductionreinforcement learning

0 likes · 6 min read

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Architect's Guide

Nov 7, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

The article examines the need for Multi‑Agent Communication Protocols (MCP), outlines the limitations of single‑agent and centralized systems, compares MCP with other interaction methods, reviews current research and industrial applications, and highlights future directions such as hardware integration, bio‑inspired mechanisms, and blockchain convergence.

Blockchaincommunication protocolsdecentralized AI

0 likes · 9 min read

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

Kuaishou Tech

Nov 5, 2025 · Artificial Intelligence

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

EfficiencyHybrid Policy OptimizationLLM

0 likes · 13 min read

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

Network Intelligence Research Center (NIRC)

Nov 4, 2025 · Artificial Intelligence

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

Curriculum LearningGUI automationSEAgent

0 likes · 6 min read

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

DataFunSummit

Nov 3, 2025 · Artificial Intelligence

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

This article shares practical experience on deploying private Agentic AI, covering background, architecture design, challenges, data generation, reinforcement learning with DPO, automated multi‑dimensional evaluation, and future plans for open‑source models and richer tool integration.

DPOLLM fine-tuningPrivate Deployment

0 likes · 16 min read

Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

Data Party THU

Oct 31, 2025 · Artificial Intelligence

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.

Diffusion Language ModelEUBOPolicy Gradient

0 likes · 9 min read

How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

Bilibili Tech

Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMReward Modeling

0 likes · 13 min read

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningPolicy Gradient

0 likes · 12 min read

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL OptimizationScaling Law

0 likes · 10 min read

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

DataFunTalk

Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Knowledge Distillationmodel efficiencyon-policy distillation

0 likes · 15 min read

How On-Policy Distillation Cuts LLM Training Cost by 90%

Baobao Algorithm Notes

Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling

0 likes · 11 min read

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

Instant Consumer Technology Team

Oct 28, 2025 · Artificial Intelligence

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

AgentFlow, a Stanford-led multi‑agent system built on a 7B model, outperforms massive models like GPT‑4o across ten benchmarks by leveraging modular agents, on‑policy learning, and a novel Flow‑GRPO training engine that solves sparse‑reward, long‑horizon challenges.

AgentFlowSmall Model Performancemulti-agent systems

0 likes · 12 min read

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

Data Party THU

Oct 24, 2025 · Artificial Intelligence

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

The paper introduces BREEZE, a behavior‑regularized zero‑shot RL framework that improves stability, policy extraction, and representation quality by combining in‑sample learning, task‑conditioned diffusion models, and expressive attention‑based architectures, achieving near‑state‑of‑the‑art performance on benchmarks like ExORL and D4RL Kitchen.

behavioral regularizationdiffusion modeloffline RL

0 likes · 3 min read

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

Data Party THU

Oct 22, 2025 · Artificial Intelligence

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

AI fundamentalsBellman equationLarge Language Model

0 likes · 13 min read

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

Data Party THU

Oct 21, 2025 · Artificial Intelligence

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

DQNDouble DQNalgorithm analysis

0 likes · 7 min read

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

Data Thinking Notes

Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models

0 likes · 11 min read

How GSPO Improves Stability in Large Language Model Training

Bilibili Tech

Oct 17, 2025 · Artificial Intelligence

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

This article details how Bilibili’s multimedia lab leveraged a multimodal training pipeline combining data‑compressed SFT and the GRPO reinforcement‑learning algorithm to achieve a 13.5% metric boost and secure second place in the ICCV MIPI Detailed Image Quality Assessment competition.

GRPOMIPI competitionSFT

0 likes · 15 min read

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

Xiaohe Frontend Team

Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyModel CompressionRAG

0 likes · 8 min read

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meituan Technology Team

Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchMultimodal AIbenchmarking

0 likes · 10 min read

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

Data Party THU

Oct 15, 2025 · Artificial Intelligence

Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

This paper proposes a reinforcement‑learning framework that simultaneously ensures safety, sample efficiency, and robustness, applying a contextual‑bandit perspective to ranking/recommendation systems and text‑to‑image diffusion models, and introduces novel algorithms for safe deployment, variance‑reduced off‑policy estimation, and a LOOP method for generative RL.

RobustnessSafetycontextual bandits

0 likes · 5 min read

Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

Alibaba Cloud Developer

Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMStructured Output

0 likes · 28 min read

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Volcano Engine Developer Services

Oct 14, 2025 · Artificial Intelligence

How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

CollabLLM tackles the limitations of large language models in everyday multi‑turn dialogues by introducing a user‑centric, multi‑turn training framework that leverages simulated interactions, multi‑round reward modeling, and veRL toolchain support, achieving superior performance over single‑turn baselines.

LLMcollaborative trainingmulti-turn dialogue

0 likes · 13 min read

How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

Shopee Tech Team

Oct 14, 2025 · Artificial Intelligence

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

SPEC‑RL introduces speculative rollouts that reuse verified historical rollouts as prefixes, cutting rollout time by 2–3× while maintaining or improving performance across various math and reasoning benchmarks, and works seamlessly with PPO, GRPO, DAPO and other on‑policy algorithms.

AI efficiencyTraining Accelerationlarge language models

0 likes · 8 min read

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

AntTech

Oct 14, 2025 · Artificial Intelligence

How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks

The Ring-1T model, a trillion-parameter AI system released as open source, leverages advanced reinforcement learning techniques, extensive benchmark evaluations, and custom training frameworks to deliver balanced performance across math, code, reasoning, and creative tasks while highlighting current limitations and future development plans.

AI modelLarge Language Modelbenchmark evaluation

0 likes · 8 min read

How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks

Data Party THU

Oct 13, 2025 · Artificial Intelligence

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.

BranchGRPOEfficiencyRLHF

0 likes · 10 min read

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

Bighead's Algorithm Notes

Oct 12, 2025 · Artificial Intelligence

Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

This article reviews Trading‑R1, an open‑source LLM inference framework that integrates multimodal financial data, three‑stage supervised‑fine‑tuning and reinforcement learning to generate structured investment arguments and risk‑adjusted trade decisions, achieving superior Sharpe ratio and drawdown performance on real‑world stock and ETF tests.

Financial TradingLLMMultimodal

0 likes · 11 min read

Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

Kuaishou Large Model

Oct 11, 2025 · Artificial Intelligence

How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench

The KwaiPilot team introduced KAT-Dev-72B-Exp, an open‑source LLM trained with large‑scale reinforcement learning that achieved a record‑breaking 74.6% score on SWE‑Bench Verified, thanks to innovations like Trie Packing, entropy‑aware advantage scaling, and a decoupled data‑environment architecture.

KAT-Dev-72B-ExpTrie Packingentropy scaling

0 likes · 6 min read

How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench

Kuaishou Tech

Oct 11, 2025 · Artificial Intelligence

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

The KAT‑Dev‑72B‑Exp model, an experimental reinforcement‑learning version of KAT‑Coder, achieves a 74.6% performance boost on the SWE‑Bench Verified benchmark, introduces Trie Packing and entropy‑aware advantage scaling, and showcases a decoupled training architecture that dramatically speeds up large‑scale agentic RL training.

AIagentic trainingcode generation

0 likes · 9 min read

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

Data Party THU

Oct 10, 2025 · Artificial Intelligence

Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework

This article examines the Language Self‑Play (LSP) approach for data‑free training of large language models, detailing its challenger‑solver game formulation, advantage calculations, loss functions, self‑reward extension, experimental setup on AlpacaEval, and results that show LSP can match or surpass data‑driven baselines.

LLMdata-free traininglarge language models

0 likes · 14 min read

Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework

Data Party THU

Oct 9, 2025 · Artificial Intelligence

How Reinforcement Learning Is Transforming the Full Lifecycle of Large Language Models

This survey systematically reviews recent advances in applying reinforcement learning across the entire lifecycle of large language models, detailing methods, datasets, benchmarks, open‑source tools, and future challenges such as scalability, reward design, and evaluation standards.

AI SurveyLLM lifecycleRLVR

0 likes · 9 min read

How Reinforcement Learning Is Transforming the Full Lifecycle of Large Language Models

DataFunTalk

Oct 9, 2025 · Artificial Intelligence

From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research

Google DeepMind hired Shunyu Yao, a Tsinghua physics prodigy and former Anthropic researcher, whose rapid transition from theoretical physics to AI highlights the intense workload, values clash, and the accelerating pace of large‑model research.

AI researchDeepMindPhysics

0 likes · 9 min read

From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research

Model Perspective

Oct 8, 2025 · Artificial Intelligence

How Mathematical Models Reveal the Hidden Dynamics of Addiction

This article explores how differential equations, SIR-like population models, and reinforcement‑learning frameworks can quantitatively describe the onset, persistence, and spread of addictive behaviors, offering insights into feedback loops, neural adaptation, and optimal intervention strategies.

addiction modelingdynamical systemsintervention optimization

0 likes · 10 min read

How Mathematical Models Reveal the Hidden Dynamics of Addiction

DataFunSummit

Oct 7, 2025 · Artificial Intelligence

Deep Thinking in Large Language Models: Overcoming Domain Challenges

This presentation explores how large language models can transcend their general knowledge limits by developing domain‑specific deep thinking abilities, addressing challenges such as complex instruction execution, expert reasoning gaps, and tool integration, and proposes reinforcement‑learning‑driven frameworks, structured thinking pipelines, and tool‑calling mechanisms to achieve rational intelligence.

deep reasoningdomain adaptationreinforcement learning

0 likes · 27 min read

Deep Thinking in Large Language Models: Overcoming Domain Challenges

DataFunTalk

Oct 7, 2025 · Artificial Intelligence

Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS

Apple’s new paper presents RL4HS, a reinforcement‑learning framework that uses span‑level rewards and class‑aware policy optimization to detect hallucinated text spans in large language models, outperforming GPT‑5 and other baselines and offering more precise, auditable error identification.

RL4HShallucination detectionreinforcement learning

0 likes · 9 min read

Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS

Amap Tech

Oct 5, 2025 · Artificial Intelligence

Can One Navigation Brain Power All Robots? Inside CE-Nav’s Cross‑Embodiment Breakthrough

CE-Nav introduces a two‑stage imitation‑then‑reinforcement framework that decouples generic geometric planning from robot‑specific dynamics, enabling low‑cost, high‑performance navigation across quadrupeds, humanoids, and drones while requiring only brief online fine‑tuning.

SimulationVelFlowcross-embodiment

0 likes · 11 min read

Can One Navigation Brain Power All Robots? Inside CE-Nav’s Cross‑Embodiment Breakthrough

Amap Tech

Oct 3, 2025 · Artificial Intelligence

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

3D synthesisGraph ModelingVideo Generation

0 likes · 12 min read

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

AI2ML AI to Machine Learning

Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost ReductionModel CompressionMultimodal AI

0 likes · 13 min read

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

Data Party THU

Sep 28, 2025 · Artificial Intelligence

Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

The article presents Richard Sutton’s OaK architecture—a domain‑general, empirical, open‑ended framework that equips agents with continuously learnable components, meta‑learned step‑sizes, and a five‑stage FC‑STOMP pipeline to build world models, generate sub‑problems, learn options, and plan at run‑time.

AI Architecturecontinual learningmeta‑learning

0 likes · 22 min read

Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

HyperAI Super Neural

Sep 28, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning

This week’s AI paper roundup highlights five recent studies: a construction‑site vision‑language dataset and safety inspection tasks, a deep CORAL method for unsupervised domain adaptation, the discovery of a new family of unstable singularities in nonlinear PDEs, a reinforcement‑learning approach that boosts reasoning in large language models, and the PANORAMA architecture for omnidirectional vision in embodied AI.

Construction SafetyOmnidirectional VisionPDE Research

0 likes · 6 min read

Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning

Huawei Cloud Developer Alliance

Sep 28, 2025 · Artificial Intelligence

Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics

This curated list presents the most influential AI books, covering foundational theory, machine learning, deep learning, reinforcement learning, computer vision, and AI ethics, with editorial insights and author biographies to guide readers through the evolving landscape of artificial intelligence.

AI ethicsArtificial Intelligencereinforcement learning

0 likes · 25 min read

Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics

HyperAI Super Neural

Sep 26, 2025 · Artificial Intelligence

Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity

ReaSyn, a new framework from Nvidia’s research team, treats synthesis pathways as chain‑of‑thought reasoning using a novel Chain‑of‑Reaction representation, achieving the highest reconstruction rates and path diversity in molecule synthesis tasks, and outperforming prior methods across multiple benchmark optimizations.

AI drug discoveryReaSynchain-of-reaction

0 likes · 14 min read

Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity

Bighead's Algorithm Notes

Sep 25, 2025 · Artificial Intelligence

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

This article reviews the MARS framework, a risk‑aware multi‑agent reinforcement‑learning system for automated portfolio management that tackles market non‑stationarity and proactive risk control, detailing its hierarchical architecture, formal MDP formulation, training process, and superior experimental results on DJIA and HSI benchmarks.

Portfolio Managementdeep learningfinancial markets

0 likes · 13 min read

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

Fun with Large Models

Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF

0 likes · 9 min read

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

Data Party THU

Sep 20, 2025 · Artificial Intelligence

How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model

The article reports that DeepSeek’s R1 large language model, detailed in a peer‑reviewed Nature paper, was built with roughly $300 k in total cost—about $29.4 k for training—using Nvidia H800 chips and novel pure reinforcement‑learning techniques, achieving competitive performance while remaining open‑source.

DeepSeekLarge Language ModelNvidia H800

0 likes · 9 min read

How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model

Bighead's Algorithm Notes

Sep 20, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)

This digest summarizes seven recent arXiv papers that apply reinforcement learning, multi‑agent frameworks, dynamic factor models, high‑frequency trading LLMs, quantum GANs, multi‑LLM sentiment analysis, and context‑aware language models to advance quantitative finance and AI‑driven market prediction.

Quantitative FinanceQuantum Machine Learninglarge language models

0 likes · 12 min read

Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)

Data Party THU

Sep 19, 2025 · Artificial Intelligence

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

DeepSeek R1 replaces traditional supervised fine‑tuning with a pure reinforcement‑learning pipeline, introducing the GRPO algorithm and a four‑stage training regime that dramatically lowers cost, boosts reasoning and code‑generation performance, and raises important ethical, privacy, and societal considerations for large language models.

AI reasoningDeepSeekGRPO

0 likes · 14 min read

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

HyperAI Super Neural

Sep 19, 2025 · Artificial Intelligence

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

This article surveys five recent AI papers, covering reinforcement learning for large reasoning models, a tree‑structured table QA framework (ST‑Raptor), visual representation alignment for multimodal LLMs, GraphRAG‑based generation, and an LLM‑driven cryptographic vulnerability detector, each with key insights and links.

cryptographic vulnerability detectiongraph retrievallarge language models

0 likes · 5 min read

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

DataFunSummit

Sep 18, 2025 · Artificial Intelligence

Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies

This presentation by Yao Yitong of China Telecom AI Research Institute explains why Function Call is essential for LLM deployment, outlines data‑centric and training‑centric optimization methods, discusses common pitfalls and reward‑function design for reinforcement learning, and showcases practical Agent application patterns for real‑world tasks.

AgentLLMTraining Optimization

0 likes · 36 min read

Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies

HyperAI Super Neural

Sep 18, 2025 · Artificial Intelligence

DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model

DeepSeek‑R1, the first mainstream large language model to pass peer review in Nature, was trained for $294,000 using 648 H800 GPUs, and its RL‑enhanced version, DeepSeek‑R1‑Zero, achieved up to 86.7% pass@1 on AIME 2024, outperforming human averages across math, coding, and science tasks.

AI researchDeepSeek-R1Large Language Model

0 likes · 10 min read

DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model

Bighead's Algorithm Notes

Sep 14, 2025 · Artificial Intelligence

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

The article reviews the MM‑DREX framework, which tackles the non‑stationarity of financial markets by modeling trading as a POMDP, employing a vision‑language model‑driven dynamic router to allocate four heterogeneous experts, and demonstrates superior returns, Sharpe ratios, and drawdown control across stocks, futures, and crypto compared with 15 strong baselines.

Dynamic RoutingLLMPOMDP

0 likes · 13 min read

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

Fighter's World

Sep 12, 2025 · Artificial Intelligence

Why Are Production‑Grade AI Agents So Hard to Build?

The article analyses why production‑grade AI agents remain unreliable, pinpointing the scarcity of high‑quality task‑action data, the limits of static benchmarks, and the need for massive data‑generation engines, simulation sandboxes, sophisticated RL reward design, and efficient context engineering.

AI agentContext EngineeringData Generation

0 likes · 21 min read

Why Are Production‑Grade AI Agents So Hard to Build?

DataFunTalk

Sep 12, 2025 · Artificial Intelligence

Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference

The 2024 Inclusion·Bund conference gathered top AI pioneers—including Turing laureate Richard Sutton, Alibaba Cloud founder Wang Jian, HKU professor Ma Yi, Yushu Tech CEO Wang Xingxing, and historian Yuval Harari—to discuss the limits of intelligence, the shift toward open‑source resources, embodied AI, and the societal implications of rapid AI advancement.

AIArtificial Intelligencereinforcement learning

0 likes · 15 min read

Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference

Bighead's Algorithm Notes

Sep 11, 2025 · Artificial Intelligence

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Fin‑PRM, a domain‑specific process reward model for financial reasoning introduced by Alibaba’s Dianjin team, employs dual‑level step and trajectory rewards to provide fine‑grained supervision, achieving up to 12.9% accuracy gains in supervised fine‑tuning and 5.1% improvements in Best‑of‑N inference on benchmarks such as CFLUE and FinQA.

CFLUEFin-PRMFinQA

0 likes · 11 min read

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Instant Consumer Technology Team

Sep 11, 2025 · Artificial Intelligence

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

LLMRAGchunk embeddings

0 likes · 18 min read

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

GRPOLLM trainingPPO

0 likes · 16 min read

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Bighead's Algorithm Notes

Sep 5, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)

This digest reviews four recent AI‑driven finance papers: a robust MCVaR portfolio optimizer with ellipsoidal support and RKHS uncertainty, a PPO‑based adaptive weighting system for LLM‑generated alphas, an empirical comparison of price‑based, GICS‑based, and LLM‑embedding stock clustering, and a diffusion‑model approach that generates future financial chart images from current charts and text prompts.

Quantitative Financediffusion modelslarge language models

0 likes · 9 min read

Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)

Data Party THU

Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO

0 likes · 30 min read

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

Sohu Tech Products

Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

Artificial IntelligenceGRPOLLM

0 likes · 17 min read

How GRPO Revolutionizes RLHF for Large Language Models

Bighead's Algorithm Notes

Sep 3, 2025 · Artificial Intelligence

Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks

The paper introduces Technical Indicator Networks (TINs), a framework that maps traditional technical analysis formulas to neural‑network topologies, initializes weights to preserve indicator behavior, and uses reinforcement learning for dynamic optimization, achieving significantly higher Sharpe, Sortino, and cumulative returns on US30 component stocks than conventional MACD approaches.

Algorithmic TradingFinancial AITechnical Indicator Networks

0 likes · 9 min read

Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks

Baobao Algorithm Notes

Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM

0 likes · 12 min read

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Data STUDIO

Sep 2, 2025 · Artificial Intelligence

Understanding NAS: Core Algorithms and Python Implementations

This article reviews Neural Architecture Search (NAS), explains its bi‑level optimization formulation, compares three major search strategies—reinforcement learning, evolutionary algorithms, and differentiable gradient‑based methods—provides complete Python code for each, and analyzes experimental results highlighting performance trade‑offs and remaining challenges.

Differentiable Architecture SearchEvolutionary AlgorithmsNAS

0 likes · 25 min read

Understanding NAS: Core Algorithms and Python Implementations

Data Party THU

Aug 30, 2025 · Artificial Intelligence

Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning

Multi‑armed bandit models illustrate the core exploration‑exploitation dilemma in reinforcement learning, covering greedy, ε‑greedy, and optimistic‑initial‑value strategies, as well as sample‑average and incremental Q‑value estimation methods with practical examples and visual illustrations.

Q-value estimationexploration vs exploitationgreedy

0 likes · 15 min read

Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning

Bighead's Algorithm Notes

Aug 29, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)

This digest summarizes nine recent arXiv papers covering quantum portfolio optimization, thematic investing with semantic stock representations, multi‑indicator reinforcement learning for trading, attention‑based asset pricing, ESG variable selection, deep neural networks for return distribution forecasting, a foundation model for financial time‑series, a multi‑agent trading system with self‑reflection, and dynamic weighting machine‑learning stock selection strategies.

ESGMachine LearningQuantitative Finance

0 likes · 17 min read

Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)

Network Intelligence Research Center (NIRC)

Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling

0 likes · 10 min read

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Wu Shixiong's Large Model Academy

Aug 26, 2025 · Artificial Intelligence

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

This comprehensive guide explains the full RLHF training pipeline, the mathematical foundations of reward modeling and PPO, and introduces DPO and KTO algorithms—including their implementations, advantages, limitations, and practical tuning strategies—for building aligned large language models.

DPOHuman FeedbackKTO

0 likes · 32 min read

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

Kuaishou Tech

Aug 23, 2025 · Artificial Intelligence

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.

AI researchLarge Language ModelMultimodal

0 likes · 12 min read

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

Architect's Must-Have

Aug 22, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are the Future of AI Collaboration

This article examines the limitations of single-agent AI, explains how Multi-Agent Communication Protocols (MCP) address challenges such as incomplete perception, decision conflicts, and scalability, and outlines current research, industrial applications, and future directions including edge integration and blockchain synergy.

Blockchaincommunication protocolsdistributed AI

0 likes · 8 min read

Why Multi-Agent Communication Protocols Are the Future of AI Collaboration

Data Thinking Notes

Aug 21, 2025 · Artificial Intelligence

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

This article distills Denny Zhou’s Stanford CS25 lecture, explaining how large language models achieve reasoning through intermediate token generation, chain‑of‑thought prompting, self‑consistency, reinforcement‑learning fine‑tuning, and answer aggregation, while highlighting theoretical foundations and practical breakthroughs.

LLMReasoningchain-of-thought

0 likes · 18 min read

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

Kuaishou Tech

Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

distributed trainingpipeline optimizationreinforcement learning

0 likes · 13 min read

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%