Tagged articles
690 articles
Page 3 of 7
AI Info Trend
AI Info Trend
Dec 23, 2025 · Industry Insights

How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report

Microsoft’s 2025 New Future of Work report reveals that AI, driven by breakthroughs in reinforcement learning, is shifting from answering questions to executing complex tasks, while investment and corporate adoption surge unevenly and employee involvement emerges as a critical factor for sustainable productivity gains.

AIFuture of WorkIndustry Insights
0 likes · 8 min read
How AI Will Boost Collective Productivity: Key Takeaways from Microsoft’s 2025 Future of Work Report
Bilibili Tech
Bilibili Tech
Dec 19, 2025 · Artificial Intelligence

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Budgeted ComputationEfficient ReasoningLLM
0 likes · 13 min read
SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Instant Consumer Technology Team
Instant Consumer Technology Team
Dec 16, 2025 · Artificial Intelligence

How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power

This article explains how the Mind Lab team tackled the challenges of training a 1‑trillion‑parameter mixture‑of‑experts model for agentic memory using reinforcement learning, LoRA, and a custom Megatron‑Bridge architecture, achieving a ten‑fold speedup while consuming just a fraction of the usual GPU resources.

AIAgentic AppsLoRA
0 likes · 9 min read
How Mind Lab Trained a Trillion‑Parameter Agentic Memory with Only 10% GPU Power
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Dec 15, 2025 · Artificial Intelligence

Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis

The paper introduces Artanis, an intent‑based network configuration update framework that combines large‑language‑model generation with a verification‑feedback loop and reinforcement‑learning optimization, addressing hallucination‑induced errors and ensuring safe, policy‑compliant deployments across diverse network scales.

Configuration ManagementIntent-based NetworkingLLM
0 likes · 9 min read
Turning LLM-Generated Network Configurations into Verified, Safe Updates with Artanis
AntTech
AntTech
Dec 11, 2025 · Artificial Intelligence

Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

Open-sourceScalable RLagentic AI
0 likes · 14 min read
Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design
AI Frontier Lectures
AI Frontier Lectures
Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models
0 likes · 12 min read
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
Data Party THU
Data Party THU
Dec 9, 2025 · Artificial Intelligence

Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough

The GenMimic paper introduces a novel framework that enables humanoid robots to zero‑shot imitate human actions generated by AI video models, presenting a new dataset, a two‑stage 4D reconstruction pipeline, and a reinforcement‑learning strategy with weighted‑tracking and symmetry losses, validated in simulation and on a real 23‑DoF robot.

Humanoid RobotsVideo Generationreinforcement learning
0 likes · 11 min read
Can Robots Learn Human Moves Directly from AI‑Generated Videos? The GenMimic Breakthrough
Baidu Tech Salon
Baidu Tech Salon
Dec 8, 2025 · Artificial Intelligence

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

AIData FlywheelMultimodal
0 likes · 20 min read
How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading

AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.

AlphaQuanterFinancial AILLM agent
0 likes · 13 min read
AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design

Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.

PPO EWMARL scalingreinforcement learning
0 likes · 7 min read
Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 7, 2025 · Artificial Intelligence

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Catastrophic ForgettingLLMSFT
0 likes · 11 min read
Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings
AntTech
AntTech
Dec 4, 2025 · Artificial Intelligence

How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AStateHigh Performance ComputingLarge Models
0 likes · 23 min read
How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds
Model Perspective
Model Perspective
Dec 1, 2025 · Artificial Intelligence

From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices

This article explains the core concepts of reinforcement learning, illustrates how its reward‑based mechanism appears in media creation, career advancement, education and social media, and warns of the pitfalls of over‑optimizing external rewards while offering practical ways to balance intrinsic motivation and reflective thinking.

Artificial IntelligenceMotivationbehavioral psychology
0 likes · 12 min read
From AI to Everyday Life: How Reinforcement Learning Shapes Our Choices
PaperAgent
PaperAgent
Dec 1, 2025 · Artificial Intelligence

How Deep Research Turns LLMs into Autonomous AI Scientists

This article surveys the emerging Deep Research (DR) paradigm that upgrades large language models into research agents capable of autonomous planning, multi‑source evidence gathering, memory management, and verifiable long‑form report generation, outlining its stages, core components, training pipeline, and evaluation benchmarks.

AI agentsAI research automationLLM agents
0 likes · 6 min read
How Deep Research Turns LLMs into Autonomous AI Scientists
Data Party THU
Data Party THU
Nov 29, 2025 · Artificial Intelligence

Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent

This comprehensive guide explores the concept of AI agents, detailing their definitions, classifications, and core interaction loops, then walks you through building a functional LLM‑driven travel assistant with step‑by‑step code, tool integration, and practical insights on agent versus workflow paradigms.

AI agentsAgent ArchitectureLLM
0 likes · 39 min read
Unlocking AI Agents: From Fundamentals to Building Your First LLM‑Powered Agent
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 28, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)

This digest summarizes five recent arXiv papers on AI-driven portfolio optimization and financial time‑series forecasting, covering G‑Learning with GIRL, transfer‑learning strategies, hybrid LSTM‑PPO frameworks, time‑series foundation models, and a KAN versus LSTM performance comparison, highlighting their methods, datasets, and reported Sharpe improvements.

Financial AIportfolio optimizationreinforcement learning
0 likes · 9 min read
Weekly Quantitative Finance Paper Digest (Nov 22‑28, 2025)
Tencent Advertising Technology
Tencent Advertising Technology
Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

EfficiencyGeneralizationMLLM
0 likes · 21 min read
How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM
Alimama Tech
Alimama Tech
Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

InfrastructureScalable Trainingagentic AI
0 likes · 12 min read
How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training
ITPUB
ITPUB
Nov 24, 2025 · Artificial Intelligence

Why Memory, Not Size, Is the Next Bottleneck for Large Language Models

In a detailed interview, the CTO of Memory Tensor (Shanghai) explains how limited memory capacity hampers large models, outlines the MemOS memory operating system, discusses information‑theoretic metrics, multimodal extensions, and reinforcement‑learning strategies for scalable, secure, and explainable AI memory management.

AI ArchitectureMultimodal AIinformation theory
0 likes · 23 min read
Why Memory, Not Size, Is the Next Bottleneck for Large Language Models
Data Party THU
Data Party THU
Nov 23, 2025 · Artificial Intelligence

Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

OpenAI GymPolicy GradientPython
0 likes · 22 min read
Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough
AntTech
AntTech
Nov 21, 2025 · Artificial Intelligence

How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models

Awex is a high‑performance Python framework that synchronizes training and inference weights for trillion‑parameter reinforcement‑learning models in seconds, using unified conversion, metadata management, and NCCL/RDMA transfer plans, dramatically reducing RL training latency and supporting diverse parallel strategies.

High Performance ComputingLarge ModelsPython
0 likes · 17 min read
How Awex Enables Sub‑Second TB‑Scale Weight Sync for Trillion‑Parameter RL Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

AI agentsGeneral AIdeep reasoning
0 likes · 14 min read
How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 20, 2025 · Artificial Intelligence

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

AI agentsToolPOlarge language models
0 likes · 8 min read
How DeepAgent Redefines AI Agents with Memory Folding and ToolPO
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 20, 2025 · Artificial Intelligence

Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning

The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.

Catastrophic ForgettingLLMOn-Policy Data
0 likes · 8 min read
Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 19, 2025 · Artificial Intelligence

How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing

This article details the end‑to‑end AIGC video automation system we created—from raw material ingestion and multimodal content understanding to script generation, AI‑driven editing, rendering, and multi‑channel distribution—highlighting architecture, key modules, technical choices, performance results, and lessons learned.

AIGCMultimodal AIScript Generation
0 likes · 16 min read
How We Built an AI‑Powered Automated Video Editing Pipeline for Short‑Form Marketing
AI Tech Publishing
AI Tech Publishing
Nov 17, 2025 · Artificial Intelligence

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI model evaluationagentic capabilitiescommon sense reasoning
0 likes · 19 min read
Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy
Data Party THU
Data Party THU
Nov 15, 2025 · Artificial Intelligence

How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows

This article explains how reinforcement learning (RL) underpins intelligent AI agents, covering the Markov Decision Process fundamentals, key RL components, multi‑hop reasoning on knowledge graphs, and a step‑by‑step LangGraph example that integrates an RL‑driven tutoring policy with Python code.

AI agentsLangGraphPython
0 likes · 17 min read
How Reinforcement Learning Powers Intelligent AI Agents and LangGraph Workflows
Kuaishou Tech
Kuaishou Tech
Nov 14, 2025 · Artificial Intelligence

How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators

This article explains the over‑optimization problem in GRPO‑based flow models, analyzes why importance‑ratio clipping fails, and introduces GRPO‑Guard with RatioNorm and cross‑step gradient balancing, showing through extensive experiments that it stabilizes training and improves image quality across multiple diffusion backbones and tasks.

GRPO-Guardflow matchinggenerative AI
0 likes · 9 min read
How GRPO‑Guard Stops Over‑Optimization in Flow‑Based Visual Generators
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Nov 13, 2025 · Artificial Intelligence

Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection

AlphaGAT introduces a two‑stage learning framework that first extracts robust alpha factors with a CATimeMixer model and a novel loss, then dynamically weights these factors via reinforcement learning (PPO) and a graph attention network, achieving superior portfolio performance across DJIA, HSI, CSI‑100 and crypto markets despite noisy data and distribution shifts.

AlphaGATFinancial AITime-series
0 likes · 14 min read
Paper Review: AlphaGAT’s Two‑Stage Learning for Adaptive Portfolio Selection
Alimama Tech
Alimama Tech
Nov 11, 2025 · Artificial Intelligence

Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards

This article introduces the 3A collaborative framework—Async architecture, Asymmetric PPO mini‑critics, and an attention‑based reasoning rhythm—demonstrating how decoupled, fine‑grained parallel training and structure‑aware reward allocation dramatically improve efficiency, scalability, and interpretability of reinforcement learning for large language models.

Asynchronous Trainingattention mechanismslarge language models
0 likes · 23 min read
Accelerating LLM RL with Async Training, Mini‑Critics, and Attention Rewards
DataFunTalk
DataFunTalk
Nov 7, 2025 · Artificial Intelligence

Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models

Training-Free GRPO, proposed by Tencent Youtu Lab, eliminates parameter updates by iteratively building an experience knowledge base, enabling cost‑effective reinforcement learning for large language models, dramatically reducing training expenses from thousands of dollars to under $20 while maintaining strong performance across math reasoning and web search tasks.

AICost Reductionreinforcement learning
0 likes · 6 min read
Training-Free GRPO: Low‑Cost Reinforcement Learning for Large Language Models
Architect's Guide
Architect's Guide
Nov 7, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI

The article examines the need for Multi‑Agent Communication Protocols (MCP), outlines the limitations of single‑agent and centralized systems, compares MCP with other interaction methods, reviews current research and industrial applications, and highlights future directions such as hardware integration, bio‑inspired mechanisms, and blockchain convergence.

Blockchaincommunication protocolsdecentralized AI
0 likes · 9 min read
Why Multi-Agent Communication Protocols Are Crucial for Next-Gen AI
Kuaishou Tech
Kuaishou Tech
Nov 5, 2025 · Artificial Intelligence

How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

EfficiencyHybrid Policy OptimizationLLM
0 likes · 13 min read
How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 4, 2025 · Artificial Intelligence

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

Curriculum LearningGUI automationSEAgent
0 likes · 6 min read
SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously
Bilibili Tech
Bilibili Tech
Oct 31, 2025 · Artificial Intelligence

RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

BLEULLMReward Modeling
0 likes · 13 min read
RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningPolicy Gradient
0 likes · 12 min read
How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study

Meta’s recent paper reveals a sigmoid‑shaped scaling law for LLM reinforcement learning, presents extensive 40‑k GPU‑hour experiments, compares various RL designs such as PPO‑off‑policy‑k and Pipeline‑RL‑k, and distills the findings into a practical “ScaleRL” recipe that improves performance and efficiency.

LLMRL OptimizationScaling Law
0 likes · 10 min read
Unlocking LLM RL Scaling: The Best Practices from Meta’s New Study
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Knowledge Distillationmodel efficiencyon-policy distillation
0 likes · 15 min read
How On-Policy Distillation Cuts LLM Training Cost by 90%
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMReward Modeling
0 likes · 11 min read
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 28, 2025 · Artificial Intelligence

How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins

AgentFlow, a Stanford-led multi‑agent system built on a 7B model, outperforms massive models like GPT‑4o across ten benchmarks by leveraging modular agents, on‑policy learning, and a novel Flow‑GRPO training engine that solves sparse‑reward, long‑horizon challenges.

AgentFlowSmall Model Performancemulti-agent systems
0 likes · 12 min read
How 7B AgentFlow Beats 200B GPT-4o: Small Models, Big Wins
Data Party THU
Data Party THU
Oct 24, 2025 · Artificial Intelligence

BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization

The paper introduces BREEZE, a behavior‑regularized zero‑shot RL framework that improves stability, policy extraction, and representation quality by combining in‑sample learning, task‑conditioned diffusion models, and expressive attention‑based architectures, achieving near‑state‑of‑the‑art performance on benchmarks like ExORL and D4RL Kitchen.

behavioral regularizationdiffusion modeloffline RL
0 likes · 3 min read
BREEZE: Enhancing Zero‑Shot Reinforcement Learning with Behavioral Regularization
Data Party THU
Data Party THU
Oct 22, 2025 · Artificial Intelligence

Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions

This article provides a comprehensive introduction to reinforcement learning for large language models, covering the Markov Decision Process formulation, the four core elements of RL, state‑value and action‑value functions, Bellman equations, and the advantage function that underpins modern policy‑gradient algorithms.

AI fundamentalsBellman equationLarge Language Model
0 likes · 13 min read
Demystifying Large‑Model Reinforcement Learning: From MDP Basics to Bellman and Advantage Functions
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

DQNDouble DQNalgorithm analysis
0 likes · 7 min read
Why DQN Overestimates Q‑Values and How Double DQN Fixes It
Data Thinking Notes
Data Thinking Notes
Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models
0 likes · 11 min read
How GSPO Improves Stability in Large Language Model Training
Xiaohe Frontend Team
Xiaohe Frontend Team
Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyModel CompressionRAG
0 likes · 8 min read
REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI
Meituan Technology Team
Meituan Technology Team
Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchMultimodal AIbenchmarking
0 likes · 10 min read
What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025
Data Party THU
Data Party THU
Oct 15, 2025 · Artificial Intelligence

Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

This paper proposes a reinforcement‑learning framework that simultaneously ensures safety, sample efficiency, and robustness, applying a contextual‑bandit perspective to ranking/recommendation systems and text‑to‑image diffusion models, and introduces novel algorithms for safe deployment, variance‑reduced off‑policy estimation, and a LOOP method for generative RL.

RobustnessSafetycontextual bandits
0 likes · 5 min read
Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2025 · Artificial Intelligence

Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends

Large language models are evolving from free‑form text generators to reliable data providers by mastering structured output through prompt engineering, validation frameworks, constrained decoding, supervised fine‑tuning, reinforcement learning, and API‑level capabilities, enabling seamless integration with software systems while addressing hallucinations and format reliability.

APILLMStructured Output
0 likes · 28 min read
Mastering Structured Output in Large Language Models: Techniques, Challenges, and Future Trends
Volcano Engine Developer Services
Volcano Engine Developer Services
Oct 14, 2025 · Artificial Intelligence

How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

CollabLLM tackles the limitations of large language models in everyday multi‑turn dialogues by introducing a user‑centric, multi‑turn training framework that leverages simulated interactions, multi‑round reward modeling, and veRL toolchain support, achieving superior performance over single‑turn baselines.

LLMcollaborative trainingmulti-turn dialogue
0 likes · 13 min read
How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training
Shopee Tech Team
Shopee Tech Team
Oct 14, 2025 · Artificial Intelligence

How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×

SPEC‑RL introduces speculative rollouts that reuse verified historical rollouts as prefixes, cutting rollout time by 2–3× while maintaining or improving performance across various math and reasoning benchmarks, and works seamlessly with PPO, GRPO, DAPO and other on‑policy algorithms.

AI efficiencyTraining Accelerationlarge language models
0 likes · 8 min read
How SPEC‑RL Boosts On‑Policy Reinforcement Learning Speed by Up to 3×
AntTech
AntTech
Oct 14, 2025 · Artificial Intelligence

How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks

The Ring-1T model, a trillion-parameter AI system released as open source, leverages advanced reinforcement learning techniques, extensive benchmark evaluations, and custom training frameworks to deliver balanced performance across math, code, reasoning, and creative tasks while highlighting current limitations and future development plans.

AI modelLarge Language Modelbenchmark evaluation
0 likes · 8 min read
How Ring-1T Achieves Trillion-Scale Deep Thinking and Competitive Benchmarks
Data Party THU
Data Party THU
Oct 13, 2025 · Artificial Intelligence

How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.

BranchGRPOEfficiencyRLHF
0 likes · 10 min read
How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 12, 2025 · Artificial Intelligence

Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

This article reviews Trading‑R1, an open‑source LLM inference framework that integrates multimodal financial data, three‑stage supervised‑fine‑tuning and reinforcement learning to generate structured investment arguments and risk‑adjusted trade decisions, achieving superior Sharpe ratio and drawdown performance on real‑world stock and ETF tests.

Financial TradingLLMMultimodal
0 likes · 11 min read
Trading-R1: Open-Source LLM Framework for Explainable Financial Trading
Kuaishou Large Model
Kuaishou Large Model
Oct 11, 2025 · Artificial Intelligence

How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench

The KwaiPilot team introduced KAT-Dev-72B-Exp, an open‑source LLM trained with large‑scale reinforcement learning that achieved a record‑breaking 74.6% score on SWE‑Bench Verified, thanks to innovations like Trie Packing, entropy‑aware advantage scaling, and a decoupled data‑environment architecture.

KAT-Dev-72B-ExpTrie Packingentropy scaling
0 likes · 6 min read
How Large-Scale Reinforcement Learning Boosted KAT-Dev-72B-Exp to 74.6% on SWE‑Bench
Kuaishou Tech
Kuaishou Tech
Oct 11, 2025 · Artificial Intelligence

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

The KAT‑Dev‑72B‑Exp model, an experimental reinforcement‑learning version of KAT‑Coder, achieves a 74.6% performance boost on the SWE‑Bench Verified benchmark, introduces Trie Packing and entropy‑aware advantage scaling, and showcases a decoupled training architecture that dramatically speeds up large‑scale agentic RL training.

AIagentic trainingcode generation
0 likes · 9 min read
How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation
Data Party THU
Data Party THU
Oct 10, 2025 · Artificial Intelligence

Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework

This article examines the Language Self‑Play (LSP) approach for data‑free training of large language models, detailing its challenger‑solver game formulation, advantage calculations, loss functions, self‑reward extension, experimental setup on AlpacaEval, and results that show LSP can match or surpass data‑driven baselines.

LLMdata-free traininglarge language models
0 likes · 14 min read
Can Language Models Self‑Train Without Data? Inside the Language Self‑Play Framework
DataFunTalk
DataFunTalk
Oct 9, 2025 · Artificial Intelligence

From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research

Google DeepMind hired Shunyu Yao, a Tsinghua physics prodigy and former Anthropic researcher, whose rapid transition from theoretical physics to AI highlights the intense workload, values clash, and the accelerating pace of large‑model research.

AI researchDeepMindPhysics
0 likes · 9 min read
From Physics to DeepMind: How a Tsinghua Star Is Shaping AI Research
Model Perspective
Model Perspective
Oct 8, 2025 · Artificial Intelligence

How Mathematical Models Reveal the Hidden Dynamics of Addiction

This article explores how differential equations, SIR-like population models, and reinforcement‑learning frameworks can quantitatively describe the onset, persistence, and spread of addictive behaviors, offering insights into feedback loops, neural adaptation, and optimal intervention strategies.

addiction modelingdynamical systemsintervention optimization
0 likes · 10 min read
How Mathematical Models Reveal the Hidden Dynamics of Addiction
DataFunSummit
DataFunSummit
Oct 7, 2025 · Artificial Intelligence

Deep Thinking in Large Language Models: Overcoming Domain Challenges

This presentation explores how large language models can transcend their general knowledge limits by developing domain‑specific deep thinking abilities, addressing challenges such as complex instruction execution, expert reasoning gaps, and tool integration, and proposes reinforcement‑learning‑driven frameworks, structured thinking pipelines, and tool‑calling mechanisms to achieve rational intelligence.

deep reasoningdomain adaptationreinforcement learning
0 likes · 27 min read
Deep Thinking in Large Language Models: Overcoming Domain Challenges
DataFunTalk
DataFunTalk
Oct 7, 2025 · Artificial Intelligence

Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS

Apple’s new paper presents RL4HS, a reinforcement‑learning framework that uses span‑level rewards and class‑aware policy optimization to detect hallucinated text spans in large language models, outperforming GPT‑5 and other baselines and offering more precise, auditable error identification.

RL4HShallucination detectionreinforcement learning
0 likes · 9 min read
Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS
Amap Tech
Amap Tech
Oct 3, 2025 · Artificial Intelligence

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

3D synthesisGraph ModelingVideo Generation
0 likes · 12 min read
How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost ReductionModel CompressionMultimodal AI
0 likes · 13 min read
2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context
Data Party THU
Data Party THU
Sep 28, 2025 · Artificial Intelligence

Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

The article presents Richard Sutton’s OaK architecture—a domain‑general, empirical, open‑ended framework that equips agents with continuously learnable components, meta‑learned step‑sizes, and a five‑stage FC‑STOMP pipeline to build world models, generate sub‑problems, learn options, and plan at run‑time.

AI Architecturecontinual learningmeta‑learning
0 likes · 22 min read
Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning
HyperAI Super Neural
HyperAI Super Neural
Sep 28, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning

This week’s AI paper roundup highlights five recent studies: a construction‑site vision‑language dataset and safety inspection tasks, a deep CORAL method for unsupervised domain adaptation, the discovery of a new family of unstable singularities in nonlinear PDEs, a reinforcement‑learning approach that boosts reasoning in large language models, and the PANORAMA architecture for omnidirectional vision in embodied AI.

Construction SafetyOmnidirectional VisionPDE Research
0 likes · 6 min read
Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Sep 28, 2025 · Artificial Intelligence

Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics

This curated list presents the most influential AI books, covering foundational theory, machine learning, deep learning, reinforcement learning, computer vision, and AI ethics, with editorial insights and author biographies to guide readers through the evolving landscape of artificial intelligence.

AI ethicsArtificial Intelligencereinforcement learning
0 likes · 25 min read
Essential AI Reading List: Must‑Read Books Across AI, ML, DL, and Ethics
HyperAI Super Neural
HyperAI Super Neural
Sep 26, 2025 · Artificial Intelligence

Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity

ReaSyn, a new framework from Nvidia’s research team, treats synthesis pathways as chain‑of‑thought reasoning using a novel Chain‑of‑Reaction representation, achieving the highest reconstruction rates and path diversity in molecule synthesis tasks, and outperforming prior methods across multiple benchmark optimizations.

AI drug discoveryReaSynchain-of-reaction
0 likes · 14 min read
Nvidia’s ReaSyn Uses Chain‑of‑Reaction Reasoning to Boost Molecule Reconstruction and Path Diversity
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 25, 2025 · Artificial Intelligence

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

This article reviews the MARS framework, a risk‑aware multi‑agent reinforcement‑learning system for automated portfolio management that tackles market non‑stationarity and proactive risk control, detailing its hierarchical architecture, formal MDP formulation, training process, and superior experimental results on DJIA and HSI benchmarks.

Portfolio Managementdeep learningfinancial markets
0 likes · 13 min read
How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management
Fun with Large Models
Fun with Large Models
Sep 24, 2025 · Artificial Intelligence

Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning

The article explains the fundamental principles of PPO and GRPO reinforcement‑learning algorithms, compares their architectures and training workflows, highlights why GRPO is gaining traction in large‑model fine‑tuning, discusses associated risks, and offers practical guidance on group size selection for engineers preparing for interviews.

GRPOPPORLHF
0 likes · 9 min read
Interview Guide: Core Differences Between PPO and GRPO Algorithms for Large Model Fine‑Tuning
Data Party THU
Data Party THU
Sep 20, 2025 · Artificial Intelligence

How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model

The article reports that DeepSeek’s R1 large language model, detailed in a peer‑reviewed Nature paper, was built with roughly $300 k in total cost—about $29.4 k for training—using Nvidia H800 chips and novel pure reinforcement‑learning techniques, achieving competitive performance while remaining open‑source.

DeepSeekLarge Language ModelNvidia H800
0 likes · 9 min read
How DeepSeek Trained a $30M LLM for Just $29.4K – Inside the R1 Model
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 20, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)

This digest summarizes seven recent arXiv papers that apply reinforcement learning, multi‑agent frameworks, dynamic factor models, high‑frequency trading LLMs, quantum GANs, multi‑LLM sentiment analysis, and context‑aware language models to advance quantitative finance and AI‑driven market prediction.

Quantitative FinanceQuantum Machine Learninglarge language models
0 likes · 12 min read
Weekly Quantitative Finance Paper Digest (Sep 13‑19, 2025)
Data Party THU
Data Party THU
Sep 19, 2025 · Artificial Intelligence

How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning

DeepSeek R1 replaces traditional supervised fine‑tuning with a pure reinforcement‑learning pipeline, introducing the GRPO algorithm and a four‑stage training regime that dramatically lowers cost, boosts reasoning and code‑generation performance, and raises important ethical, privacy, and societal considerations for large language models.

AI reasoningDeepSeekGRPO
0 likes · 14 min read
How DeepSeek R1 Redefines AI Reasoning with Pure Reinforcement Learning
HyperAI Super Neural
HyperAI Super Neural
Sep 19, 2025 · Artificial Intelligence

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

This article surveys five recent AI papers, covering reinforcement learning for large reasoning models, a tree‑structured table QA framework (ST‑Raptor), visual representation alignment for multimodal LLMs, GraphRAG‑based generation, and an LLM‑driven cryptographic vulnerability detector, each with key insights and links.

cryptographic vulnerability detectiongraph retrievallarge language models
0 likes · 5 min read
Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs
DataFunSummit
DataFunSummit
Sep 18, 2025 · Artificial Intelligence

Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies

This presentation by Yao Yitong of China Telecom AI Research Institute explains why Function Call is essential for LLM deployment, outlines data‑centric and training‑centric optimization methods, discusses common pitfalls and reward‑function design for reinforcement learning, and showcases practical Agent application patterns for real‑world tasks.

AgentLLMTraining Optimization
0 likes · 36 min read
Boosting LLM Function Call: Data, Training, and Agent Optimization Strategies
HyperAI Super Neural
HyperAI Super Neural
Sep 18, 2025 · Artificial Intelligence

DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model

DeepSeek‑R1, the first mainstream large language model to pass peer review in Nature, was trained for $294,000 using 648 H800 GPUs, and its RL‑enhanced version, DeepSeek‑R1‑Zero, achieved up to 86.7% pass@1 on AIME 2024, outperforming human averages across math, coding, and science tasks.

AI researchDeepSeek-R1Large Language Model
0 likes · 10 min read
DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 14, 2025 · Artificial Intelligence

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

The article reviews the MM‑DREX framework, which tackles the non‑stationarity of financial markets by modeling trading as a POMDP, employing a vision‑language model‑driven dynamic router to allocate four heterogeneous experts, and demonstrates superior returns, Sharpe ratios, and drawdown control across stocks, futures, and crypto compared with 15 strong baselines.

Dynamic RoutingLLMPOMDP
0 likes · 13 min read
How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading
Fighter's World
Fighter's World
Sep 12, 2025 · Artificial Intelligence

Why Are Production‑Grade AI Agents So Hard to Build?

The article analyses why production‑grade AI agents remain unreliable, pinpointing the scarcity of high‑quality task‑action data, the limits of static benchmarks, and the need for massive data‑generation engines, simulation sandboxes, sophisticated RL reward design, and efficient context engineering.

AI agentContext EngineeringData Generation
0 likes · 21 min read
Why Are Production‑Grade AI Agents So Hard to Build?
DataFunTalk
DataFunTalk
Sep 12, 2025 · Artificial Intelligence

Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference

The 2024 Inclusion·Bund conference gathered top AI pioneers—including Turing laureate Richard Sutton, Alibaba Cloud founder Wang Jian, HKU professor Ma Yi, Yushu Tech CEO Wang Xingxing, and historian Yuval Harari—to discuss the limits of intelligence, the shift toward open‑source resources, embodied AI, and the societal implications of rapid AI advancement.

AIArtificial Intelligencereinforcement learning
0 likes · 15 min read
Key Takeaways from AI Leaders at the 2024 Inclusion·Bund Conference
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 11, 2025 · Artificial Intelligence

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Fin‑PRM, a domain‑specific process reward model for financial reasoning introduced by Alibaba’s Dianjin team, employs dual‑level step and trajectory rewards to provide fine‑grained supervision, achieving up to 12.9% accuracy gains in supervised fine‑tuning and 5.1% improvements in Best‑of‑N inference on benchmarks such as CFLUE and FinQA.

CFLUEFin-PRMFinQA
0 likes · 11 min read
Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 11, 2025 · Artificial Intelligence

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

LLMRAGchunk embeddings
0 likes · 18 min read
How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 5, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)

This digest reviews four recent AI‑driven finance papers: a robust MCVaR portfolio optimizer with ellipsoidal support and RKHS uncertainty, a PPO‑based adaptive weighting system for LLM‑generated alphas, an empirical comparison of price‑based, GICS‑based, and LLM‑embedding stock clustering, and a diffusion‑model approach that generates future financial chart images from current charts and text prompts.

Quantitative Financediffusion modelslarge language models
0 likes · 9 min read
Weekly Quantitative Finance Paper Digest (Aug 30 – Sep 5, 2025)
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO
0 likes · 30 min read
Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive
Sohu Tech Products
Sohu Tech Products
Sep 3, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

Artificial IntelligenceGRPOLLM
0 likes · 17 min read
How GRPO Revolutionizes RLHF for Large Language Models
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 3, 2025 · Artificial Intelligence

Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks

The paper introduces Technical Indicator Networks (TINs), a framework that maps traditional technical analysis formulas to neural‑network topologies, initializes weights to preserve indicator behavior, and uses reinforcement learning for dynamic optimization, achieving significantly higher Sharpe, Sortino, and cumulative returns on US30 component stocks than conventional MACD approaches.

Algorithmic TradingFinancial AITechnical Indicator Networks
0 likes · 9 min read
Decoding TINs: Reconstructing Classic Technical Analysis with Neural Networks
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 3, 2025 · Artificial Intelligence

How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards

Atom-Searcher introduces an atomic‑thought reinforcement‑learning framework that decomposes complex reasoning into fine‑grained units, uses a Reasoning Reward Model to assign step‑wise rewards, dynamically balances process and result incentives, and achieves state‑of‑the‑art performance on multiple LLM benchmarks.

Agentic ResearchAtomic ThoughtLLM
0 likes · 12 min read
How Atom-Searcher Boosts LLM Reasoning with Atomic Thought Rewards
Data STUDIO
Data STUDIO
Sep 2, 2025 · Artificial Intelligence

Understanding NAS: Core Algorithms and Python Implementations

This article reviews Neural Architecture Search (NAS), explains its bi‑level optimization formulation, compares three major search strategies—reinforcement learning, evolutionary algorithms, and differentiable gradient‑based methods—provides complete Python code for each, and analyzes experimental results highlighting performance trade‑offs and remaining challenges.

Differentiable Architecture SearchEvolutionary AlgorithmsNAS
0 likes · 25 min read
Understanding NAS: Core Algorithms and Python Implementations
Data Party THU
Data Party THU
Aug 30, 2025 · Artificial Intelligence

Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning

Multi‑armed bandit models illustrate the core exploration‑exploitation dilemma in reinforcement learning, covering greedy, ε‑greedy, and optimistic‑initial‑value strategies, as well as sample‑average and incremental Q‑value estimation methods with practical examples and visual illustrations.

Q-value estimationexploration vs exploitationgreedy
0 likes · 15 min read
Understanding Multi‑Armed Bandits: Balancing Exploration and Exploitation in Reinforcement Learning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Aug 29, 2025 · Artificial Intelligence

Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)

This digest summarizes nine recent arXiv papers covering quantum portfolio optimization, thematic investing with semantic stock representations, multi‑indicator reinforcement learning for trading, attention‑based asset pricing, ESG variable selection, deep neural networks for return distribution forecasting, a foundation model for financial time‑series, a multi‑agent trading system with self‑reflection, and dynamic weighting machine‑learning stock selection strategies.

ESGMachine LearningQuantitative Finance
0 likes · 17 min read
Weekly Quantitative Finance Paper Digest (Aug 23‑29, 2025)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling
0 likes · 10 min read
Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision
Kuaishou Tech
Kuaishou Tech
Aug 23, 2025 · Artificial Intelligence

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.

AI researchLarge Language ModelMultimodal
0 likes · 12 min read
How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning
Architect's Must-Have
Architect's Must-Have
Aug 22, 2025 · Artificial Intelligence

Why Multi-Agent Communication Protocols Are the Future of AI Collaboration

This article examines the limitations of single-agent AI, explains how Multi-Agent Communication Protocols (MCP) address challenges such as incomplete perception, decision conflicts, and scalability, and outlines current research, industrial applications, and future directions including edge integration and blockchain synergy.

Blockchaincommunication protocolsdistributed AI
0 likes · 8 min read
Why Multi-Agent Communication Protocols Are the Future of AI Collaboration
Data Thinking Notes
Data Thinking Notes
Aug 21, 2025 · Artificial Intelligence

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

This article distills Denny Zhou’s Stanford CS25 lecture, explaining how large language models achieve reasoning through intermediate token generation, chain‑of‑thought prompting, self‑consistency, reinforcement‑learning fine‑tuning, and answer aggregation, while highlighting theoretical foundations and practical breakthroughs.

LLMReasoningchain-of-thought
0 likes · 18 min read
Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning
Kuaishou Tech
Kuaishou Tech
Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

distributed trainingpipeline optimizationreinforcement learning
0 likes · 13 min read
How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%