Tagged articles

689 articles

Page 1 of 7

May 31, 2026 · Artificial Intelligence

Reinforcement Learning Launches a New Paradigm for Spatial Omics Experiment Design

A reinforcement‑learning framework called SOFisher, developed by teams from Fudan and Beijing Institute of Technology, enables intelligent, adaptive selection of field‑of‑view positions in costly spatial‑omics experiments, dramatically improving target detection efficiency and revealing disease‑relevant cellular niches with far fewer measurements.

AI-driven microscopyAlzheimer's diseaseSOFisher

0 likes · 7 min read

Reinforcement Learning Launches a New Paradigm for Spatial Omics Experiment Design

Machine Learning Algorithms & Natural Language Processing

May 30, 2026 · Artificial Intelligence

Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

ClawGym provides a complete open‑source framework for Claw‑style personal agents, linking a 13.5 K synthetic task dataset, black‑box rollout training, sandbox‑parallel reinforcement learning, and a rigorously verified benchmark of 200 tasks, and demonstrates that synthetic data can lift a 30 B model beyond a 235 B baseline.

ClawGymOpenClawagent training

0 likes · 16 min read

Breaking the Agent Training Bottleneck: Open‑Source ClawGym Data, Training, and Evaluation Pipeline

Machine Heart

May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTchain-of-thought

0 likes · 11 min read

How Abstract Symbols Cut AI Inference Cost by 11×

AI Engineering

May 30, 2026 · Artificial Intelligence

A Unified Toolbox for JEPA and World Model Research: stable-worldmodel

Researchers tackling world‑model problems often rebuild data pipelines, environments, and baselines from scratch, but the open‑source stable‑worldmodel platform consolidates diverse dataset formats, SOTA baselines, hundreds of environments, and multiple solvers, offering a three‑step workflow with demonstrated storage and speed advantages.

JEPALanceDBdatasets

0 likes · 4 min read

A Unified Toolbox for JEPA and World Model Research: stable-worldmodel

SuanNi

May 29, 2026 · Artificial Intelligence

SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes

The SenseNova-U1-8B-MoT-Infographic model dramatically improves AI‑generated infographics by enhancing dense‑text rendering, layout stability, and chart accuracy through targeted data, extended mid‑training, and reinforcement‑learning fine‑tuning, achieving top scores on BizGenEval and IGenBench and surpassing many commercial rivals.

AI modelMultimodalSenseNova

0 likes · 9 min read

SenseNova-U1-8B-MoT-Infographic: Academic Charts, Posters, Recipes

Old Zhang's AI Learning

May 29, 2026 · Artificial Intelligence

How NVIDIA’s Polar Enables Any Agent Framework to Plug Into Reinforcement Learning

Integrating diverse AI agent harnesses into reinforcement‑learning pipelines is notoriously labor‑intensive, but NVIDIA’s new Polar system inserts an API‑proxy layer that treats any harness as a black box, enabling seamless rollout recording and trajectory reconstruction, as demonstrated by dramatic performance gains on a 4B model across multiple harnesses.

AI agentAPI ProxyNVIDIA

0 likes · 10 min read

How NVIDIA’s Polar Enables Any Agent Framework to Plug Into Reinforcement Learning

Machine Heart

May 29, 2026 · Artificial Intelligence

DiffusionOPD: A New Online Policy Distillation Paradigm for Multi‑Task Diffusion Models

DiffusionOPD introduces a unified on‑policy distillation framework for diffusion models that decouples single‑task online policy exploration from multi‑task capability integration, training expert teachers per task and distilling their skills into a single student model, achieving faster convergence and higher performance across composition, OCR, and aesthetic tasks.

KL divergencePPOdiffusion models

0 likes · 8 min read

DiffusionOPD: A New Online Policy Distillation Paradigm for Multi‑Task Diffusion Models

SuanNi

May 28, 2026 · Artificial Intelligence

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

Microsoft’s Lens team shows that a 3.8 B‑parameter image‑generation model can match or surpass 6 B‑plus models while consuming only about 19 % of the GPU compute, thanks to aggressive model compression, dense captioning, mixed‑resolution training, optimized VAE and language encoders, and targeted RL fine‑tuning.

benchmarkingdense captioningimage generation

0 likes · 14 min read

How a 3.8B Model Beats 6B+ Models Using Just 20% of the Compute – Inside Microsoft Lens

HyperAI Super Neural

May 28, 2026 · Artificial Intelligence

Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning

HyperAI curates six cutting‑edge large‑model reinforcement‑learning papers—from ECHO’s free world‑model learning to DelTA’s discriminative token credit, GoLongRL’s capability‑oriented long‑context RL, Anti‑SD’s reverse distillation, RubricEM’s rubric‑guided policy decomposition, and Poly‑EPO’s diversity‑driven exploration—highlighting their methods, benchmarks, and performance gains.

Agent LearningComplex ReasoningCredit Assignment

0 likes · 10 min read

Large-Model RL Advances: Credit Allocation, Complex Reasoning, Agent Learning

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks

The open‑source 35‑billion‑parameter Intern‑S2‑Preview model achieves scientific‑task performance comparable to trillion‑parameter models, thanks to full‑link “general‑specialized” training, reinforced‑learning scaling, and hardware‑aware optimizations, and it outperforms leading closed‑source models on benchmarks such as MolecularIQ and crystal‑structure generation.

InternLMLarge Language ModelOpen Source

0 likes · 11 min read

Open‑Source 35B Intern‑S2‑Preview Rivals Trillion‑Parameter Models on Scientific Benchmarks

Data Party THU

May 27, 2026 · Artificial Intelligence

How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×

The article explains how large‑language‑model post‑training suffers from rollout bottlenecks, introduces the Trajectory Balance with Asynchrony (TBA) framework that separates a Searcher from a Trainer, reuses off‑policy trajectories via a Trajectory Balance objective, and demonstrates up to 50× speed‑ups while preserving or improving performance on math reasoning, preference fine‑tuning, and automated red‑team tasks.

Asynchronous TrainingLLMLarge Models

0 likes · 9 min read

How Bengio’s TBA Decouples Sampling and Learning to Speed Up LLM RL by 50×

SuanNi

May 26, 2026 · Artificial Intelligence

Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming

The SkyClaw‑v1.0 model from Skywork AI offers a free, soon‑to‑be open‑source large‑language model for agent applications that matches Claude Opus 4.6 in performance while cutting token costs dramatically, and the article details its benchmarks, training pipeline, and deployment recommendations.

AgentLarge Language ModelOpenAI API

0 likes · 7 min read

Why Tokens Are Burning Out and a Free Claude Opus 4.6‑Level Model Is Coming

Machine Heart

May 26, 2026 · Artificial Intelligence

Can China’s SkyClaw‑v1.0 Challenge Claude Opus 4.6 with High Performance at Low Cost?

SkyClaw‑v1.0, a domestically released Agent model, delivers benchmark scores that surpass many open‑source rivals and approach top‑tier closed models like Claude Opus 4.6, while offering a dramatically lower price and a frictionless deployment experience for developers.

AI BenchmarkAgentClaude Opus 4.6

0 likes · 12 min read

Can China’s SkyClaw‑v1.0 Challenge Claude Opus 4.6 with High Performance at Low Cost?

Machine Heart

May 23, 2026 · Artificial Intelligence

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

The article analyzes why large language models cannot simply adopt AlphaGo’s Monte‑Carlo Tree Search, highlighting credit‑assignment difficulties, gradient‑variance explosion in multi‑step RL, and how AlphaGo’s tight integration of value and policy networks amortizes search in a way LLMs cannot replicate.

AlphaGoCredit AssignmentLLM

0 likes · 6 min read

Why Can’t LLMs Directly Copy AlphaGo’s MCTS Success?

Machine Heart

May 22, 2026 · Artificial Intelligence

ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

ATLAS introduces a discrete functional token that simultaneously serves as an agentic operation and a latent reasoning unit, enabling large multimodal models to perform visual tasks without external tools or intermediate image generation, and achieves competitive results through SFT‑plus‑RL training and a token‑level gradient‑anchor technique.

ATLASagentic reasoningfunctional token

0 likes · 11 min read

ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

Machine Learning Algorithms & Natural Language Processing

May 21, 2026 · Artificial Intelligence

Breaking the UED Bottleneck: PACE Locates the Reinforcement‑Learning Zone of Proximal Development

The paper introduces PACE, a Parameter‑Change based Unsupervised Environment Design method that evaluates training levels by the magnitude of induced policy‑parameter updates, offering a low‑variance, computationally cheap signal that consistently outperforms prior UED approaches on MiniGrid and Craftax benchmarks.

CraftaxCurriculum LearningICML 2026

0 likes · 11 min read

Breaking the UED Bottleneck: PACE Locates the Reinforcement‑Learning Zone of Proximal Development

Machine Heart

May 21, 2026 · Artificial Intelligence

Breaking the Traditional UED Bottleneck: Using RL to Precisely Locate the Zone of Proximal Development

The paper introduces PACE, a Parameter Change Environment Design method that evaluates training levels by measuring induced policy parameter updates, offering a low‑variance learning‑progress signal that outperforms prior UED approaches on MiniGrid and Craftax benchmarks, achieving higher success rates and more stable generalization.

CraftaxCurriculum LearningICML 2026

0 likes · 10 min read

Breaking the Traditional UED Bottleneck: Using RL to Precisely Locate the Zone of Proximal Development

Old Zhang's AI Learning

May 21, 2026 · Artificial Intelligence

SkillOS: Enabling Agents to Self‑Manage Their Skills

SkillOS reframes skill management for LLM agents as a long‑horizon reinforcement‑learning problem, letting a trainable Skill Curator automatically insert, update, or delete markdown‑based skills, which the frozen Agent Executor then consumes, improving memory‑free performance and cross‑task transfer.

LLM agentsMarkdownSkillOS

0 likes · 6 min read

SkillOS: Enabling Agents to Self‑Manage Their Skills

Alimama Tech

May 21, 2026 · Artificial Intelligence

Bridging LLMs' Social Gap: Graphia Uses Social Graphs as Supervision for Full Macro‑Micro Alignment

Graphia, a new LLM‑based social simulation framework, leverages social graph data as high‑quality supervision to jointly align microscopic interaction predictions and macroscopic network structures, achieving significant gains on TDGG and IDGG benchmarks across three real‑world datasets.

GraphiaLLMdynamic graphs

0 likes · 12 min read

Bridging LLMs' Social Gap: Graphia Uses Social Graphs as Supervision for Full Macro‑Micro Alignment

Machine Heart

May 21, 2026 · Artificial Intelligence

OneModel 1.7 Hits 99% LIBERO Success, Bridging ‘Seeing’ to ‘Doing’ with Implicit Predictive Policy

OneModel 1.7 FrontoStria‑RL achieves a 99% average success rate on the LIBERO benchmark, surpassing π0.5, GR00T‑N1.5 and OpenVLA‑OFT, by introducing a Predictive Policy Latent that implicitly links world‑model understanding to action execution and is continuously refined through a reinforcement‑learning loop and a Retrieve‑then‑Steer memory mechanism.

Embodied AILIBERO BenchmarkPredictive Policy Latent

0 likes · 15 min read

OneModel 1.7 Hits 99% LIBERO Success, Bridging ‘Seeing’ to ‘Doing’ with Implicit Predictive Policy

Data Party THU

May 21, 2026 · Artificial Intelligence

ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

MedScope proposes a "Think with Videos" paradigm that lets AI models actively locate and verify evidence in long clinical videos, using coarse‑to‑fine tool calling, evidence‑centric training data (ClinVideoSuite) and a grounding‑aware reinforcement learning objective, achieving superior performance on multiple video‑understanding benchmarks.

Evidence-based QALong Video ReasoningMedical Video AI

0 likes · 10 min read

ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

PaperAgent

May 21, 2026 · Artificial Intelligence

238 Promising Reinforcement‑Learning Ideas Likely to Earn CCF‑A Papers in 2026

The article compiles 238 cutting‑edge reinforcement‑learning ideas across 21 research directions, highlights recent breakthroughs such as Sutton’s Intentional Updates, and provides brief overviews of representative papers—including knowledge‑graph, Kalman‑filter, agentic, LLM‑driven, and world‑model approaches—along with links to the accompanying source code.

Kalman filterLLMagentic RL

0 likes · 6 min read

238 Promising Reinforcement‑Learning Ideas Likely to Earn CCF‑A Papers in 2026

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

Composer 2.5 Narrows the Gap to Claude Opus 4.7 with Ten‑Fold Cost Savings

Composer 2.5, the latest AI‑coding model from Cursor, claims near‑par performance with Claude 4.7 Opus and GPT‑5.5 while delivering up to ten‑times higher efficiency and a pricing model of $0.5 per M input tokens and $2.5 per M output tokens, backed by novel reinforcement‑learning tricks, massive synthetic data, and a custom Muon optimizer with dual‑grid HSDP architecture.

AI programmingComposer 2.5HSDP

0 likes · 13 min read

Composer 2.5 Narrows the Gap to Claude Opus 4.7 with Ten‑Fold Cost Savings

Machine Heart

May 19, 2026 · Artificial Intelligence

HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

HyperEyes introduces a unified‑location‑as‑search (UGS) action space, parallel data synthesis, and a dual‑granularity efficiency‑aware RL framework that enable multimodal agents to perform simultaneous multi‑target retrieval, dramatically reducing interaction rounds while improving accuracy and cost‑efficiency across benchmark evaluations.

AgentEfficiencybenchmark

0 likes · 9 min read

HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

Machine Heart

May 19, 2026 · Artificial Intelligence

100k‑Token Natural‑Language Reasoning Enables a 30B‑A3B Model to Reach Olympiad Gold Level

A 30B‑A3B model, trained with reverse‑perplexity supervised fine‑tuning, two‑stage reinforcement learning, and a multi‑round generate‑verify‑revise inference loop, achieves gold‑medal performance on IMO, USAMO and IPhO contests using over 100 k token natural‑language reasoning without external tools.

30B-A3Bnatural language processingolympiad AI

0 likes · 11 min read

100k‑Token Natural‑Language Reasoning Enables a 30B‑A3B Model to Reach Olympiad Gold Level

ByteDance SE Lab

May 19, 2026 · Artificial Intelligence

Introducing Uni-Agent: veRL’s Open‑Source Unified Framework for General‑Purpose Agent Training

Uni-Agent is an open‑source framework that unifies building, running, and training of general AI agents, offering extensible model, tool, and environment modules, scalable sandbox execution via veFaaS, live monitoring, and demonstrated performance gains on large‑scale coding‑agent experiments.

AgentOpen SourceScalable Execution

0 likes · 8 min read

Introducing Uni-Agent: veRL’s Open‑Source Unified Framework for General‑Purpose Agent Training

AI Insight Log

May 19, 2026 · Artificial Intelligence

Cursor Returns with Composer 2.5: Openly Built on Kimi, 10× Lower Cost, Musk Endorses

Cursor unveiled Composer 2.5, reporting benchmark scores comparable to Opus 4.7 and GPT‑5.5, a ten‑fold cost reduction, explicit use of Moonshot’s Kimi K2.5 as a base, new RL training techniques, and a partnership with SpaceXAI that multiplies compute power, all highlighted by Elon Musk’s retweet.

AI modelComposer 2.5Cursor

0 likes · 10 min read

Cursor Returns with Composer 2.5: Openly Built on Kimi, 10× Lower Cost, Musk Endorses

Machine Learning Algorithms & Natural Language Processing

May 19, 2026 · Artificial Intelligence

From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

The paper introduces PreRL, which removes the input condition to directly optimize the reasoning trajectory (P(y)) of large language models, and combines it with standard RL in Dual Space RL (DSRL), achieving consistent gains on math and out‑of‑distribution benchmarks, faster training, and richer reasoning behaviors.

DSRLPreRLReasoning

0 likes · 11 min read

From P(y|x) to P(y): Reinforcement Learning in Pre‑train Space Unlocks Endogenous Reasoning

Machine Heart

May 18, 2026 · Artificial Intelligence

Composer 2.5 Delivers Opus‑level Performance at One‑Tenth the Cost

Composer 2.5, Cursor’s latest LLM, matches Claude Opus 4.7‑level capabilities while costing roughly one‑tenth as much, thanks to larger training scale, precise text‑feedback reinforcement learning, 25× more synthetic tasks, and a new Muon‑HSDP optimizer that boosts efficiency up to ten‑fold.

Composer 2.5LLMMuon optimizer

0 likes · 9 min read

Composer 2.5 Delivers Opus‑level Performance at One‑Tenth the Cost

Bighead's Algorithm Notes

May 18, 2026 · Artificial Intelligence

FineFT: Efficient Risk-Aware Reinforcement Learning for Futures Trading

FineFT introduces a three‑stage ensemble reinforcement‑learning framework that tackles high‑leverage reward volatility and missing ability‑boundary awareness in crypto futures trading by using selective TD‑error updates, VAE‑based market‑state boundary detection, and a risk‑aware routing mechanism, ultimately outperforming twelve baselines on six financial metrics while cutting risk by over 40%.

ensemble methodsfinancial RLfutures trading

0 likes · 12 min read

FineFT: Efficient Risk-Aware Reinforcement Learning for Futures Trading

Machine Heart

May 18, 2026 · Artificial Intelligence

ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy

The paper “When to Think, When to Speak” introduces Side‑by‑Side Interleaved Reasoning, a learnable disclosure policy that lets LLMs alternate between internal thinking and user‑visible answer fragments, reducing content latency while preserving or improving accuracy on math and scientific QA benchmarks.

CoTLLMQwen3

0 likes · 10 min read

ICML 2026: Teaching Large Models to Think and Speak – Turning “When to Speak” into a Learnable Strategy

Machine Heart

May 17, 2026 · Artificial Intelligence

What Exactly Is a World Model? History, Technology, and the $10 B Bet

The article traces the two decades‑long, parallel research lines that birthed video world models—dreaming agents in reinforcement learning and learning physics from human video—explains how they converged in 2024‑2025, evaluates current capabilities and limitations, and analyzes the $10 billion investment landscape and strategic moves by NVIDIA, OpenAI, and others.

AI researchSimulationVideo Generation

0 likes · 32 min read

What Exactly Is a World Model? History, Technology, and the $10 B Bet

Data Party THU

May 16, 2026 · Artificial Intelligence

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

The article analyzes SubQ, a new LLM architecture using Subquadratic Sparse Attention (SSA) to achieve a 12‑million‑token context window with linear compute scaling, delivering up to 52× speedup and costing just 5% of Opus while matching dense‑attention performance on long‑context benchmarks.

SSASparse AttentionSubQ

0 likes · 14 min read

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

Machine Heart

May 16, 2026 · Artificial Intelligence

GIPO: Overcoming Utilization Collapse for Efficient Large‑Model Reinforcement Learning

GIPO (Gaussian Importance Sampling Policy Optimization) replaces PPO’s hard clipping with a smooth Gaussian‑weighted trust region, achieving log‑space symmetry and bias‑variance balance that mitigates policy lag and utilization collapse, and demonstrates superior stability and sample efficiency on GridWorld, LIBERO, MetaWorld, and 7‑billion‑parameter VLA experiments.

Bias-Variance TradeoffGIPOPolicy Optimization

0 likes · 17 min read

GIPO: Overcoming Utilization Collapse for Efficient Large‑Model Reinforcement Learning

Machine Heart

May 16, 2026 · Artificial Intelligence

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

In a deep interview, former Google TPU architect Reiner Pope explains that low‑concurrency fast‑mode services trade higher fees for faster streaming but are limited by memory‑bandwidth bottlenecks, that optimal concurrency balances compute and memory costs, and that pipeline‑parallel sparse expert models and reinforcement‑learning fine‑tuning introduce new inefficiencies and overtraining risks.

LLMMemory BandwidthOvertraining

0 likes · 7 min read

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

Machine Heart

May 14, 2026 · Artificial Intelligence

Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration

I²B‑LPO is an exploration‑enhancement framework for RLVR that branches rollouts at high‑entropy nodes, injects latent variables via pseudo self‑attention, and filters paths with an information‑bottleneck self‑reward, achieving up to 5.3% accuracy and 7.4% diversity improvements on multiple math reasoning benchmarks.

RLVRentropyexploration

0 likes · 14 min read

Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration

Machine Heart

May 14, 2026 · Artificial Intelligence

How PsiBot Uses 100,000 Hours of Human Data to Power Embodied Intelligence

PsiBot demonstrates that, with a 100,000‑hour human‑operation dataset captured via exoskeleton gloves and ego‑vision, a world‑model (W0) and reinforcement‑learning policy (R2) can bridge the gap to robot control, offering a scalable alternative to costly teleoperation pipelines.

Embodied AIWorld Modeldata collection

0 likes · 12 min read

How PsiBot Uses 100,000 Hours of Human Data to Power Embodied Intelligence

Kuaishou Tech

May 13, 2026 · Artificial Intelligence

OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users

OneSearch‑V2 introduces a latent‑reasoning enhanced self‑distillation framework that augments query understanding with thought‑augmented CoT, aligns preferences via direct user behavior feedback, and achieves up to 4 % CTR lift and significant order growth without adding inference cost or latency.

LLMbehavioral feedbacke-commerce

0 likes · 26 min read

OneSearch‑V2 Launches: Self‑Distilled Generative Search That Truly Understands Users

Machine Learning Algorithms & Natural Language Processing

May 12, 2026 · Artificial Intelligence

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Trajectory Balance with Asynchrony (TBA) separates sample generation (Searcher) from model updates (Trainer), uses a trajectory‑balance objective to incorporate off‑policy data, and achieves up to 50× speedup in large‑model RL post‑training while preserving or improving performance on math reasoning, preference fine‑tuning, and red‑team tasks.

Asynchronous TrainingLLMOff-Policy

0 likes · 10 min read

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Machine Learning Algorithms & Natural Language Processing

May 12, 2026 · Artificial Intelligence

LaST‑R1: Embodied Robot Model Hits 99.9% LIBERO Success via Physical Reasoning

LaST‑R1 presents a new embodied AI framework that inserts latent physical reasoning before action generation and jointly optimizes reasoning and control with LAPO, achieving 99.9% average success on the LIBERO benchmark after a single‑trajectory warm‑up and boosting real‑world task success from 52.5% to 93.75%, while showing superior generalization to unseen objects, backgrounds and lighting.

Embodied AILAPOLIBERO Benchmark

0 likes · 11 min read

LaST‑R1: Embodied Robot Model Hits 99.9% LIBERO Success via Physical Reasoning

Data Party THU

May 12, 2026 · Artificial Intelligence

MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)

MathForge tackles the long‑standing question of which math problems deserve focus in reinforcement‑learning‑based training, introducing a difficulty‑aware optimizer (DGPO) and multi‑aspect question reformulation (MQR) that together prioritize harder‑but‑learnable questions, yielding consistent performance gains across model sizes and modalities.

DGPODifficulty‑Aware OptimizationMQR

0 likes · 11 min read

MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)

AsiaInfo Technology: New Tech Exploration

May 12, 2026 · Artificial Intelligence

Silicon Brain: Neural Connections, Symbolic Reasoning, and Reinforcement Learning in AGI

This article analyses DeepMind’s three‑pronged AGI paradigm—combining neural networks, symbolic systems, and reinforcement learning—by dissecting AlphaGo, AlphaFold 2, Gemini, and the Genie‑Sima loop, mapping the biological inspiration, outlining engineering and safety challenges, and proposing research directions for large‑scale deployment in communication scenarios.

AGIDeepMindEngineering Challenges

0 likes · 21 min read

Silicon Brain: Neural Connections, Symbolic Reasoning, and Reinforcement Learning in AGI

Machine Learning Algorithms & Natural Language Processing

May 11, 2026 · Artificial Intelligence

Heuristic Learning: A New Reinforcement Learning Paradigm for Continual Learning

The article proposes Heuristic Learning (HL) as a way to tackle continual learning’s catastrophic forgetting by using coding agents that iteratively refine rule‑based policies, showing empirical gains on Atari, MuJoCo, and VizDoom tasks and outlining HL’s benefits, challenges, and future integration with neural networks.

LLMcoding agentscontinual learning

0 likes · 15 min read

Heuristic Learning: A New Reinforcement Learning Paradigm for Continual Learning

PaperAgent

May 11, 2026 · Artificial Intelligence

SkillOS: How Skill Governance Powers Self‑Evolving AI Agents

SkillOS addresses the one‑off nature of current LLM agents by introducing a closed‑loop system where a trainable Skill Curator continuously extracts, updates, and manages reusable skills from execution traces, leading to measurable gains in success rates, efficiency, and cross‑task generalization.

Grouped Task StreamsLLM agentsMeta-Strategy Skills

0 likes · 10 min read

SkillOS: How Skill Governance Powers Self‑Evolving AI Agents

Machine Heart

May 10, 2026 · Artificial Intelligence

Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

The article reviews the recent Intentional Updates framework—co‑authored by Turing laureate Richard Sutton—that redefines step‑size in streaming reinforcement learning using a 1967 NLMS‑style formula, details its algorithmic design, experimental validation, and remaining challenges.

Policy GradientSuttonintentional updates

0 likes · 11 min read

Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

Machine Heart

May 10, 2026 · Artificial Intelligence

Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan

In a detailed interview, Ted Xiao, former Google DeepMind researcher, walks through the existence‑proof, foundation‑model, and scaling eras of embodied robot learning, explaining the technical challenges, pivotal decisions, and the evolving role of large language and vision models in robotics.

Embodied AIfoundation modelsimitation learning

0 likes · 19 min read

Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan

DataFunTalk

May 10, 2026 · Artificial Intelligence

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

The article analyzes why DeepSeek’s large‑model training struggles with Monte‑Carlo Tree Search, explains its use of Chain‑of‑Thought prompting, GRPO entropy‑boosting and rejection‑sampling fine‑tuning, compares these methods with Google’s OmegaPRM and PRM approaches, and proposes a concrete MCTS‑driven data‑generation pipeline to overcome the “chicken and liquor” trade‑off.

DeepSeekGRPOMonte Carlo Tree Search

0 likes · 14 min read

DeepSeek vs MCTS: Decoding the ‘Chicken & Liquor’ Dilemma in LLM Training

Machine Heart

May 10, 2026 · Artificial Intelligence

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

The HiLight approach inserts lightweight highlight tags into full-length inputs, training a small Emphasis Actor to score token importance and guide a frozen large language model, improving performance on tasks like recommendation and QA without modifying the solver, while keeping low latency and training cost.

LLMLow latencyevaluation

0 likes · 9 min read

Stop Fragmenting Long Texts: HiLight Lets AI Highlight Key Points Directly

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

Heuristic Learning: Reinforcement Without Parameter Updates via .py File

OpenAI researcher Yong Jiayi introduces Heuristic Learning, a reinforcement paradigm that replaces gradient‑based neural network updates with code‑editing driven by GPT‑5.4, achieving the theoretical 864‑point Atari Breakout score and matching or surpassing PPO on multiple Atari and robot tasks.

Atari BenchmarkGPT-5.4continual learning

0 likes · 8 min read

Heuristic Learning: Reinforcement Without Parameter Updates via .py File

PaperAgent

May 9, 2026 · Artificial Intelligence

How Anthropic’s Natural Language Autoencoders Open the LLM Black Box

Anthropic’s Natural Language Autoencoders (NLA) translate high‑dimensional LLM activation vectors into readable text, using an Activation Verbalizer and Reconstruction module trained via RL to maximize Fraction of Variance Explained, and reveal internal planning, language bias, tool‑call hallucinations, and hidden reasoning across multiple Claude models.

Activation VerbalizerAnthropicClaude

0 likes · 9 min read

How Anthropic’s Natural Language Autoencoders Open the LLM Black Box

DeepHub IMBA

May 8, 2026 · Artificial Intelligence

Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium

This tutorial walks through creating a custom 8×8 GridWorld environment in Gymnasium, implementing a Q‑Learning agent that learns to navigate from the top‑left corner to the bottom‑right goal while avoiding walls, and visualizing training curves, learned policies, and a performance comparison with a random agent.

GridWorldGymnasiumPython

0 likes · 10 min read

Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium

Machine Heart

May 8, 2026 · Industry Insights

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

RadixArk raised a $100 million seed round backed by top hardware and AI investors to turn the open‑source SGLang inference engine and the Miles RL framework into day‑0 standards, aiming to democratize AI infrastructure and eliminate bottlenecks from training to inference.

AI infrastructureDeepSeek V4Hardware‑agnostic AI

0 likes · 10 min read

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

Machine Learning Algorithms & Natural Language Processing

May 7, 2026 · Artificial Intelligence

Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

By learning a compact latent‑action space from paired image‑text and large‑scale text data, the authors reduce the RL search space from a vocabulary of over 150 k tokens to a 128‑codebook, enabling more efficient fine‑tuning of multimodal conversational agents and achieving consistent gains across several RL algorithms.

MultimodalVision-Language Modelsdialogue agents

0 likes · 11 min read

Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

Alimama Tech

May 7, 2026 · Artificial Intelligence

Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

The DARA framework splits online advertising budget allocation into a few‑shot LLM reasoning stage and a fine‑grained optimizer stage, enhanced by a dynamically updated RL‑fine‑tuning algorithm (GRPO‑Adaptive), achieving significantly lower ROI variance than traditional baselines in both real and simulated environments.

LLMbudget allocationfew-shot learning

0 likes · 16 min read

Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

PaperAgent

May 7, 2026 · Artificial Intelligence

190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack

The article provides a free compiled resource containing 190 essential AI Agent papers—from fundamentals to cutting‑edge topics—along with 321 Google‑released implementation cases and 500 open‑source agent applications, all with source code to help beginners and researchers quickly understand the field and reproduce results.

AI agentLLMMemory

0 likes · 6 min read

190 Must-Read AI Agent Papers + 321 Google Implementation Cases – Free Resource Pack

Machine Heart

May 6, 2026 · Artificial Intelligence

Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A

The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.

G²RPO‑Aadaptive guidancecode generation

0 likes · 10 min read

Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A

Machine Heart

May 6, 2026 · Artificial Intelligence

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

PromptEcho computes a continuous reward for text‑to‑image generation by measuring how well a frozen vision‑language model can reconstruct the original prompt from the generated image, eliminating the need for annotated data or a trained reward model and outperforming prior methods across multiple benchmarks.

PromptEchoReward Modelingbenchmark

0 likes · 10 min read

PromptEcho: Leveraging Frozen Multimodal Models for High‑Quality Text‑to‑Image Rewards Without Labels

Machine Learning Algorithms & Natural Language Processing

May 5, 2026 · Artificial Intelligence

LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models

The LLMBeginner project from the MLNLP community offers a staged, project‑oriented learning path—covering big‑picture concepts, deep learning and reinforcement learning fundamentals, LLM theory and practice, and agent development—to guide beginners from fragmented resources to systematic mastery, with both concise and detailed versions hosted on GitHub.

AgentGitHubLLM

0 likes · 5 min read

LLMBeginner: A Project‑Based Roadmap for Zero‑Base Mastery of Large Language Models

Data Party THU

May 4, 2026 · Artificial Intelligence

Understanding the Mathematical Foundations of Reinforcement Learning

This article provides a concise overview of a ten‑chapter reinforcement‑learning textbook, outlining the progression from basic concepts such as states and rewards to advanced algorithms like policy gradients and actor‑critic methods, and explains how each chapter builds on the previous ones.

Bellman equationMonte CarloPolicy Gradient

0 likes · 11 min read

Understanding the Mathematical Foundations of Reinforcement Learning

Machine Learning Algorithms & Natural Language Processing

May 2, 2026 · Artificial Intelligence

Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks

The article presents the LWD (Learning While Deploying) framework, detailing its reinforcement‑learning‑driven data flywheel, the DIVL value‑evaluation and QAM policy‑optimization modules, and experimental results where a dual‑arm robot improves success rates by up to 17% and reduces cycle time by 23.75 seconds across eight real‑world tasks, surpassing strong baselines.

DIVLData FlywheelLWD

0 likes · 12 min read

Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks

AI Explorer

May 2, 2026 · Industry Insights

AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs

In May 2026, the AI sector saw a 77% rise in capital spending by the four biggest tech firms, Meta's acquisition of robot startup ARI, reinforcement‑learning advances boosting LLM inference, OpenAI's ChatGPT Images 2.0 launch, Tencent's Hy‑MT model outperforming Google, Microsoft's legal‑AI assistant, a 400B model running on iPhone, and notable research from CMU and independent scholars.

AI investmentCMU researchMeta

0 likes · 5 min read

AI Industry Highlights May 2, 2026: Funding Surge, New Tools, and Research Breakthroughs

Machine Heart

May 1, 2026 · Artificial Intelligence

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

GRPOLLMMaxRL

0 likes · 27 min read

From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

Machine Heart

Apr 30, 2026 · Artificial Intelligence

Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

The article analyzes how DeepSeek’s "极" bug and OpenAI’s recurring "goblin" output stem from unclean training data and an unintended reinforcement‑learning reward bias, showing how a persona‑specific habit leaked into general model behavior and how engineers responded.

GPT-5Goblin bugNerdy persona

0 likes · 8 min read

Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

Machine Heart

Apr 30, 2026 · Artificial Intelligence

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Embodied AILWDVLA

0 likes · 13 min read

How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

PaperAgent

Apr 30, 2026 · Artificial Intelligence

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

The article highlights the rapid rise of reinforcement learning across major 2026 conferences, curates 181 RL papers from eight top venues, and provides detailed summaries of innovative works such as MSRL and MedVR, offering free access to the papers and code.

Large ModelsReward Modelingagentic RL

0 likes · 6 min read

Why Reinforcement Learning Is the Future: 2026 Top‑Conference RL Paper Collection

PaperAgent

Apr 30, 2026 · Artificial Intelligence

How Agentic AI is Redefining World Modeling

The article reviews the paper "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond", introducing a two‑axis framework (capability levels L1‑L3 and law domains) to map diverse world‑modeling systems, highlighting that most current systems stall at L1, that explicit law encoding is crucial for long‑term stability, and that L3 represents the ultimate, self‑evolving model.

AI agentsAI researchSimulation

1 likes · 6 min read

How Agentic AI is Redefining World Modeling

SuanNi

Apr 28, 2026 · Artificial Intelligence

ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold

The open‑source ASI‑EVOLVE framework lets AI autonomously design AI across model architecture, data curation, and reinforcement‑learning algorithms, achieving up to three times the human‑level state‑of‑the‑art performance and demonstrating cross‑domain gains in drug‑target prediction.

AI-driven AIASI-EVOLVECross-domain AI

0 likes · 12 min read

ASI‑EVOLVE: AI Designs AI and Beats Human SOTA by Almost Three‑Fold

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026 · Artificial Intelligence

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.

EM algorithmReasoningTest-Time Training

0 likes · 14 min read

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

AI2ML AI to Machine Learning

Apr 28, 2026 · Artificial Intelligence

Which of the Three Types of AI Agents Are You Building?

The article classifies today’s booming AI agents into three categories—foundation‑model RL agents, OpenClaw‑style autonomous agents, and ontology‑driven agents—detailing their architectures, key components, comparative strengths, and how they converge toward the envisioned L4/L5 AGI stages.

AI agentsAgent OrchestrationLLM

0 likes · 9 min read

Which of the Three Types of AI Agents Are You Building?

Machine Heart

Apr 28, 2026 · Artificial Intelligence

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

The SHAPE framework (Stage‑aware Hierarchical Advantage via Potential Estimation) adds a milestone‑based “reasoning tax” to large language model inference, providing step‑wise correctness signals and penalizing verbosity, which yields an average 3% accuracy gain and a 30% reduction in token consumption across multiple math‑reasoning benchmarks.

ACL 2026LLMMathematical Reasoning

0 likes · 10 min read

Can LLMs Answer More Accurately While Writing Less? Introducing SHAPE’s Reasoning Tax

Machine Heart

Apr 28, 2026 · Artificial Intelligence

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

The article introduces the globally first open‑source large model uAI‑NEXUS‑MedVLM, built on the MedVidBench dataset and the MedGRPO training framework, which together overcome data scarcity, evaluation gaps, and task specialization challenges in surgical video AI, achieving state‑of‑the‑art performance across eight benchmark tasks.

AI in SurgeryLarge Language ModelMedVidBench

0 likes · 18 min read

World’s First Open‑Source Large Model for Real‑World Medical Video Understanding

PMTalk Product Manager Community

Apr 28, 2026 · Artificial Intelligence

First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow

The article presents a decision framework for AI product managers, mapping workflow determinism and context certainty to four technical patterns—traditional RPA + AI, single Agent + RAG/knowledge graph, end‑to‑end RL Agent, and multi‑Agent collaboration—each with concrete use‑case examples and selection guidelines.

AI agentsMulti-Agent SystemsRPA

0 likes · 6 min read

First Principle for Agent Product Managers: Choosing Between Single Agent, Multi‑Agent Collaboration, and Workflow

360 Tech Engineering

Apr 28, 2026 · Artificial Intelligence

How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%

The 360 AI Research Institute tackled the zero‑tolerance translation demands of airline maintenance by building a specialized parallel corpus and applying RAG‑enhanced, SFT‑fine‑tuned, and RL‑reinforced models, raising Chinese‑to‑English translation accuracy from 70% to 96% and enabling a one‑month rollout.

AI translationRAGSFT

0 likes · 5 min read

How 360 AI Institute Boosted Airline Translation Accuracy from 70% to 96%

AI Explorer

Apr 27, 2026 · Artificial Intelligence

Reinforcement Learning Scaling Law Shows How RL Fine‑Tuning Boosts Large Model Reasoning

A new study by USTC and Shanghai AI Lab uncovers a power‑law scaling relationship between RL fine‑tuning compute and large‑model reasoning performance, offering a quantitative way to predict and control AI capability growth.

AI researchScaling Lawlarge language models

0 likes · 7 min read

Reinforcement Learning Scaling Law Shows How RL Fine‑Tuning Boosts Large Model Reasoning

Machine Heart

Apr 27, 2026 · Artificial Intelligence

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

The paper presents a systematic empirical study that derives a power‑law scaling formula for reinforcement‑learning‑after‑training of large language models, demonstrating accurate inter‑ and intra‑model performance prediction, learning‑efficiency saturation, data‑reuse benefits, and cross‑architecture validity.

Data ReuseLlama 3Qwen2.5

0 likes · 11 min read

ACL 2026: Unveiling a Predictive Scaling Law for Reinforcement Learning Fine‑Tuning of Large Models

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems

This review surveys classic multi-agent systems and the emerging large-foundation-model-driven MAS paradigm, comparing their architectures, perception, communication, decision-making and control, and discusses how integrating LFMs enables semantic reasoning, greater adaptability, and new research challenges.

Collaborative AILarge Foundation ModelsMulti-Agent Systems

0 likes · 8 min read

From Classic Multi-Agent Paradigms to Future Large-Foundation-Model-Driven Systems

Alibaba Cloud Developer

Apr 24, 2026 · Artificial Intelligence

How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design

This article provides a detailed technical analysis of Hermes Agent, explaining how its dynamic skill generation and reinforcement‑learning loop enable true self‑evolution, and examines the prompt engineering, context compression, memory architecture, harness mechanisms, error handling, and plugin ecosystem that differentiate it from OpenClaw and Claude Code.

Agent FrameworkContext CompressionHermes Agent

0 likes · 41 min read

How Hermes Agent Achieves Self‑Evolution: A Deep Dive into Prompt, Context, and Harness Design

Bighead's Algorithm Notes

Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management

0 likes · 15 min read

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

AntTech

Apr 22, 2026 · Artificial Intelligence

How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

This article reviews two recent ICLR 2026 papers—M²‑Miner, a multi‑agent Monte‑Carlo Tree Search framework for low‑cost mobile GUI data mining, and IGPO, an information‑gain‑based reinforcement‑learning method that provides dense rewards for multi‑turn search agents—detailing their designs, experiments, and open‑source releases.

GUI Data MiningInformation GainLLM agents

0 likes · 8 min read

How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

Java Architect Essentials

Apr 21, 2026 · Artificial Intelligence

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Cost

Cursor’s new Composer 2 model outperforms Claude Opus 4.6 on benchmarks like Terminal‑Bench 2.0, slashes pricing to $0.5/2.5 USD per million tokens, and introduces a self‑summary reinforcement‑learning technique that dramatically reduces context loss in long‑running coding tasks.

AI programmingComposer 2Cursor

0 likes · 9 min read

Why Cursor’s Composer 2 Beats Claude Opus 4.6 in Performance and Cost

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMMultimodal

0 likes · 15 min read

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

AIWalker

Apr 20, 2026 · Artificial Intelligence

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

VA‑π introduces a lightweight post‑training framework that uses variational inference and reinforcement learning to align tokenizers with visual autoregressive generators, achieving dramatic quality gains, extreme training efficiency, and robust pixel‑level reconstruction across diverse image generation tasks.

Autoregressive ModelsPixel Alignmentpost-training

0 likes · 14 min read

How VA‑π Bridges Tokenizers and Autoregressive Generators for Pixel‑Perfect Images

Data Party THU

Apr 20, 2026 · Artificial Intelligence

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

MemPO introduces a self‑memory policy optimization framework that lets long‑horizon LLM agents autonomously manage and refine their memory via reinforcement learning, using global‑trajectory and informative‑memory advantage estimates, achieving up to 25.98% F1 gain and 73% token reduction on benchmark tasks.

LLMLong-Horizon AgentsMemPO

0 likes · 8 min read

How MemPO Uses Reinforcement Learning to Turn Agent Memory into a Trainable Policy

Baidu Maps Tech Team

Apr 20, 2026 · Artificial Intelligence

How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL

Facing the shift from keyword indexing to generative AI, Baidu Maps overhauled its LBS architecture by introducing a native multi‑agent system, context‑engineering (ACE) framework, and reinforcement‑learning alignment, enabling dynamic routing, knowledge evolution, and a 36% boost in planning compliance while maintaining zero‑tolerance for factual errors.

AI agentsContext EngineeringLLM

0 likes · 10 min read

How Baidu Maps Reinvents LBS Search with Multi‑Agent AI and RL

Old Zhang's AI Learning

Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

LLMUnslothfine-tuning

0 likes · 12 min read

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

Machine Heart

Apr 19, 2026 · Artificial Intelligence

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

World Engine introduces a post‑training pipeline that combines high‑fidelity 3DGS simulation, hard‑case mining with diffusion generation, and reinforcement‑learning optimization to give autonomous‑driving models true decision‑making ability, surpassing data‑scaling limits and achieving significant safety gains in both industrial simulations and real‑world tests.

Autonomous DrivingPhysical AISimulation

0 likes · 11 min read

World Engine: How Post‑Training Is Launching a New Era of Physical AGI

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026 · Artificial Intelligence

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Efficient InferenceQwenReward Shaping

0 likes · 16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

AI Explorer

Apr 16, 2026 · Artificial Intelligence

How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training

NVIDIA, Hong Kong University, and MIT introduced the Sol‑RL framework, which uses reinforcement‑learning‑guided sampling to cut diffusion model training time by several‑fold without sacrificing image quality, potentially lowering entry barriers for small teams and shifting the AIGC industry toward an efficiency‑driven competition.

AIGCNVIDIASol-RL

0 likes · 6 min read

How NVIDIA, HKU, and MIT’s Sol‑RL Framework Supercharges Diffusion Model Training

Xiaohongshu Tech REDtech

Apr 15, 2026 · Artificial Intelligence

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Asynchronous PipelineRay Servedistributed training

0 likes · 10 min read

How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

SuanNi

Apr 12, 2026 · Artificial Intelligence

How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

The paper introduces MemPO, a self‑memory strategy optimization algorithm that lets large language model agents actively manage their memory, dramatically improving accuracy on complex multi‑step tasks while reducing token consumption by up to 73%, and validates the approach with extensive experiments and analysis.

AIEfficiencyLong-term Memory

0 likes · 11 min read

How MemPO Gives AI Agents Long‑Term Memory and Cuts Costs by 70%

CodeTrend

Apr 11, 2026 · Artificial Intelligence

Inside OpenClaw: Architecture, Core Technologies, and Security Risks

The article provides a detailed technical analysis of the OpenClaw AI‑agent framework, covering its three‑layer architecture, prompt compiler, heartbeat mechanism, file‑based memory, skill system, ReAct loop, model‑agnostic routing, reinforcement‑learning extension, security concerns, and a side‑by‑side comparison with Hermes Agent.

Agent FrameworkOpenClawfile-based memory

0 likes · 13 min read

Inside OpenClaw: Architecture, Core Technologies, and Security Risks

Machine Heart

Apr 11, 2026 · Artificial Intelligence

How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

Lingchu AI demonstrates that scaling human‑operation data to nearly 100,000 hours, combined with a two‑model system and reinforcement learning, can replace costly robot‑teleoperation data and achieve top performance on the MolmoSpaces benchmark.

Embodied AIPsi-R2Psi-W0

0 likes · 12 min read

How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

AI2ML AI to Machine Learning

Apr 10, 2026 · Artificial Intelligence

Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis

The article dissects HermesAgent’s architecture, showing how it extends OpenClaw with self‑learning, reinforcement‑learning modules, and advanced prompt‑evolution techniques to mitigate token‑hole costs and achieve more deterministic results, while also detailing its TUI‑driven CLI and evaluation workflow.

DSPyGEPAHermesAgent

0 likes · 8 min read

Why HermesAgent Outperforms OpenClaw: A Deep Source‑Code Analysis

Machine Heart

Apr 10, 2026 · Artificial Intelligence

AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

AdaGen replaces handcrafted static schedules in multi‑step image generators with a universal, learnable policy network trained via reinforcement learning, using an MDP formulation, adversarial rewards and action smoothing, achieving consistent quality and efficiency gains across diffusion, autoregressive, mask and flow models while adding negligible overhead.

MDPaction smoothingadaptive policy

0 likes · 11 min read

AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

Machine Heart

Apr 9, 2026 · Artificial Intelligence

How TDM‑R1 Boosts Few‑Step Image Generation: GenEval Jumps from 61% to 92% and Beats GPT‑4o

The TDM‑R1 framework introduces a two‑stage reinforcement learning pipeline that lets 4‑step diffusion models achieve a GenEval score of 92%, surpassing 80‑step baselines and GPT‑4o while also fixing instruction compliance, text rendering, and compositional generation issues.

GenEvalOCR improvementTDM-R1

0 likes · 15 min read

How TDM‑R1 Boosts Few‑Step Image Generation: GenEval Jumps from 61% to 92% and Beats GPT‑4o

Alibaba Cloud Big Data AI Platform

Apr 9, 2026 · Artificial Intelligence

How Data Flywheels Accelerate Small Agentic Model Training

This article details a data‑flywheel framework for training compact agentic language models, describing synthetic task generation, mock environment simulation, rubric‑based reward design, iterative hard‑sample augmentation, and experimental results that show consistent performance gains across benchmarks.

Synthetic Environmentsagentic modelsdata augmentation

0 likes · 17 min read

How Data Flywheels Accelerate Small Agentic Model Training

Machine Heart

Apr 9, 2026 · Artificial Intelligence

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

Gen-Searcher equips text-to-image models with searchable, reasoning, and web‑browsing capabilities, turning the traditional direct‑generation pipeline into an agentic system that fetches and verifies real‑world knowledge, dramatically improving accuracy and quality across multiple benchmarks.

Gen-SearcherKnowGenagentic AI

0 likes · 7 min read

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

Machine Heart

Apr 8, 2026 · Artificial Intelligence

Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab

Meta has launched Muse Spark, its inaugural model from the newly formed Superintelligence Lab, showcasing multimodal capabilities, tool use, visual chain‑of‑thought, and multi‑agent orchestration, while detailing pretraining scaling gains, reinforcement‑learning improvements, and test‑time reasoning efficiencies.

AI scalingMetaMuse Spark

0 likes · 9 min read

Meta Unveils Muse Spark: The First Model from Its Superintelligence Lab

AIWalker

Apr 6, 2026 · Artificial Intelligence

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

The paper introduces TIR‑Agent, an image‑restoration agent that learns a tool‑calling policy via supervised fine‑tuning and reinforcement learning, addressing exploration stagnation and multi‑objective reward imbalance, and demonstrates over 2.5× faster inference and superior multi‑metric performance on synthetic and real degradation datasets.

agent-based AIcomputer visionimage restoration

0 likes · 18 min read

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

DataFunSummit

Apr 5, 2026 · Industry Insights

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent

This article examines Datus AI’s open‑source Data Engineering Agent, detailing its architecture, interactive context engineering, evaluation results, and future roadmap, and explains how it tackles the challenges of scaling AI‑driven data workflows.

AI agentsNL2SQLOpen Source

0 likes · 20 min read

How Datus AI Is Redefining Data Engineering with an Open‑Source Data Agent