Tagged articles

13 articles

Page 1 of 1

May 20, 2026 · Artificial Intelligence

AutoTTS Shows How AI Agents Can Outperform Human‑Designed Test‑Time Scaling Strategies

The paper “LLMs Improving LLMs” introduces AutoTTS, an environment where a Claude‑based explorer agent automatically searches test‑time scaling policies, achieving up to 69.5% token savings and superior accuracy on unseen models, all for $39.9 and 160 minutes without any LLM calls during evaluation.

AutoTTSClaudeLLM agents

0 likes · 7 min read

AutoTTS Shows How AI Agents Can Outperform Human‑Designed Test‑Time Scaling Strategies

Machine Heart

May 11, 2026 · Artificial Intelligence

How PRISM Enables Efficient Test‑Time Scaling for Discrete Diffusion Language Models

The article analyzes how the PRISM framework redesigns test‑time scaling for discrete diffusion language models by replacing costly Best‑of‑N sampling with a three‑stage hierarchical search, local branching via partial remasking, and self‑verified feedback, achieving large accuracy gains on math and code benchmarks while cutting inference compute by up to four‑fold.

Discrete DiffusionHierarchical SearchInference Optimization

0 likes · 11 min read

How PRISM Enables Efficient Test‑Time Scaling for Discrete Diffusion Language Models

Machine Heart

Apr 25, 2026 · Artificial Intelligence

Open‑Source Models Dominate 21 Scientific Discovery Tasks with SimpleTES

The SimpleTES framework decomposes trial‑and‑error into three scalable dimensions—Concurrency, Length, and Candidates—enabling test‑time scaling that lets open‑source models outperform closed‑source rivals across 21 diverse scientific benchmarks, from LASSO regression to quantum circuit compilation.

AI for ScienceOpen-source modelsScientific Discovery

0 likes · 13 min read

Open‑Source Models Dominate 21 Scientific Discovery Tasks with SimpleTES

Machine Heart

Apr 25, 2026 · Artificial Intelligence

Can Multi-Model Co-Evolution Shatter the Single-Model Ceiling? Squeeze Evolve Achieves Validator-Free SOTA Inference

The paper introduces Squeeze Evolve, a validator‑free multi‑model evolutionary framework that orchestrates diverse large language models to break the performance ceiling of any single model, delivering up to 23‑point accuracy improvements and 1.4‑3.3× cost reductions across math, vision, and scientific benchmarks.

AI researchInference OptimizationSqueeze Evolve

0 likes · 8 min read

Can Multi-Model Co-Evolution Shatter the Single-Model Ceiling? Squeeze Evolve Achieves Validator-Free SOTA Inference

Old Zhang's AI Learning

Jan 27, 2026 · Artificial Intelligence

Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source

Alibaba’s new Qwen3‑Max‑Thinking model adds inference‑time scaling and adaptive tool use, delivering large gains on math, coding, and agent benchmarks while remaining closed‑source, and it offers drop‑in OpenAI‑compatible API access at the cost of higher latency and token usage.

AI BenchmarkAdaptive Tool UseLarge Language Model

0 likes · 7 min read

Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source

Data Party THU

Oct 29, 2025 · Artificial Intelligence

Can Test-Time Scaling Unlock More Reliable Vision‑Language‑Action Robots?

The paper introduces RoboMonkey, a framework that applies a generate‑and‑verify paradigm and test‑time scaling to Vision‑Language‑Action models, showing that increasing sampling and verification at inference dramatically reduces action error across multiple VLA architectures, and presents scalable verifier training, synthetic data augmentation, and efficient deployment strategies.

AI researchAction VerificationRoboMonkey

0 likes · 8 min read

Can Test-Time Scaling Unlock More Reliable Vision‑Language‑Action Robots?

vivo Internet Technology

Aug 25, 2025 · Artificial Intelligence

How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

DiMo-GUI is a plug‑and‑play framework that dramatically improves multimodal large language models' ability to locate GUI elements by using a hierarchical dynamic visual reasoning loop and modality‑aware optimization, achieving up to double the performance on high‑resolution GUI benchmarks without any additional training data.

GUI groundingTest-Time Scalingdynamic visual reasoning

0 likes · 7 min read

How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

Kuaishou Large Model

Jul 3, 2025 · Artificial Intelligence

How EvoSearch Boosts Image & Video Generation with Test‑Time Evolutionary Search

The EvoSearch method introduced by HKUST and Kuaishou’s KuaLing team leverages test‑time scaling to dramatically improve diffusion‑based image and video generation without training, using evolutionary search along the denoising trajectory, achieving state‑of‑the‑art results on SD2.1, Flux‑1‑dev and other models.

Evolutionary SearchTest-Time ScalingVideo Generation

0 likes · 8 min read

How EvoSearch Boosts Image & Video Generation with Test‑Time Evolutionary Search

Kuaishou Tech

Jul 2, 2025 · Artificial Intelligence

How EvoSearch Supercharges Image and Video Generation with Test‑Time Evolutionary Search

EvoSearch, a test‑time evolutionary search method, dramatically improves image and video generation by increasing inference compute without extra training, outperforming existing scaling techniques on diffusion and flow models while maintaining robustness and diversity across multiple benchmarks.

AI researchEvolutionary SearchTest-Time Scaling

0 likes · 8 min read

How EvoSearch Supercharges Image and Video Generation with Test‑Time Evolutionary Search

Fighter's World

Jun 28, 2025 · Artificial Intelligence

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

The article explains the Generator‑Verifier Gap (GVG)—the asymmetry where verifying a solution is far cheaper than generating it—covers its origin, its impact on test‑time scaling for large language models, reinforcement‑learning approaches, and how the concept can shape agent architectures and AI product strategy.

Agent ArchitectureGenerator-Verifier GapLLM

0 likes · 21 min read

What Is the Generator‑Verifier Gap and Why It Matters for LLM Reasoning

AI Frontier Lectures

Apr 4, 2025 · Artificial Intelligence

Why Test‑Time Scaling Is Revolutionizing LLM Reasoning in 2025

This article surveys the latest research on large language model reasoning, highlighting test‑time scaling methods, chain‑of‑thought variants, and novel inference‑time techniques that boost performance while exposing trade‑offs, costs, and future directions for AI developers.

AILLMTest-Time Scaling

0 likes · 26 min read

Why Test‑Time Scaling Is Revolutionizing LLM Reasoning in 2025

Architect

Feb 19, 2025 · Artificial Intelligence

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

The article critically examines whether the pre‑training Scaling Law still applies to Grok 3, compares its compute usage and model size with DeepSeek and OpenAI models, evaluates the cost‑effectiveness of pre‑training, RL and test‑time scaling, and explores how these insights shape future large‑language‑model development strategies.

Grok-3Pre‑trainingRL scaling

0 likes · 11 min read

Does Scaling Law Still Hold for Grok 3? A Deep Dive into LLM Training Economics

Baobao Algorithm Notes

Sep 29, 2024 · Artificial Intelligence

Decoding OpenAI o1: Test‑Time Scaling, PRM Search & Inference Strategies

This article analyses the training tricks behind OpenAI's o1 model, explaining test/inference‑time scaling laws, post‑training techniques, process‑supervised reward models (PRM), various inference‑time search methods, data‑collection pipelines, and the trade‑offs between allocating compute to pre‑training versus inference.

LLM inferenceOpenAI o1Reward Model

0 likes · 34 min read