RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.

Dynamic Token SelectionKernel OptimizationLLM inference

0 likes · 17 min read

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

AI Frontier Lectures

Jul 13, 2025 · Artificial Intelligence

How HarmoniCa Boosts Diffusion Model Speed with Joint Training‑Inference Caching

HarmoniCa, a new feature‑caching framework co‑designed by HKUST, Beihang University, and SenseTime, tackles diffusion model inference bottlenecks by aligning training and inference through Step‑Wise Denoising Training and an Image Error Proxy Objective, achieving up to 2× speedup while preserving image quality.

Performance Accelerationdiffusion modelsfeature caching

0 likes · 9 min read

How HarmoniCa Boosts Diffusion Model Speed with Joint Training‑Inference Caching

Code DAO

Dec 11, 2021 · Artificial Intelligence

Nimble: A Lightweight Parallel GPU Scheduler Boosting Deep Learning Performance

The article analyzes how Nimble reduces GPU scheduling overhead and enables parallel execution through ahead‑of‑time scheduling and automatic multi‑stream assignment, achieving up to 22.3× inference speedup over PyTorch and significantly improving GPU utilization for deep learning workloads.

GPU schedulingParallel ExecutionPerformance Acceleration

0 likes · 9 min read

Nimble: A Lightweight Parallel GPU Scheduler Boosting Deep Learning Performance