RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training
The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.
