Artificial Intelligence 10 min read

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

Baidu Geek Talk

Jan 15, 2025

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large model inference engines are the core runtime that receives prompts and generates responses for generative language models, orchestrating heterogeneous hardware to convert electrical power into human knowledge.

The basic workflow consists of receiving concurrent requests containing prompts and sampling parameters, tokenizing and batching them, scheduling GPU forward inference, processing results, and returning token IDs to users.

The processing is divided into two stages: the Prefill stage, where the model builds a contextual memory of the prompt, and the Decoder (autoregressive) stage, where the model repeatedly predicts the next token until a stop condition is met.

Performance is typically measured from the user perspective using Service Level Objectives (SLOs): Time To First Token (TTFT) evaluates the Prefill stage, while Time Per Output Token (TPOT) evaluates the Decoder stage. Throughput is assessed by the maximum Tokens Per Second (TPS) under full load.

Popular open‑source inference engines such as vLLM, SGLang, LMDeploy, and TRT‑LLM aim to maximize throughput, but vLLM still incurs significant CPU work during tokenization/detokenization, which lengthens TPOT and reduces GPU utilization.

To reduce TPOT, Baidu’s AIAK suite introduces three optimization layers:

Multi‑process architecture: separates tokenization/detokenization into a Triton model and overlaps it with GPU inference, cutting CPU‑only time by about 10%.

Static Slot scheduling: transforms global scheduling into a local, incremental approach, reuses slot information across steps, and moves token‑level operations to CUDA kernels, shrinking latency from milliseconds to microseconds and avoiding host‑to‑device copies.

Asynchronous execution: runs the forward‑pass task on a background thread while the main thread handles scheduling, communicating via a queue and GPU stream events, achieving near‑zero token interval and 100 % GPU utilization.

Combined, these techniques lower the token interval from ~35 ms to ~14 ms and raise GPU utilization from 50‑60 % to about 75 %, moving toward the ultimate goal of maximal throughput and zero token‑interval latency.

Beyond these, AIAK also invests in quantization, speculative execution, service‑oriented design, multi‑chip adaptation, and other engineering efforts to deliver a high‑performance, multi‑scenario inference engine.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large-model inference vLLM GPU utilization AI acceleration token interval TPOT

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.