Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.
Large model inference engines are the core runtime that receives prompts and generates responses for generative language models, orchestrating heterogeneous hardware to convert electrical power into human knowledge.
The basic workflow consists of receiving concurrent requests containing prompts and sampling parameters, tokenizing and batching them, scheduling GPU forward inference, processing results, and returning token IDs to users.
The processing is divided into two stages: the Prefill stage, where the model builds a contextual memory of the prompt, and the Decoder (autoregressive) stage, where the model repeatedly predicts the next token until a stop condition is met.
Performance is typically measured from the user perspective using Service Level Objectives (SLOs): Time To First Token (TTFT) evaluates the Prefill stage, while Time Per Output Token (TPOT) evaluates the Decoder stage. Throughput is assessed by the maximum Tokens Per Second (TPS) under full load.
Popular open‑source inference engines such as vLLM, SGLang, LMDeploy, and TRT‑LLM aim to maximize throughput, but vLLM still incurs significant CPU work during tokenization/detokenization, which lengthens TPOT and reduces GPU utilization.
To reduce TPOT, Baidu’s AIAK suite introduces three optimization layers:
Multi‑process architecture: separates tokenization/detokenization into a Triton model and overlaps it with GPU inference, cutting CPU‑only time by about 10%.
Static Slot scheduling: transforms global scheduling into a local, incremental approach, reuses slot information across steps, and moves token‑level operations to CUDA kernels, shrinking latency from milliseconds to microseconds and avoiding host‑to‑device copies.
Asynchronous execution: runs the forward‑pass task on a background thread while the main thread handles scheduling, communicating via a queue and GPU stream events, achieving near‑zero token interval and 100 % GPU utilization.
Combined, these techniques lower the token interval from ~35 ms to ~14 ms and raise GPU utilization from 50‑60 % to about 75 %, moving toward the ultimate goal of maximal throughput and zero token‑interval latency.
Beyond these, AIAK also invests in quantization, speculative execution, service‑oriented design, multi‑chip adaptation, and other engineering efforts to deliver a high‑performance, multi‑scenario inference engine.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.