Tag

token interval

0 views collected around this technical thread.

Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)