vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs
The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.
Framework status (quick overview)
oMLX (Apple Silicon) – released v0.2.21, supports 128K context with 79% KV‑cache reduction.
mlx‑vlm (Apple Silicon) – PR in progress, Metal kernel implementation approaching full‑precision decoding.
llama.cpp (all platforms) – experimental branch compiled, community evaluating TurboQuant support.
vLLM (CUDA) – detailed six‑step integration plan posted, PR pending.
oMLX: TurboQuant KV‑Cache on macOS
oMLX is a macOS‑optimized local LLM inference server with menu‑bar management, batch processing, and a two‑tier KV cache (memory + SSD). TurboQuant KV‑Cache is toggled via the Admin UI.
Prefill uses full fp16 (zero quality loss); the first decode token quantizes the accumulated KV cache into 3‑bit or 4‑bit codebook indices. Decode attention runs on a fused two‑pass Flash‑Attention Metal kernel that reads directly from the packed indices, avoiding de‑quantisation and intermediate fp16 tensors.
KV‑cache memory savings for Qwen3.5‑35B‑A3B (3‑bit TurboQuant):
32K context: 735 MB → 195 MB (73% saved)
64K context: 1 407 MB → 327 MB (77% saved)
128K context: 2 749 MB → 589 MB (79% saved) – zero quality loss
Relative speed compared with fp16 baseline:
Qwen3.5‑35B‑A3B – Prefill 95%, Decode 87%
Qwen3.5‑27B – Prefill 97%, Decode 95%
Installation:
# Install oMLX
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Start the service
brew services start omlxThe release also includes oQ+ , which adds GPTQ weight optimisation on top of mixed‑precision quantisation and a batch‑processing acceleration for MoE models. Compressing Qwen3.5‑35B‑A3B (256 experts × 40 layers) takes six minutes, a 15× speedup over sequential processing.
mlx‑vlm: Metal kernels approaching full precision
PR #858 (https://github.com/Blaizzy/mlx-vlm/pull/858) adds a complete TurboQuant inference chain. Five commits introduce the following kernels: _mse_score_kernel – MSE scoring _pack_lowbit_kernel / _unpack_lowbit_kernel – low‑bit pack/unpack _qjl_score_kernel – 1‑bit residual correction _prod_score_kernel – inner‑product calculation scaled_dot_product_attention – adapted for TurboQuant fast‑decode path (single‑query inputs)
Multi‑head optimisation kernels: _prod_score_multi_kernel – multi‑head batch processing _mse_weighted_rot_multi_kernel – weighted rotation multi‑head _prod_score_repeat_kernel – repetition‑mode optimisation
4‑bit PolarQuant path adds: _polar_prod_score_kernel – polar‑coordinate inner product _polar_turbo_score_repeat_kernel – polar‑coordinate repetition optimisation
Decoding speed reaches 70‑85% of full‑precision performance and continues to improve.
llama.cpp: community effort
Issue #20977 (https://github.com/ggml-org/llama.cpp/issues/20977) requests TurboQuant support. Developer @mudler forked a feat/turbo-quant branch (https://github.com/mudler/llama.cpp/tree/feat/turbo-quant) that already compiles and runs; evaluation is ongoing.
vLLM: six‑step integration plan
Issue #38171 (https://github.com/vllm-project/vllm/issues/38171) outlines the following steps:
Extend CacheDType with "turboquant".
Create TurboQuantConfig class using @register_quantization_config decorator.
Implement KV‑Cache method by inheriting BaseKVCacheMethod and registering codebook parameters.
Update quantisation detection so is_quantized_kv_cache() recognises TurboQuant.
Implement CUDA/Triton kernels for encoding (quantised storage) and decoding (attention‑pre‑restore).
Update memory management to accommodate codebook overhead and variable compression rates.
For cloud inference, vLLM + TurboQuant yields a 4‑5× KV‑cache compression, allowing an H100 GPU to serve more concurrent requests and longer contexts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
