Tagged articles

27 articles

Page 1 of 1

May 29, 2026 · Artificial Intelligence

Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

OSCAR, a new attention‑aware 2‑bit KV cache quantization method, cuts KV memory by up to 8×, delivers up to 3× decode speedup and 7× throughput gain, and matches BF16 accuracy across 4B‑32B models on diverse long‑context reasoning tasks, surpassing TurboQuant.

2-bit compressionKV CacheLLM Quantization

0 likes · 12 min read

Beyond TurboQuant: Introducing a True 2‑bit KV Quantization for Long‑Context LLM Inference

Xiaomi Tech

May 26, 2026 · Artificial Intelligence

MiMo V2.5 API Gets Permanent Price Cut and Token Plan Overhaul – Incentive Program Ends

MiMo announces a permanent up to 99% price reduction for its V2.5 API, a 5‑8× usage boost in its Token Plan billing, a full reset of all Token Plan quotas, and the conclusion of its Hundred‑Trillion Token Creator Incentive Program, effective May 27, 2026.

AI infrastructureAPI pricingInference Optimization

0 likes · 5 min read

MiMo V2.5 API Gets Permanent Price Cut and Token Plan Overhaul – Incentive Program Ends

Machine Heart

May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU

0 likes · 12 min read

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

Old Zhang's AI Learning

May 11, 2026 · Artificial Intelligence

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Ant Group's Ling‑2.6‑1T, a 1‑trillion‑parameter LLM built for token efficiency and fast‑thinking, outperforms on elite reasoning and agentic benchmarks, offers easy local deployment via vLLM or SGLang, provides a quantized 3.6‑bit version, and includes practical usage tips for developers and knowledge workers.

Agentic ModelClaude Code IntegrationLing-2.6-1T

0 likes · 12 min read

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Machine Heart

May 8, 2026 · Industry Insights

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

RadixArk raised a $100 million seed round backed by top hardware and AI investors to turn the open‑source SGLang inference engine and the Miles RL framework into day‑0 standards, aiming to democratize AI infrastructure and eliminate bottlenecks from training to inference.

AI infrastructureDeepSeek V4Hardware‑agnostic AI

0 likes · 10 min read

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

Old Zhang's AI Learning

May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek V4HiSparseSGLang

0 likes · 15 min read

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

Old Zhang's AI Learning

Apr 20, 2026 · Artificial Intelligence

Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide

Kimi K2.6, an open-source 1-trillion-parameter MoE model, expands Agent capabilities with 256K context, multimodal inputs, and the ability to coordinate 300 sub-Agents over 4,000 steps, achieving top scores on benchmarks like Terminal-Bench 2.0, SWE-Bench Pro, and BrowseComp, while offering flexible deployment via vLLM, SGLang, and KTransformers.

Agent ModelDeploymentKTransformers

0 likes · 11 min read

Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide

Old Zhang's AI Learning

Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang

0 likes · 12 min read

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Old Zhang's AI Learning

Apr 12, 2026 · Artificial Intelligence

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

DeploymentGPULLM

0 likes · 11 min read

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

Old Zhang's AI Learning

Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV CacheLLM inferencePerformance Optimization

0 likes · 13 min read

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

AI Engineering

Mar 25, 2026 · Artificial Intelligence

Is “Harness Engineering” Just Rebranded Engineering Common Sense?

The article examines the hype around “harness engineering” in LLM workflows, showing through SGLang’s multi‑agent experience that the approach merely repackages established software‑engineering principles such as separation of concerns, docs‑as‑code, and structured routing, and discusses its limits and future implications.

Harness EngineeringLLMSGLang

0 likes · 8 min read

Is “Harness Engineering” Just Rebranded Engineering Common Sense?

DeepHub IMBA

Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV CacheLLM inferencePagedAttention

0 likes · 16 min read

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

Alibaba Cloud Infrastructure

Jan 30, 2026 · Artificial Intelligence

Deploy Kimi 2.5 LLM on Alibaba Cloud with SGLang, RBG, and Openclaw

This guide walks through preparing the Kimi 2.5 model, uploading it to OSS, configuring persistent storage, and using SGLang, RoleBasedGroup, and Openclaw to deploy a production‑grade inference service on Alibaba Cloud Kubernetes with step‑by‑step commands and YAML examples.

AIDeploymentKimi

0 likes · 14 min read

Deploy Kimi 2.5 LLM on Alibaba Cloud with SGLang, RBG, and Openclaw

Old Zhang's AI Learning

Jan 23, 2026 · Artificial Intelligence

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Chinese dialectsGLM-ASR-Nano-2512Open Source

0 likes · 11 min read

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

AI Engineering

Jan 22, 2026 · Industry Insights

SGLang Spins Out as RadixArk with $400M Valuation Amid Inference Infrastructure Boom

SGLang, the open‑source inference accelerator, has been spun out into RadixArk—a $400 million‑valued startup aiming to democratize AI infrastructure, while the broader market sees a surge of funding for inference‑focused companies.

AI inferenceAI infrastructureRadixArk

0 likes · 5 min read

SGLang Spins Out as RadixArk with $400M Valuation Amid Inference Infrastructure Boom

MaGe Linux Operations

Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU optimizationH800

0 likes · 19 min read

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

Alibaba Cloud Developer

Dec 23, 2025 · Artificial Intelligence

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.

Inference OptimizationKVCacheSGLang

0 likes · 22 min read

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

Baidu Intelligent Cloud Tech Hub

Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation

0 likes · 17 min read

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

Baidu Intelligent Cloud Tech Hub

Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference

0 likes · 9 min read

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Meituan Technology Team

Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashLow latencyMixture of Experts

0 likes · 15 min read

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

Volcano Engine Developer Services

Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI infrastructureKVCacheLLM inference

0 likes · 30 min read

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementLarge ModelsMegatron

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

Liangxu Linux

Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1

0 likes · 7 min read

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

Zhihu Tech Column

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism

0 likes · 11 min read

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Meituan Technology Team

Mar 6, 2025 · Artificial Intelligence

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.

DeepSeek-R1GPU deploymentINT8 Quantization

0 likes · 11 min read

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Architects' Tech Alliance

Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization

0 likes · 5 min read

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization