Tagged articles
133 articles
Page 2 of 2
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementLarge ModelsMegatron
0 likes · 16 min read
How to Train a 671B‑Scale Model with RL: Insights from a verl Internship
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
AIWalker
AIWalker
May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

Supervised Fine‑Tuningautoregressivebenchmark
0 likes · 14 min read
SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL
Liangxu Linux
Liangxu Linux
Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1
0 likes · 7 min read
Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayKubernetesLLM
0 likes · 19 min read
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU MemoryMemory Cache
0 likes · 21 min read
Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache
Infra Learning Club
Infra Learning Club
Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor
0 likes · 8 min read
Testing Augment Code: A Powerful New Rival to Cursor
Alibaba Cloud Observability
Alibaba Cloud Observability
Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityPrometheus
0 likes · 19 min read
Achieving Full Observability for AI Inference Apps with Prometheus
ByteDance Cloud Native
ByteDance Cloud Native
Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1GPU cluster
0 likes · 14 min read
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferencePrometheusRay Serve
0 likes · 21 min read
How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 17, 2025 · Cloud Native

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

ACK GatewayAI ExtensionKubernetes
0 likes · 16 min read
Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudGPU
0 likes · 17 min read
Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide
AIWalker
AIWalker
Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU optimizationLLM deployment
0 likes · 39 min read
Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

DeepSeekFlashMLAInferenceOptimization
0 likes · 16 min read
Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide
Alibaba Cloud Native
Alibaba Cloud Native
Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek
0 likes · 15 min read
Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes
Alibaba Cloud Native
Alibaba Cloud Native
Feb 13, 2025 · Artificial Intelligence

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba CloudGPU Reserved Instancesdeployment guide
0 likes · 14 min read
Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 13, 2025 · Cloud Computing

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.

ACK OneACS GPUDeepSeek
0 likes · 14 min read
Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 13, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

DeepSeekDifyDistributed Deployment
0 likes · 12 min read
Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify
Baidu Geek Talk
Baidu Geek Talk
Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI model deploymentBaidu BaigeCloud AI
0 likes · 3 min read
Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance
0 likes · 10 min read
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency
DataFunSummit
DataFunSummit
Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory
0 likes · 25 min read
Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
Infra Learning Club
Infra Learning Club
Nov 1, 2024 · Artificial Intelligence

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

GPU Memory Managementcpu_offload_gblarge language model inference
0 likes · 9 min read
Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference
21CTO
21CTO
Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonlarge language models
0 likes · 13 min read
Deploy Large Language Models with vLLM and Quantization for Low Latency
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU MemoryLLM inferencePagedAttention
0 likes · 25 min read
How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference