Tagged articles

133 articles

Page 2 of 2

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementLarge ModelsMegatron

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Baobao Algorithm Notes

May 20, 2025 · Artificial Intelligence

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

This article explains how an asynchronous RLHF pipeline built on vLLM, Ray, and OpenRLHF dramatically reduces training bottlenecks by decoupling inference, environment interaction, and model updates, and provides detailed implementation code and design choices for scalable reinforcement learning.

OpenRLHFRLHFRay

0 likes · 11 min read

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

AIWalker

May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

Supervised Fine‑Tuningautoregressivebenchmark

0 likes · 14 min read

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

Liangxu Linux

Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1

0 likes · 7 min read

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

Alibaba Cloud Infrastructure

Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayKubernetesLLM

0 likes · 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

Alibaba Cloud Developer

Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU MemoryMemory Cache

0 likes · 21 min read

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

Infra Learning Club

Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor

0 likes · 8 min read

Testing Augment Code: A Powerful New Rival to Cursor

Alibaba Cloud Observability

Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityPrometheus

0 likes · 19 min read

Achieving Full Observability for AI Inference Apps with Prometheus

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1GPU cluster

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

Alibaba Cloud Developer

Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferencePrometheusRay Serve

0 likes · 21 min read

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

Alibaba Cloud Infrastructure

Mar 17, 2025 · Cloud Native

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

ACK GatewayAI ExtensionKubernetes

0 likes · 16 min read

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

Zhihu Tech Column

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism

0 likes · 11 min read

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Alibaba Cloud Infrastructure

Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudGPU

0 likes · 17 min read

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Mar 8, 2025 · Artificial Intelligence

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

ACKKubernetesLLM

0 likes · 17 min read

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

ByteDance Cloud Native

Mar 7, 2025 · Artificial Intelligence

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

This guide walks you through the end‑to‑end process of deploying the open‑source QwQ‑32B inference model on Volcengine's cloud platform, covering GPU ECS selection, VKE cluster creation, continuous delivery CP setup, vLLM service launch, and API gateway exposure.

GPU ECSLarge Language ModelQwQ-32B

0 likes · 8 min read

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

AIWalker

Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU optimizationLLM deployment

0 likes · 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

DeepSeekFlashMLAInferenceOptimization

0 likes · 16 min read

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

Alibaba Cloud Native

Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek

0 likes · 15 min read

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

Alibaba Cloud Native

Feb 13, 2025 · Artificial Intelligence

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba CloudGPU Reserved Instancesdeployment guide

0 likes · 14 min read

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

Alibaba Cloud Infrastructure

Feb 13, 2025 · Cloud Computing

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.

ACK OneACS GPUDeepSeek

0 likes · 14 min read

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

Alibaba Cloud Infrastructure

Feb 13, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

DeepSeekDifyDistributed Deployment

0 likes · 12 min read

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

Alibaba Cloud Infrastructure

Feb 12, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

This guide explains how to prepare an Alibaba Cloud GPU instance, install Docker and NVIDIA tools, pull or build a container image, and run the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model using vLLM and OpenWebUI for both offline and online inference.

DeepSeekFP8 quantizationGPU

0 likes · 18 min read

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

Baidu Geek Talk

Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI model deploymentBaidu BaigeCloud AI

0 likes · 3 min read

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

Alibaba Cloud Developer

Feb 5, 2025 · Artificial Intelligence

Deploy DeepSeek Models on Alibaba Cloud PAI with One-Click: A Step-by-Step Guide

This tutorial shows how to log into Alibaba Cloud PAI, navigate to the Model Gallery, select a DeepSeek model such as the distilled DeepSeek‑R1‑Distill‑Qwen‑7B, and deploy it with a single click using vLLM or BladeLLM, providing endpoint and token details for immediate use.

AIAlibaba CloudBladeLLM

0 likes · 3 min read

Deploy DeepSeek Models on Alibaba Cloud PAI with One-Click: A Step-by-Step Guide

Baidu Geek Talk

Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT

0 likes · 10 min read

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Baobao Algorithm Notes

Jan 9, 2025 · Artificial Intelligence

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

A technical walkthrough shows how to use vLLM to load multiple LoRA adapters for role‑playing LLMs, analyzes the massive GPU and labor costs of naïve deployment, and presents a hosted multi‑LoRA platform as a cost‑effective solution.

AI inferenceLLMLoRA

0 likes · 11 min read

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

Baidu Intelligent Cloud Tech Hub

Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance

0 likes · 10 min read

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

Infra Learning Club

Nov 1, 2024 · Artificial Intelligence

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

GPU Memory Managementcpu_offload_gblarge language model inference

0 likes · 9 min read

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

DeWu Technology

Aug 19, 2024 · Artificial Intelligence

Multi‑LoRA Deployment for Large Language Models: Concepts, Fine‑tuning, and Cost‑Effective Strategies

The article introduces a multi‑LoRA strategy that lets many scenario‑specific adapters share a single base LLM, dramatically cutting GPU usage and cost while preserving performance, and explains how to fine‑tune with LoRA, merge adapters, and serve them efficiently using VLLM.

LoRAModel Deploymentfine-tuning

0 likes · 10 min read

Multi‑LoRA Deployment for Large Language Models: Concepts, Fine‑tuning, and Cost‑Effective Strategies

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonlarge language models

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

Baobao Algorithm Notes

Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU MemoryLLM inferencePagedAttention

0 likes · 25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference