Collection size

99 articles

Page 2 of 5

Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPULLMPerformance tuning

0 likes · 48 min read

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

Baidu Intelligent Cloud Tech Hub

Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference

0 likes · 16 min read

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

Baidu Intelligent Cloud Tech Hub

Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIHardwareKunlun

0 likes · 12 min read

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

Geek Labs

May 7, 2026 · Artificial Intelligence

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article introduces two recent GitHub projects—club‑3090, which enables single‑ or dual‑RTX 3090 inference of 27‑billion‑parameter models with detailed performance benchmarks, and library‑skills, a tool that keeps AI agents synchronized with the latest official library APIs—explaining their configurations, usage steps, hardware requirements, and target audiences.

AI agentsDockerRTX 3090

0 likes · 7 min read

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

58 Tech

Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersModel Serving

0 likes · 35 min read

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

Ops Development Stories

Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI gatewayAI pluginsHigress

0 likes · 30 min read

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

Alibaba Cloud Native

Mar 3, 2026 · Cloud Native

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

This guide explains how to quickly deploy the new Qwen3.5-397B-A17B open‑source large model using Alibaba Cloud Function Compute's serverless GPU service, covering model features, deployment steps, required commands, and performance benefits.

AICloud NativeFunction Compute

0 likes · 5 min read

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

Old Meng AI Explorer

Apr 20, 2026 · Artificial Intelligence

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

This article explains what NVIDIA NIM is, compares its generous free quota to other LLM providers, lists the supported free models, walks through a five‑minute sign‑up, shows three code examples for calling the API, offers model‑selection advice, and provides a hands‑on case for building a free AI chat interface.

AI modelsFree LLM APINIM

0 likes · 16 min read

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

Alibaba Cloud Native

Aug 21, 2025 · Cloud Native

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.

GPULLMRedis

0 likes · 11 min read

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

Baidu Intelligent Cloud Tech Hub

Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference

0 likes · 21 min read

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Lao Guo's Learning Space

Apr 19, 2026 · Artificial Intelligence

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

The article provides a 2026 deep comparative analysis of three major large‑model inference frameworks—vLLM, llama.cpp, and MLX—detailing their core designs, recent updates, benchmark results on various hardware, deployment complexity, and recommended use cases to help developers choose the right tool.

MLXbenchmarkframework comparison

0 likes · 15 min read

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonlarge language models

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

Alibaba Cloud Native

Jan 26, 2024 · Artificial Intelligence

Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation

This guide explains how to overcome GPU cost, high‑concurrency, and model‑switching challenges by using Alibaba Cloud's Serverless Stable Diffusion API, detailing deployment steps, supported use cases, performance advantages, and the full set of RESTful endpoints for AI image creation.

AIAPIFunction Compute

0 likes · 19 min read

Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation

Alibaba Cloud Infrastructure

Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes

0 likes · 13 min read

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

Alibaba Cloud Developer

Mar 25, 2025 · Cloud Computing

How to Deploy QwQ-32B Model on Alibaba Cloud Function Compute Using CAP

This guide walks you through deploying the open‑source QwQ‑32B model on Alibaba Cloud Function Compute with CAP, covering required services, step‑by‑step deployment, cost notes, accessing the demo UI, interacting with the model, scaling settings, and resource cleanup.

Alibaba CloudCAPFunction Compute

0 likes · 7 min read

How to Deploy QwQ-32B Model on Alibaba Cloud Function Compute Using CAP

Architect

Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance OptimizationSpeculative Decoding

0 likes · 23 min read

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

Baobao Algorithm Notes

Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI agentsMultimodal LearningPerformance Optimization

0 likes · 23 min read

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1GPU cluster

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

Volcano Engine Developer Services

Jun 13, 2023 · Artificial Intelligence

Deploy Stable Diffusion on Volcengine Cloud: A Step‑by‑Step Guide

Learn how to deploy your own Stable Diffusion text‑to‑image model on Volcengine Cloud by setting up a VKE Kubernetes cluster, configuring storage, GPU resources, container images, and exposing the service via ALB or API Gateway, while leveraging mGPU sharing and serverless GPU options.

AIGPUKubernetes

0 likes · 14 min read

Deploy Stable Diffusion on Volcengine Cloud: A Step‑by‑Step Guide

HyperAI Super Neural

Apr 8, 2026 · Artificial Intelligence

One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance

HyperAI’s tutorial lets developers instantly launch the open‑source Gemma‑4‑31B model—supporting multimodal input, up to 256 K token context and over 140 languages—through a one‑click deployment on RTX 6000 or RTX 5090 GPUs, with detailed step‑by‑step instructions and optional compute credits.

256K contextGemma-4-31BHyperAI

0 likes · 5 min read

One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance