Collection size
99 articles
Page 2 of 5
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPULLMPerformance tuning
0 likes · 48 min read
How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference
0 likes · 16 min read
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIHardwareKunlun
0 likes · 12 min read
How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins
Geek Labs
Geek Labs
May 7, 2026 · Artificial Intelligence

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article introduces two recent GitHub projects—club‑3090, which enables single‑ or dual‑RTX 3090 inference of 27‑billion‑parameter models with detailed performance benchmarks, and library‑skills, a tool that keeps AI agents synchronized with the latest official library APIs—explaining their configurations, usage steps, hardware requirements, and target audiences.

AI agentsDockerRTX 3090
0 likes · 7 min read
Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions
58 Tech
58 Tech
Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersModel Serving
0 likes · 35 min read
How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference
Ops Development Stories
Ops Development Stories
Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI gatewayAI pluginsHigress
0 likes · 30 min read
How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide
Old Meng AI Explorer
Old Meng AI Explorer
Apr 20, 2026 · Artificial Intelligence

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

This article explains what NVIDIA NIM is, compares its generous free quota to other LLM providers, lists the supported free models, walks through a five‑minute sign‑up, shows three code examples for calling the API, offers model‑selection advice, and provides a hands‑on case for building a free AI chat interface.

AI modelsFree LLM APINIM
0 likes · 16 min read
Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide
Alibaba Cloud Native
Alibaba Cloud Native
Aug 21, 2025 · Cloud Native

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.

GPULLMRedis
0 likes · 11 min read
How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference
0 likes · 21 min read
How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 19, 2026 · Artificial Intelligence

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

The article provides a 2026 deep comparative analysis of three major large‑model inference frameworks—vLLM, llama.cpp, and MLX—detailing their core designs, recent updates, benchmark results on various hardware, deployment complexity, and recommended use cases to help developers choose the right tool.

MLXbenchmarkframework comparison
0 likes · 15 min read
Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)
21CTO
21CTO
Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonlarge language models
0 likes · 13 min read
Deploy Large Language Models with vLLM and Quantization for Low Latency
Alibaba Cloud Native
Alibaba Cloud Native
Jan 26, 2024 · Artificial Intelligence

Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation

This guide explains how to overcome GPU cost, high‑concurrency, and model‑switching challenges by using Alibaba Cloud's Serverless Stable Diffusion API, detailing deployment steps, supported use cases, performance advantages, and the full set of RESTful endpoints for AI image creation.

AIAPIFunction Compute
0 likes · 19 min read
Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes
0 likes · 13 min read
Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide
Architect
Architect
Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance OptimizationSpeculative Decoding
0 likes · 23 min read
How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI agentsMultimodal LearningPerformance Optimization
0 likes · 23 min read
Must‑Read AI Agent and LLM Research Papers for Deep Understanding
ByteDance Cloud Native
ByteDance Cloud Native
Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1GPU cluster
0 likes · 14 min read
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
HyperAI Super Neural
HyperAI Super Neural
Apr 8, 2026 · Artificial Intelligence

One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance

HyperAI’s tutorial lets developers instantly launch the open‑source Gemma‑4‑31B model—supporting multimodal input, up to 256 K token context and over 140 languages—through a one‑click deployment on RTX 6000 or RTX 5090 GPUs, with detailed step‑by‑step instructions and optional compute credits.

256K contextGemma-4-31BHyperAI
0 likes · 5 min read
One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance