Collection size

99 articles

Page 3 of 5

May 1, 2023 · Cloud Native

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

This guide shows how to quickly deploy the open‑source FastChat AI assistant on Alibaba Cloud ASK's serverless Kubernetes platform, covering prerequisites, YAML configuration, GPU handling, verification steps, and three usage scenarios including web UI, API calls, and a VSCode extension.

AIASKDeployment

0 likes · 12 min read

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

Old Zhang's AI Learning

Apr 28, 2026 · Artificial Intelligence

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.

CUDA 13DeepSeek V4FlashAttention

0 likes · 10 min read

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

James' Growth Diary

May 25, 2026 · Artificial Intelligence

Practical Agent Performance Tuning: Slash Latency 75%, Cut Token Costs 71%, Boost Throughput 217%

The article walks through a systematic performance map of LangChain agents and demonstrates concrete latency, token‑usage, and concurrency optimizations—streaming responses, Redis caching, model routing, prompt trimming, context summarisation, dynamic tool selection, parallel graph nodes and batch processing—showing real‑world gains of up to 75% lower latency, 71% fewer tokens and a 217% throughput increase.

Agent OptimizationLangChainLangGraph

0 likes · 30 min read

Practical Agent Performance Tuning: Slash Latency 75%, Cut Token Costs 71%, Boost Throughput 217%

Alibaba Cloud Native

Mar 27, 2025 · Cloud Native

Deploy the QwQ‑32B LLM on Alibaba Cloud Function Compute with CAP in Minutes

This guide walks you through deploying the open‑source QwQ‑32B model on Alibaba Cloud Function Compute using the Cloud Application Platform (CAP), covering architecture, required services, account setup, step‑by‑step deployment, cost considerations, model interaction via Open WebUI and Chatbox, scaling configuration, and resource cleanup.

CAPFunction ComputeOllama

0 likes · 8 min read

Deploy the QwQ‑32B LLM on Alibaba Cloud Function Compute with CAP in Minutes

Old Zhang's AI Learning

May 1, 2026 · Artificial Intelligence

NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)

NVIDIA’s Nemotron 3 Nano Omni 30B‑A3B‑Reasoning model, an open‑source multimodal LLM with 30 B parameters, 256K context and video‑audio‑image‑text capabilities, outperforms comparable models by up to 9.2× in video throughput, runs on consumer GPUs via 4‑bit GGUF quantization, but currently supports only English input.

GGUFGPUMultimodal

0 likes · 17 min read

NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)

ByteDance Cloud Native

Mar 7, 2025 · Artificial Intelligence

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

This guide walks you through the end‑to‑end process of deploying the open‑source QwQ‑32B inference model on Volcengine's cloud platform, covering GPU ECS selection, VKE cluster creation, continuous delivery CP setup, vLLM service launch, and API gateway exposure.

GPU ECSLarge Language ModelQwQ-32B

0 likes · 8 min read

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

Old Zhang's AI Learning

Apr 20, 2026 · Artificial Intelligence

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The article details how to deploy the 4‑bit quantized Qwen3.6-35B model with vLLM 0.17 (and 0.19.1 patch) on a Docker container, compares its memory usage and token‑generation speed to Qwen3.5‑35B, and shares practical scripts and observed performance of roughly 150 tokens per second.

DockerLLM deploymentPerformance Benchmark

0 likes · 5 min read

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

Old Zhang's AI Learning

Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek V4DockerFP8 quantization

0 likes · 6 min read

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

MaGe Linux Operations

Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI deploymentCUDAOllama

0 likes · 16 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

Alibaba Cloud Native

Dec 19, 2024 · Artificial Intelligence

Deploy Open-Source LLMs on Alibaba Cloud Function Compute in 10 Minutes

This guide explains how to quickly launch an open‑source large language model from ModelScope on Alibaba Cloud Function Compute, covering the required cloud services, step‑by‑step deployment, reserved‑instance configuration, and how to invoke the model via the provided domain.

AIAlibaba CloudDeployment

0 likes · 7 min read

Deploy Open-Source LLMs on Alibaba Cloud Function Compute in 10 Minutes

Alibaba Cloud Native

Dec 27, 2023 · Cloud Computing

One‑Click Deployment of LLMs to Alibaba Cloud Function Compute with SwingDeploy

This guide explains how to quickly select a ModelScope open‑source LLM, deploy it to Alibaba Cloud Function Compute using the SwingDeploy one‑click feature, enable reserved idle billing, and evaluate the cost savings compared with traditional GPU provisioning.

Function ComputeGPULLM

0 likes · 11 min read

One‑Click Deployment of LLMs to Alibaba Cloud Function Compute with SwingDeploy

Alibaba Cloud Big Data AI Platform

Jul 11, 2024 · Artificial Intelligence

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling

Alibaba Cloud's PAI team unveiled Llumnix, a dynamic scheduling framework for large language model serving that dramatically reduces tail latency, speeds high‑priority requests, and cuts costs, earning acceptance at OSDI 2024 and now open‑sourced on GitHub.

AI SystemsCost ReductionDynamic Scheduling

0 likes · 5 min read

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling

Architect's Alchemy Furnace

Jul 17, 2025 · Artificial Intelligence

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

This article compiles a comprehensive, up‑to‑date inventory of open‑source large language models from Chinese and international organizations, detailing each model’s architecture, parameter count, multilingual capabilities, deployment requirements, and associated tools, offering a valuable reference for AI researchers and developers.

AILLMLarge Language Model

0 likes · 50 min read

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

AI Explorer

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV CacheLLM inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

Alibaba Cloud Infrastructure

Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

DeepSeekGPUKServe

0 likes · 13 min read

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

Baidu Intelligent Cloud Tech Hub

Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference

0 likes · 9 min read

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

DaTaobao Tech

Sep 27, 2023 · Artificial Intelligence

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

FlashAttention‑2 is an IO‑aware exact attention algorithm that cuts GPU HBM traffic through tiling and recomputation, optimizes non‑matmul FLOPs, expands sequence‑parallelism and warp‑level work distribution, delivering up to 2× speedup over FlashAttention, near‑GEMM efficiency, and enabling longer‑context Transformer training and inference for AIGC with fastunet and negligible accuracy loss.

AIGCAttention optimizationFlashAttention-2

0 likes · 20 min read

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

AI Explorer

Mar 18, 2026 · Artificial Intelligence

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Unsloth offers a unified web UI that accelerates fine‑tuning by up to 2×, cuts VRAM usage by 70% (80% for RL), supports hundreds of open‑source models, and provides simple installation steps for rapid local AI experimentation.

AI workstationGPU optimizationLLM

0 likes · 6 min read

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Alibaba Cloud Infrastructure

Jan 21, 2026 · Artificial Intelligence

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

This article details how to deploy the 235‑billion‑parameter Qwen3‑235B model using PD‑separation and MoE techniques, explains the associated challenges, and demonstrates a production‑grade solution built on the high‑performance SGLang inference engine and the RoleBasedGroup (RBG) orchestration framework, complete with benchmark results and best‑practice YAML examples.

AIKubernetesLLM

0 likes · 21 min read

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG