Collection size

99 articles

Page 5 of 5

Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationModel Compression

0 likes · 10 min read

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

Volcano Engine Developer Services

Feb 10, 2025 · Artificial Intelligence

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

This article explains how to deploy DeepSeek's open‑source large language models—especially DeepSeek‑R1‑Distill—on Volcengine Cloud using three approaches: a containerized VKE solution, a serverless veFaaS setup, and a one‑click Terraform script, complete with step‑by‑step instructions, code snippets, and configuration tips.

DeepSeekVolcenginecloud deployment

0 likes · 18 min read

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

Volcano Engine Developer Services

Jun 30, 2023 · Cloud Native

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks you through preparing a VKE cluster, pulling the Langchain‑ChatGLM container image, creating the necessary Deployment and Service resources, and adding a local knowledge base, enabling you to run a Langchain‑based ChatGLM service with GPU support on Volcengine’s cloud‑native platform.

AI deploymentChatGLMGPU

0 likes · 6 min read

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

JD Tech Talk

Feb 10, 2025 · Artificial Intelligence

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

This guide walks you through preparing a JD Cloud GPU instance, installing NVIDIA drivers, deploying Ollama, running the DeepSeek LLM (including model download and execution), configuring the Chatbox graphical client for interactive queries, and optionally feeding local documents into AnythingLLM for a private knowledge base.

AnythingLLMChatboxDeepSeek

0 likes · 17 min read

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

Raymond Ops

Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU

0 likes · 15 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

Alibaba Cloud Native

Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek

0 likes · 15 min read

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

Old Zhang's AI Learning

Mar 4, 2026 · Artificial Intelligence

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.

LM StudioOllamaThinking mode

0 likes · 4 min read

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

Alibaba Cloud Developer

Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI infrastructureKVCacheKubernetes

0 likes · 21 min read

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Baidu Intelligent Cloud Tech Hub

Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

KunlunLLMOpen Source

0 likes · 8 min read

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

Baobao Algorithm Notes

Oct 15, 2023 · Artificial Intelligence

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

GPU memory optimizationPyTorchmeta device

0 likes · 12 min read

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

Old Meng AI Explorer

Jan 10, 2026 · Artificial Intelligence

Run Large Language Models on a Laptop: How ktransformers Breaks the GPU Barrier

ktransformers is an open‑source AI model optimization framework that uses dynamic quantization, layer fusion and memory reuse to cut memory usage by up to 50%, double loading speed and reduce inference cost, enabling 7B‑13B models to run smoothly on ordinary CPUs or low‑end GPUs.

KTransformersOpen SourcePython

0 likes · 11 min read

Run Large Language Models on a Laptop: How ktransformers Breaks the GPU Barrier

Architect's Alchemy Furnace

Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

LLMModel ServingOpen Source

0 likes · 7 min read

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

AI Algorithm Path

Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU optimization

0 likes · 8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Alibaba Cloud Observability

Oct 20, 2025 · Artificial Intelligence

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

EmbeddingGPU inferencePerformance Optimization

0 likes · 9 min read

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

Alibaba Cloud Big Data AI Platform

Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI infrastructureAutoTunerLLM inference

0 likes · 11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

HyperAI Super Neural

Mar 10, 2026 · Artificial Intelligence

Deploy Popular Open‑Source LLMs on Free CPU in Minutes – Qwen3.5, DeepSeek‑R1, Gemma 3, Llama 3.2 and More

This guide shows how to use HyperAI’s free CPU quota to quickly deploy popular open‑source LLMs such as Qwen3.5, DeepSeek‑R1, Gemma 3 and Llama 3.2, walking through environment setup, model download, and inference execution without needing local GPU hardware.

CPUHyperAILLM

0 likes · 6 min read

Deploy Popular Open‑Source LLMs on Free CPU in Minutes – Qwen3.5, DeepSeek‑R1, Gemma 3, Llama 3.2 and More