Collection size
99 articles
Page 5 of 5
Old Meng AI Explorer
Old Meng AI Explorer
Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationModel Compression
0 likes · 10 min read
How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU
Volcano Engine Developer Services
Volcano Engine Developer Services
Feb 10, 2025 · Artificial Intelligence

How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods

This article explains how to deploy DeepSeek's open‑source large language models—especially DeepSeek‑R1‑Distill—on Volcengine Cloud using three approaches: a containerized VKE solution, a serverless veFaaS setup, and a one‑click Terraform script, complete with step‑by‑step instructions, code snippets, and configuration tips.

DeepSeekVolcenginecloud deployment
0 likes · 18 min read
How to Quickly Deploy DeepSeek‑R1‑Distill on Volcengine Cloud: Three Practical Methods
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 30, 2023 · Cloud Native

Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks you through preparing a VKE cluster, pulling the Langchain‑ChatGLM container image, creating the necessary Deployment and Service resources, and adding a local knowledge base, enabling you to run a Langchain‑based ChatGLM service with GPU support on Volcengine’s cloud‑native platform.

AI deploymentChatGLMGPU
0 likes · 6 min read
Deploy Langchain‑ChatGLM on Volcengine VKE: A Step‑by‑Step Cloud‑Native Guide
DataFunSummit
DataFunSummit
Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory
0 likes · 25 min read
Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
JD Tech Talk
JD Tech Talk
Feb 10, 2025 · Artificial Intelligence

Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox

This guide walks you through preparing a JD Cloud GPU instance, installing NVIDIA drivers, deploying Ollama, running the DeepSeek LLM (including model download and execution), configuring the Chatbox graphical client for interactive queries, and optionally feeding local documents into AnythingLLM for a private knowledge base.

AnythingLLMChatboxDeepSeek
0 likes · 17 min read
Deploy DeepSeek on JD Cloud GPU and Chat with It via Ollama & Chatbox
Raymond Ops
Raymond Ops
Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU
0 likes · 15 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
Alibaba Cloud Native
Alibaba Cloud Native
Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek
0 likes · 15 min read
Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 24, 2025 · Artificial Intelligence

Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service

Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.

AI infrastructureKVCacheKubernetes
0 likes · 21 min read
Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

KunlunLLMOpen Source
0 likes · 8 min read
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 15, 2023 · Artificial Intelligence

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

GPU memory optimizationPyTorchmeta device
0 likes · 12 min read
Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mar 27, 2025 · Artificial Intelligence

Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?

This article provides a comprehensive side‑by‑side comparison of the open‑source LLM serving tools Xinference and Ollama, examining their core goals, architecture, model support, deployment options, performance, ecosystem integration, typical use cases, future roadmap, and guidance on selecting the right solution for enterprise or personal projects.

LLMModel ServingOpen Source
0 likes · 7 min read
Xinference vs Ollama: Which Open‑Source LLM Engine Fits Your Needs?
AI Algorithm Path
AI Algorithm Path
Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU optimization
0 likes · 8 min read
Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 20, 2025 · Artificial Intelligence

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

EmbeddingGPU inferencePerformance Optimization
0 likes · 9 min read
How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI infrastructureAutoTunerLLM inference
0 likes · 11 min read
BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner