Tag

AI inference

1 views collected around this technical thread.

Code Mala Tang
Code Mala Tang
Apr 3, 2025 · Artificial Intelligence

Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?

This article compares the inference performance of a high‑end Intel Core Ultra 5 AI workstation with an Apple M1 MacBook Air using the IPEX‑LLM library, detailing installation steps, minimal code changes, resource usage, and benchmark results for small and large language models.

AI inferenceApple M1IPEX-LLM
0 likes · 9 min read
Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 18, 2025 · Cloud Native

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

ACK GatewayAI inferenceLLM
0 likes · 25 min read
Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 21, 2025 · Artificial Intelligence

Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU

This article explains how to use Alibaba Cloud ACK Edge to manage on‑premise GPU resources and seamlessly fall back to cloud‑based ACS Serverless GPU via virtual nodes for deploying DeepSeek R1 inference, covering environment preparation, model download, storage setup, custom scheduling, and scaling strategies.

ACK EdgeAI inferenceDeepSeek
0 likes · 16 min read
Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU
Java Tech Enthusiast
Java Tech Enthusiast
Feb 15, 2025 · Artificial Intelligence

DeepSeek-R1: High-Performance AI Inference Model

DeepSeek‑R1 is a high‑performance AI inference model that leverages reinforcement‑learning techniques to boost reasoning on complex tasks, has become a Chinese‑New‑Year sensation, and requires substantial hardware resources for local deployment, especially the full‑scale 671‑billion‑parameter version.

AI deploymentAI inferenceAI model
0 likes · 4 min read
DeepSeek-R1: High-Performance AI Inference Model
Tencent Tech
Tencent Tech
Feb 4, 2025 · Artificial Intelligence

Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes

This guide walks you through quickly deploying DeepSeek series models on the Tencent Cloud TI platform, covering model selection, resource planning, step‑by‑step service creation, free online trial, API testing via built‑in tools or curl, and managing inference services for both large and compact models.

AI inferenceDeepSeekModel Deployment
0 likes · 13 min read
Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes
DevOps
DevOps
Jan 6, 2025 · Artificial Intelligence

Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations

This article reviews ten mainstream LLM deployment solutions—including WebLLM, LM Studio, Ollama, vLLM, LightLLM, OpenLLM, HuggingFace TGI, GPT4ALL, llama.cpp, and Triton Inference Server—detailing their technical characteristics, strengths, drawbacks, and example deployment workflows for both personal and enterprise environments.

AI inferenceGPU AccelerationLLM
0 likes · 16 min read
Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations
DevOps Cloud Academy
DevOps Cloud Academy
Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferencefault tolerancekubernetes
0 likes · 15 min read
Key Kubernetes Features that Benefit AI Inference Workloads
JD Tech
JD Tech
Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceGPU optimizationdeep learning compiler
0 likes · 14 min read
High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization
php中文网 Courses
php中文网 Courses
Jan 8, 2024 · Artificial Intelligence

Setting Up and Using LocalAI as an Open‑Source Alternative to the ChatGPT API

LocalAI is an open‑source, cost‑effective alternative to the ChatGPT API that lets you download and run thousands of language models locally via Docker or compiled binaries, offering privacy, customization, and easy integration into projects through a compatible API.

AI inferenceAPIDocker
0 likes · 7 min read
Setting Up and Using LocalAI as an Open‑Source Alternative to the ChatGPT API
DataFunSummit
DataFunSummit
Jul 4, 2023 · Artificial Intelligence

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

The article presents SenseTime's PPL framework, detailing its toolchain, inference engine, multi‑backend operator library, quantization tools, CUDA optimizations, performance benchmarks across CPUs, GPUs, DSPs and DSAs, and outlines future plans for broader chip support and AI for Science.

AI inferenceCUDA OptimizationDeep Learning Deployment
0 likes · 23 min read
PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Tencent Tech
Tencent Tech
Apr 18, 2023 · Artificial Intelligence

How Tencent’s Zixiao AI Chip Supercharges Real‑Time Meeting Subtitles

Tencent’s home‑grown Zixiao AI inference chip, combined with the LightRuntime engine, dramatically reduces latency and cost for real‑time subtitles in Tencent Meeting, handling tens of thousands of concurrent audio streams while meeting sub‑second delay requirements through hardware‑software co‑optimizations and mixed‑precision model tuning.

AI inferencePerformance OptimizationTencent Meeting
0 likes · 16 min read
How Tencent’s Zixiao AI Chip Supercharges Real‑Time Meeting Subtitles
High Availability Architecture
High Availability Architecture
Apr 3, 2023 · Cloud Native

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

The article describes Punica, a cloud‑native, function‑as‑a‑service platform that unifies content‑understanding inference services through a one‑stop portal and unattended operations, improving deployment speed, resource utilization, and reducing manual effort for AI model serving.

AI inferenceFaaSResource Scheduling
0 likes · 13 min read
Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform
Baidu Tech Salon
Baidu Tech Salon
Mar 29, 2023 · Artificial Intelligence

Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture

The Punica system unifies AI inference development, testing, deployment, and maintenance on a FaaS‑based one‑stop platform that automates resource scheduling, self‑healing, and monitoring, supporting multiple frameworks and GPUs, thereby doubling onboarding speed, quintuple scaling efficiency, and reclaiming hundreds of GPU cards.

AI inferenceContainer FrameworkFaaS architecture
0 likes · 13 min read
Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformResource Scheduling
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services
DataFunTalk
DataFunTalk
Dec 7, 2022 · Artificial Intelligence

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.

AI inferenceKunlun chipSpeech Recognition
0 likes · 10 min read
Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library
Tencent Architect
Tencent Architect
Jun 9, 2022 · Artificial Intelligence

From Zero to Chip: Tencent’s Multi‑Year Journey in AI, FPGA, and Smart‑NIC Development

Tencent’s hardware teams evolved from a lack of verification tools in 2019 to building AI inference chips, video‑encoding silicon, and intelligent NICs, overcoming FPGA challenges, scaling cloud infrastructure, and delivering high‑performance, low‑cost solutions for massive multimedia and AI workloads.

AI inferenceFPGATencent
0 likes · 16 min read
From Zero to Chip: Tencent’s Multi‑Year Journey in AI, FPGA, and Smart‑NIC Development
Yiche Technology
Yiche Technology
Jan 27, 2022 · Backend Development

C++ Multithreaded Service Architecture for High‑Throughput AI Inference

The article explains how to design a C++‑based multithreaded service that uses Pthreads, channels, and TensorRT to parallelize deep‑learning inference tasks, thereby reducing latency and dramatically increasing throughput for AI applications such as facial‑recognition access control systems.

AI inferenceC++Concurrency
0 likes · 11 min read
C++ Multithreaded Service Architecture for High‑Throughput AI Inference