Tag

GPU inference

0 views collected around this technical thread.

Baidu Geek Talk
Baidu Geek Talk
Nov 9, 2023 · Artificial Intelligence

Deep Learning Model Architecture Evolution in Baidu Search

The article chronicles Baidu Search’s Model Architecture Group’s evolution of deep‑learning‑driven search, detailing the shift from inverted‑index to semantic vector indexing, the use of transformer‑based models for text and image queries, large‑scale offline/online pipelines, and extensive GPU‑centric optimizations such as pruning, quantization and distillation, all aimed at delivering precise, cost‑effective results to hundreds of millions of users.

Deep LearningERNIEGPU inference
0 likes · 14 min read
Deep Learning Model Architecture Evolution in Baidu Search
NetEase Media Technology Team
NetEase Media Technology Team
Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceMulti-GPUTensorRT
0 likes · 44 min read
GPU Model Inference Optimization Practices in NetEase News Recommendation System
DeWu Technology
DeWu Technology
Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference
0 likes · 14 min read
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT
DataFunSummit
DataFunSummit
Nov 3, 2022 · Artificial Intelligence

Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference

This article explains why traditional CPU inference and naïve GPU usage are inefficient for recommendation workloads, introduces NVIDIA Multi‑Process Service (MPS) technology, describes VIVO's custom Rust‑based inference engine and deployment strategies, and presents performance and cost benefits along with practical deployment considerations.

GPU inferenceKubernetesMPS
0 likes · 13 min read
Applying NVIDIA MPS to Boost GPU Utilization for Recommendation Inference
DataFunTalk
DataFunTalk
Feb 14, 2021 · Artificial Intelligence

TurboTransformers: An Efficient GPU Serving System for Transformer Models

TurboTransformers introduces a suite of GPU‑centric optimizations—including a high‑throughput batch reduction algorithm, a variable‑length‑aware memory allocator, and a dynamic‑programming‑based batch scheduling strategy—that together deliver significantly lower latency and higher throughput for Transformer‑based NLP services compared with existing frameworks such as PyTorch, TensorFlow, ONNX Runtime and TensorRT.

BERTDynamic batchingGPU inference
0 likes · 13 min read
TurboTransformers: An Efficient GPU Serving System for Transformer Models
58 Tech
58 Tech
Nov 6, 2019 · Artificial Intelligence

TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)

This article explains how the 58 AI platform leverages NVIDIA TensorRT to accelerate deep‑learning inference on GPUs, describes three integration approaches, details the TF‑TRT implementation and Kubernetes deployment, and presents performance gains for ResNet‑50 and OCR models.

AI PlatformGPU inferenceKubernetes deployment
0 likes · 7 min read
TensorRT Acceleration and Integration Design for the 58 AI Platform (WPAI)