Tag

TensorRT

0 views collected around this technical thread.

58 Tech
58 Tech
Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphTensorRTVisual Language Model
0 likes · 19 min read
Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
Zhuanzhuan Tech
Zhuanzhuan Tech
Oct 16, 2024 · Artificial Intelligence

Optimizing TorchServe Inference Service Architecture for High‑Performance AI Deployment

This article details the engineering practice of optimizing TorchServe‑based AI inference services, covering background challenges, framework selection, GPU‑accelerated Torch‑TRT integration, CPU‑side preprocessing improvements, and deployment on Kubernetes to achieve higher throughput and lower resource consumption.

GPUOptimizationKubernetesModelServing
0 likes · 17 min read
Optimizing TorchServe Inference Service Architecture for High‑Performance AI Deployment
DataFunTalk
DataFunTalk
Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference
0 likes · 20 min read
Efficient Deployment of Speech AI Models on GPUs
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 17, 2024 · Artificial Intelligence

Building a License Plate Recognition Service with C++, TensorRT, and Go

This article details how to train a YOLOv8‑pose model for license‑plate detection, convert it to TensorRT engine, implement C++ inference and preprocessing, expose the functionality via CGO to Go, and assemble a lightweight web service for real‑time plate recognition.

C++License Plate RecognitionModel Inference
0 likes · 12 min read
Building a License Plate Recognition Service with C++, TensorRT, and Go
NetEase Media Technology Team
NetEase Media Technology Team
Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceMulti-GPUPerformance
0 likes · 44 min read
GPU Model Inference Optimization Practices in NetEase News Recommendation System
DataFunSummit
DataFunSummit
Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRGPU DeploymentSpeech AI
0 likes · 14 min read
Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT
Kuaishou Large Model
Kuaishou Large Model
Mar 31, 2023 · Artificial Intelligence

How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023

At NVIDIA GTC 2023, Kuaishou engineers unveiled cutting‑edge solutions ranging from video quality assessment and enhancement, 3D digital‑human live streaming, a custom TensorRT‑based performance framework, large‑scale recommendation model acceleration, to multimodal massive‑model deployment for short‑video scenarios.

AI optimizationDigital HumanMultimodal Models
0 likes · 9 min read
How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023
DeWu Technology
DeWu Technology
Mar 8, 2023 · Artificial Intelligence

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

CPU-GPU SeparationCUDAGPU inference
0 likes · 14 min read
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT
Alimama Tech
Alimama Tech
Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

GPU utilizationHigh Performance ComputingTensorRT
0 likes · 19 min read
Optimizing GPU Utilization for Multimedia AI Services with high_service
Alimama Tech
Alimama Tech
Oct 26, 2022 · Artificial Intelligence

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

The paper analyzes why Alibaba Mama’s intelligent creative video service suffers low GPU utilization—due to Python GIL blocking, lack of kernel fusion, and serialized CUDA streams—and details service‑level changes (separate CPU/GPU processes, shared‑memory queues, priority scheduling) and operator‑level kernel‑fusion techniques (channels‑last layouts, custom pooling, TensorRT conversion) that raise utilization from ~30 % to near 100 % and boost throughput by 75 %.

GPU optimizationPythonTensorRT
0 likes · 20 min read
GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service
Yiche Technology
Yiche Technology
Jan 27, 2022 · Backend Development

C++ Multithreaded Service Architecture for High‑Throughput AI Inference

The article explains how to design a C++‑based multithreaded service that uses Pthreads, channels, and TensorRT to parallelize deep‑learning inference tasks, thereby reducing latency and dramatically increasing throughput for AI applications such as facial‑recognition access control systems.

AI inferenceC++Performance
0 likes · 11 min read
C++ Multithreaded Service Architecture for High‑Throughput AI Inference
58 Tech
58 Tech
Dec 21, 2021 · Artificial Intelligence

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

dl_inference is an open‑source, production‑grade deep learning inference platform that supports TensorFlow, PyTorch and Caffe models, offering GPU and CPU deployment, TensorRT and MKL acceleration, multi‑node load balancing, and extensive Q&A on model conversion, hardware requirements, INT8 quantization, and performance gains.

CPUGPUInference
0 likes · 8 min read
dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration
58 Tech
58 Tech
Dec 8, 2021 · Artificial Intelligence

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.

DockerInferenceIntel MKL
0 likes · 12 min read
dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 5, 2021 · Artificial Intelligence

Accelerating 4K Video Super‑Resolution with TensorRT: iQIYI’s Optimization and Production Practices

iQIYI optimized a 4K video super-resolution model using TensorRT, employing split of graph, operator fusion, custom CUDA kernels, and int8 quantization, achieving tenfold speedup (≈180 ms per 1080p frame) and demonstrating deep customization potential for large‑scale production.

INT8 quantizationTensorRTdeep learning
0 likes · 17 min read
Accelerating 4K Video Super‑Resolution with TensorRT: iQIYI’s Optimization and Production Practices
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
360 Tech Engineering
360 Tech Engineering
Mar 1, 2021 · Artificial Intelligence

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

BERTGPUOnline Inference
0 likes · 10 min read
Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search
DataFunTalk
DataFunTalk
Jan 10, 2021 · Artificial Intelligence

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

This article presents a comprehensive overview of Didi's machine translation platform, covering its evolution from statistical to neural models, the Transformer architecture with relative position and larger FFN, data preparation, training strategies such as back‑translation and knowledge distillation, deployment optimizations with TensorRT, and the team's successful participation in the WMT2020 news translation task.

BLEUTensorRTTransformer
0 likes · 14 min read
Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience
58 Tech
58 Tech
Nov 20, 2020 · Artificial Intelligence

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

The article details the development, architecture, and optimization of 58.com’s AI algorithm platform (WPAI), covering its background, overall design, large‑scale distributed machine learning, deep‑learning platform features, inference performance enhancements, GPU resource scheduling improvements, and future directions.

AI PlatformGPU schedulingKubernetes
0 likes · 15 min read
Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)
Didi Tech
Didi Tech
Oct 27, 2020 · Artificial Intelligence

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

Didi's machine translation system combines a Transformer‑big architecture with relative position representations, enlarged feed‑forward networks, iterative back‑translation, knowledge‑distillation and domain fine‑tuning, optimized via TensorRT for speed, achieving a BLEU 36.6 and third place in the WMT2020 Chinese‑to‑English news task.

BLEUTensorRTTransformer
0 likes · 15 min read
Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience
Zhengtong Technical Team
Zhengtong Technical Team
Aug 14, 2020 · Artificial Intelligence

ZTFace: A High‑Precision, Fast Face Recognition Algorithm

This article presents ZTFace, an end‑to‑end face recognition solution that integrates face detection, alignment, feature embedding, verification, anti‑spoofing and attribute recognition using deep learning, details its backbone networks, loss functions, training datasets, experimental results on WIDER FACE and LFW, and demonstrates acceleration with TensorRT.

Computer VisionTensorRTZTFace
0 likes · 17 min read
ZTFace: A High‑Precision, Fast Face Recognition Algorithm