Tag

GPU Acceleration

1 views collected around this technical thread.

DataFunSummit
DataFunSummit
Jun 12, 2025 · Artificial Intelligence

How Alibaba Cloud’s AI Search Evolves with Agentic RAG and Multi‑Model Innovations

This article details Alibaba Cloud AI Search’s development journey, covering its dual product lines, the evolution of Agentic RAG technology, multi‑agent architectures, vector retrieval breakthroughs, GPU‑accelerated indexing, NL2SQL capabilities, deployment models, and future directions for AI‑driven search solutions.

AI SearchGPU AccelerationLarge Models
0 likes · 33 min read
How Alibaba Cloud’s AI Search Evolves with Agentic RAG and Multi‑Model Innovations
DeWu Technology
DeWu Technology
Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed InferenceGPU Acceleration
0 likes · 22 min read
Optimizing Large Model Inference: High‑Performance Frameworks and Techniques
Architect
Architect
Feb 2, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 Locally with Ollama and Accessing It via Spring Boot and Spring AI

This guide explains how to install Ollama, download and run the open‑source DeepSeek‑R1 language model locally, configure GPU acceleration, and integrate the model into a Spring Boot application using Spring AI to provide an API service for AI inference.

AI model deploymentDeepSeek-R1GPU Acceleration
0 likes · 12 min read
Deploying DeepSeek‑R1 Locally with Ollama and Accessing It via Spring Boot and Spring AI
DevOps
DevOps
Jan 6, 2025 · Artificial Intelligence

Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations

This article reviews ten mainstream LLM deployment solutions—including WebLLM, LM Studio, Ollama, vLLM, LightLLM, OpenLLM, HuggingFace TGI, GPT4ALL, llama.cpp, and Triton Inference Server—detailing their technical characteristics, strengths, drawbacks, and example deployment workflows for both personal and enterprise environments.

AI inferenceGPU AccelerationLLM
0 likes · 16 min read
Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Dec 17, 2024 · Artificial Intelligence

Translate Foreign Videos into Chinese with Whisper, Ollama & FFmpeg

This guide shows how to automatically extract subtitles from English videos using OpenAI's Whisper, translate them into Chinese with a locally‑deployed Ollama large language model, and finally merge the bilingual subtitles back into the video using FFmpeg, all with GPU acceleration.

AIFFmpegGPU Acceleration
0 likes · 11 min read
Translate Foreign Videos into Chinese with Whisper, Ollama & FFmpeg
AntTech
AntTech
Nov 29, 2024 · Artificial Intelligence

AI Inference with Trusted Execution Environment: HyperGPU and DistMSM Accelerated Zero‑Knowledge Proofs Win 2024 Financial Cipher Cup Innovation Award

The award‑winning solution combines a GPU‑accelerated TEE framework (HyperGPU) and a multi‑GPU zkSNARK acceleration scheme (DistMSM) to provide fast, privacy‑preserving AI inference proofs, earning the third‑place Innovation Team prize at the 2024 Financial Cipher Cup competition.

AIDistMSMFinancial Cipher
0 likes · 6 min read
AI Inference with Trusted Execution Environment: HyperGPU and DistMSM Accelerated Zero‑Knowledge Proofs Win 2024 Financial Cipher Cup Innovation Award
DataFunSummit
DataFunSummit
Jun 17, 2024 · Artificial Intelligence

Strategies for Reducing Cost and Improving Efficiency in Recommendation Systems with Alibaba Cloud PAI‑Rec

This article discusses how Alibaba Cloud’s AI platform PAI‑Rec reduces recommendation system costs and boosts efficiency by optimizing training resources, leveraging FeatureStore, EasyRec and TorchEasyRec frameworks, detailing workflow stages, feature consistency, GPU acceleration, componentized model configuration, and practical deployment timelines.

AI PlatformFeature StoreGPU Acceleration
0 likes · 14 min read
Strategies for Reducing Cost and Improving Efficiency in Recommendation Systems with Alibaba Cloud PAI‑Rec
DataFunTalk
DataFunTalk
Jun 3, 2024 · Artificial Intelligence

Deploying Speech AI Services Quickly with NVIDIA Riva

This article explains how to use NVIDIA Riva to rapidly deploy speech AI services, covering Riva's overview, Chinese ASR model updates, TTS capabilities, customization options, the Quickstart tool, and a Q&A session that clarifies deployment, model fine‑tuning, and integration with NeMo and Triton.

ASRGPU AccelerationNVIDIA Riva
0 likes · 13 min read
Deploying Speech AI Services Quickly with NVIDIA Riva
Didi Tech
Didi Tech
Apr 16, 2024 · Artificial Intelligence

Optimizing DSP Deep Model Latency by Externalizing Feature Processing with EzFeaFly

By externalizing feature processing with the EzFeaFly tool and feeding a dense index/value tensor directly to the GPU, the DSP platform decouples feature transformation from model inference, cutting instance usage by ~40%, reducing inference latency 70‑80%, and achieving over 60% end‑to‑end latency improvement while lowering costs.

DSPGPU AccelerationPerformance Optimization
0 likes · 11 min read
Optimizing DSP Deep Model Latency by Externalizing Feature Processing with EzFeaFly
DataFunSummit
DataFunSummit
Apr 14, 2024 · Artificial Intelligence

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

This article presents a comprehensive overview of NVIDIA’s TensorRT-LLM, detailing its product positioning as a scalable LLM inference solution, key features such as model support, low-precision and quantization techniques, parallelism strategies, the end-to-end usage workflow, performance highlights, future roadmap, and answers to common technical questions.

GPU AccelerationLLM inferenceNvidia
0 likes · 13 min read
TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions
Bilibili Tech
Bilibili Tech
Mar 5, 2024 · Game Development

Bilibili Color Space Conversion Engine for Video Processing

Bilibili's color space conversion engine processes user‑uploaded videos with varied color parameters into a unified format, using layered filters, precomputed optimizations, CPU and CUDA implementations, handling transformations, quantization, chroma subsampling, matrix conversion, transfer functions, gamut and tone mapping, HDR dynamic metadata, and achieving high performance for millions of users.

Color spaceGPU AccelerationHDR
0 likes · 19 min read
Bilibili Color Space Conversion Engine for Video Processing
DataFunTalk
DataFunTalk
Jan 31, 2024 · Artificial Intelligence

Introduction to NVIDIA TensorRT-LLM Inference Framework

TensorRT-LLM is NVIDIA's scalable inference framework for large language models that combines TensorRT compilation, fast kernels, multi‑GPU parallelism, low‑precision quantization, and a PyTorch‑like API to deliver high‑performance LLM serving with extensive customization and future‑focused enhancements.

Artificial IntelligenceGPU AccelerationLLM inference
0 likes · 12 min read
Introduction to NVIDIA TensorRT-LLM Inference Framework
JD Retail Technology
JD Retail Technology
Jan 30, 2024 · Artificial Intelligence

Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

The article details JD Retail's evolution from TensorFlow‑based sparse training to a custom high‑performance parameter server and a fully GPU‑accelerated, multi‑node, multi‑card synchronous training framework that leverages GPU‑RDMA, two‑level CPU‑DRAM/GPU‑HBM caching, and pipeline parallelism to overcome storage, I/O, and compute challenges of trillion‑parameter recommendation systems.

AI infrastructureGPU AccelerationRecommendation systems
0 likes · 12 min read
Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models
JD Retail Technology
JD Retail Technology
Jan 25, 2024 · Artificial Intelligence

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.

GPU AccelerationOnline InferenceRecommendation systems
0 likes · 13 min read
Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration
DataFunTalk
DataFunTalk
Dec 23, 2023 · Artificial Intelligence

NVIDIA Merlin: Product Overview, Models, Distributed Embeddings, Hierarchical KV and Parameter Server

This article introduces NVIDIA's Merlin recommendation system suite, detailing its product overview, model and system libraries, TensorFlow Distributed Embedding plugin, hierarchical key‑value store, and hierarchical parameter server, while highlighting integration with NVTABULAR, Triton, and performance gains on GPU‑accelerated training and inference.

Distributed EmbeddingGPU AccelerationHierarchical KV
0 likes · 13 min read
NVIDIA Merlin: Product Overview, Models, Distributed Embeddings, Hierarchical KV and Parameter Server
DataFunSummit
DataFunSummit
Nov 19, 2023 · Artificial Intelligence

Overview of NVIDIA Merlin for Recommendation Systems

This article introduces NVIDIA's Merlin suite, covering product overview, Merlin Models & Systems, the TensorFlow Distributed Embedding (TFDE) plugin, the Hierarchical‑KV library, and the Hierarchical Parameter Server (HPS), while highlighting their architecture, performance benefits, and ease of integration for large‑scale recommendation workloads.

Distributed EmbeddingGPU AccelerationHierarchical KV
0 likes · 13 min read
Overview of NVIDIA Merlin for Recommendation Systems
DaTaobao Tech
DaTaobao Tech
May 24, 2023 · Mobile Development

Understanding and Optimizing Mobile Page Performance and Jank

Effective mobile page performance requires identifying three jank types—screen tearing, frame drops, and long unresponsiveness—monitoring metrics such as response time, animation latency, idle time, and SM, understanding the CPU‑GPU rendering pipeline, and applying optimizations like hardware acceleration, transform‑based animations, reduced layout thrashing, task slicing, and GPU‑friendly techniques.

FPSGPU AccelerationJank
0 likes · 13 min read
Understanding and Optimizing Mobile Page Performance and Jank
Bilibili Tech
Bilibili Tech
Apr 21, 2023 · Artificial Intelligence

Design and Optimization of Bilibili's Large-Scale Video Duplicate Detection System

Bilibili built a massive video‑duplicate detection platform that trains a self‑supervised ResNet‑50 feature extractor, removes black borders, and uses a two‑stage ANN‑plus‑segment‑level matching pipeline accelerated by custom GPU decoding and inference, boosting duplicate rejection 7.5×, recall 3.75×, and cutting manual misses from 65 to 5 per day.

Feature ExtractionGPU AccelerationSystem Architecture
0 likes · 19 min read
Design and Optimization of Bilibili's Large-Scale Video Duplicate Detection System
DataFunSummit
DataFunSummit
Apr 9, 2023 · Artificial Intelligence

PGLBox: An Industrial-Scale GPU‑Accelerated Graph Learning Framework

This article introduces the development trends of graph learning frameworks, explains GPU acceleration techniques such as UVA and multi‑GPU pipelines, details the design of the PaddlePaddle Graph Learning (PGL) framework and its large‑scale engine PGLBox, and demonstrates how these technologies enable industrial‑grade graph representation learning with billions of nodes and edges.

GPU AccelerationGraph Neural NetworksMessage Passing
0 likes · 18 min read
PGLBox: An Industrial-Scale GPU‑Accelerated Graph Learning Framework