Tagged articles

124 articles

Page 2 of 2

Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX

0 likes · 10 min read

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Top Architect

Apr 21, 2023 · Artificial Intelligence

Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT

This article explains why and how to fine‑tune the LLaMA‑7B model using the cheap Alpaca‑LoRA approach, covering hardware requirements, dataset preparation, LoRA training, optional model merging and quantization, and provides ready‑to‑run code snippets for single‑ and multi‑GPU setups.

Alpaca-LoRAGPULLM

0 likes · 10 min read

Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT

21CTO

Apr 11, 2023 · Artificial Intelligence

Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline

This article introduces ColossalChat, an open‑source ChatGPT‑like model built on LLaMA and the Colossal‑AI framework, detailing its full RLHF workflow, bilingual dataset, low‑cost training tricks, quantized inference, and step‑by‑step code to help developers quickly replicate large‑language‑model capabilities.

ChatGPTColossalAIRLHF

0 likes · 10 min read

Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Artificial Intelligence

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

BladeDISCCompilerGPU optimization

0 likes · 11 min read

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

Meituan Technology Team

Sep 22, 2022 · Artificial Intelligence

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

The paper proposes a full quantization pipeline for YOLOv6 that combines a re‑parameterization optimizer, partial PTQ, channel‑wise distillation, graph‑scale merging, and GPU‑offloaded preprocessing, enabling an INT8 model to retain ~42 % mAP while delivering over 200 % throughput increase and 40 % QPS gain versus FP16.

Channel DistillationModel DeploymentPTQ

0 likes · 16 min read

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

Meituan Technology Team

Sep 15, 2022 · Artificial Intelligence

YOLOv6 2.0: Enhanced Object Detection Models and Quantization Solutions

The new YOLOv6 2.0 release upgrades lightweight and medium‑large models with a CSPStackRep backbone, self‑distillation, and a custom quantization pipeline, delivering up to 869 FPS for the quantized YOLOv6‑S and achieving 49.5%/52.5% AP on COCO while halving training time.

COCO benchmarkCSPStackRepTensorRT

0 likes · 6 min read

YOLOv6 2.0: Enhanced Object Detection Models and Quantization Solutions

Kuaishou Large Model

Jul 29, 2022 · Fundamentals

How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations

This article explains how researchers applied quantization techniques to high‑resolution physical simulations, enabling over 50% memory reduction without noticeable visual loss, by modeling error propagation, using constrained optimization, and introducing dithering, with results demonstrated on GPU‑based smoke, fluid, and elastic body simulations.

GPU memory optimizationPhysical SimulationSIGGRAPH

0 likes · 6 min read

How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations

DataFunSummit

Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic BatchingInference AccelerationModel Compression

0 likes · 12 min read

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

Code DAO

May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNModel CompressionPyTorch

0 likes · 11 min read

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

DataFunTalk

Apr 22, 2022 · Artificial Intelligence

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

This article presents a comprehensive overview of inference optimization methods—including model pruning, quantization, knowledge distillation, caching, instruction‑set acceleration, and operator fusion—and details a GPU‑centric parallel acceleration methodology with CUDA basics, performance‑analysis tools, theoretical limits, and practical case studies, all illustrated with real‑world examples from Tencent's intelligent dialogue products.

CachingGPU AccelerationKnowledge Distillation

0 likes · 18 min read

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

DataFunSummit

Jan 29, 2022 · Artificial Intelligence

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

Model Compressiondeep learningneural networks

0 likes · 23 min read

Survey of Model Pruning and Quantization Techniques for Deep Learning

Laiye Technology Team

Jan 28, 2022 · Artificial Intelligence

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.

EfficiencyModel Compressiondeep learning

0 likes · 26 min read

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

Kuaishou Large Model

Jul 30, 2021 · Fundamentals

How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

QuanTaichi introduces a new language abstraction and compiler system that quantizes simulation data, dramatically reducing memory and bandwidth usage so that high‑precision physical effects—once requiring multiple GPUs—can now run on a single GPU, even on mobile devices.

CompilerGPU optimizationGraphics

0 likes · 12 min read

How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

Kuaishou Tech

Jul 14, 2021 · Fundamentals

QuanTaichi: A Physical Compiler for Automatic Quantization of High‑Precision Simulations

QuanTaichi, built on the Taichi language, introduces custom numeric types, bit‑struct adapters, and compiler optimizations that dramatically reduce memory and bandwidth for particle‑based physics simulations, enabling high‑precision GPU rendering on a single card and even on mobile devices.

GPU simulationGraphicsPhysical Simulation

0 likes · 13 min read

QuanTaichi: A Physical Compiler for Automatic Quantization of High‑Precision Simulations

DataFunSummit

Jun 5, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

This article reviews BERT’s architecture, analyzes the storage and compute costs of each layer, and systematically presents compression methods—including quantization, pruning, knowledge distillation (Distilled BiLSTM and MobileBERT), and structure‑preserving techniques—aimed at enabling efficient deployment on resource‑constrained mobile devices.

BERTKnowledge DistillationMobile Deployment

0 likes · 15 min read

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

DataFunTalk

Jun 3, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure-Preserving Methods

This article examines the internal structure of BERT and systematically presents various model‑compression strategies—including quantization, pruning, knowledge distillation, and structure‑preserving techniques—highlighting their impact on storage, computational cost, and inference speed for deployment on resource‑constrained mobile devices.

BERTKnowledge DistillationModel Compression

0 likes · 16 min read

Kuaishou Tech

Mar 18, 2021 · Artificial Intelligence

Hammer: An Integrated Hardware-Aware Model Compression Framework

Hammer is an integrated hardware-aware model compression tool developed by Kuaishou in collaboration with universities, combining pruning, quantization, search, and distillation to achieve efficient and accurate neural network models tailored to specific hardware.

AI FrameworkKuaishouNAS

0 likes · 9 min read

Hammer: An Integrated Hardware-Aware Model Compression Framework

Sohu Tech Products

Jan 6, 2021 · Artificial Intelligence

Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

This article reviews four mainstream model compression and acceleration methods—structural optimization, pruning, quantization, and knowledge distillation—explaining their principles, implementations, and performance, and presents practical examples such as DistillBERT, TinyBERT, and FastBERT with comparative results.

AIKnowledge DistillationModel Compression

0 likes · 14 min read

Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

Didi Tech

Oct 21, 2020 · Artificial Intelligence

Deep Model Compression Techniques for Intelligent Automotive Cockpits

The article reviews deep‑model compression methods—ADMM‑based structured pruning, low‑bit quantization, and teacher‑student knowledge distillation—and their automated AutoCompress workflow, demonstrating how these techniques shrink neural networks enough to run real‑time driver‑monitoring and other intelligent cockpit functions on resource‑limited automotive hardware while preserving accuracy.

ADMMEdge AIKnowledge Distillation

0 likes · 16 min read

Deep Model Compression Techniques for Intelligent Automotive Cockpits

AntTech

Jun 9, 2020 · Artificial Intelligence

Deep Learning Model Compression and Acceleration Techniques for Mobile AI

This article reviews the motivations, challenges, and a comprehensive set of algorithmic, framework, and hardware methods—including structural optimization, quantization, pruning, and knowledge distillation—to compress and accelerate deep learning models for deployment on mobile devices, highlighting benefits such as reduced server load, lower latency, improved reliability, and enhanced privacy.

Knowledge DistillationModel Compressionmobile AI

0 likes · 17 min read

Deep Learning Model Compression and Acceleration Techniques for Mobile AI

Tencent Tech

Feb 27, 2020 · Artificial Intelligence

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Knowledge Distillationdeep learningdistributed training

0 likes · 14 min read

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Alibaba Cloud Developer

May 21, 2019 · Artificial Intelligence

How Alibaba’s Offline AI Advances Model Compression and Edge Inference

Alibaba’s Machine Intelligence Lab shares two years of breakthroughs in offline AI, detailing low‑bit quantization, unified sparsity frameworks, hardware‑software co‑design, lightweight networks, and on‑device detection, alongside standardized training tools, multi‑platform inference engines, and productized edge solutions such as smart boxes and integrated cameras.

AIModel Compressionedge inference

0 likes · 16 min read

How Alibaba’s Offline AI Advances Model Compression and Edge Inference

Hulu Beijing

Apr 30, 2019 · Artificial Intelligence

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

This article reviews why deep neural networks are over‑parameterized, outlines the challenges of deploying them on mobile and embedded devices, and presents six major strategies—pruning, low‑rank approximation, filter selection, quantization, knowledge distillation, and novel architecture design—to accelerate and compress models while preserving performance.

Knowledge Distillationdeep learningmodel acceleration

0 likes · 11 min read

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

Tencent Architect

Nov 13, 2017 · Artificial Intelligence

Survey of Bandwidth Optimization Techniques in AI Accelerators

This article reviews various architectural strategies—including streaming processing, on‑chip memory optimization, bit‑width compression, sparsity techniques, on‑chip models with chip‑level interconnects, and emerging technologies such as binary networks, memristors, and HBM—to alleviate bandwidth bottlenecks in FPGA/ASIC/TPU AI accelerators.

AIASICAccelerators

0 likes · 20 min read

Survey of Bandwidth Optimization Techniques in AI Accelerators