Tagged articles
124 articles
Page 2 of 2
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Top Architect
Top Architect
Apr 21, 2023 · Artificial Intelligence

Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT

This article explains why and how to fine‑tune the LLaMA‑7B model using the cheap Alpaca‑LoRA approach, covering hardware requirements, dataset preparation, LoRA training, optional model merging and quantization, and provides ready‑to‑run code snippets for single‑ and multi‑GPU setups.

Alpaca-LoRAGPULLM
0 likes · 10 min read
Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT
21CTO
21CTO
Apr 11, 2023 · Artificial Intelligence

Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline

This article introduces ColossalChat, an open‑source ChatGPT‑like model built on LLaMA and the Colossal‑AI framework, detailing its full RLHF workflow, bilingual dataset, low‑cost training tricks, quantized inference, and step‑by‑step code to help developers quickly replicate large‑language‑model capabilities.

ChatGPTColossalAIRLHF
0 likes · 10 min read
Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 9, 2022 · Artificial Intelligence

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

BladeDISCCompilerGPU optimization
0 likes · 11 min read
What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization
Meituan Technology Team
Meituan Technology Team
Sep 22, 2022 · Artificial Intelligence

Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation

The paper proposes a full quantization pipeline for YOLOv6 that combines a re‑parameterization optimizer, partial PTQ, channel‑wise distillation, graph‑scale merging, and GPU‑offloaded preprocessing, enabling an INT8 model to retain ~42 % mAP while delivering over 200 % throughput increase and 40 % QPS gain versus FP16.

Channel DistillationModel DeploymentPTQ
0 likes · 16 min read
Quantization Deployment Scheme for YOLOv6: Methods, Optimizations, and Performance Evaluation
Kuaishou Large Model
Kuaishou Large Model
Jul 29, 2022 · Fundamentals

How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations

This article explains how researchers applied quantization techniques to high‑resolution physical simulations, enabling over 50% memory reduction without noticeable visual loss, by modeling error propagation, using constrained optimization, and introducing dithering, with results demonstrated on GPU‑based smoke, fluid, and elastic body simulations.

GPU memory optimizationPhysical SimulationSIGGRAPH
0 likes · 6 min read
How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic BatchingInference AccelerationModel Compression
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
Code DAO
Code DAO
May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNModel CompressionPyTorch
0 likes · 11 min read
How Quantization and Fusion Accelerate CNN Inference on Edge Devices
DataFunTalk
DataFunTalk
Apr 22, 2022 · Artificial Intelligence

Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models

This article presents a comprehensive overview of inference optimization methods—including model pruning, quantization, knowledge distillation, caching, instruction‑set acceleration, and operator fusion—and details a GPU‑centric parallel acceleration methodology with CUDA basics, performance‑analysis tools, theoretical limits, and practical case studies, all illustrated with real‑world examples from Tencent's intelligent dialogue products.

CachingGPU AccelerationKnowledge Distillation
0 likes · 18 min read
Inference Optimization Techniques and GPU Parallel Acceleration for Tencent Intelligent Dialogue Models
DataFunSummit
DataFunSummit
Jan 29, 2022 · Artificial Intelligence

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

Model Compressiondeep learningneural networks
0 likes · 23 min read
Survey of Model Pruning and Quantization Techniques for Deep Learning
Laiye Technology Team
Laiye Technology Team
Jan 28, 2022 · Artificial Intelligence

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.

EfficiencyModel Compressiondeep learning
0 likes · 26 min read
Survey of Model Compression and Quantization Techniques for Deep Neural Networks
DataFunSummit
DataFunSummit
Jun 5, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

This article reviews BERT’s architecture, analyzes the storage and compute costs of each layer, and systematically presents compression methods—including quantization, pruning, knowledge distillation (Distilled BiLSTM and MobileBERT), and structure‑preserving techniques—aimed at enabling efficient deployment on resource‑constrained mobile devices.

BERTKnowledge DistillationMobile Deployment
0 likes · 15 min read
Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods
DataFunTalk
DataFunTalk
Jun 3, 2021 · Artificial Intelligence

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure-Preserving Methods

This article examines the internal structure of BERT and systematically presents various model‑compression strategies—including quantization, pruning, knowledge distillation, and structure‑preserving techniques—highlighting their impact on storage, computational cost, and inference speed for deployment on resource‑constrained mobile devices.

BERTKnowledge DistillationModel Compression
0 likes · 16 min read
Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure-Preserving Methods
Kuaishou Tech
Kuaishou Tech
Mar 18, 2021 · Artificial Intelligence

Hammer: An Integrated Hardware-Aware Model Compression Framework

Hammer is an integrated hardware-aware model compression tool developed by Kuaishou in collaboration with universities, combining pruning, quantization, search, and distillation to achieve efficient and accurate neural network models tailored to specific hardware.

AI FrameworkKuaishouNAS
0 likes · 9 min read
Hammer: An Integrated Hardware-Aware Model Compression Framework
Sohu Tech Products
Sohu Tech Products
Jan 6, 2021 · Artificial Intelligence

Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

This article reviews four mainstream model compression and acceleration methods—structural optimization, pruning, quantization, and knowledge distillation—explaining their principles, implementations, and performance, and presents practical examples such as DistillBERT, TinyBERT, and FastBERT with comparative results.

AIKnowledge DistillationModel Compression
0 likes · 14 min read
Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation
Didi Tech
Didi Tech
Oct 21, 2020 · Artificial Intelligence

Deep Model Compression Techniques for Intelligent Automotive Cockpits

The article reviews deep‑model compression methods—ADMM‑based structured pruning, low‑bit quantization, and teacher‑student knowledge distillation—and their automated AutoCompress workflow, demonstrating how these techniques shrink neural networks enough to run real‑time driver‑monitoring and other intelligent cockpit functions on resource‑limited automotive hardware while preserving accuracy.

ADMMEdge AIKnowledge Distillation
0 likes · 16 min read
Deep Model Compression Techniques for Intelligent Automotive Cockpits
AntTech
AntTech
Jun 9, 2020 · Artificial Intelligence

Deep Learning Model Compression and Acceleration Techniques for Mobile AI

This article reviews the motivations, challenges, and a comprehensive set of algorithmic, framework, and hardware methods—including structural optimization, quantization, pruning, and knowledge distillation—to compress and accelerate deep learning models for deployment on mobile devices, highlighting benefits such as reduced server load, lower latency, improved reliability, and enhanced privacy.

Knowledge DistillationModel Compressionmobile AI
0 likes · 17 min read
Deep Learning Model Compression and Acceleration Techniques for Mobile AI
Tencent Tech
Tencent Tech
Feb 27, 2020 · Artificial Intelligence

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Knowledge Distillationdeep learningdistributed training
0 likes · 14 min read
How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques
Alibaba Cloud Developer
Alibaba Cloud Developer
May 21, 2019 · Artificial Intelligence

How Alibaba’s Offline AI Advances Model Compression and Edge Inference

Alibaba’s Machine Intelligence Lab shares two years of breakthroughs in offline AI, detailing low‑bit quantization, unified sparsity frameworks, hardware‑software co‑design, lightweight networks, and on‑device detection, alongside standardized training tools, multi‑platform inference engines, and productized edge solutions such as smart boxes and integrated cameras.

AIModel Compressionedge inference
0 likes · 16 min read
How Alibaba’s Offline AI Advances Model Compression and Edge Inference
Hulu Beijing
Hulu Beijing
Apr 30, 2019 · Artificial Intelligence

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

This article reviews why deep neural networks are over‑parameterized, outlines the challenges of deploying them on mobile and embedded devices, and presents six major strategies—pruning, low‑rank approximation, filter selection, quantization, knowledge distillation, and novel architecture design—to accelerate and compress models while preserving performance.

Knowledge Distillationdeep learningmodel acceleration
0 likes · 11 min read
How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained
Tencent Architect
Tencent Architect
Nov 13, 2017 · Artificial Intelligence

Survey of Bandwidth Optimization Techniques in AI Accelerators

This article reviews various architectural strategies—including streaming processing, on‑chip memory optimization, bit‑width compression, sparsity techniques, on‑chip models with chip‑level interconnects, and emerging technologies such as binary networks, memristors, and HBM—to alleviate bandwidth bottlenecks in FPGA/ASIC/TPU AI accelerators.

AIASICAccelerators
0 likes · 20 min read
Survey of Bandwidth Optimization Techniques in AI Accelerators