Tag

AI acceleration

1 views collected around this technical thread.

Architects' Tech Alliance
Architects' Tech Alliance
Jun 9, 2025 · Artificial Intelligence

What Makes Nvidia’s Blackwell GPUs a Game-Changer for AI Performance?

In March 2024 Nvidia unveiled the Blackwell GPU family and the GB200 NVL72 architecture, featuring 3‑4 nm processes, redesigned CUDA cores, next‑gen ray‑tracing, upgraded DLSS, massive FP16/FP8 compute gains, 8 TB/s memory bandwidth, and NVLink Gen5, while also presenting complex power, cooling, and packaging challenges for large‑scale AI deployments.

AI accelerationBlackwellGPU
0 likes · 6 min read
What Makes Nvidia’s Blackwell GPUs a Game-Changer for AI Performance?
Architects' Tech Alliance
Architects' Tech Alliance
Jun 6, 2025 · Artificial Intelligence

B30 vs H20: Which NVIDIA GPU Wins for AI Workloads and Budgets?

This article compares NVIDIA’s China‑specific B30 and high‑end H20 GPUs, detailing their CPU/CPU architecture updates, memory technologies, architectural differences, performance metrics, power and cooling characteristics, and price positioning, to help enterprises and developers choose the most suitable accelerator for AI and deep‑learning tasks.

AI accelerationB30GPU
0 likes · 13 min read
B30 vs H20: Which NVIDIA GPU Wins for AI Workloads and Budgets?
Architects' Tech Alliance
Architects' Tech Alliance
Apr 28, 2025 · Artificial Intelligence

NVLink High‑Speed Interconnect: Architecture, Evolution, and Performance

NVLink, NVIDIA's high‑bandwidth interconnect introduced with the P100 GPU, replaces PCIe by offering significantly higher data rates and lower latency for GPU‑GPU and GPU‑CPU communication, and has evolved through multiple generations to support modern AI and high‑performance computing workloads.

AI accelerationGPU interconnectHigh Performance Computing
0 likes · 9 min read
NVLink High‑Speed Interconnect: Architecture, Evolution, and Performance
AntTech
AntTech
Mar 19, 2025 · Artificial Intelligence

Award-Winning HPCA 2025 Papers on Near‑DRAM Processing (UniNDP) and GPU‑Accelerated Fully Homomorphic Encryption (WarpDrive)

At HPCA 2025, two standout papers—UniNDP, a unified compilation and simulation tool for near‑DRAM processing architectures, and WarpDrive, a GPU‑based fully homomorphic encryption accelerator leveraging Tensor and CUDA cores—demonstrate significant performance gains for AI workloads and privacy‑preserving computation.

AI accelerationFully Homomorphic EncryptionGPU
0 likes · 5 min read
Award-Winning HPCA 2025 Papers on Near‑DRAM Processing (UniNDP) and GPU‑Accelerated Fully Homomorphic Encryption (WarpDrive)
DataFunTalk
DataFunTalk
Feb 26, 2025 · Artificial Intelligence

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.

AI accelerationDeepGEMMFP8
0 likes · 7 min read
DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationGPUKVCache
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework
DataFunSummit
DataFunSummit
Oct 2, 2024 · Artificial Intelligence

NVIDIA’s Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

This article explains NVIDIA’s end‑to‑end stack for large language models, covering the NeMo Framework for data processing, training, and deployment, the open‑source TensorRT‑LLM inference accelerator, and the Retrieval‑Augmented Generation (RAG) technique that enriches model outputs with external knowledge.

AI accelerationNeMoNvidia
0 likes · 17 min read
NVIDIA’s Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 19, 2024 · Artificial Intelligence

Target-Driven Distillation (TDD): A Multi‑Goal Distillation Method for Accelerating Diffusion Models

Target‑Driven Distillation (TDD) is a multi‑goal distillation method that flexibly selects short‑range target steps and decouples guidance during training, enabling 4‑to‑8‑step diffusion generation that preserves high‑resolution detail, works with LoRA, ControlNet, InstantID, and outperforms existing consistency distillation techniques in speed and quality.

AI accelerationImage Generationdiffusion models
0 likes · 9 min read
Target-Driven Distillation (TDD): A Multi‑Goal Distillation Method for Accelerating Diffusion Models
DataFunSummit
DataFunSummit
Sep 5, 2024 · Artificial Intelligence

NVIDIA’s End‑to‑End Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

This article introduces NVIDIA’s comprehensive solutions for large language models, covering the NeMo Framework’s full‑stack development pipeline, the open‑source TensorRT‑LLM inference accelerator, and Retrieval‑Augmented Generation techniques, while detailing data preprocessing, distributed training, model fine‑tuning, deployment, and performance optimizations.

AI accelerationNeMo FrameworkNvidia
0 likes · 16 min read
NVIDIA’s End‑to‑End Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation
ByteDance SYS Tech
ByteDance SYS Tech
Aug 12, 2024 · Cloud Native

How mGPU Enables Efficient GPU Sharing for AI Workloads

This article explains the mGPU solution that virtualizes NVIDIA GPUs for containers, detailing its driver architecture, compute and memory isolation mechanisms, performance benchmarks on ResNet‑50 inference, and how it boosts GPU utilization by over 50% for AI and high‑performance computing tasks.

AI accelerationContainer OrchestrationGPU sharing
0 likes · 10 min read
How mGPU Enables Efficient GPU Sharing for AI Workloads
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 1, 2024 · Artificial Intelligence

Hyper‑SD: Trajectory‑Segmented Consistency Model for Accelerating Diffusion Image Generation

Hyper‑SD introduces a trajectory‑segmented consistency distillation framework that combines trajectory‑preserving and trajectory‑reconstruction strategies, integrates human‑feedback learning and score distillation, and achieves state‑of‑the‑art low‑step image generation performance on both SD1.5 and SDXL models.

AI accelerationImage GenerationModel Distillation
0 likes · 10 min read
Hyper‑SD: Trajectory‑Segmented Consistency Model for Accelerating Diffusion Image Generation
DevOps Operations Practice
DevOps Operations Practice
Apr 29, 2024 · Fundamentals

Introduction to CPUs and GPUs: Functions, Advanced Features, and Key Differences

This article explains the basic functions of CPUs and GPUs, their advanced capabilities and real‑world applications, and compares their architectures, processing models, and roles in environments such as IoT, mobile devices, Kubernetes, and AI workloads.

AI accelerationCPUGPU
0 likes · 7 min read
Introduction to CPUs and GPUs: Functions, Advanced Features, and Key Differences
Sohu Tech Products
Sohu Tech Products
Mar 27, 2024 · Artificial Intelligence

NVIDIA NeMo Framework, TensorRT‑LLM, and RAG for Large Language Model Solutions

NVIDIA’s comprehensive LLM ecosystem combines the full‑stack NeMo Framework for data curation, distributed training, fine‑tuning, inference acceleration with TensorRT‑LLM and Triton, plus Retrieval‑Augmented Generation and Guardrails, enabling efficient, low‑latency, knowledge‑grounded model deployment across clusters.

AI accelerationNeMo FrameworkNvidia
0 likes · 16 min read
NVIDIA NeMo Framework, TensorRT‑LLM, and RAG for Large Language Model Solutions
AntTech
AntTech
Jan 9, 2024 · Artificial Intelligence

ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models

Ant Group’s newly open‑sourced ATorch library extends PyTorch with a layered architecture and automated resource‑aware strategies, boosting large‑model training efficiency up to 60% utilization, enhancing stability, and delivering significant throughput gains across multi‑node, multi‑GPU deployments.

AI accelerationLarge ModelsPyTorch
0 likes · 6 min read
ATorch: Ant Group’s Open‑Source Distributed Training Acceleration Library for Large‑Scale AI Models
Architects' Tech Alliance
Architects' Tech Alliance
Dec 23, 2023 · Artificial Intelligence

Future Development Paths of Computing Power Technology (2023): Chip Architecture, Near‑Memory Computing, and Distributed xPU Systems

The article outlines the accelerating demand for high‑performance computing driven by AI, AR/VR, biotech and other workloads, examines the limits of Moore's law, and presents emerging solutions such as advanced chip architectures, chiplet integration, near‑memory/in‑memory computing, and distributed xPU‑based systems for scalable, efficient compute.

AI accelerationChipletchip architecture
0 likes · 11 min read
Future Development Paths of Computing Power Technology (2023): Chip Architecture, Near‑Memory Computing, and Distributed xPU Systems
Architects' Tech Alliance
Architects' Tech Alliance
Nov 15, 2023 · Fundamentals

FPGA: A Versatile Chip Igniting New Momentum and the Future of Domestic Substitution (2023)

The article analyzes the rapid growth of FPGA technology, its flexible architecture and low‑cost development, the expanding role of FPGA in data‑center acceleration, the strategic moves of AMD, Intel and Nvidia in heterogeneous computing, and forecasts a strong market expansion worldwide through 2025.

AI accelerationFPGAdata center
0 likes · 10 min read
FPGA: A Versatile Chip Igniting New Momentum and the Future of Domestic Substitution (2023)
Architects' Tech Alliance
Architects' Tech Alliance
Sep 11, 2023 · Artificial Intelligence

Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design

The 2023 Open Acceleration Specification AI Server Design Guide details the hardware architecture, OAM module and UBB board specifications, cooling, management, fault diagnosis, and software platform needed to build high‑performance, scalable AI compute clusters for large‑model training.

AI accelerationHardware ArchitectureLarge Model Training
0 likes · 10 min read
Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design
Architects' Tech Alliance
Architects' Tech Alliance
Sep 4, 2023 · Artificial Intelligence

Overview of AI Chip Types, Architectures, and Market Trends

The article explains the various AI‑capable chips such as CPUs, GPUs, FPGAs, NPUs, and TPUs, compares their performance and efficiency, describes heterogeneous CPU+xPU solutions, and provides market share data while highlighting the growing adoption of specialized AI accelerators.

AI accelerationAI chipsCPU
0 likes · 7 min read
Overview of AI Chip Types, Architectures, and Market Trends
Architects' Tech Alliance
Architects' Tech Alliance
Aug 2, 2023 · Fundamentals

Emerging Trends in Digital Infrastructure: Beyond Moore's Law, Chiplet, and Compute‑in‑Memory

The article surveys recent digital‑infrastructure trends, explaining why traditional Moore's Law scaling is slowing, describing More‑Moore and Beyond‑CMOS approaches, and detailing new chip architectures such as DSA, 3D stacking, Chiplet, compute‑in‑memory, and distributed xPU‑centric systems that together address the growing compute demands of AI, AR/VR, and bio‑pharma workloads.

AI accelerationCompute-in-MemoryMoore's law
0 likes · 11 min read
Emerging Trends in Digital Infrastructure: Beyond Moore's Law, Chiplet, and Compute‑in‑Memory