Tagged articles

14 articles

Page 1 of 1

May 25, 2026 · Artificial Intelligence

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

This article walks through a two‑month source‑code study of vLLM, explaining how token‑level scheduling, continuous batching, and the Paged Attention mechanism reshape tensor dimensions to turn large‑model inference into a compute‑bound, high‑throughput process while managing GPU memory efficiently.

FlashAttentionGPU optimizationLLM inference

0 likes · 29 min read

vLLM Deep Dive: Continuous Batching and Paged Attention for Fast LLM Inference

Old Zhang's AI Learning

Apr 28, 2026 · Artificial Intelligence

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.

CUDA 13DeepSeek V4FlashAttention

0 likes · 10 min read

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

Data Party THU

Mar 26, 2026 · Artificial Intelligence

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Deep KVFlashAttentionMixture-of-Depths Attention

0 likes · 9 min read

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

DeepHub IMBA

Mar 25, 2026 · Artificial Intelligence

TPU Architecture and Pallas Kernels: From Memory Hierarchy to FlashAttention

This article explains why TPU programming differs from GPU, describes the explicit HBM‑VMEM‑register data movement required on TPU, introduces the Pallas grid‑BlockSpec‑Ref model, and walks through four progressively more complex kernels—including element‑wise add, tiled dot product, fused RMSNorm with scratch memory, and a production‑grade FlashAttention implementation—showing how each kernel maps to the TPU memory hierarchy and leverages Pallas features such as input_output_aliases and PrefetchScalarGridSpec.

FlashAttentionJAXMemory Hierarchy

0 likes · 20 min read

TPU Architecture and Pallas Kernels: From Memory Hierarchy to FlashAttention

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

Code Mala Tang

Mar 5, 2026 · Artificial Intelligence

Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models

This tutorial walks readers through the fundamentals of YOLOv12, covering model variants, dataset preparation with Roboflow, optional FlashAttention acceleration, installation, model selection, training commands, post‑training tasks such as tracking, validation, inference, exporting to ONNX, and benchmarking, all with concrete code snippets and practical tips.

FlashAttentionPythonRoboflow

0 likes · 8 min read

Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models

AI2ML AI to Machine Learning

Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

FlashAttentionGPU optimizationOnline Algorithm

0 likes · 8 min read

Bilibili Tech

Mar 4, 2025 · Artificial Intelligence

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

The Bilibili TTV team optimized OpenSora and CogVideoX text‑to‑video models by redesigning data storage with Alluxio, parallelizing VAE encoding, applying dynamic sequence‑parallel and DeepSpeed‑Ulysses attention, adapting GPU code for NPU execution, leveraging profiling‑driven kernel fusion, FlashAttention, and expandable memory to dramatically increase training efficiency and frame throughput, while outlining future pipeline‑parallel and ZeRO‑3 scaling plans.

Diffusion TransformerFlashAttentionModel Parallelism

0 likes · 26 min read

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

Ops Development & AI Practice

Feb 16, 2025 · Artificial Intelligence

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

This article explains the FlashAttention algorithm, its memory‑efficient tiling and recomputation techniques, and how enabling the flash_attn flag dramatically speeds up Qwen‑series large models while outlining hardware, software requirements and potential trade‑offs.

FlashAttentionGPU optimizationLarge Language Model

0 likes · 8 min read

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

NewBeeNLP

Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV CacheModel Parallelism

0 likes · 17 min read

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

Baobao Algorithm Notes

Nov 7, 2024 · Artificial Intelligence

Demystifying FlashAttention: A Minimalist Derivation of the Algorithm

This article presents a concise, step‑by‑step derivation of FlashAttention, explaining the prerequisite linear‑algebra concepts, the softmax simplifications, and the parallel computation workflow—including the LSE‑enhanced version—so readers can grasp the algorithm’s elegance without heavy mathematics.

Algorithm DerivationFlashAttentionLarge Language Models

0 likes · 8 min read

Demystifying FlashAttention: A Minimalist Derivation of the Algorithm

Sohu Tech Products

Sep 11, 2024 · Artificial Intelligence

How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery

This article explains the core mechanisms of Transformer models, details the Rotational Position Embedding (RoPE) and FlashAttention techniques for handling long sequences, introduces the GLM-4-Plus series, and presents an empirical evaluation on the THUCNews dataset showing its superior long-text performance.

FlashAttentionGLM-4-PlusLong Text

0 likes · 13 min read

How RoPE and FlashAttention Empower GLM-4-Plus for Long-Text Mastery

DeWu Technology

May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationLarge Language Models

0 likes · 17 min read

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Huawei Cloud Developer Alliance

Nov 16, 2023 · Artificial Intelligence

ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features

During the Saturday session, we reviewed ChatGLM2’s upgrades—Multi‑Query Attention and FlashAttention—demonstrated deployment on Ascend + ModelArts + MindSpore, and introduced ChatGLM3’s revamped prompt design, native tool‑calling and code‑interpreter capabilities, while previewing the next lecture on text‑generation decoding.

ChatGLM2ChatGLM3FlashAttention

0 likes · 6 min read

ChatGLM2 vs ChatGLM3: MQA, FlashAttention, and New Prompt Features