Tag

Vision Transformer

0 views collected around this technical thread.

Ximalaya Technology Team
Ximalaya Technology Team
Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchMultimodal GenerationStable Diffusion
0 likes · 9 min read
MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 24, 2023 · Artificial Intelligence

Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers

This article explains the Slide-Transformer paper, describing how the proposed Slide Attention replaces inefficient Im2Col‑based local attention with depthwise convolutions and a deformable shift module, achieving high efficiency, flexibility, and hardware‑agnostic performance for Vision Transformers.

Deep LearningDeformable ShiftDepthwise Convolution
0 likes · 13 min read
Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 12, 2023 · Artificial Intelligence

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

This article provides an in‑depth, English‑language overview of Vision Transformer (ViT), covering its Transformer‑based architecture, patch‑to‑token conversion, token and position embeddings, fine‑tuning strategies such as 2‑D interpolation, experimental results versus CNNs, and the model’s broader significance for multimodal AI research.

Deep LearningFine-tuningPatch Embedding
0 likes · 25 min read
Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance
DataFunSummit
DataFunSummit
Jun 23, 2023 · Artificial Intelligence

Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications

This article introduces video action recognition, covering its basic definition, downstream tasks, major algorithmic families—including CNN‑based, Vision‑Transformer, self‑supervised, and multimodal approaches—and discusses practical deployment scenarios and open challenges in the field.

CNNMultimodal ModelsVision Transformer
0 likes · 16 min read
Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 18, 2022 · Artificial Intelligence

Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch

This article walks readers through building, training, and evaluating a Vision Transformer (ViT) model for a five‑class flower classification task, providing detailed code snippets, model architecture explanations, training script adjustments, and experimental results that highlight the importance of pre‑trained weights.

Deep LearningPyTorchViT
0 likes · 13 min read
Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 10, 2022 · Artificial Intelligence

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

PyTorchSelf-AttentionVision Transformer
0 likes · 12 min read
A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers
AntTech
AntTech
Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

Document UnderstandingMultimodalVision Transformer
0 likes · 10 min read
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding
DataFunTalk
DataFunTalk
Jun 9, 2022 · Artificial Intelligence

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

This article introduces the MAE (Masked AutoEncoder) self‑supervised learning method, explains its asymmetric encoder‑decoder design and high masking ratio, evaluates its performance, and provides a step‑by‑step guide to reproduce MAE using Alibaba’s EasyCV framework, including code snippets, training tips, and troubleshooting.

EasyCVMAEPyTorch
0 likes · 15 min read
Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV
Baidu Geek Talk
Baidu Geek Talk
Mar 28, 2022 · Artificial Intelligence

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Deep LearningGrad-CAMInput Visualization
0 likes · 11 min read
Robust Input Visualization Methods for Vision Transformers