Tagged articles

37 articles

Page 1 of 1

May 21, 2026 · Artificial Intelligence

RAG with Multimodal Inputs vs LLM + Toolchains: Handling Non‑Text Data

The article analyzes how large language models process only tokenized text, compares the traditional LLM‑plus‑toolchain pipeline with emerging multimodal models, evaluates their cost, speed, controllability, and hallucination risks, and proposes a hybrid architecture that matches each approach to specific document scenarios.

LLMMultimodalRAG

0 likes · 16 min read

RAG with Multimodal Inputs vs LLM + Toolchains: Handling Non‑Text Data

AIWalker

May 19, 2026 · Artificial Intelligence

Why Attention Transfer Fails for DINOv2 and Other Modern ViTs: Architecture Mismatch Revealed

A large-scale benchmark of 20 pretrained ViT teachers across 11 families shows that attention copy and distillation improve some models but hurt others—especially DINOv2, CLIP, and BEiTv2—due to architecture mismatches, and adding the teachers' native components to students restores the lost performance.

Architecture CompatibilityAttention TransferKnowledge Distillation

0 likes · 13 min read

Why Attention Transfer Fails for DINOv2 and Other Modern ViTs: Architecture Mismatch Revealed

AI Explorer

Apr 24, 2026 · Artificial Intelligence

Google’s ‘Banana’ Model Redefines Visual Transformers with Dynamic Sparse Attention

Google’s newly unveiled “Banana” visual Transformer introduces dynamic sparse attention that cuts inference cost 3‑5×, reduces memory by 70%, and improves ImageNet accuracy, while demonstrating real‑world gains in autonomous driving, medical imaging, and satellite analysis.

Dynamic Sparse AttentionGoogleImageNet

0 likes · 6 min read

Google’s ‘Banana’ Model Redefines Visual Transformers with Dynamic Sparse Attention

Machine Heart

Apr 2, 2026 · Artificial Intelligence

LongCat-Next: Turning Images, Audio, and Text into Tokens – What’s Next?

LongCat-Next is a 68.5‑billion‑parameter discrete‑native autoregressive multimodal model that tokenizes images, audio and text, challenges the belief that visual tokenization loses detail, matches specialized models on fine‑grained tasks, and demonstrates that joint understanding‑generation training can even improve generation quality.

LongCat-NextMultimodalVision Transformer

0 likes · 21 min read

LongCat-Next: Turning Images, Audio, and Text into Tokens – What’s Next?

Machine Learning Algorithms & Natural Language Processing

Mar 30, 2026 · Artificial Intelligence

Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens

LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.

DiNALongCat-NextOmniDocBench

0 likes · 10 min read

Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens

AI Frontier Lectures

Mar 19, 2026 · Artificial Intelligence

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

BCCB MatrixCirculant AttentionEfficient Attention

0 likes · 15 min read

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

AI Frontier Lectures

Mar 19, 2026 · Artificial Intelligence

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

The article analyzes the hidden conflict between [CLS] and patch tokens in Vision Transformers, reveals how shared normalization and linear layers cause computational friction, and demonstrates that layer‑specific parameters dramatically improve dense prediction tasks without increasing inference FLOPs.

Dense PredictionLayer SpecializationSelf-Attention

0 likes · 9 min read

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

AIWalker

Feb 26, 2026 · Artificial Intelligence

Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5

ViT‑5 systematically revisits five years of Transformer architecture advances, introducing seven plug‑and‑play components—LayerScale, RMSNorm, GeLU, dual positional encodings, high‑frequency RoPE for register tokens, QK‑Norm, and bias‑free projections—that together raise ImageNet‑1k Top‑1 accuracy to 84.2% (Base) and achieve superior performance across classification, generation, and segmentation tasks.

ViT-5Vision Transformercomputer vision

0 likes · 14 min read

Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5

HyperAI Super Neural

Feb 11, 2026 · Artificial Intelligence

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

Researchers at Oak Ridge National Laboratory introduced D‑CHAG, a distributed cross‑channel hierarchical aggregation method that cuts memory consumption by up to 75% and more than doubles throughput when training massive multi‑channel foundation models on up to 1,024 AMD GPUs, as demonstrated on hyperspectral imaging and weather‑forecasting workloads.

D-CHAGMemory OptimizationVision Transformer

0 likes · 14 min read

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

Sohu Tech Products

Dec 17, 2025 · Artificial Intelligence

How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms

Facing 53.64 ms per‑image latency in a Flask‑served Vision Transformer classifier, we iteratively optimized the pipeline—switching to ONNX Runtime, leveraging TensorRT, replacing Pillow with OpenCV, eliminating URL downloads, and finally batching requests—reducing average server‑side processing to 8.34 ms, a 6.4× speedup.

BatchingFlaskONNX

0 likes · 28 min read

How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms

AI Frontier Lectures

Nov 22, 2025 · Artificial Intelligence

Can Vision Transformers Crack the ARC Puzzle? Introducing VARC

MIT researchers argue that the ARC benchmark is essentially a visual problem and present the Vision ARC (VARC) framework, which reformulates ARC as an image‑to‑image translation task using a Vision Transformer, achieving human‑level accuracy through a novel canvas representation and test‑time training.

ARCArtificial IntelligenceImage-to-Image Translation

0 likes · 9 min read

Can Vision Transformers Crack the ARC Puzzle? Introducing VARC

AI Algorithm Path

Nov 1, 2025 · Artificial Intelligence

Deep Dive into Vision Transformer Patch Embedding Mechanisms

This article explains how Vision Transformers convert images into patch embeddings, compares flattening versus convolutional approaches, discusses position and CLS tokens, analyzes the effect of patch size, explores pixel‑level tokens, and contrasts ViT’s inductive bias with CNNs.

ConvolutionInductive BiasPatch Embedding

0 likes · 10 min read

Deep Dive into Vision Transformer Patch Embedding Mechanisms

Data Party THU

Jul 31, 2025 · Artificial Intelligence

How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

The LaVin-DiT paper introduces a large‑scale vision diffusion transformer that combines a spatiotemporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient generation across diverse visual tasks such as segmentation and video prediction.

3D RoPEVision Transformercomputer vision

0 likes · 11 min read

How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

Amap Tech

Jul 11, 2025 · Artificial Intelligence

Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding

The USP framework introduces masked latent modeling within a VAE space to pretrain ViT encoders, enabling seamless weight transfer to both image classification and diffusion‑based generation tasks, dramatically accelerating training while preserving strong performance across multiple benchmarks.

Vision Transformerdiffusion modelsimage generation

0 likes · 10 min read

Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding

Network Intelligence Research Center (NIRC)

Jul 7, 2025 · Artificial Intelligence

Exploring Collaborative Perception with V2X‑ViT: Architecture, Innovations, and Practical Insights

This article reviews the V2X‑ViT collaborative perception framework for autonomous driving, detailing its end‑to‑end pipeline, the novel HMSA and MSwin attention mechanisms, and the delay‑aware positional encoding that together enable high‑accuracy 3D object detection across vehicles and infrastructure.

3D Object DetectionAutonomous DrivingCollaborative Perception

0 likes · 10 min read

Exploring Collaborative Perception with V2X‑ViT: Architecture, Innovations, and Practical Insights

AI Algorithm Path

Jul 5, 2025 · Artificial Intelligence

Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

This article explains CLIP’s dual‑encoder architecture—using a Vision Transformer for images and a Transformer for text—how both encoders map inputs into a shared embedding space, the role of cosine similarity, and the InfoNCE contrastive loss that drives joint visual‑language learning.

CLIPInfoNCEMulti-modal Embedding

0 likes · 8 min read

Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

AI Algorithm Path

Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder

0 likes · 15 min read

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

AIWalker

Jun 18, 2025 · Artificial Intelligence

SeNaTra: Nvidia’s Spatial Grouping Layer Pushes Semantic Segmentation Past Swin Transformer

Nvidia introduces SeNaTra, a native‑segmentation vision transformer that replaces uniform down‑sampling with a content‑aware spatial grouping layer, delivering superior zero‑shot and supervised segmentation performance while cutting parameters and FLOPs compared with Swin Transformer and other backbones.

NVIDIAVision Transformersemantic segmentation

0 likes · 29 min read

SeNaTra: Nvidia’s Spatial Grouping Layer Pushes Semantic Segmentation Past Swin Transformer

AI Algorithm Path

Mar 20, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.

AI researchCross-AttentionModel Training Strategies

0 likes · 16 min read

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

AIWalker

Feb 28, 2025 · Artificial Intelligence

FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok

FlexTok is a flexible‑length 1‑D image tokenizer that can resample pictures into as few as 1‑256 discrete tokens, achieving superior reconstruction (FID) and autoregressive generation quality compared with TiTok, thanks to nested random dropout, causal masks and a flow‑based decoder evaluated on ImageNet and DFN.

FlexTokVision Transformerautoregressive generation

0 likes · 21 min read

FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok

AIWalker

Feb 23, 2025 · Artificial Intelligence

U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

U‑ViT replaces the convolutional U‑Net backbone of diffusion models with a Vision Transformer, treats time, condition and noisy patches as tokens, adds long skip connections and a lightweight 3×3 convolution, and through extensive ablations and scaling studies achieves state‑of‑the‑art FID scores on unconditional, class‑conditional and text‑to‑image generation tasks.

AdaLNFIDLong Skip Connections

0 likes · 16 min read

U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

AIWalker

Feb 9, 2025 · Artificial Intelligence

Douyin’s BDVQAGroup Secures Global Runner‑Up in DXOMARK Image Quality Challenge at CVPR 2024

At CVPR 2024 NTIRE, Douyin’s BDVQAGroup achieved second place worldwide in the DXOMARK portrait quality track using their SampleIQA model, which combines data‑re‑sampling, a Swin‑Transformer backbone, twin‑network ranking loss and content‑aware cropping to outperform existing IQA state‑of‑the‑art methods.

DXOMARKNTIRE2024SampleIQA

0 likes · 10 min read

Douyin’s BDVQAGroup Secures Global Runner‑Up in DXOMARK Image Quality Challenge at CVPR 2024

AIWalker

Jan 13, 2025 · Artificial Intelligence

Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation

The paper proposes MVFormer, a Vision Transformer that combines a Multi‑View Normalization (MVN) module and a Multi‑View Token Mixer (MVTM) to diversify feature learning, achieving state‑of‑the‑art Top‑1 accuracy of 83.4%‑84.6% on ImageNet‑1K and superior performance on COCO detection and ADE20K segmentation while using comparable or fewer parameters and MACs.

Multi-View NormalizationToken MixerVision Transformer

0 likes · 25 min read

Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation

AIWalker

Jan 11, 2025 · Artificial Intelligence

CAS-ViT: The Fastest, Strongest Vision Transformer for Mobile Image Classification & Detection

CAS‑ViT introduces a convolutional additive self‑attention mechanism that dramatically reduces the computational cost of Vision Transformers, achieving state‑of‑the‑art accuracy on image classification, object detection, and segmentation while being deployable on mobile devices.

Efficient ModelsSelf-AttentionVision Transformer

0 likes · 19 min read

CAS-ViT: The Fastest, Strongest Vision Transformer for Mobile Image Classification & Detection

Open Source Tech Hub

Sep 3, 2024 · Artificial Intelligence

Run Vision Transformer in PHP with phpy: A Complete Step‑by‑Step Guide

This article explains how to implement and run a Vision Transformer (ViT) model in PHP using the phpy extension, covering ViT fundamentals, installation of Python dependencies, full PHP and Python code examples, and practical application scenarios for PHP developers.

AIPHPPyTorch

0 likes · 15 min read

Run Vision Transformer in PHP with phpy: A Complete Step‑by‑Step Guide

AsiaInfo Technology: New Tech Exploration

Jan 29, 2024 · Artificial Intelligence

Can Vision Transformers Revolutionize Edge AI Video Analysis?

This article examines the rapid rise of edge AI video analytics, explains how Vision Transformers (ViT) overcome the limitations of traditional CNNs, details a technical pre‑research and POC conducted by a Chinese AI firm, evaluates several open‑source large models, and concludes that the OFA model best meets current edge deployment needs.

Edge AIOFAVision Transformer

0 likes · 14 min read

Can Vision Transformers Revolutionize Edge AI Video Analysis?

Ximalaya Technology Team

Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchStable DiffusionVision Transformer

0 likes · 9 min read

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

Alibaba Cloud Big Data AI Platform

Oct 8, 2023 · Artificial Intelligence

Why the Scale‑Aware Modulation Transformer Outperforms CNNs and Vision Transformers with Fewer Parameters

The Scale‑Aware Modulation Transformer (SMT) introduces a lightweight SAM module and an Evolutionary Hybrid Network that together achieve higher accuracy on ImageNet, COCO, and ADE20K while using significantly fewer parameters and FLOPs than existing CNN and Transformer baselines.

Image ClassificationSMTScale‑Aware Modulation

0 likes · 12 min read

Why the Scale‑Aware Modulation Transformer Outperforms CNNs and Vision Transformers with Fewer Parameters

Rare Earth Juejin Tech Community

Jul 24, 2023 · Artificial Intelligence

Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers

This article explains the Slide-Transformer paper, describing how the proposed Slide Attention replaces inefficient Im2Col‑based local attention with depthwise convolutions and a deformable shift module, achieving high efficiency, flexibility, and hardware‑agnostic performance for Vision Transformers.

Deformable ShiftDepthwise ConvolutionLocal Attention

0 likes · 13 min read

Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers

Rare Earth Juejin Tech Community

Jul 12, 2023 · Artificial Intelligence

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

This article provides an in‑depth, English‑language overview of Vision Transformer (ViT), covering its Transformer‑based architecture, patch‑to‑token conversion, token and position embeddings, fine‑tuning strategies such as 2‑D interpolation, experimental results versus CNNs, and the model’s broader significance for multimodal AI research.

Fine‑tuningPatch EmbeddingTransformer

0 likes · 25 min read

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

DataFunSummit

Jun 23, 2023 · Artificial Intelligence

Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications

This article introduces video action recognition, covering its basic definition, downstream tasks, major algorithmic families—including CNN‑based, Vision‑Transformer, self‑supervised, and multimodal approaches—and discusses practical deployment scenarios and open challenges in the field.

CNNVision Transformermultimodal models

0 likes · 16 min read

Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications

Rare Earth Juejin Tech Community

Oct 18, 2022 · Artificial Intelligence

Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch

This article walks readers through building, training, and evaluating a Vision Transformer (ViT) model for a five‑class flower classification task, providing detailed code snippets, model architecture explanations, training script adjustments, and experimental results that highlight the importance of pre‑trained weights.

Image ClassificationPyTorchViT

0 likes · 13 min read

Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch

Rare Earth Juejin Tech Community

Oct 10, 2022 · Artificial Intelligence

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

PyTorchSelf-AttentionTransformer

0 likes · 12 min read

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

AntTech

Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

MultimodalVision TransformerXYCut

0 likes · 10 min read

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

DataFunTalk

Jun 9, 2022 · Artificial Intelligence

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

This article introduces the MAE (Masked AutoEncoder) self‑supervised learning method, explains its asymmetric encoder‑decoder design and high masking ratio, evaluates its performance, and provides a step‑by‑step guide to reproduce MAE using Alibaba’s EasyCV framework, including code snippets, training tips, and troubleshooting.

EasyCVMAEPyTorch

0 likes · 15 min read

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

Baobao Algorithm Notes

Apr 11, 2022 · Artificial Intelligence

Can ResNet Still Beat Transformers? A Deep Dive into Modern Training Tricks

This article reviews recent research and official PyTorch blog updates that modify ResNet architectures and training tricks, compares their performance against EfficientNet, ConvNeXt, and Vision Transformers using extensive ImageNet benchmarks, and provides both literature‑based and local evaluation results to assess whether classic CNNs remain competitive.

CNNResNetVision Transformer

0 likes · 13 min read

Can ResNet Still Beat Transformers? A Deep Dive into Modern Training Tricks

Baidu Geek Talk

Mar 28, 2022 · Artificial Intelligence

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Grad-CAMInput VisualizationSelf-Attention

0 likes · 11 min read

Robust Input Visualization Methods for Vision Transformers