Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.