Tag

image captioning

0 views collected around this technical thread.

DataFunTalk
DataFunTalk
Sep 26, 2023 · Artificial Intelligence

MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models

This article presents MiniGPT-4, a multimodal system that combines a frozen visual encoder (Q‑Former + ViT) with an open‑source large language model (Vicuna), describes its motivation, training pipeline, demo capabilities, observed limitations, and includes a brief Q&A session.

AI researchMiniGPT-4Multimodal
0 likes · 15 min read
MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models
360 Tech Engineering
360 Tech Engineering
Jun 25, 2023 · Artificial Intelligence

Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model

The article reviews why visual ability is essential for artificial general intelligence, compares native multimodal and expert‑stitching integration approaches, details the architectures of models such as KOSMOS‑1, PALM‑E, Flamingo, BLIP‑2, LLAVA, miniGPT‑4, and introduces the SEEChat project that fuses CLIP vision encoders with chatGLM6B via a projection layer, presenting its training pipeline, experimental results, and future directions.

AGIModel FusionSEEChat
0 likes · 13 min read
Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model
Alimama Tech
Alimama Tech
Feb 1, 2023 · Artificial Intelligence

CapOnImage: Context-driven Dense Captioning on Images

The paper presents CapOnImage, a novel image‑on‑image captioning task that generates location‑specific decorative text for product images, introduces the 2.1‑million‑image CapOnImage2M dataset, and proposes a mixed‑modality transformer with position‑aware pre‑training and progressive training, achieving superior accuracy and diversity and already deployed in Alibaba’s advertising platforms for measurable business impact.

Multimodaladvertisingcontext-aware
0 likes · 9 min read
CapOnImage: Context-driven Dense Captioning on Images
DataFunSummit
DataFunSummit
Oct 9, 2022 · Artificial Intelligence

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

GIT modelVision-Languageimage captioning
0 likes · 12 min read
Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison
JD Tech
JD Tech
Aug 14, 2018 · Artificial Intelligence

GCN‑LSTM Image Captioning Model by JD AI Research Institute

JD AI Research Institute presented a GCN‑LSTM encoder‑decoder system that integrates object semantic and spatial relationships via graph convolutional networks to significantly improve image captioning performance on the COCO benchmark, achieving state‑of‑the‑art results.

COCO datasetLSTMcomputer vision
0 likes · 7 min read
GCN‑LSTM Image Captioning Model by JD AI Research Institute