Artificial Intelligence 12 min read

Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

This article presents Baidu's commercial multimodal understanding framework and AIGC innovations, detailing rich-media multimodal perception, the VICAN‑12B multimodal representation‑generation model, scenario‑specific fine‑tuning, feature quantization for ranking, and practical applications such as marketing content generation, digital‑human video creation, and poster synthesis.

DataFunTalk
DataFunTalk
DataFunTalk
Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

Introduction – The sharing focuses on Baidu's commercial multimodal understanding and AIGC (Artificial Intelligence Generated Content) innovations, covering two main parts: rich‑media multimodal understanding and Baidu AIGC‑Qingduo.

1. Rich‑Media Multimodal Understanding

Challenges include diverse commercial scenarios, redundant independent modeling, mismatched visual‑textual materials, meaningless ID features, and the need to fuse visual, video, and other features efficiently.

What constitutes a good multimodal representation? It should broaden data applicability, enhance visual fidelity, and allow fine‑tuning per scenario. Early approaches used separate image and text encoders (CNN, detection‑based features) and later shifted to Vision Transformers (ViT) and CLIP‑style dual‑tower models, which excel at retrieval but lack reasoning capabilities.

To improve natural‑language‑to‑visual perception, Baidu built a hundred‑billion‑scale training set and introduced the VICAN‑12B multimodal representation + generation model. The architecture combines a dual‑tower (large‑scale ViT visual encoder, stacked Transformer text encoder) with a single‑tower generation head, trained on three tasks: generation, classification, and image contrast, achieving strong performance.

2. Scenario‑Specific Fine‑Tuning

For visual retrieval, the base representation is fine‑tuned with commercial click‑through signals as labels, achieving SOTA results on seven datasets.

In ranking scenarios, multimodal features are quantized similarly to textual tokenization, using sparse activation and STE (Straight‑Through Estimator) to discretize continuous signals into IDs while preserving partial order. A two‑step process—learning discreteness then learning fusion—integrates quantized visual IDs with sparse ranking features, reducing quantization loss below 1%.

3. Baidu AIGC‑Qingduo (Commercial AIGC)

The platform links inspiration, creation, and delivery: AI assists in prompt generation, multimodal generation (text, images, digital humans, videos), and automated optimization of ad delivery.

Key components include a commercial prompt system, a large language model (Wenxin), and a digital‑human pipeline that can produce a video in three minutes by combining prompt‑driven script generation, digital‑human selection, and AI‑based face, background, and voice replacement.

Marketing poster generation leverages the same hundred‑billion‑scale multimodal representation with a diffusion‑based UNet, allowing customers to edit images or change backgrounds via prompts, supported by a lightweight fine‑tuning mechanism.

Conclusion – The sharing covered Baidu's end‑to‑end multimodal understanding pipeline, scenario‑aware fine‑tuning, and practical AIGC applications that boost commercial content production, personalization, and system monetization.

Rankinglarge language modelmultimodalAIGCBaiduvisual-languagefeature quantization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.