Artificial Intelligence 12 min read

Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

This article presents Baidu's commercial multimodal understanding and AIGC innovations, detailing rich‑media multimodal perception, a unified large‑scale representation framework, scenario‑specific fine‑tuning, and practical applications such as marketing copy, digital‑human video, and poster generation.

DataFunSummit
DataFunSummit
DataFunSummit
Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

Introduction

The talk shares Baidu's commercial multimodal understanding and AIGC innovation practices, focusing on two main parts: rich‑media multimodal understanding and Baidu AIGC‑Qingduo.

1. Rich‑Media Multimodal Understanding

Challenges include diverse commercial scenarios, redundant modeling, mismatched text‑image pairs, meaningless ID features, and efficient fusion of visual semantics with other modalities.

A good multimodal representation should broaden data coverage, deepen visual fidelity, and allow scenario‑specific fine‑tuning.

Historical approaches evolved from CNN‑based visual encoders to VIT and CLIP, but lacked reasoning ability for tasks like VQA.

Baidu built a 12‑billion‑parameter multimodal representation + generation model (VICAN‑12B) that combines a large ViT visual encoder, a text transformer for retrieval, and a generation head, trained on a hundred‑billion‑scale dataset.

2. Scenario‑Specific Fine‑Tuning

For visual retrieval, the base representation is fine‑tuned with click‑through signals, achieving SOTA on seven datasets.

In ranking scenarios, multimodal features are tokenized similarly to text tokens using sparse activation, STE, and encoder‑decoder quantization, preserving partial order and keeping quantization loss below 1%.

Fusion combines discrete multimodal IDs with sparse features, using multi‑scale residual connections to reduce loss.

3. Baidu AIGC‑Qingduo (Commercial AIGC)

Integrates marketing with AIGC to boost content production efficiency, covering inspiration (AI‑driven prompt discovery), creation (text, image, digital‑human, video generation), and delivery (AI‑guided optimization).

Marketing copy generation relies on a commercial prompt system combined with Baidu's large language model, incorporating knowledge graphs, style tags, selling points, and user personas.

Digital‑human video generation uses prompts to control script, avatar, voice, and background, enabling a 3‑minute pipeline to produce a personalized marketing video.

Poster generation leverages the large multimodal representation and diffusion‑based UNet to allow background replacement and fine‑tuning for brand‑specific visuals.

Overall, Baidu's multimodal and AIGC solutions provide a unified, transferable representation that enhances commercial systems across advertising, recommendation, and content creation.

advertisinglarge language modelMultimodalAIGCBaiduvisual-language
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.