Artificial Intelligence 12 min read

Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

This article presents Baidu's commercial multimodal understanding and AIGC innovations, detailing rich‑media multimodal perception, a unified large‑scale representation framework, scenario‑specific fine‑tuning, and practical applications such as marketing copy, digital‑human video, and poster generation.

DataFunSummit

Jan 10, 2024

Baidu Commercial Multimodal Understanding and AIGC Innovation Practices

Introduction

The talk shares Baidu's commercial multimodal understanding and AIGC innovation practices, focusing on two main parts: rich‑media multimodal understanding and Baidu AIGC‑Qingduo.

1. Rich‑Media Multimodal Understanding

Challenges include diverse commercial scenarios, redundant modeling, mismatched text‑image pairs, meaningless ID features, and efficient fusion of visual semantics with other modalities.

A good multimodal representation should broaden data coverage, deepen visual fidelity, and allow scenario‑specific fine‑tuning.

Historical approaches evolved from CNN‑based visual encoders to VIT and CLIP, but lacked reasoning ability for tasks like VQA.

Baidu built a 12‑billion‑parameter multimodal representation + generation model (VICAN‑12B) that combines a large ViT visual encoder, a text transformer for retrieval, and a generation head, trained on a hundred‑billion‑scale dataset.

2. Scenario‑Specific Fine‑Tuning

For visual retrieval, the base representation is fine‑tuned with click‑through signals, achieving SOTA on seven datasets.

In ranking scenarios, multimodal features are tokenized similarly to text tokens using sparse activation, STE, and encoder‑decoder quantization, preserving partial order and keeping quantization loss below 1%.

Fusion combines discrete multimodal IDs with sparse features, using multi‑scale residual connections to reduce loss.

3. Baidu AIGC‑Qingduo (Commercial AIGC)

Integrates marketing with AIGC to boost content production efficiency, covering inspiration (AI‑driven prompt discovery), creation (text, image, digital‑human, video generation), and delivery (AI‑guided optimization).

Marketing copy generation relies on a commercial prompt system combined with Baidu's large language model, incorporating knowledge graphs, style tags, selling points, and user personas.

Digital‑human video generation uses prompts to control script, avatar, voice, and background, enabling a 3‑minute pipeline to produce a personalized marketing video.

Poster generation leverages the large multimodal representation and diffusion‑based UNet to allow background replacement and fine‑tuning for brand‑specific visuals.

Overall, Baidu's multimodal and AIGC solutions provide a unified, transferable representation that enhances commercial systems across advertising, recommendation, and content creation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Large Language Model Multimodal AIGC Baidu Visual Language

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.