Artificial Intelligence 19 min read

AliMe MKG: Multimodal Knowledge Graph for Live E‑commerce and Its Technical Exploration

This report presents AliMe MKG, a multimodal knowledge graph designed for live e‑commerce, detailing its business background, construction and application, the three types of multimodal knowledge (triples, sentences, and visual media), the underlying extraction techniques, and its deployment in digital‑human anchors and intelligent live‑room assistants.

DataFunSummit

Aug 3, 2022

AliMe MKG: Multimodal Knowledge Graph for Live E‑commerce and Its Technical Exploration

AliMe MKG is a multimodal knowledge graph created by Alibaba to support digital‑human anchors in live e‑commerce, enabling merchants to launch live streams with a single click and provide 24‑hour product presentation.

Business background: Traditional live‑streaming requires costly human anchors and carries reputation risk; a digital‑human solution reduces cost, risk, and allows continuous streaming, powered by an intelligent script system that delivers product descriptions, images, and videos.

Intelligent script system: The system generates multimodal scripts containing text, associated images, and videos, which are displayed to the digital human during the broadcast. The scripts rely on a underlying multimodal knowledge graph.

Multimodal knowledge graph: The graph stores three categories of knowledge: (1) triple‑type knowledge linking scene, pain point, demand, and product; (2) sentence‑type knowledge providing detailed product descriptions such as usage methods; (3) multimodal knowledge linking images and videos to product attributes.

Construction and application: The graph is built on Alibaba’s product middle‑platform, adding nodes for scenes, pain points, demands, sentences, and visual media. It evolved from a triple‑only graph in 2019 to a full multimodal graph by 2020, supporting both digital‑human anchors and an intelligent live‑room assistant for personalized product recommendation.

Triple‑type knowledge extraction: Node mining uses phrase extraction and entity recognition; relation construction uses relation‑extraction algorithms, employing distant supervision and external knowledge to reduce manual labeling.

Sentence‑type knowledge extraction: Sentences are mined from internal micro‑articles, product reviews, and detail pages. Techniques include summarization, pipeline extraction, polarity classification, OCR text extraction, text rewriting, and language‑model scoring to ensure fluent, relevant sentences.

Multimodal knowledge extraction: Image knowledge is obtained from product detail images using image‑text matching (Vision Transformer for images, StructBERT for text) and pre‑training tasks (contrastive learning, masked region feature regression, masked language modeling). Video knowledge uses video grounding: clips are extracted, encoded with a 3D‑CNN, combined with ASR text, and processed by a multimodal transformer; multiple‑instance learning reduces annotation cost.

Advanced research topics: Multi‑modal NER enhances entity recognition by prompting CLIP with image‑derived labels; multi‑modal entity linking combines multi‑modal retrieval and contrastive learning with a dual‑tower model to disambiguate mentions.

Applications: The graph powers digital‑human anchors for scripted product presentation and an intelligent assistant that recommends multimodal product content based on user profiles, improving conversion rates in live streams.

Q&A highlights: Business metrics compare digital‑human conversion and follower rates to human anchors; scripts are reviewed by merchants and can be edited; merchant edits feed back into model improvement; offline script quality is evaluated on reliability, diversity, and vividness; graph updates occur monthly for schema knowledge and daily for instance knowledge.

Overall, AliMe MKG demonstrates how a multimodal knowledge graph can drive AI‑enabled live‑commerce solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e‑commerce AI digital human knowledge extraction multimodal knowledge graph video grounding

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.