Artificial Intelligence 12 min read

Multimodal Search: From Mobile to the 5G+ Intelligent Era – Baidu’s Voice and Visual Search Technologies

This article reviews Baidu's multimodal search advancements, covering the evolution of voice and visual search, technical architectures, algorithmic improvements, and future prospects such as the DuXiaoxiao app that integrates speech, image, and text AI for immersive user experiences.

DataFunTalk
DataFunTalk
DataFunTalk
Multimodal Search: From Mobile to the 5G+ Intelligent Era – Baidu’s Voice and Visual Search Technologies

Since the launch of the iPhone 4 in 2010, smartphones have become ubiquitous, and with the rollout of 5G the speed of data transfer has dramatically increased, enabling new search modalities such as voice and image search.

The content is organized into four main parts:

Multimodal search – originating from mobile devices and flourishing in the 5G‑plus‑intelligent era.

Voice search – focusing on clear capture, understanding, and satisfying user intent.

Visual search – delivering "see‑what‑you‑see" results.

"Breaking the circle" – exploring unlimited possibilities for future multimodal products.

1. Concept of multimodal search : Baidu’s app provides a voice button for speech‑based queries and a camera button for visual search. Voice search can replace traditional text input, while visual search extracts information behind images.

2. Why Baidu began building multimodal search in 2015 : The rise of smartphones made voice input feasible; 4G networks greatly improved upload speeds, making image upload easy; the user base expanded beyond young adults to children and seniors.

3. Changes brought by the 5G era : More immersive experiences, lower latency, and the proliferation of smart speakers, wearables, and AR devices create new user demands for multimodal search.

4. Voice search :

Goals: "listen clearly", "listen understand", "listen satisfy".

Challenges in clear listening: noisy environments, dialects, low volume.

Understanding challenges: colloquial expressions, long‑tail queries, context continuity (e.g., follow‑up questions).

Satisfaction: voice assistants must return the most relevant top‑1 answer rather than a list.

Technical solution for voice search : three stages – (1) speech recognition and correction, (2) query generalization, QA conversion, context handling, session management, and (3) intelligent QA using knowledge graphs to provide precise answers.

5. Visual search :

Goal: "see‑what‑you‑see" – retrieve the underlying content of captured images or video streams.

Key challenges: interaction efficiency, pixel‑level perception, and accurate object recognition.

Achievements : Baidu’s visual perception can respond within ~100 ms on mobile devices, covering more than 60 scenarios, indexing over 80 million entities, billions of products, and over 100 billion images.

Visual technology layers :

1) Perception on the device – 2D/3D detection, tracking, scene recognition, AR positioning and rendering.

2) Recognition – detailed search and fulfillment after perception.

3) Foundational tech – image, text, video understanding, human/body detection, cloud‑edge performance optimization, multimodal QA.

Visual perception pipeline (six steps): detection & segmentation, tracking, coarse‑grained understanding, AR rendering, MR interaction, and application scenarios such as multi‑object tracking, quiz search, AR translation, and real‑time word extraction.

Algorithm evolution for detection :

First generation – lightweight one‑stage detector using MobileNet‑V1 with pruning and focal loss.

Second generation – addressed instability in continuous frames with multi‑frame information fusion.

Third generation – improved small‑object recall using YOLO‑v3 and neural architecture search.

Fourth generation – applied knowledge distillation to close the gap with larger models like RetinaNet‑50.

Visual retrieval workflow : feature extraction (SIFT or CNN) → approximate nearest neighbor (ANN) search.

Supervised vs. unsupervised methods :

Supervised methods suffer from limited, noisy labeled data, insufficient diversity, and high annotation cost.

Unsupervised approaches include spectral clustering for pseudo‑labels and BYOL‑style contrastive learning to obtain robust image representations.

Future outlook – "Breaking the circle" : Baidu’s DuXiaoxiao app, unveiled at the 2020 Baidu World Conference, integrates voice, visual, and text AI, featuring a virtual avatar with emotional TTS, and exemplifies the next generation of multimodal search.

In summary, multimodal search is evolving rapidly with advances in speech recognition, visual perception, and large‑scale unsupervised learning, promising richer, more immersive user experiences.

artificial intelligence5Gvisual searchBaidumultimodal searchvoice search
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.