Tag

multimodal learning

1 views collected around this technical thread.

DataFunSummit
DataFunSummit
Feb 4, 2025 · Artificial Intelligence

Training Optimization for Large-Scale Multimodal Models in Content Safety

This article examines the challenges of content safety, outlines the limitations of current task‑specific multimodal models, and proposes large‑model‑inspired training optimizations—including diversified data construction, automated annotation, parameter fine‑tuning, and multi‑task evaluation—to improve efficiency, accuracy, and scalability of multimodal AI systems.

AI optimizationLarge Model Trainingcontent safety
0 likes · 26 min read
Training Optimization for Large-Scale Multimodal Models in Content Safety
DataFunSummit
DataFunSummit
Oct 29, 2024 · Artificial Intelligence

Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration

This article details Xiaohongshu's technical approaches to decentralized content distribution, covering business background, core challenges, high‑frequency recommendation pipelines, link‑level analysis, sideinfo decoupling, graph‑model integration, multimodal signal fusion, explicit interest exploration, interest protection, and future research directions.

Recommendation systemsdecentralized distributiongraph models
0 likes · 24 min read
Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration
DataFunSummit
DataFunSummit
Sep 16, 2024 · Artificial Intelligence

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

This article details how NetEase Cloud Music leverages multimodal content understanding—using audio models like MusicCLIP and Audio MAE and image‑text fusion via FLAVA—to improve recommendation performance for new content and new users, covering system architecture, cold‑start solutions, and future AI‑driven directions.

AI modelsCold Startaudio representation
0 likes · 15 min read
Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System
Bilibili Tech
Bilibili Tech
Aug 27, 2024 · Artificial Intelligence

Multimodal Video Scene Classification for Adaptive Video Processing

The paper presents a multimodal video scene classification system that leverages CLIP‑generated pseudo‑labels and a fine‑tuned image encoder to automatically identify nature, animation/game, and document scenes, enabling more effective adaptive transcoding, intelligent restoration, and quality assessment for user‑generated content on platforms such as Bilibili.

Bilibili multimediaClipComputer Vision
0 likes · 17 min read
Multimodal Video Scene Classification for Adaptive Video Processing
AntTech
AntTech
Aug 16, 2024 · Artificial Intelligence

PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

The paper introduces PC², a novel framework that combines pseudo‑classification and pseudo‑captioning to mitigate noisy correspondence in cross‑modal retrieval, presents a large‑scale web‑page/image‑meta‑description dataset called Noise of Web (NoW), and demonstrates significant performance gains on multiple benchmark datasets including Flickr30K, MS‑COCO, and the newly released NoW.

PC2cross-modal retrievaldataset
0 likes · 16 min read
PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval
DataFunTalk
DataFunTalk
Aug 5, 2024 · Artificial Intelligence

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches, and Insights

This article presents a comprehensive study on integrating multimodal image‑text representations into large‑scale e‑commerce advertising CTR models, introducing a semantic‑aware contrastive pre‑training (SCL) method and two application algorithms (SimTier and MAKE) that together achieve over 1 % GAUC improvement and significant online gains.

CTR predictionPretrainingRecommendation systems
0 likes · 21 min read
Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches, and Insights
Alimama Tech
Alimama Tech
Aug 2, 2024 · Artificial Intelligence

Multimodal Representations Boost Taobao Display Advertising CTR

Alibaba’s advertising team introduces semantic‑aware contrastive learning to pre‑train multimodal image‑text embeddings, integrates them via SimTier and MAKE into ID‑based CTR models, achieving up to 6.9% lift in Taobao display ad click‑through rates and improving long‑tail item performance.

CTR predictionRecommendation systemscontrastive learning
0 likes · 21 min read
Multimodal Representations Boost Taobao Display Advertising CTR
AntTech
AntTech
Jul 23, 2024 · Artificial Intelligence

Ant Group’s 11 Papers Accepted at ICML 2024 Cover AI Efficiency, Security, Multimodal Learning, and More

At ICML 2024 in Vienna, Ant Group had eleven papers accepted, spanning topics such as quantization-aware secure inference for transformers, multimodal contrastive captioners, self-cognitive denoising with noisy labels, directed graph embedding, GAN improvement via score matching, and trustworthy alignment of retrieval-augmented large language models.

AI securityAnt GroupICML2024
0 likes · 18 min read
Ant Group’s 11 Papers Accepted at ICML 2024 Cover AI Efficiency, Security, Multimodal Learning, and More
Kuaishou Tech
Kuaishou Tech
Apr 17, 2024 · Artificial Intelligence

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

The paper presented at AAAI introduces the EERCF method, a coarse‑to‑fine visual representation and two‑stage recall‑then‑rerank strategy that dramatically reduces cross‑modal matching FLOPs while preserving state‑of‑the‑art retrieval performance on multiple video benchmarks.

AIEfficiencycoarse-to-fine representation
0 likes · 8 min read
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Dec 21, 2023 · Artificial Intelligence

Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications

The article examines NetEase Cloud Music’s video and image technology stack—covering a four‑module architecture, algorithms for content understanding, intelligent production, moderation, and interactive effects—and explains how these systems enhance user experience, streamline backend processing, and position the platform for future AIGC‑driven innovations.

AI algorithmsCloud MusicImage Recognition
0 likes · 11 min read
Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications
DataFunTalk
DataFunTalk
Nov 10, 2023 · Artificial Intelligence

Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

This article presents NetEase Cloud Music's multimodal cold-start recommendation approach, detailing the problem's significance, feature extraction using CLIP, I2I2U indirect modeling, U2I DSSM direct modeling with contrastive learning and interest‑boundary mechanisms, deployment pipeline, evaluation results, and future optimization directions.

Cold Startcontrastive learningdeep learning
0 likes · 14 min read
Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 20, 2023 · Artificial Intelligence

Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification

At CVPR 2023 the Xiaohongshu team presented OvarNet, a unified one‑stage Faster‑RCNN model built on CLIP that uses prompt learning and knowledge distillation to jointly detect objects and recognize open‑vocabulary attributes, achieving state‑of‑the‑art results on VAW, MS‑COCO, LSA and OVAD datasets.

Computer Visionattribute recognitionknowledge distillation
0 likes · 12 min read
Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification
DataFunSummit
DataFunSummit
May 5, 2023 · Artificial Intelligence

Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO

The article presents OPPO's latest research on virtual human audio‑lip and RGB driving, multimodal learning breakthroughs such as CETNETs and cross‑modal matching, and a reflective discussion on the challenges and future directions of general artificial intelligence, highlighting the interconnections among these three domains.

AI Engineeringaudio2lipgeneral AI
0 likes · 9 min read
Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
May 27, 2022 · Artificial Intelligence

Multimodal Model for Game Frame Rate Prediction

This article explains how a multimodal deep learning model combines static and temporal game data to predict frame rates, helping identify performance bottlenecks and improve client smoothness through feature fusion, data pipelines, and real‑time inference in modern games.

AIFeature Engineeringdeep learning
0 likes · 7 min read
Multimodal Model for Game Frame Rate Prediction
DataFunTalk
DataFunTalk
May 22, 2022 · Artificial Intelligence

Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling

This article reviews Huawei Noah's Ark Lab's work on modern information‑flow recommendation, covering the evolution from collaborative filtering to deep learning, the application of BERT‑based pre‑training for news ranking, multimodal user‑interface modeling, practical deployment challenges, and future research directions.

AIBERTHuawei
0 likes · 19 min read
Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling
NetEase Media Technology Team
NetEase Media Technology Team
Apr 11, 2022 · Artificial Intelligence

Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution

To tackle the massive, multimodal tagging challenge of short‑video platforms—characterized by a huge long‑tail tag set, sparse annotations, and uneven modality contributions—the authors propose a two‑stage recall‑ranking system that first retrieves candidates via text, visual, audio and classification cues, then refines them with contrastive learning and extensive hard‑negative sampling, achieving 0.884 tag accuracy in a real‑world news video recommender.

EmbeddingRecommendation systemsdeep learning
0 likes · 12 min read
Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Feb 14, 2022 · Artificial Intelligence

Multimodal Evolution and Application in Tencent Game Advertising System

This article describes the end‑to‑end multimodal modeling pipeline—covering text, image, and video understanding, model evolution from shallow to deep networks, key‑frame extraction, fine‑tuning, and multimodal fusion—used in Tencent's game ad exchange platform, along with practical deployment challenges and solutions.

CNNText ClassificationTransformer
0 likes · 22 min read
Multimodal Evolution and Application in Tencent Game Advertising System
DataFunTalk
DataFunTalk
Jan 15, 2022 · Artificial Intelligence

Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

This article presents the multimodal learning demands of QQ Music, introduces the MMatch series of multimodal matching technologies—including image‑text matching, music similarity, AI tagging, and video scoring—and details their practical applications in business scenarios such as merchant public‑play, search, recommendation, and future product ideas.

Artificial IntelligenceRecommendation systemsTencent Music
0 likes · 25 min read
Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music
Amap Tech
Amap Tech
Nov 4, 2021 · Artificial Intelligence

POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

To efficiently filter unchanged POI signboards, the authors propose a multimodal image‑retrieval system that combines enhanced global and local visual features with BERT‑encoded OCR text, using metric learning and alignment techniques to achieve over 95 % accuracy while handling occlusion, viewpoint variation, and subtle text changes.

Computer Visiondeep learningimage retrieval
0 likes · 17 min read
POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions
Kuaishou Tech
Kuaishou Tech
Oct 20, 2021 · Artificial Intelligence

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

This paper proposes HiT, a hierarchical transformer model with momentum contrast for video-text retrieval, addressing limitations in existing multimodal learning methods by introducing hierarchical cross-modal contrast matching and momentum cross-modal contrast to improve retrieval performance.

Artificial IntelligenceHCMMCC
0 likes · 9 min read
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval