Tagged articles

396 articles

Page 3 of 4

May 8, 2025 · Artificial Intelligence

Advances and Future of AI Agents: Capabilities, Trends, and Applications

AI agents are rapidly evolving toward a 2025 breakthrough in perception, autonomous planning, tool use and memory, driven by multimodal models, neural‑symbolic reasoning and embodied intelligence, with $27 billion investment forecasts, exemplified by general‑purpose agents like Manus and emerging applications in code generation, research, healthcare, and risk analysis.

AI agentAgent FrameworkAutonomous Planning

0 likes · 12 min read

Advances and Future of AI Agents: Capabilities, Trends, and Applications

Spring Full-Stack Practical Cases

May 7, 2025 · Artificial Intelligence

Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

This article introduces Spring AI's multimodal capabilities, explains the Message API for handling text, image, audio, and video inputs, and provides step‑by‑step Spring Boot examples for image analysis, ID card extraction, and structured JSON output of car‑color counts.

Artificial IntelligenceJavaMultimodal

0 likes · 8 min read

Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

AI Algorithm Path

May 2, 2025 · Artificial Intelligence

Qwen3 Launch: Open-Source Models Redefine General AI

The Qwen3 series introduces eight open‑source large language models ranging from 0.6B to 235B parameters, combines dense and Mixture‑of‑Experts architectures, supports multimodal input, offers mixed inference modes, and demonstrates benchmark superiority over leading models such as OpenAI o1 and Gemini 2.5 Pro.

AI agentsLarge Language ModelMixture of Experts

0 likes · 10 min read

Qwen3 Launch: Open-Source Models Redefine General AI

Data Thinking Notes

Apr 29, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

This article chronicles the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, multimodal models, and recent cost‑efficient innovations like DeepSeek‑R1, highlighting key architectures, training methods, alignment techniques, and their transformative impact on AI applications.

AI alignmentMultimodalTransformer

0 likes · 29 min read

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

DevOps

Apr 27, 2025 · Artificial Intelligence

Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends

This article examines how Retrieval‑Augmented Generation (RAG), AI agents, and multimodal large‑model techniques are reshaping AI‑industry integration, discusses their technical challenges and practical implementations, and outlines future development directions across algorithms, products, and domain‑specific applications.

AI agentsArtificial IntelligenceLarge Models

0 likes · 14 min read

Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends

Kuaishou Tech

Apr 23, 2025 · Artificial Intelligence

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

The article highlights Kuashou's eleven high‑quality papers accepted at ICLR 2025, covering advances in streaming video understanding, 3D trajectory control, multimodal talking‑face animation, transformer indexing, efficient video generation, industrial recommendation datasets, token gradient conflict in MoE, stable segmentation, multi‑camera video synthesis, large‑scale multimodal instruction tuning, and hallucination detection in retrieval‑augmented generation.

AIResearchDeepLearningICLR2025

0 likes · 20 min read

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

Liangxu Linux

Apr 22, 2025 · Artificial Intelligence

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

This article compiles a ranked list of ten popular open-source OCR projects on GitHub, summarizing each tool’s key capabilities—such as multimodal text extraction, PDF linearization, layout analysis, and multilingual support—along with star counts and direct repository links for developers seeking ready-to-use OCR solutions.

GitHubMultimodalOCR

0 likes · 9 min read

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

Swan Home Tech Team

Apr 21, 2025 · Artificial Intelligence

How Front-End Teams Leverage AI: FastGPT Platform, Intelligent Search, and Video Synthesis

This article examines how a front‑end team uses AI innovations—FastGPT visual platform, AI‑powered semantic search, and AI video synthesis—to rebuild business workflows, cut costs, and boost efficiency, highlighting architecture, technical highlights, and practical use cases.

AILow‑code platformMultimodal

0 likes · 7 min read

How Front-End Teams Leverage AI: FastGPT Platform, Intelligent Search, and Video Synthesis

DataFunTalk

Apr 18, 2025 · Artificial Intelligence

Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

The article demonstrates how ByteDance’s new Doubao‑1.5 multimodal model can be used to locate and count objects in images—such as sushi plates, street signs, and cartoon hats—by generating coordinates and overlaying visual annotations through a concise Python script.

AIDoubaoImage Annotation

0 likes · 5 min read

Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

AIWalker

Apr 17, 2025 · Artificial Intelligence

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

This article provides an in‑depth analysis of DeepSeek’s Janus and Janus‑Pro models, explaining how decoupling visual encoding resolves the conflict between multimodal understanding and generation, detailing training stages, data scaling, architectural choices, and presenting extensive benchmark results that demonstrate significant performance gains.

DeepSeekJanusModel Scaling

0 likes · 23 min read

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

58UXD

Apr 17, 2025 · Artificial Intelligence

How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction

Zero‑UI, powered by multimodal AI models like Google Gemini, is shifting design from screen‑based interfaces to natural voice, gesture, and environmental interactions, prompting a fundamental redesign of how devices understand user intent across smart homes, cars, and immersive experiences.

AIHuman-Computer InteractionMultimodal

0 likes · 9 min read

How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction

Baidu Tech Salon

Apr 16, 2025 · Artificial Intelligence

Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models

The China AI Industry Alliance unveiled its Q1 2025 Fangsheng benchmark, showing Baidu’s new multimodal models—Wenxin 4.5 leading basic abilities and Wenxin X1 excelling in reasoning—available for free on the Wenxin Yiyan platform, while Baidu pledges major 2025 investments in AI, data‑center and cloud infrastructure.

AIFactTestingMultimodal

0 likes · 4 min read

Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models

JD Tech

Apr 15, 2025 · Artificial Intelligence

Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation

The article presents a series of technical breakthroughs by JD's advertising team that improve the quality and coverage of AI‑generated ad images through a trustworthy multimodal feedback network, introduce a large human‑annotated image dataset, and enhance creative ranking with offline multimodal representations and online architecture optimizations, ultimately achieving more precise and scalable ad personalization.

AIAIGCAdvertising

0 likes · 10 min read

Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalTensorRT

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

AntTech

Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchMultimodalReasoning

0 likes · 6 min read

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

Baidu Geek Talk

Apr 9, 2025 · Artificial Intelligence

Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform

On April 2, Baidu released its Wenxin X1 large model on the Qianfan platform, offering enterprise users and developers a multimodal, deep‑thinking AI with superior math, coding, and reasoning scores, low token‑price API access, batch inference, one‑click distillation, and rapid RAG/Agent application building.

AIAPI ServiceBaidu

0 likes · 4 min read

Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform

AIWalker

Apr 7, 2025 · Artificial Intelligence

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.

CLIPModel ScalingMultimodal

0 likes · 13 min read

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

AI Algorithm Path

Apr 6, 2025 · Artificial Intelligence

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Meta’s newly released Llama 4 models—Maverick with 4 020 billion total parameters and Scout with 1 090 billion—feature a 128‑expert MoE, 10 million‑token context, native multimodal fusion, and FP8 training, delivering benchmark‑leading performance that outpaces GPT‑4o, Gemini 2.0 Flash and DeepSeek v3, while being openly available on Hugging Face and GitHub.

FP8 trainingLlama 4Meta AI

0 likes · 8 min read

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Fighter's World

Apr 5, 2025 · Artificial Intelligence

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

The article analyses Google’s Gemini 2.5 Pro as a decisive shift toward a “Reasoning Model”, detailing its architectural focus on inference, benchmark breakthroughs such as Humanity’s Last Exam and GPQA Diamond, long‑context capability, multimodal strengths, Vibe‑coding experience, and the roadmap for future Gemini models.

AI strategyGemini 2.5 ProMultimodal

0 likes · 25 min read

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

Nightwalker Tech

Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI agentAutoGLMLarge Language Model

0 likes · 4 min read

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

AIWalker

Mar 31, 2025 · Artificial Intelligence

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.

AI evaluationIntrinsic FaithfulnessMultimodal

0 likes · 12 min read

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

Nightwalker Tech

Mar 28, 2025 · Artificial Intelligence

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

This article presents a thorough assessment of GPT‑4o’s new image generation features, detailing multiple test scenarios—from simple portrait creation and style transfer to UI design, product rendering, and educational illustrations—comparing its output with Claude‑3.7‑Sonnet, highlighting strengths in realism and weaknesses in Chinese text handling.

AI evaluationGPT-4oMultimodal

0 likes · 16 min read

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

Meituan Technology Team

Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCMachine LearningMultimodal

0 likes · 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

37 Interactive Technology Team

Mar 26, 2025 · Artificial Intelligence

LUI vs GUI: Choosing the Right Interface for AI Product Design

When designing AI products, choosing between a Language User Interface—leveraging speech recognition, NLP, and conversational flexibility—and a Graphical User Interface—relying on visual icons, layouts, and intuitive interaction—depends on technology maturity, response speed, and user learning cost, while emerging multimodal designs increasingly blend both for richer, context‑aware experiences.

AIGUIInteraction

0 likes · 11 min read

LUI vs GUI: Choosing the Right Interface for AI Product Design

JD Retail Technology

Mar 25, 2025 · Artificial Intelligence

2024 Advances in Advertising Creative Generation and Selection

In 2024 the advertising team deployed an end‑to‑end AIGC pipeline that automatically creates high‑quality ad images, uses the multimodal Reliable Feedback Network and the million‑size RF1M dataset to filter outputs, builds rich offline and online multimodal representations with contrastive and list‑wise learning, and optimizes ranking architecture to deliver scalable, personalized creative selection.

AIAIGCAdvertising

0 likes · 10 min read

2024 Advances in Advertising Creative Generation and Selection

AI Large Model Application Practice

Mar 24, 2025 · Artificial Intelligence

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

This article explains a step‑by‑step implementation of a multimodal Retrieval‑Augmented Generation system that parses PPT/PDF files, extracts rich text and images with vision models, indexes them in a vector store, and generates answers that combine markdown and relevant slide screenshots.

LLMMultimodalPython

0 likes · 9 min read

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

Alibaba Cloud Big Data AI Platform

Mar 21, 2025 · Artificial Intelligence

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

This tutorial walks through using DataWorks Notebook with GPU support to combine the open‑vocabulary visual model RAM and the language model BERT for zero‑shot multimodal image captioning, covering environment setup, model installation, dataset preparation, tagging code, and result visualization.

BERTDataWorksMultimodal

0 likes · 13 min read

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

Amap Tech

Mar 19, 2025 · Artificial Intelligence

Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps

Gaode Map and Xi'an Jiaotong University introduce the “Driving by the Rules” task, releasing the MapDR benchmark that integrates lane‑level traffic‑sign regulations into online‑constructed HD maps, and provide modular (VLE‑MEE) and end‑to‑end (RuleVLM) baselines to evaluate rule extraction and lane association.

AIAutonomous DrivingHD maps

0 likes · 8 min read

Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps

IT Services Circle

Mar 19, 2025 · Artificial Intelligence

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

ByteDance and partners open‑source three AI projects—Goku for high‑quality text‑to‑video generation, Streamer‑Sales for multimodal live‑selling LLMs, and MimicTalk for rapid 3D talking‑head creation—detailing their core features, underlying transformer‑based architectures, training pipelines, and public repositories.

AI video generationMultimodalTransformer

0 likes · 5 min read

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

JD Tech Talk

Mar 19, 2025 · Artificial Intelligence

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

The 2024 advertising team introduced a suite of AI‑driven techniques—including a trustworthy feedback network, a large‑scale human‑annotated dataset, multimodal large language model representations, and online ranking architecture upgrades—to dramatically improve the quality, coverage, and personalization of generated ad creatives.

AIGCAdvertisingMLLM

0 likes · 10 min read

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

JD Cloud Developers

Mar 19, 2025 · Artificial Intelligence

How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection

2024 saw the advertising team achieve major breakthroughs in AI-generated ad creatives by introducing a multimodal reliable feedback network to improve image usability, releasing a large human-annotated dataset, and leveraging multimodal large language models for richer representation and more effective online/offline creative selection.

AIGCMultimodalad optimization

0 likes · 10 min read

How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection

NewBeeNLP

Mar 18, 2025 · Interview Experience

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

This article recounts a three‑stage interview for a multimodal large‑model position at Taobao's Search AI division, detailing typical questions on CLIP, LoRA, BLIP, Qwen‑VL, Transformer fundamentals, RLHF, and coding challenges, and offers insights on what interviewers focus on.

AICLIPInterview

0 likes · 5 min read

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

Code Mala Tang

Mar 15, 2025 · Artificial Intelligence

What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?

Google’s Gemma 3, a lightweight open‑source model with up to 27 billion parameters, offers multimodal input, 128K token context, and broad language support, outperforming leading rivals on single‑GPU benchmarks and providing flexible deployment options for developers and researchers alike.

AI modelGemma 3Google AI

0 likes · 9 min read

What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?

AIWalker

Mar 7, 2025 · Artificial Intelligence

How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

The paper introduces GIFNet, a three‑branch network that leverages low‑level visual tasks and a cross‑fusion gating mechanism to achieve a single, task‑agnostic image‑fusion model with dramatically reduced computation, strong generalization to unseen modalities, and even single‑modal enhancement capabilities.

CVPR2025GIFNetImage Fusion

0 likes · 20 min read

How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

DaTaobao Tech

Mar 7, 2025 · Artificial Intelligence

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

AIGCMultimodalReward Model

0 likes · 10 min read

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Cognitive Technology Team

Mar 7, 2025 · Artificial Intelligence

From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution

This article traces the development of AI models—from early word embeddings like Word2Vec and ELMo, through transformer‑based encoders such as BERT and decoder‑only models like GPT‑1/2/3, to recent multimodal systems and scaling laws—explaining their architectures, training methods, and impact on modern AI applications.

AIEmbeddingMultimodal

0 likes · 22 min read

From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution

DaTaobao Tech

Mar 5, 2025 · Artificial Intelligence

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

Taobao’s new multimodal AI Agent automatically creates high‑quality static and dynamic video covers by planning tasks, consulting a memory of quality criteria, executing frame selection with ReKV streaming and dual‑stage evaluation, generating marketing copy via fine‑tuned Qwen2.5‑7B, and refining layout, resulting in significantly higher click‑through rates, lower latency, and reduced manual effort.

AIMultimodalVideo processing

0 likes · 17 min read

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

DaTaobao Tech

Mar 3, 2025 · Artificial Intelligence

How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation

Taobao’s AIGC video generation platform, built on a large‑scale “Faxiang” model that evolved from UNet to DiT, leverages over 2 billion curated e‑commerce videos, expert alignment, Lora fine‑tuning, and multi‑control capabilities to deliver diverse, high‑quality product videos that dramatically boost conversion metrics across the marketplace.

AI video generationAIGCLarge Models

0 likes · 11 min read

How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation

JD Retail Technology

Mar 1, 2025 · Industry Insights

How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce

JD Retail’s AI assistant combines a Master‑Sub agent framework, ReAct paradigm, multimodal integration and MoE architecture to improve sales forecasting, pricing, and recommendation accuracy, while the team’s collaborative culture and open talent pathways illustrate how cutting‑edge AI is applied in real‑world e‑commerce.

AIJD RetailLLM

0 likes · 8 min read

How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce

AIWalker

Feb 20, 2025 · Artificial Intelligence

Transfusion: A Single Model for Unified Image Generation and Understanding

Transfusion is a 7B‑parameter transformer that jointly trains language modeling and diffusion losses on mixed text‑image data, enabling seamless text generation, image generation, and image understanding within one model and outperforming prior multimodal approaches such as Chameleon across multiple benchmarks.

AI researchLanguage ModelingMultimodal

0 likes · 20 min read

Transfusion: A Single Model for Unified Image Generation and Understanding

Architect

Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training

0 likes · 18 min read

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

AIWalker

Feb 16, 2025 · Artificial Intelligence

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

VARGPT is a novel multimodal large language model that unifies visual understanding and autoregressive image generation within a single architecture, extending LLaVA with next‑token and next‑scale prediction, trained through three staged data‑curated phases and achieving superior performance on numerous vision‑language benchmarks.

AI researchLarge Language ModelMultimodal

0 likes · 20 min read

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

Architects' Tech Alliance

Feb 16, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

This article provides an in‑depth technical analysis of DeepSeek’s model distillation technology, covering its core principles, innovative data‑model fusion strategies, architecture design, training optimizations, performance benchmarks, and the remaining challenges of scaling distillation to multimodal tasks.

DeepSeekMultimodalai-optimization

0 likes · 16 min read

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

Ops Development & AI Practice

Feb 10, 2025 · Artificial Intelligence

What’s Inside Google Gemini 2.0 Pro? Free Pricing, Multimodal Power & Real‑Time Streaming

The article reviews Google Gemini 2.0 Pro Experimental, detailing its free‑during‑experiment pricing, multimodal understanding, real‑time streaming, native tool integration, usage limits, latency controls, and practical scenarios such as large‑scale code processing and live media handling.

AIGeminiMultimodal

0 likes · 5 min read

What’s Inside Google Gemini 2.0 Pro? Free Pricing, Multimodal Power & Real‑Time Streaming

AIWalker

Feb 8, 2025 · Artificial Intelligence

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

The article presents Ola, an open‑source full‑modal LLM that uses progressive modality alignment to jointly process text, images, video, and audio, and demonstrates competitive performance across image, video, and audio benchmarks, surpassing many specialized models.

Large Language ModelMultimodalOla

0 likes · 22 min read

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

AIWalker

Feb 4, 2025 · Artificial Intelligence

Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

The paper introduces MILS, a training‑free multimodal iterative LLM solver that lets large language models perceive and generate across image, video, and audio domains, achieving new state‑of‑the‑art results without any task‑specific data or fine‑tuning.

AI researchLLMMILS

0 likes · 18 min read

Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

AI Code to Success

Jan 23, 2025 · Industry Insights

Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?

The article analyzes the 2025 AI large‑model landscape, contrasting slowing foundational breakthroughs with fierce application competition, highlighting MiniMax’s low‑cost linear‑attention models, multimodal advances, and the strategic shift from price wars to sustainable, technology‑driven growth.

AILarge ModelsMultimodal

0 likes · 7 min read

Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?

DataFunSummit

Jan 22, 2025 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

This article presents a comprehensive overview of the RAG2.0 engine design, covering RAG1.0 limitations, effective chunking methods, accurate retrieval techniques, advanced multimodal processing, hybrid search strategies, database indexing choices, and future directions such as agentic RAG and memory‑enhanced models.

ChunkingHybrid SearchMultimodal

0 likes · 23 min read

RAG2.0 Engine Design Challenges and Implementation

AI Code to Success

Jan 16, 2025 · Industry Insights

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

MiniMax, a Shanghai‑based AI unicorn, has open‑sourced its MiniMax‑01 series featuring large‑scale linear attention, secured $600 million in funding, launched multimodal products like Talkie and Hailuo AI, and is positioning itself as a competitive force amid rising geopolitical tensions in the global artificial‑intelligence market.

AIChina AIIndustry Insights

0 likes · 4 min read

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

ZhongAn Tech Team

Jan 12, 2025 · Artificial Intelligence

AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies

This issue reviews recent AI industry developments, including Lee Kai‑fu’s clarification on Zero‑One’s strategy, Microsoft’s open‑source Phi‑4 model, the multimodal VITA‑1.5 release, and HaiLuo AI’s advanced Chinese voice‑cloning technology, providing technical details and market implications.

AIMultimodalvoice cloning

0 likes · 10 min read

AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies

Infra Learning Club

Jan 2, 2025 · Artificial Intelligence

Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion

In 2025, large language models will see three key trends—agents becoming pervasive in daily life and industry, the emergence of efficient small models for edge and specialized tasks, and the integration of multimodal capabilities that combine text, images, and audio to enable more natural human‑machine interaction.

AI trendsLLMMultimodal

0 likes · 4 min read

Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion

Programmer DD

Dec 31, 2024 · Artificial Intelligence

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

This article demonstrates how to create an AI‑driven personal expense‑tracking assistant by leveraging Zhipu's GLM‑4V‑Flash multimodal model for receipt OCR, generating SQL statements, and integrating them with MaxKB workflows and a MySQL database, complete with code snippets and deployment steps.

AIGLM-4V-FlashMaxKB

0 likes · 13 min read

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

Baidu Geek Talk

Dec 25, 2024 · Industry Insights

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

AIHTMLLLM adaptation

0 likes · 10 min read

How to Build a Multimodal Web Page Model for the LLM Era

DevOps

Dec 23, 2024 · Artificial Intelligence

Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications

This article explains what AIGC agents are, outlines their four main characteristics, describes the underlying transformer‑based architecture, dual‑stage learning, probabilistic generation and feedback optimization, and explores their current and future commercial use cases across content creation, knowledge bases, customer service, internal operations, and product design.

AIGCAgentArtificial Intelligence

0 likes · 14 min read

Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications

Tencent Cloud Developer

Dec 5, 2024 · Industry Insights

Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them

The article analyses the rapid growth of Retrieval‑Augmented Generation (RAG) in enterprises, explains why self‑built RAG solutions often collapse under cost and maintenance pressures, and demonstrates how Tencent LeXiang AI Assistant addresses these issues through a robust knowledge‑management core, extensive industry experience, scalable resources, and advanced multimodal capabilities.

AI assistantLarge Language ModelMultimodal

0 likes · 16 min read

Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them

21CTO

Dec 4, 2024 · Artificial Intelligence

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Physical Intelligence's new Pi-zero model, built on a vision‑language foundation and fine‑tuned with extensive robot data, outperforms prior baselines across multiple tasks, showcasing the promise of large multimodal foundation models for flexible, robust robot control.

AIMultimodalPi-zero

0 likes · 6 min read

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Alibaba Cloud Big Data AI Platform

Dec 4, 2024 · Artificial Intelligence

How EasyAnimate V5 Advances AI Video Generation with Multimodal Control

EasyAnimate V5, an Alibaba Cloud AI video generation framework, expands model size to 7B/12B, introduces multimodal control, token‑length based training, and inpaint‑based image‑to‑video strategies, while providing easy deployment via PAI, DSW, and local ComfyUI integration.

AILoRAMMDiT

0 likes · 11 min read

How EasyAnimate V5 Advances AI Video Generation with Multimodal Control

NewBeeNLP

Dec 2, 2024 · Artificial Intelligence

What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

This article surveys current unified generation-and-understanding multimodal large-model architectures, compares LLM-centric and LLM-plus-diffusion designs, extracts common insights, details large-scale training tricks from models like Emu3, Chameleon and Janus, and outlines open research directions for visual encoders.

Multimodaldiffusionlarge language models

0 likes · 5 min read

What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

JD Retail Technology

Nov 14, 2024 · Artificial Intelligence

Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)

The paper introduces a Multimodal Reliable Feedback Network (RFNet) and a consistency‑condition regularization technique that together boost the usable rate of automatically generated advertisement images while preserving visual quality, supported by a new million‑image annotated dataset and extensive ECCV‑2024 experiments.

AIECCV2024Multimodal

0 likes · 8 min read

Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)

Bilibili Tech

Nov 8, 2024 · Artificial Intelligence

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Bilibili’s AI‑driven game‑recognition system extracts real‑time LoL events through OCR, hero detection and hot‑spot tagging, generating high‑energy timestamps and interactive overlays that let viewers jump to key moments and view detailed statistics, enhancing spectator engagement and analytical capabilities across major esports tournaments.

AIGame RecognitionMultimodal

0 likes · 14 min read

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Alibaba Cloud Big Data AI Platform

Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

EMNLP2024Multimodaldistillation

0 likes · 8 min read

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

DataFunSummit

Nov 1, 2024 · Big Data

DataFun Summit Session Overview and E‑book Access Instructions

The article outlines how to obtain the DataFun Summit e‑book by following the public account instructions and provides concise English summaries of twelve technical sessions covering data lineage, integration, AI language models, multimodal content, game AI agents, lake‑warehouse governance, big‑data architecture, and cluster management.

AIBig DataDataOps

0 likes · 5 min read

DataFun Summit Session Overview and E‑book Access Instructions

AntTech

Oct 28, 2024 · Artificial Intelligence

Highlights of AI Large‑Model Sessions at CNCC 2024

The CNCC 2024 conference featured a series of expert talks on AI large‑model research, covering paradigm shifts in scientific discovery, knowledge enhancement and governance, data‑infrastructure analytics, vertical‑domain inference, diffusion‑model advances, multimodal model progress, and medical AI applications, illustrating the breadth and impact of large‑model technologies across multiple domains.

AIKnowledge GovernanceMultimodal

0 likes · 9 min read

Highlights of AI Large‑Model Sessions at CNCC 2024

JD Retail Technology

Oct 15, 2024 · Artificial Intelligence

Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail

The article examines how large language models are reshaping JD Retail's e‑commerce search and recommendation pipelines, detailing industry evolution, technical challenges such as knowledge hallucination, intent understanding, personalization, cost, and safety, and presenting JD's end‑to‑end AIGC architecture, data preprocessing, alignment, evaluation, and next‑generation AI search solutions.

AILarge ModelsMultimodal

0 likes · 36 min read

Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail

DataFunTalk

Oct 1, 2024 · Artificial Intelligence

From Early AI to Superintelligence: Challenges and Prospects

The article reviews the evolution of artificial intelligence from early statistical models through deep learning and Transformer architectures, examines current breakthroughs like multimodal models, and discusses the technical, computational, and safety challenges that must be overcome before achieving artificial superintelligence (ASI).

AIArtificial IntelligenceMultimodal

0 likes · 8 min read

From Early AI to Superintelligence: Challenges and Prospects

Data Thinking Notes

Sep 26, 2024 · Big Data

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

The talk reviews the evolution of data technologies from early database storage to today’s generative AI-driven era, highlighting how massive data, multimodal processing, and advanced analytics are transforming data systems from cost‑centered infrastructures to value‑focused ecosystems that empower intelligent agents, open data ecosystems, and new application paradigms.

Big DataData PlatformsData Value

0 likes · 19 min read

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

JD Tech Talk

Sep 23, 2024 · Artificial Intelligence

JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering

The JD Advertising R&D team applies cutting‑edge AI techniques—including query intent models, multimodal representation pipelines, reinforcement‑learning‑based auction mechanisms, generative recommendation with quantized product tokens, and large‑model infrastructure—to boost traffic valuation, ad relevance, revenue, and creative generation across the platform.

AIAdvertisingLarge Models

0 likes · 19 min read

JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering

JD Cloud Developers

Sep 23, 2024 · Artificial Intelligence

How JD’s Advertising Lab Leverages Large‑Scale AI to Transform E‑Commerce Ads

JD's advertising research team combines deep learning, multimodal modeling, reinforcement‑learning auctions, and generative recommendation to boost ad relevance, improve long‑tail product exposure, and overcome large‑model inference challenges in a high‑traffic e‑commerce environment.

Graph Neural NetworkLarge ModelsMultimodal

0 likes · 22 min read

How JD’s Advertising Lab Leverages Large‑Scale AI to Transform E‑Commerce Ads

AntData

Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data

0 likes · 18 min read

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

JD Retail Technology

Sep 4, 2024 · Artificial Intelligence

Multimodal Recommendation Algorithms and System Architecture at JD.com

This article presents JD.com’s multimodal recommendation system architecture, covering content understanding, multimodal ranking and recall models, practical deployment pipelines, and future research directions such as large‑model integration and supply‑side generation, all illustrated with detailed diagrams and Q&A.

AIJD.comMultimodal

0 likes · 14 min read

Multimodal Recommendation Algorithms and System Architecture at JD.com

AI Large Model Application Practice

Aug 29, 2024 · Artificial Intelligence

8 Essential Indexing Strategies to Boost Enterprise RAG Performance

This article presents eight practical optimization recommendations for the indexing stage of enterprise‑level Retrieval‑Augmented Generation (RAG) applications, covering chunk creation, abbreviation handling, multimodal document processing, semantic enrichment, metadata usage, alternative index types, and embedding model selection.

ChunkingMultimodalRAG

0 likes · 15 min read

8 Essential Indexing Strategies to Boost Enterprise RAG Performance

DataFunSummit

Aug 29, 2024 · Artificial Intelligence

Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations

This article details Tencent Game's end‑to‑end approach to building intelligent NPCs, covering the opportunities brought by AI, the practical implementation of multimodal LLM‑driven dialogue, knowledge‑augmented retrieval, long‑context handling, safety measures, multimodal expression (voice and facial animation), and system‑level performance optimizations for real‑time deployment.

AILLMMultimodal

0 likes · 18 min read

Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations

DataFunSummit

Aug 25, 2024 · Artificial Intelligence

Applying Large AI Models to Financial Data Governance and Innovative Use Cases

This article presents a comprehensive technical overview of how large AI models are reshaping financial data production, governance, multimodal document understanding, lakehouse storage, private‑domain model deployment, data‑centric engineering methods, and multi‑agent intelligent advisory within the finance sector.

AILarge ModelsMultimodal

0 likes · 21 min read

Applying Large AI Models to Financial Data Governance and Innovative Use Cases

NewBeeNLP

Aug 15, 2024 · Industry Insights

Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion

This article analyzes how Xiaohongshu addresses the decentralization challenge in its recommendation system by strengthening side‑information usage, integrating multimodal signals across the full pipeline, and implementing interest exploration and protection mechanisms, while also outlining future research directions such as generative recommendation and large‑model‑driven user profiling.

Multimodaldecentralized-distributiongraph

0 likes · 25 min read

Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion

AntTech

Aug 13, 2024 · Artificial Intelligence

Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI

From August 11‑16, 2024 the ACL conference in Bangkok featured 14 Ant Group papers covering large‑scale information extraction, decomposed LLMs for semantic search, multimodal hallucination detection, long‑context attention mechanisms, concept‑reasoning datasets, knowledge‑graph alignment, and more, highlighting the group's breadth in natural language processing and AI research.

ACL2024MultimodalNLP

0 likes · 20 min read

Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI

DataFunSummit

Jul 28, 2024 · Artificial Intelligence

Leveraging Large Language Models for Graph Learning: Opportunities, Current Progress, and Future Directions

This article reviews why large language models can be applied to graph learning, outlines their capabilities and graph data characteristics, surveys current research across different graph types and LLM roles, and proposes future research directions for unified cross‑domain graph learning.

AIMultimodalResearch Directions

0 likes · 19 min read

Leveraging Large Language Models for Graph Learning: Opportunities, Current Progress, and Future Directions

Tencent Cloud Developer

Jul 18, 2024 · Artificial Intelligence

Exploring Large Language Models (LLM): Fundamentals, Applications, and Future Directions

Exploring Large Language Models, this article surveys their core concepts, evolution through Transformers, GPT and BERT, generation challenges, diverse applications such as QA, multimodal creation, summarization and retrieval‑augmented generation, prompt‑engineering frameworks and tools, LangChain‑based pipelines, AI‑driven agents, and future prospects toward domain‑specific use, multimodality, and AGI.

AIAgentLLM

0 likes · 35 min read

Exploring Large Language Models (LLM): Fundamentals, Applications, and Future Directions

Architects' Tech Alliance

Jul 10, 2024 · Industry Insights

Why AI Large Models Are Driving the Next Industrial Revolution

The article analyzes the rapid evolution of AI large models—from their role in advancing AGI through massive pre‑training and fine‑tuning, to current market dynamics led by GPT and domestic Chinese players, and finally to future multimodal applications, content‑factory capabilities, and emerging AIGC revenue models projected to reach trillion‑yuan scales by 2030.

AIAIGCGPT

0 likes · 7 min read

Why AI Large Models Are Driving the Next Industrial Revolution

Baobao Algorithm Notes

Jul 8, 2024 · Industry Insights

Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers

The article analyzes current challenges in deploying large AI models, covering robot automation, scaling‑law limits, vertical‑domain use cases, multimodal breakthroughs, algorithmic evolution, and the hardware‑software trade‑offs of training and inference infrastructures, while questioning ROI and practical feasibility.

Large ModelsMultimodalalgorithm evolution

0 likes · 21 min read

Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers

Baobao Algorithm Notes

Jul 4, 2024 · Artificial Intelligence

Vitron: How a Pixel‑Level Multimodal LLM Bridges Vision and Language

Vitron is a unified pixel‑level visual multimodal large language model that integrates image, video, and region encoders with a text‑centric strategy, delivering precise pixel‑wise perception and a comprehensive suite of vision tasks from understanding to generation and editing.

AILLMMultimodal

0 likes · 12 min read

Vitron: How a Pixel‑Level Multimodal LLM Bridges Vision and Language

AI Large Model Application Practice

Jul 4, 2024 · Artificial Intelligence

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

This article explains how to handle complex multimodal PDFs in RAG systems, outlines extraction, indexing, and multimodal model integration, details four query‑rewriting strategies (HyDE, stepwise, sub‑question, backward), and presents key evaluation metrics and tools for assessing RAG performance.

Document ParsingMultimodalQuery Rewriting

0 likes · 12 min read

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

360 Tech Engineering

Jul 3, 2024 · Artificial Intelligence

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

The 360LayoutAnalysis project from 360 AI Lab releases lightweight, yolov8‑based layout analysis models covering Chinese and English papers, Chinese research reports, and a general document scenario, providing fast inference, paragraph‑level detection, and open‑source code and weights for flexible document‑understanding pipelines.

AI modelLayout AnalysisMultimodal

0 likes · 9 min read

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

JD Tech

Jun 28, 2024 · Artificial Intelligence

An Overview of Large Language Models: History, Fundamentals, Prompt Engineering, Retrieval‑Augmented Generation, Agents, and Multimodal AI

This article provides a comprehensive introduction to large language models, covering their historical development, core architecture, training process, prompt engineering techniques, Retrieval‑Augmented Generation, agent frameworks, multimodal capabilities, safety challenges, and future research directions.

AI agentsAI safetyMultimodal

0 likes · 22 min read

An Overview of Large Language Models: History, Fundamentals, Prompt Engineering, Retrieval‑Augmented Generation, Agents, and Multimodal AI

AntTech

Jun 18, 2024 · Artificial Intelligence

Ant Group’s 24 Papers Featured at CVPR2024: Topics and Abstracts

The IEEE CVPR2024 conference in Seattle accepted 2,719 papers out of 11,532 submissions, and Ant Group contributed 24 papers covering computer vision, deep learning, digital humans, large models, multimodal remote sensing, vision‑language distillation, federated incremental learning, model‑stealing defense, and more, with one highlighted as a highlight.

Ant GroupCVPR2024Multimodal

0 likes · 17 min read

Ant Group’s 24 Papers Featured at CVPR2024: Topics and Abstracts

NewBeeNLP

Jun 18, 2024 · Artificial Intelligence

How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models

This article presents Shopee's comprehensive approach to constructing an e‑commerce knowledge graph, detailing the challenges of heterogeneous data, multi‑language handling, entity disambiguation, and the integration of deep learning and large language models to improve product matching, recommendation, and operational efficiency.

AILarge Language ModelMultimodal

0 likes · 22 min read

How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models

DataFunTalk

Jun 14, 2024 · Artificial Intelligence

Shopee's E‑commerce Knowledge Graph Construction and Integration with Large Models

This article presents Shopee's comprehensive exploration of building an e‑commerce knowledge graph, detailing its challenges, construction pipeline, AI‑driven extraction and fusion techniques, multilingual and multimodal modeling, and practical applications ranging from search and recommendation to AI assistants and real‑time updates.

AI applicationsMultimodale-commerce

0 likes · 21 min read

Shopee's E‑commerce Knowledge Graph Construction and Integration with Large Models

Alibaba Cloud Developer

Jun 13, 2024 · Artificial Intelligence

Creating a Full AI‑Generated Music Video with Large‑Model Agents

This article documents the end‑to‑end workflow of using large multimodal models and specialized agents to automatically generate a storyboard, compose original music and lyrics, produce keyframes, and assemble a complete music video, while highlighting the remaining manual steps and future automation possibilities.

AIMultimodalMusic

0 likes · 10 min read

Creating a Full AI‑Generated Music Video with Large‑Model Agents

Baobao Algorithm Notes

Jun 5, 2024 · Artificial Intelligence

Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage

This article reviews the open‑source 9‑billion‑parameter GLM‑4‑9B model, covering installation, quick‑start inference code, quirky Chinese riddles that highlight its strengths over GPT‑4, extensive benchmark tables for dialogue, multilingual, tool‑calling and multimodal tasks, and its broader impact on the Chinese AI ecosystem.

AIGLM-4-9BMultimodal

0 likes · 14 min read

Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage

DataFunSummit

Jun 4, 2024 · Artificial Intelligence

Multimodal and Graph Neural Network Techniques for eBay Recommendation Systems

This article details eBay's practical experience integrating multimodal data and graph neural networks into its recommendation pipeline, covering pain‑point analysis, a twin‑tower multimodal embedding model with triplet loss and TransH, engineering design, experimental results, and key takeaways for future AI‑driven product development.

EmbeddingGNNGraph Neural Network

0 likes · 19 min read

Multimodal and Graph Neural Network Techniques for eBay Recommendation Systems

Alimama Tech

May 29, 2024 · Artificial Intelligence

Mixture of Multi‑Modal Experts for Advertising Recall

The Mixed‑Modal Expert Model combines ID features with image and text embeddings through optimized representations and conditional output fusion, dramatically improving advertising recall—especially for long‑tail items—and delivering measurable gains in click‑recall, revenue, CTR, and page views in large‑scale online tests.

Machine LearningMultimodalmodel

0 likes · 15 min read

Mixture of Multi‑Modal Experts for Advertising Recall

NewBeeNLP

May 28, 2024 · Artificial Intelligence

How Generative Models Are Redefining Recommendation Systems

This article reviews recent advances in generative recommendation, highlighting challenges such as item representation and multimodal fusion, and summarizing four key research papers that propose novel tokenization, collaborative integration, and transformer-based multimodal approaches to improve recommendation performance.

AI researchGenerative RecommendationLLM

0 likes · 8 min read

How Generative Models Are Redefining Recommendation Systems

DataFunTalk

May 20, 2024 · Artificial Intelligence

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

This article presents OPPO's practical research on deploying multi‑modal pre‑training models across mobile devices and cloud, covering edge image‑text retrieval, text‑image generation and understanding optimizations, and lightweight diffusion model techniques, with detailed algorithmic improvements, performance results, and real‑world application cases.

AIGCEdge AIModel Compression

0 likes · 18 min read

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

21CTO

May 18, 2024 · Artificial Intelligence

What Makes GPT‑4o Faster, Smarter, and More Multimodal Than GPT‑4?

This article examines OpenAI's GPT‑4o, outlining its key performance, speed, accuracy, latency, multimodal, and resource‑efficiency improvements over GPT‑4, and explains why these enhancements broaden the model's applicability across various AI‑driven applications.

AI modelGPT-4oMultimodal

0 likes · 6 min read

What Makes GPT‑4o Faster, Smarter, and More Multimodal Than GPT‑4?

360 Tech Engineering

May 17, 2024 · Artificial Intelligence

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.

AI researchLarge Language ModelLlama3

0 likes · 8 min read

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

CSS Magic

May 14, 2024 · Artificial Intelligence

First Look at GPT-4o: Hands‑On Experience, FAQs, and New Free‑User Benefits

The article provides a hands‑on review of OpenAI's newly released GPT‑4o model, covering its multimodal capabilities, real‑time voice demo, desktop client rollout, access options for paid and free users, practical usage tips, and early observations on API performance and limitations.

AI modelAPIChatGPT

0 likes · 9 min read

First Look at GPT-4o: Hands‑On Experience, FAQs, and New Free‑User Benefits

DataFunSummit

Apr 24, 2024 · Artificial Intelligence

Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications

This article presents Baidu's exploration of multimodal content understanding for commercial advertising, detailing the ViCAN pre‑training model, its contrastive and mask‑language learning tasks, integration across recall, ranking and risk‑control pipelines, quantization with MMDict, and future AIGC‑driven generation, all backed by extensive experiments and Q&A.

AIAIGCAdvertising

0 likes · 27 min read

Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications

21CTO

Apr 20, 2024 · Artificial Intelligence

What Developers Need to Know About Meta’s New Open‑Source Llama 3 Model

Meta’s newly open‑source Llama 3 model pushes the frontier of large language models with a larger context window, Mixture‑of‑Experts architecture, multilingual support, and multimodal capabilities, while facing challenges in transparency, bias, and computational resources, and offering diverse applications from NLU to code generation.

AILarge Language ModelLlama3

0 likes · 10 min read

What Developers Need to Know About Meta’s New Open‑Source Llama 3 Model

Architects' Tech Alliance

Apr 7, 2024 · Artificial Intelligence

How Sora Is Redefining Text‑to‑Video Generation: Inside the New AI Model

Sora, the newly announced text‑to‑video large model, can generate one‑minute high‑fidelity videos from textual prompts or static images, handling complex scenes, expressive characters, and sophisticated camera motions while also supporting video extension and frame‑filling, positioning it at the forefront of multimodal AI research.

AI modelMultimodalSora

0 likes · 6 min read

How Sora Is Redefining Text‑to‑Video Generation: Inside the New AI Model

DataFunSummit

Mar 27, 2024 · Artificial Intelligence

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

MultimodalOFAONE-PEACE

0 likes · 15 min read

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

Alibaba Cloud Big Data AI Platform

Mar 18, 2024 · Artificial Intelligence

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling

The paper presents MuLTI, a multimodal video‑language model that tackles the memory and efficiency challenges of long video‑text sequences by introducing a Text‑Guided MultiWay Sampler and a Multiple Choice Modeling pre‑training task, achieving state‑of‑the‑art results on video QA and retrieval while drastically reducing GPU memory consumption.

Multimodalefficient-aifeature fusion

0 likes · 19 min read

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling