Tagged articles
396 articles
Page 3 of 4
Tencent Cloud Developer
Tencent Cloud Developer
May 8, 2025 · Artificial Intelligence

Advances and Future of AI Agents: Capabilities, Trends, and Applications

AI agents are rapidly evolving toward a 2025 breakthrough in perception, autonomous planning, tool use and memory, driven by multimodal models, neural‑symbolic reasoning and embodied intelligence, with $27 billion investment forecasts, exemplified by general‑purpose agents like Manus and emerging applications in code generation, research, healthcare, and risk analysis.

AI agentAgent FrameworkAutonomous Planning
0 likes · 12 min read
Advances and Future of AI Agents: Capabilities, Trends, and Applications
AI Algorithm Path
AI Algorithm Path
May 2, 2025 · Artificial Intelligence

Qwen3 Launch: Open-Source Models Redefine General AI

The Qwen3 series introduces eight open‑source large language models ranging from 0.6B to 235B parameters, combines dense and Mixture‑of‑Experts architectures, supports multimodal input, offers mixed inference modes, and demonstrates benchmark superiority over leading models such as OpenAI o1 and Gemini 2.5 Pro.

AI agentsLarge Language ModelMixture of Experts
0 likes · 10 min read
Qwen3 Launch: Open-Source Models Redefine General AI
Data Thinking Notes
Data Thinking Notes
Apr 29, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

This article chronicles the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, multimodal models, and recent cost‑efficient innovations like DeepSeek‑R1, highlighting key architectures, training methods, alignment techniques, and their transformative impact on AI applications.

AI alignmentMultimodalTransformer
0 likes · 29 min read
From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025
DevOps
DevOps
Apr 27, 2025 · Artificial Intelligence

Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends

This article examines how Retrieval‑Augmented Generation (RAG), AI agents, and multimodal large‑model techniques are reshaping AI‑industry integration, discusses their technical challenges and practical implementations, and outlines future development directions across algorithms, products, and domain‑specific applications.

AI agentsArtificial IntelligenceLarge Models
0 likes · 14 min read
Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends
Kuaishou Tech
Kuaishou Tech
Apr 23, 2025 · Artificial Intelligence

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

The article highlights Kuashou's eleven high‑quality papers accepted at ICLR 2025, covering advances in streaming video understanding, 3D trajectory control, multimodal talking‑face animation, transformer indexing, efficient video generation, industrial recommendation datasets, token gradient conflict in MoE, stable segmentation, multi‑camera video synthesis, large‑scale multimodal instruction tuning, and hallucination detection in retrieval‑augmented generation.

AIResearchDeepLearningICLR2025
0 likes · 20 min read
Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries
Liangxu Linux
Liangxu Linux
Apr 22, 2025 · Artificial Intelligence

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

This article compiles a ranked list of ten popular open-source OCR projects on GitHub, summarizing each tool’s key capabilities—such as multimodal text extraction, PDF linearization, layout analysis, and multilingual support—along with star counts and direct repository links for developers seeking ready-to-use OCR solutions.

GitHubMultimodalOCR
0 likes · 9 min read
Top 10 Open-Source OCR Projects on GitHub Ranked by Stars
AIWalker
AIWalker
Apr 17, 2025 · Artificial Intelligence

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

This article provides an in‑depth analysis of DeepSeek’s Janus and Janus‑Pro models, explaining how decoupling visual encoding resolves the conflict between multimodal understanding and generation, detailing training stages, data scaling, architectural choices, and presenting extensive benchmark results that demonstrate significant performance gains.

DeepSeekJanusModel Scaling
0 likes · 23 min read
Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation
58UXD
58UXD
Apr 17, 2025 · Artificial Intelligence

How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction

Zero‑UI, powered by multimodal AI models like Google Gemini, is shifting design from screen‑based interfaces to natural voice, gesture, and environmental interactions, prompting a fundamental redesign of how devices understand user intent across smart homes, cars, and immersive experiences.

AIHuman-Computer InteractionMultimodal
0 likes · 9 min read
How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction
Baidu Tech Salon
Baidu Tech Salon
Apr 16, 2025 · Artificial Intelligence

Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models

The China AI Industry Alliance unveiled its Q1 2025 Fangsheng benchmark, showing Baidu’s new multimodal models—Wenxin 4.5 leading basic abilities and Wenxin X1 excelling in reasoning—available for free on the Wenxin Yiyan platform, while Baidu pledges major 2025 investments in AI, data‑center and cloud infrastructure.

AIFactTestingMultimodal
0 likes · 4 min read
Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models
JD Tech
JD Tech
Apr 15, 2025 · Artificial Intelligence

Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation

The article presents a series of technical breakthroughs by JD's advertising team that improve the quality and coverage of AI‑generated ad images through a trustworthy multimodal feedback network, introduce a large human‑annotated image dataset, and enhance creative ranking with offline multimodal representations and online architecture optimizations, ultimately achieving more precise and scalable ad personalization.

AIAIGCAdvertising
0 likes · 10 min read
Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation
58 Tech
58 Tech
Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalTensorRT
0 likes · 19 min read
Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
AntTech
AntTech
Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchMultimodalReasoning
0 likes · 6 min read
Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase
Baidu Geek Talk
Baidu Geek Talk
Apr 9, 2025 · Artificial Intelligence

Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform

On April 2, Baidu released its Wenxin X1 large model on the Qianfan platform, offering enterprise users and developers a multimodal, deep‑thinking AI with superior math, coding, and reasoning scores, low token‑price API access, batch inference, one‑click distillation, and rapid RAG/Agent application building.

AIAPI ServiceBaidu
0 likes · 4 min read
Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform
AI Algorithm Path
AI Algorithm Path
Apr 6, 2025 · Artificial Intelligence

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Meta’s newly released Llama 4 models—Maverick with 4 020 billion total parameters and Scout with 1 090 billion—feature a 128‑expert MoE, 10 million‑token context, native multimodal fusion, and FP8 training, delivering benchmark‑leading performance that outpaces GPT‑4o, Gemini 2.0 Flash and DeepSeek v3, while being openly available on Hugging Face and GitHub.

FP8 trainingLlama 4Meta AI
0 likes · 8 min read
Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI
Fighter's World
Fighter's World
Apr 5, 2025 · Artificial Intelligence

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

The article analyses Google’s Gemini 2.5 Pro as a decisive shift toward a “Reasoning Model”, detailing its architectural focus on inference, benchmark breakthroughs such as Humanity’s Last Exam and GPQA Diamond, long‑context capability, multimodal strengths, Vibe‑coding experience, and the roadmap for future Gemini models.

AI strategyGemini 2.5 ProMultimodal
0 likes · 25 min read
Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?
Nightwalker Tech
Nightwalker Tech
Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI agentAutoGLMLarge Language Model
0 likes · 4 min read
Evaluation of AutoGLM: Features, Architecture, and Practical Test Results
AIWalker
AIWalker
Mar 31, 2025 · Artificial Intelligence

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.

AI evaluationIntrinsic FaithfulnessMultimodal
0 likes · 12 min read
VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation
Nightwalker Tech
Nightwalker Tech
Mar 28, 2025 · Artificial Intelligence

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

This article presents a thorough assessment of GPT‑4o’s new image generation features, detailing multiple test scenarios—from simple portrait creation and style transfer to UI design, product rendering, and educational illustrations—comparing its output with Claude‑3.7‑Sonnet, highlighting strengths in realism and weaknesses in Chinese text handling.

AI evaluationGPT-4oMultimodal
0 likes · 16 min read
Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities
Meituan Technology Team
Meituan Technology Team
Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCMachine LearningMultimodal
0 likes · 9 min read
Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation
37 Interactive Technology Team
37 Interactive Technology Team
Mar 26, 2025 · Artificial Intelligence

LUI vs GUI: Choosing the Right Interface for AI Product Design

When designing AI products, choosing between a Language User Interface—leveraging speech recognition, NLP, and conversational flexibility—and a Graphical User Interface—relying on visual icons, layouts, and intuitive interaction—depends on technology maturity, response speed, and user learning cost, while emerging multimodal designs increasingly blend both for richer, context‑aware experiences.

AIGUIInteraction
0 likes · 11 min read
LUI vs GUI: Choosing the Right Interface for AI Product Design
JD Retail Technology
JD Retail Technology
Mar 25, 2025 · Artificial Intelligence

2024 Advances in Advertising Creative Generation and Selection

In 2024 the advertising team deployed an end‑to‑end AIGC pipeline that automatically creates high‑quality ad images, uses the multimodal Reliable Feedback Network and the million‑size RF1M dataset to filter outputs, builds rich offline and online multimodal representations with contrastive and list‑wise learning, and optimizes ranking architecture to deliver scalable, personalized creative selection.

AIAIGCAdvertising
0 likes · 10 min read
2024 Advances in Advertising Creative Generation and Selection
Amap Tech
Amap Tech
Mar 19, 2025 · Artificial Intelligence

Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps

Gaode Map and Xi'an Jiaotong University introduce the “Driving by the Rules” task, releasing the MapDR benchmark that integrates lane‑level traffic‑sign regulations into online‑constructed HD maps, and provide modular (VLE‑MEE) and end‑to‑end (RuleVLM) baselines to evaluate rule extraction and lane association.

AIAutonomous DrivingHD maps
0 likes · 8 min read
Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps
IT Services Circle
IT Services Circle
Mar 19, 2025 · Artificial Intelligence

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

ByteDance and partners open‑source three AI projects—Goku for high‑quality text‑to‑video generation, Streamer‑Sales for multimodal live‑selling LLMs, and MimicTalk for rapid 3D talking‑head creation—detailing their core features, underlying transformer‑based architectures, training pipelines, and public repositories.

AI video generationMultimodalTransformer
0 likes · 5 min read
ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project
JD Tech Talk
JD Tech Talk
Mar 19, 2025 · Artificial Intelligence

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

The 2024 advertising team introduced a suite of AI‑driven techniques—including a trustworthy feedback network, a large‑scale human‑annotated dataset, multimodal large language model representations, and online ranking architecture upgrades—to dramatically improve the quality, coverage, and personalization of generated ad creatives.

AIGCAdvertisingMLLM
0 likes · 10 min read
Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations
JD Cloud Developers
JD Cloud Developers
Mar 19, 2025 · Artificial Intelligence

How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection

2024 saw the advertising team achieve major breakthroughs in AI-generated ad creatives by introducing a multimodal reliable feedback network to improve image usability, releasing a large human-annotated dataset, and leveraging multimodal large language models for richer representation and more effective online/offline creative selection.

AIGCMultimodalad optimization
0 likes · 10 min read
How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection
NewBeeNLP
NewBeeNLP
Mar 18, 2025 · Interview Experience

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

This article recounts a three‑stage interview for a multimodal large‑model position at Taobao's Search AI division, detailing typical questions on CLIP, LoRA, BLIP, Qwen‑VL, Transformer fundamentals, RLHF, and coding challenges, and offers insights on what interviewers focus on.

AICLIPInterview
0 likes · 5 min read
How to Ace Multimodal Model Interviews at Taobao's Search AI Division
Code Mala Tang
Code Mala Tang
Mar 15, 2025 · Artificial Intelligence

What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?

Google’s Gemma 3, a lightweight open‑source model with up to 27 billion parameters, offers multimodal input, 128K token context, and broad language support, outperforming leading rivals on single‑GPU benchmarks and providing flexible deployment options for developers and researchers alike.

AI modelGemma 3Google AI
0 likes · 9 min read
What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?
AIWalker
AIWalker
Mar 7, 2025 · Artificial Intelligence

How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

The paper introduces GIFNet, a three‑branch network that leverages low‑level visual tasks and a cross‑fusion gating mechanism to achieve a single, task‑agnostic image‑fusion model with dramatically reduced computation, strong generalization to unseen modalities, and even single‑modal enhancement capabilities.

CVPR2025GIFNetImage Fusion
0 likes · 20 min read
How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks
DaTaobao Tech
DaTaobao Tech
Mar 7, 2025 · Artificial Intelligence

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

AIGCMultimodalReward Model
0 likes · 10 min read
Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques
Cognitive Technology Team
Cognitive Technology Team
Mar 7, 2025 · Artificial Intelligence

From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution

This article traces the development of AI models—from early word embeddings like Word2Vec and ELMo, through transformer‑based encoders such as BERT and decoder‑only models like GPT‑1/2/3, to recent multimodal systems and scaling laws—explaining their architectures, training methods, and impact on modern AI applications.

AIEmbeddingMultimodal
0 likes · 22 min read
From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution
DaTaobao Tech
DaTaobao Tech
Mar 5, 2025 · Artificial Intelligence

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

Taobao’s new multimodal AI Agent automatically creates high‑quality static and dynamic video covers by planning tasks, consulting a memory of quality criteria, executing frame selection with ReKV streaming and dual‑stage evaluation, generating marketing copy via fine‑tuned Qwen2.5‑7B, and refining layout, resulting in significantly higher click‑through rates, lower latency, and reduced manual effort.

AIMultimodalVideo processing
0 likes · 17 min read
Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams
DaTaobao Tech
DaTaobao Tech
Mar 3, 2025 · Artificial Intelligence

How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation

Taobao’s AIGC video generation platform, built on a large‑scale “Faxiang” model that evolved from UNet to DiT, leverages over 2 billion curated e‑commerce videos, expert alignment, Lora fine‑tuning, and multi‑control capabilities to deliver diverse, high‑quality product videos that dramatically boost conversion metrics across the marketplace.

AI video generationAIGCLarge Models
0 likes · 11 min read
How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation
JD Retail Technology
JD Retail Technology
Mar 1, 2025 · Industry Insights

How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce

JD Retail’s AI assistant combines a Master‑Sub agent framework, ReAct paradigm, multimodal integration and MoE architecture to improve sales forecasting, pricing, and recommendation accuracy, while the team’s collaborative culture and open talent pathways illustrate how cutting‑edge AI is applied in real‑world e‑commerce.

AIJD RetailLLM
0 likes · 8 min read
How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce
AIWalker
AIWalker
Feb 20, 2025 · Artificial Intelligence

Transfusion: A Single Model for Unified Image Generation and Understanding

Transfusion is a 7B‑parameter transformer that jointly trains language modeling and diffusion losses on mixed text‑image data, enabling seamless text generation, image generation, and image understanding within one model and outperforming prior multimodal approaches such as Chameleon across multiple benchmarks.

AI researchLanguage ModelingMultimodal
0 likes · 20 min read
Transfusion: A Single Model for Unified Image Generation and Understanding
Architect
Architect
Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training
0 likes · 18 min read
DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights
AIWalker
AIWalker
Feb 16, 2025 · Artificial Intelligence

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

VARGPT is a novel multimodal large language model that unifies visual understanding and autoregressive image generation within a single architecture, extending LLaVA with next‑token and next‑scale prediction, trained through three staged data‑curated phases and achieving superior performance on numerous vision‑language benchmarks.

AI researchLarge Language ModelMultimodal
0 likes · 20 min read
VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation
Architects' Tech Alliance
Architects' Tech Alliance
Feb 16, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

This article provides an in‑depth technical analysis of DeepSeek’s model distillation technology, covering its core principles, innovative data‑model fusion strategies, architecture design, training optimizations, performance benchmarks, and the remaining challenges of scaling distillation to multimodal tasks.

DeepSeekMultimodalai-optimization
0 likes · 16 min read
How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance
AI Code to Success
AI Code to Success
Jan 23, 2025 · Industry Insights

Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?

The article analyzes the 2025 AI large‑model landscape, contrasting slowing foundational breakthroughs with fierce application competition, highlighting MiniMax’s low‑cost linear‑attention models, multimodal advances, and the strategic shift from price wars to sustainable, technology‑driven growth.

AILarge ModelsMultimodal
0 likes · 7 min read
Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?
DataFunSummit
DataFunSummit
Jan 22, 2025 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

This article presents a comprehensive overview of the RAG2.0 engine design, covering RAG1.0 limitations, effective chunking methods, accurate retrieval techniques, advanced multimodal processing, hybrid search strategies, database indexing choices, and future directions such as agentic RAG and memory‑enhanced models.

ChunkingHybrid SearchMultimodal
0 likes · 23 min read
RAG2.0 Engine Design Challenges and Implementation
AI Code to Success
AI Code to Success
Jan 16, 2025 · Industry Insights

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

MiniMax, a Shanghai‑based AI unicorn, has open‑sourced its MiniMax‑01 series featuring large‑scale linear attention, secured $600 million in funding, launched multimodal products like Talkie and Hailuo AI, and is positioning itself as a competitive force amid rising geopolitical tensions in the global artificial‑intelligence market.

AIChina AIIndustry Insights
0 likes · 4 min read
How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape
ZhongAn Tech Team
ZhongAn Tech Team
Jan 12, 2025 · Artificial Intelligence

AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies

This issue reviews recent AI industry developments, including Lee Kai‑fu’s clarification on Zero‑One’s strategy, Microsoft’s open‑source Phi‑4 model, the multimodal VITA‑1.5 release, and HaiLuo AI’s advanced Chinese voice‑cloning technology, providing technical details and market implications.

AIMultimodalvoice cloning
0 likes · 10 min read
AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies
Infra Learning Club
Infra Learning Club
Jan 2, 2025 · Artificial Intelligence

Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion

In 2025, large language models will see three key trends—agents becoming pervasive in daily life and industry, the emergence of efficient small models for edge and specialized tasks, and the integration of multimodal capabilities that combine text, images, and audio to enable more natural human‑machine interaction.

AI trendsLLMMultimodal
0 likes · 4 min read
Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion
Programmer DD
Programmer DD
Dec 31, 2024 · Artificial Intelligence

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

This article demonstrates how to create an AI‑driven personal expense‑tracking assistant by leveraging Zhipu's GLM‑4V‑Flash multimodal model for receipt OCR, generating SQL statements, and integrating them with MaxKB workflows and a MySQL database, complete with code snippets and deployment steps.

AIGLM-4V-FlashMaxKB
0 likes · 13 min read
Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB
Baidu Geek Talk
Baidu Geek Talk
Dec 25, 2024 · Industry Insights

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

AIHTMLLLM adaptation
0 likes · 10 min read
How to Build a Multimodal Web Page Model for the LLM Era
DevOps
DevOps
Dec 23, 2024 · Artificial Intelligence

Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications

This article explains what AIGC agents are, outlines their four main characteristics, describes the underlying transformer‑based architecture, dual‑stage learning, probabilistic generation and feedback optimization, and explores their current and future commercial use cases across content creation, knowledge bases, customer service, internal operations, and product design.

AIGCAgentArtificial Intelligence
0 likes · 14 min read
Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications
Tencent Cloud Developer
Tencent Cloud Developer
Dec 5, 2024 · Industry Insights

Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them

The article analyses the rapid growth of Retrieval‑Augmented Generation (RAG) in enterprises, explains why self‑built RAG solutions often collapse under cost and maintenance pressures, and demonstrates how Tencent LeXiang AI Assistant addresses these issues through a robust knowledge‑management core, extensive industry experience, scalable resources, and advanced multimodal capabilities.

AI assistantLarge Language ModelMultimodal
0 likes · 16 min read
Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them
21CTO
21CTO
Dec 4, 2024 · Artificial Intelligence

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Physical Intelligence's new Pi-zero model, built on a vision‑language foundation and fine‑tuned with extensive robot data, outperforms prior baselines across multiple tasks, showcasing the promise of large multimodal foundation models for flexible, robust robot control.

AIMultimodalPi-zero
0 likes · 6 min read
Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics
NewBeeNLP
NewBeeNLP
Dec 2, 2024 · Artificial Intelligence

What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

This article surveys current unified generation-and-understanding multimodal large-model architectures, compares LLM-centric and LLM-plus-diffusion designs, extracts common insights, details large-scale training tricks from models like Emu3, Chameleon and Janus, and outlines open research directions for visual encoders.

Multimodaldiffusionlarge language models
0 likes · 5 min read
What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?
JD Retail Technology
JD Retail Technology
Nov 14, 2024 · Artificial Intelligence

Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)

The paper introduces a Multimodal Reliable Feedback Network (RFNet) and a consistency‑condition regularization technique that together boost the usable rate of automatically generated advertisement images while preserving visual quality, supported by a new million‑image annotated dataset and extensive ECCV‑2024 experiments.

AIECCV2024Multimodal
0 likes · 8 min read
Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)
Bilibili Tech
Bilibili Tech
Nov 8, 2024 · Artificial Intelligence

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Bilibili’s AI‑driven game‑recognition system extracts real‑time LoL events through OCR, hero detection and hot‑spot tagging, generating high‑energy timestamps and interactive overlays that let viewers jump to key moments and view detailed statistics, enhancing spectator engagement and analytical capabilities across major esports tournaments.

AIGame RecognitionMultimodal
0 likes · 14 min read
AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

EMNLP2024Multimodaldistillation
0 likes · 8 min read
Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI
DataFunSummit
DataFunSummit
Nov 1, 2024 · Big Data

DataFun Summit Session Overview and E‑book Access Instructions

The article outlines how to obtain the DataFun Summit e‑book by following the public account instructions and provides concise English summaries of twelve technical sessions covering data lineage, integration, AI language models, multimodal content, game AI agents, lake‑warehouse governance, big‑data architecture, and cluster management.

AIBig DataDataOps
0 likes · 5 min read
DataFun Summit Session Overview and E‑book Access Instructions
AntTech
AntTech
Oct 28, 2024 · Artificial Intelligence

Highlights of AI Large‑Model Sessions at CNCC 2024

The CNCC 2024 conference featured a series of expert talks on AI large‑model research, covering paradigm shifts in scientific discovery, knowledge enhancement and governance, data‑infrastructure analytics, vertical‑domain inference, diffusion‑model advances, multimodal model progress, and medical AI applications, illustrating the breadth and impact of large‑model technologies across multiple domains.

AIKnowledge GovernanceMultimodal
0 likes · 9 min read
Highlights of AI Large‑Model Sessions at CNCC 2024
JD Retail Technology
JD Retail Technology
Oct 15, 2024 · Artificial Intelligence

Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail

The article examines how large language models are reshaping JD Retail's e‑commerce search and recommendation pipelines, detailing industry evolution, technical challenges such as knowledge hallucination, intent understanding, personalization, cost, and safety, and presenting JD's end‑to‑end AIGC architecture, data preprocessing, alignment, evaluation, and next‑generation AI search solutions.

AILarge ModelsMultimodal
0 likes · 36 min read
Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail
DataFunTalk
DataFunTalk
Oct 1, 2024 · Artificial Intelligence

From Early AI to Superintelligence: Challenges and Prospects

The article reviews the evolution of artificial intelligence from early statistical models through deep learning and Transformer architectures, examines current breakthroughs like multimodal models, and discusses the technical, computational, and safety challenges that must be overcome before achieving artificial superintelligence (ASI).

AIArtificial IntelligenceMultimodal
0 likes · 8 min read
From Early AI to Superintelligence: Challenges and Prospects
Data Thinking Notes
Data Thinking Notes
Sep 26, 2024 · Big Data

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

The talk reviews the evolution of data technologies from early database storage to today’s generative AI-driven era, highlighting how massive data, multimodal processing, and advanced analytics are transforming data systems from cost‑centered infrastructures to value‑focused ecosystems that empower intelligent agents, open data ecosystems, and new application paradigms.

Big DataData PlatformsData Value
0 likes · 19 min read
How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era
JD Tech Talk
JD Tech Talk
Sep 23, 2024 · Artificial Intelligence

JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering

The JD Advertising R&D team applies cutting‑edge AI techniques—including query intent models, multimodal representation pipelines, reinforcement‑learning‑based auction mechanisms, generative recommendation with quantized product tokens, and large‑model infrastructure—to boost traffic valuation, ad relevance, revenue, and creative generation across the platform.

AIAdvertisingLarge Models
0 likes · 19 min read
JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering
AntData
AntData
Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data
0 likes · 18 min read
From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era
JD Retail Technology
JD Retail Technology
Sep 4, 2024 · Artificial Intelligence

Multimodal Recommendation Algorithms and System Architecture at JD.com

This article presents JD.com’s multimodal recommendation system architecture, covering content understanding, multimodal ranking and recall models, practical deployment pipelines, and future research directions such as large‑model integration and supply‑side generation, all illustrated with detailed diagrams and Q&A.

AIJD.comMultimodal
0 likes · 14 min read
Multimodal Recommendation Algorithms and System Architecture at JD.com
AI Large Model Application Practice
AI Large Model Application Practice
Aug 29, 2024 · Artificial Intelligence

8 Essential Indexing Strategies to Boost Enterprise RAG Performance

This article presents eight practical optimization recommendations for the indexing stage of enterprise‑level Retrieval‑Augmented Generation (RAG) applications, covering chunk creation, abbreviation handling, multimodal document processing, semantic enrichment, metadata usage, alternative index types, and embedding model selection.

ChunkingMultimodalRAG
0 likes · 15 min read
8 Essential Indexing Strategies to Boost Enterprise RAG Performance
DataFunSummit
DataFunSummit
Aug 29, 2024 · Artificial Intelligence

Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations

This article details Tencent Game's end‑to‑end approach to building intelligent NPCs, covering the opportunities brought by AI, the practical implementation of multimodal LLM‑driven dialogue, knowledge‑augmented retrieval, long‑context handling, safety measures, multimodal expression (voice and facial animation), and system‑level performance optimizations for real‑time deployment.

AILLMMultimodal
0 likes · 18 min read
Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations
DataFunSummit
DataFunSummit
Aug 25, 2024 · Artificial Intelligence

Applying Large AI Models to Financial Data Governance and Innovative Use Cases

This article presents a comprehensive technical overview of how large AI models are reshaping financial data production, governance, multimodal document understanding, lakehouse storage, private‑domain model deployment, data‑centric engineering methods, and multi‑agent intelligent advisory within the finance sector.

AILarge ModelsMultimodal
0 likes · 21 min read
Applying Large AI Models to Financial Data Governance and Innovative Use Cases
NewBeeNLP
NewBeeNLP
Aug 15, 2024 · Industry Insights

Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion

This article analyzes how Xiaohongshu addresses the decentralization challenge in its recommendation system by strengthening side‑information usage, integrating multimodal signals across the full pipeline, and implementing interest exploration and protection mechanisms, while also outlining future research directions such as generative recommendation and large‑model‑driven user profiling.

Multimodaldecentralized-distributiongraph
0 likes · 25 min read
Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion
AntTech
AntTech
Aug 13, 2024 · Artificial Intelligence

Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI

From August 11‑16, 2024 the ACL conference in Bangkok featured 14 Ant Group papers covering large‑scale information extraction, decomposed LLMs for semantic search, multimodal hallucination detection, long‑context attention mechanisms, concept‑reasoning datasets, knowledge‑graph alignment, and more, highlighting the group's breadth in natural language processing and AI research.

ACL2024MultimodalNLP
0 likes · 20 min read
Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI
DataFunSummit
DataFunSummit
Jul 28, 2024 · Artificial Intelligence

Leveraging Large Language Models for Graph Learning: Opportunities, Current Progress, and Future Directions

This article reviews why large language models can be applied to graph learning, outlines their capabilities and graph data characteristics, surveys current research across different graph types and LLM roles, and proposes future research directions for unified cross‑domain graph learning.

AIMultimodalResearch Directions
0 likes · 19 min read
Leveraging Large Language Models for Graph Learning: Opportunities, Current Progress, and Future Directions
Tencent Cloud Developer
Tencent Cloud Developer
Jul 18, 2024 · Artificial Intelligence

Exploring Large Language Models (LLM): Fundamentals, Applications, and Future Directions

Exploring Large Language Models, this article surveys their core concepts, evolution through Transformers, GPT and BERT, generation challenges, diverse applications such as QA, multimodal creation, summarization and retrieval‑augmented generation, prompt‑engineering frameworks and tools, LangChain‑based pipelines, AI‑driven agents, and future prospects toward domain‑specific use, multimodality, and AGI.

AIAgentLLM
0 likes · 35 min read
Exploring Large Language Models (LLM): Fundamentals, Applications, and Future Directions
Architects' Tech Alliance
Architects' Tech Alliance
Jul 10, 2024 · Industry Insights

Why AI Large Models Are Driving the Next Industrial Revolution

The article analyzes the rapid evolution of AI large models—from their role in advancing AGI through massive pre‑training and fine‑tuning, to current market dynamics led by GPT and domestic Chinese players, and finally to future multimodal applications, content‑factory capabilities, and emerging AIGC revenue models projected to reach trillion‑yuan scales by 2030.

AIAIGCGPT
0 likes · 7 min read
Why AI Large Models Are Driving the Next Industrial Revolution
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 8, 2024 · Industry Insights

Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers

The article analyzes current challenges in deploying large AI models, covering robot automation, scaling‑law limits, vertical‑domain use cases, multimodal breakthroughs, algorithmic evolution, and the hardware‑software trade‑offs of training and inference infrastructures, while questioning ROI and practical feasibility.

Large ModelsMultimodalalgorithm evolution
0 likes · 21 min read
Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers
AI Large Model Application Practice
AI Large Model Application Practice
Jul 4, 2024 · Artificial Intelligence

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

This article explains how to handle complex multimodal PDFs in RAG systems, outlines extraction, indexing, and multimodal model integration, details four query‑rewriting strategies (HyDE, stepwise, sub‑question, backward), and presents key evaluation metrics and tools for assessing RAG performance.

Document ParsingMultimodalQuery Rewriting
0 likes · 12 min read
Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting
360 Tech Engineering
360 Tech Engineering
Jul 3, 2024 · Artificial Intelligence

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

The 360LayoutAnalysis project from 360 AI Lab releases lightweight, yolov8‑based layout analysis models covering Chinese and English papers, Chinese research reports, and a general document scenario, providing fast inference, paragraph‑level detection, and open‑source code and weights for flexible document‑understanding pipelines.

AI modelLayout AnalysisMultimodal
0 likes · 9 min read
360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios
JD Tech
JD Tech
Jun 28, 2024 · Artificial Intelligence

An Overview of Large Language Models: History, Fundamentals, Prompt Engineering, Retrieval‑Augmented Generation, Agents, and Multimodal AI

This article provides a comprehensive introduction to large language models, covering their historical development, core architecture, training process, prompt engineering techniques, Retrieval‑Augmented Generation, agent frameworks, multimodal capabilities, safety challenges, and future research directions.

AI agentsAI safetyMultimodal
0 likes · 22 min read
An Overview of Large Language Models: History, Fundamentals, Prompt Engineering, Retrieval‑Augmented Generation, Agents, and Multimodal AI
AntTech
AntTech
Jun 18, 2024 · Artificial Intelligence

Ant Group’s 24 Papers Featured at CVPR2024: Topics and Abstracts

The IEEE CVPR2024 conference in Seattle accepted 2,719 papers out of 11,532 submissions, and Ant Group contributed 24 papers covering computer vision, deep learning, digital humans, large models, multimodal remote sensing, vision‑language distillation, federated incremental learning, model‑stealing defense, and more, with one highlighted as a highlight.

Ant GroupCVPR2024Multimodal
0 likes · 17 min read
Ant Group’s 24 Papers Featured at CVPR2024: Topics and Abstracts
NewBeeNLP
NewBeeNLP
Jun 18, 2024 · Artificial Intelligence

How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models

This article presents Shopee's comprehensive approach to constructing an e‑commerce knowledge graph, detailing the challenges of heterogeneous data, multi‑language handling, entity disambiguation, and the integration of deep learning and large language models to improve product matching, recommendation, and operational efficiency.

AILarge Language ModelMultimodal
0 likes · 22 min read
How Shopee Builds an E‑Commerce Knowledge Graph and Leverages Large Models
DataFunTalk
DataFunTalk
Jun 14, 2024 · Artificial Intelligence

Shopee's E‑commerce Knowledge Graph Construction and Integration with Large Models

This article presents Shopee's comprehensive exploration of building an e‑commerce knowledge graph, detailing its challenges, construction pipeline, AI‑driven extraction and fusion techniques, multilingual and multimodal modeling, and practical applications ranging from search and recommendation to AI assistants and real‑time updates.

AI applicationsMultimodale-commerce
0 likes · 21 min read
Shopee's E‑commerce Knowledge Graph Construction and Integration with Large Models
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 13, 2024 · Artificial Intelligence

Creating a Full AI‑Generated Music Video with Large‑Model Agents

This article documents the end‑to‑end workflow of using large multimodal models and specialized agents to automatically generate a storyboard, compose original music and lyrics, produce keyframes, and assemble a complete music video, while highlighting the remaining manual steps and future automation possibilities.

AIMultimodalMusic
0 likes · 10 min read
Creating a Full AI‑Generated Music Video with Large‑Model Agents
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 5, 2024 · Artificial Intelligence

Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage

This article reviews the open‑source 9‑billion‑parameter GLM‑4‑9B model, covering installation, quick‑start inference code, quirky Chinese riddles that highlight its strengths over GPT‑4, extensive benchmark tables for dialogue, multilingual, tool‑calling and multimodal tasks, and its broader impact on the Chinese AI ecosystem.

AIGLM-4-9BMultimodal
0 likes · 14 min read
Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage
DataFunSummit
DataFunSummit
Jun 4, 2024 · Artificial Intelligence

Multimodal and Graph Neural Network Techniques for eBay Recommendation Systems

This article details eBay's practical experience integrating multimodal data and graph neural networks into its recommendation pipeline, covering pain‑point analysis, a twin‑tower multimodal embedding model with triplet loss and TransH, engineering design, experimental results, and key takeaways for future AI‑driven product development.

EmbeddingGNNGraph Neural Network
0 likes · 19 min read
Multimodal and Graph Neural Network Techniques for eBay Recommendation Systems
Alimama Tech
Alimama Tech
May 29, 2024 · Artificial Intelligence

Mixture of Multi‑Modal Experts for Advertising Recall

The Mixed‑Modal Expert Model combines ID features with image and text embeddings through optimized representations and conditional output fusion, dramatically improving advertising recall—especially for long‑tail items—and delivering measurable gains in click‑recall, revenue, CTR, and page views in large‑scale online tests.

Machine LearningMultimodalmodel
0 likes · 15 min read
Mixture of Multi‑Modal Experts for Advertising Recall
NewBeeNLP
NewBeeNLP
May 28, 2024 · Artificial Intelligence

How Generative Models Are Redefining Recommendation Systems

This article reviews recent advances in generative recommendation, highlighting challenges such as item representation and multimodal fusion, and summarizing four key research papers that propose novel tokenization, collaborative integration, and transformer-based multimodal approaches to improve recommendation performance.

AI researchGenerative RecommendationLLM
0 likes · 8 min read
How Generative Models Are Redefining Recommendation Systems
DataFunTalk
DataFunTalk
May 20, 2024 · Artificial Intelligence

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

This article presents OPPO's practical research on deploying multi‑modal pre‑training models across mobile devices and cloud, covering edge image‑text retrieval, text‑image generation and understanding optimizations, and lightweight diffusion model techniques, with detailed algorithmic improvements, performance results, and real‑world application cases.

AIGCEdge AIModel Compression
0 likes · 18 min read
Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations
21CTO
21CTO
May 18, 2024 · Artificial Intelligence

What Makes GPT‑4o Faster, Smarter, and More Multimodal Than GPT‑4?

This article examines OpenAI's GPT‑4o, outlining its key performance, speed, accuracy, latency, multimodal, and resource‑efficiency improvements over GPT‑4, and explains why these enhancements broaden the model's applicability across various AI‑driven applications.

AI modelGPT-4oMultimodal
0 likes · 6 min read
What Makes GPT‑4o Faster, Smarter, and More Multimodal Than GPT‑4?
360 Tech Engineering
360 Tech Engineering
May 17, 2024 · Artificial Intelligence

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.

AI researchLarge Language ModelLlama3
0 likes · 8 min read
360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B
CSS Magic
CSS Magic
May 14, 2024 · Artificial Intelligence

First Look at GPT-4o: Hands‑On Experience, FAQs, and New Free‑User Benefits

The article provides a hands‑on review of OpenAI's newly released GPT‑4o model, covering its multimodal capabilities, real‑time voice demo, desktop client rollout, access options for paid and free users, practical usage tips, and early observations on API performance and limitations.

AI modelAPIChatGPT
0 likes · 9 min read
First Look at GPT-4o: Hands‑On Experience, FAQs, and New Free‑User Benefits
DataFunSummit
DataFunSummit
Apr 24, 2024 · Artificial Intelligence

Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications

This article presents Baidu's exploration of multimodal content understanding for commercial advertising, detailing the ViCAN pre‑training model, its contrastive and mask‑language learning tasks, integration across recall, ranking and risk‑control pipelines, quantization with MMDict, and future AIGC‑driven generation, all backed by extensive experiments and Q&A.

AIAIGCAdvertising
0 likes · 27 min read
Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications
21CTO
21CTO
Apr 20, 2024 · Artificial Intelligence

What Developers Need to Know About Meta’s New Open‑Source Llama 3 Model

Meta’s newly open‑source Llama 3 model pushes the frontier of large language models with a larger context window, Mixture‑of‑Experts architecture, multilingual support, and multimodal capabilities, while facing challenges in transparency, bias, and computational resources, and offering diverse applications from NLU to code generation.

AILarge Language ModelLlama3
0 likes · 10 min read
What Developers Need to Know About Meta’s New Open‑Source Llama 3 Model
Architects' Tech Alliance
Architects' Tech Alliance
Apr 7, 2024 · Artificial Intelligence

How Sora Is Redefining Text‑to‑Video Generation: Inside the New AI Model

Sora, the newly announced text‑to‑video large model, can generate one‑minute high‑fidelity videos from textual prompts or static images, handling complex scenes, expressive characters, and sophisticated camera motions while also supporting video extension and frame‑filling, positioning it at the forefront of multimodal AI research.

AI modelMultimodalSora
0 likes · 6 min read
How Sora Is Redefining Text‑to‑Video Generation: Inside the New AI Model
DataFunSummit
DataFunSummit
Mar 27, 2024 · Artificial Intelligence

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

MultimodalOFAONE-PEACE
0 likes · 15 min read
Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 18, 2024 · Artificial Intelligence

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling

The paper presents MuLTI, a multimodal video‑language model that tackles the memory and efficiency challenges of long video‑text sequences by introducing a Text‑Guided MultiWay Sampler and a Multiple Choice Modeling pre‑training task, achieving state‑of‑the‑art results on video QA and retrieval while drastically reducing GPU memory consumption.

Multimodalefficient-aifeature fusion
0 likes · 19 min read
How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling