Tag

multimodal AI

1 views collected around this technical thread.

Architects' Tech Alliance
Architects' Tech Alliance
Jun 11, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: The 2017‑2025 Evolution of Large Language Models

This article chronicles the rapid development of large language models from the 2017 Transformer breakthrough through the rise of BERT, GPT‑3, ChatGPT, multimodal systems like GPT‑4V/o, and the recent cost‑efficient DeepSeek‑R1, highlighting key architectural innovations, scaling trends, alignment techniques, and their transformative impact on AI research and industry.

AI alignmentBERTGPT
0 likes · 26 min read
From Transformers to DeepSeek‑R1: The 2017‑2025 Evolution of Large Language Models
Kuaishou Large Model
Kuaishou Large Model
Jun 11, 2025 · Artificial Intelligence

12 Kuaishou Breakthrough Papers at CVPR 2025: Video Generation, Diffusion & Multimodal AI

CVPR 2025 in Nashville will feature 12 Kuaishou papers spanning large‑scale video datasets, quality assessment, 3D/4D reconstruction, controllable generation, diffusion scaling laws, multimodal simulation, and novel benchmarks, highlighting the company's cutting‑edge contributions to video AI research.

computer visiondiffusion modelslarge-scale datasets
0 likes · 21 min read
12 Kuaishou Breakthrough Papers at CVPR 2025: Video Generation, Diffusion & Multimodal AI
Kuaishou Tech
Kuaishou Tech
Jun 10, 2025 · Artificial Intelligence

Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025

The article highlights CVPR 2025’s acceptance statistics and showcases twelve cutting‑edge video‑generation papers from Kuaishou, spanning datasets, quality assessment, style control, scaling laws, 4D simulation, interleaved image‑text data, vision‑language acceleration, high‑fidelity avatars, patch‑wise super‑resolution, narrative‑driven benchmarks, sketch‑based editing, and spatio‑temporal diffusion, each with links and abstracts.

CVPR2025Kuaishoucomputer vision
0 likes · 20 min read
Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025
Kuaishou Large Model
Kuaishou Large Model
Jun 5, 2025 · Artificial Intelligence

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

ACL 2025Large Language Modelsbenchmark
0 likes · 16 min read
7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances
AntTech
AntTech
Jun 4, 2025 · Artificial Intelligence

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

This article presents the LLaDA series of diffusion‑based large language models, explains how their generative‑modeling principle yields language intelligence comparable to autoregressive models, and details the multimodal LLaDA‑V architecture, training methods, experimental results, and broader implications for AI research.

Large Language Modelsdiffusion modelsgenerative modeling
0 likes · 10 min read
LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions
Java Architecture Diary
Java Architecture Diary
May 19, 2025 · Artificial Intelligence

How Ollama 0.7 Unlocks Local Multimodal AI with One Command

Ollama 0.7 introduces a fully re‑engineered core that brings seamless multimodal model support, lists top visual models, showcases OCR and image analysis capabilities, explains technical breakthroughs, and provides a quick three‑step guide to deploy powerful local AI vision.

AI EngineeringAI modelsImage Recognition
0 likes · 7 min read
How Ollama 0.7 Unlocks Local Multimodal AI with One Command
DaTaobao Tech
DaTaobao Tech
Apr 14, 2025 · Artificial Intelligence

Taobao AIGC Content Generation: Short Video Production Techniques

Taobao’s Content AI team leverages a proprietary multimodal Mixture‑of‑Experts model to automatically generate short‑form videos—extracting highlights from live streams and creating customized product explainers—using two‑stage CLIP/VideoBLIP training, character‑level timestamps, LLM re‑segmentation and OCR masking, now producing over 100 k daily videos with a 12 % approval boost and notable conversion gains.

AIGCContent AIe-commerce
0 likes · 20 min read
Taobao AIGC Content Generation: Short Video Production Techniques
Tencent Cloud Developer
Tencent Cloud Developer
Apr 10, 2025 · Artificial Intelligence

The Magic of GPT‑4o: Technical Overview and Speculated Architecture

GPT‑4o combines extremely long‑form text generation, high‑quality image creation and interactive editing by likely using an autoregressive multimodal transformer that tokenizes visuals via VQ‑VAE/GAN pipelines, trained on massive data and refined through fine‑tuning and RLHF, offering a unified model for generation, editing, and understanding.

AI architectureGPT-4oVQ-VAE
0 likes · 17 min read
The Magic of GPT‑4o: Technical Overview and Speculated Architecture
DataFunTalk
DataFunTalk
Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of Expertslarge language model
0 likes · 14 min read
Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context
Architects' Tech Alliance
Architects' Tech Alliance
Mar 31, 2025 · Artificial Intelligence

A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)

This article reviews the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, alignment techniques, multimodal extensions, open‑weight releases, and the cost‑efficient DeepSeek‑R1 in 2025, highlighting key technical advances, scaling trends, and their societal impact.

AI alignmentLLM evolutionLarge Language Models
0 likes · 26 min read
A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)
ByteDance Web Infra
ByteDance Web Infra
Mar 21, 2025 · Artificial Intelligence

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

Midscene.js is an open‑source UI automation framework that leverages multimodal AI to simplify web UI testing and interaction, offering three core interfaces—Action, Query, and Assert—along with a JavaScript SDK, support for multiple AI models, YAML scripting, and future‑focused features for stable, scalable automation.

AIJavaScriptMidscene.js
0 likes · 21 min read
Midscene.js: An AI‑Driven UI Automation Framework from ByteDance
DevOps
DevOps
Mar 19, 2025 · Artificial Intelligence

From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents

This article surveys the rapid development of computer‑use AI agents—from Anthropic’s Claude 3.5 Sonnet and OpenAI’s Operator to the multi‑agent Manus platform—detailing their capabilities, benchmark results, open‑source alternatives, practical challenges, and future prospects for autonomous digital assistants.

AI agentsAnthropicComputer Use Agent
0 likes · 24 min read
From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents
Java Architecture Diary
Java Architecture Diary
Mar 19, 2025 · Artificial Intelligence

Unlocking Google’s Gemma 3: Multimodal Power, 128k Context & Local Deployment Guide

This article introduces Google’s open‑source Gemma 3 model, highlighting its multimodal capabilities, massive 128k token context window, multilingual support, and provides step‑by‑step instructions for installing Ollama, pulling the model, and running local tests with code examples.

AI modelGemma 3Local Deployment
0 likes · 7 min read
Unlocking Google’s Gemma 3: Multimodal Power, 128k Context & Local Deployment Guide
DaTaobao Tech
DaTaobao Tech
Mar 12, 2025 · Artificial Intelligence

Multimodal Automatic Layout Generation for E-commerce

The project develops a multimodal automatic layout generation system for e‑commerce by fine‑tuning the qwen‑vl‑7b vision‑language model with LoRA on poster and Taobao image‑layout data, employing diffusion‑based image generation and coordinate‑prediction methods to produce structured layouts that power poster, marketing image, and video‑cover creation with over 90% adoption, while exploring multi‑image, style‑aware, and iterative refinement extensions.

LLMLayout Generationdiffusion
0 likes · 12 min read
Multimodal Automatic Layout Generation for E-commerce
DataFunSummit
DataFunSummit
Feb 26, 2025 · Artificial Intelligence

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

This article details how NetEase Cloud Music leverages multimodal large language models to improve music recommendation across daily, personalized, and playlist scenarios by extracting rich audio, text, and visual features, addressing data skew, cold‑start challenges, and achieving measurable gains in user engagement and distribution efficiency.

Feature ExtractionLarge Language ModelsNetEase Cloud Music
0 likes · 12 min read
Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music
Architecture & Thinking
Architecture & Thinking
Feb 26, 2025 · Artificial Intelligence

Unlocking DeepSeek: A Comprehensive Guide to China’s Cutting-Edge AI Chat Model

This article provides an in‑depth overview of DeepSeek, covering its core multimodal and multilingual features, long‑context capabilities, domain optimizations, security, main functions, diverse application scenarios, and practical usage via web interface or API integration.

AI chatbotArtificial IntelligenceDeepSeek
0 likes · 6 min read
Unlocking DeepSeek: A Comprehensive Guide to China’s Cutting-Edge AI Chat Model
DaTaobao Tech
DaTaobao Tech
Feb 24, 2025 · Artificial Intelligence

AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications

The article surveys recent AIGC video generation advances for Taobao e‑commerce, detailing lip‑sync models like Wav2Lip and MuseTalk, head‑driven systems such as Hallo and EchoMimic, body‑driven pipelines including AnimateAnyone and Tango, and a four‑stage production workflow that boosts click‑through rates and enables virtual try‑on.

AIGCdeep learninge-commerce
0 likes · 21 min read
AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications
DataFunSummit
DataFunSummit
Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Document UnderstandingRAGTensor Retrieval
0 likes · 11 min read
Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects
DataFunTalk
DataFunTalk
Feb 19, 2025 · Artificial Intelligence

Large Models: Concepts, Principles, Classifications and Applications

This report provides a comprehensive overview of large-scale AI models, explaining their definition, massive parameter and data requirements, underlying transformer architecture, classification into language, vision and multimodal models, notable examples such as DeepSeek, and a survey of popular AIGC tools and practical use cases.

AIGC toolsArtificial IntelligenceLarge Language Models
0 likes · 9 min read
Large Models: Concepts, Principles, Classifications and Applications
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Feb 17, 2025 · Artificial Intelligence

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.

Large Modelsbenchmark datasetmodel analysis
0 likes · 12 min read
WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios