Tagged articles
21 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

Doc‑V* introduces a dynamic, thumbnail‑driven approach that lets a model decide which pages to read, achieving a 49.7% improvement over RAG variants on multi‑page document QA benchmarks without larger models or longer context windows, and demonstrates how strategic evidence acquisition outperforms naïve full‑document reading.

AIRAGdocument understanding
0 likes · 10 min read
Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights
Xiaomi Tech
Xiaomi Tech
Apr 10, 2026 · Artificial Intelligence

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

Xiaomi’s AI team announced seven ACL 2026 papers that span low‑bit KV‑cache quantization for 8.3× faster LLM inference, OCR‑free multi‑page document VQA, a new attention‑basin analysis, non‑autoregressive spoken dialogue generation, a comprehensive mobile‑agent benchmark, a success‑rate‑aware training policy, and a progressive universal information‑extraction framework.

Inference Optimizationbenchmarkdialogue generation
0 likes · 12 min read
Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 31, 2026 · Artificial Intelligence

How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models

UniRec‑0.1B, a lightweight OCR model with only 0.1 B parameters, achieves accuracy comparable to or better than multi‑billion‑parameter visual‑language models across text, formula, and mixed‑content tasks, thanks to hierarchical supervision training, a semantic‑decoupled tokenizer, and a large 40 M‑sample dataset, while delivering 2‑9× faster inference and full open‑source availability.

Hierarchical SupervisionOCROpen Source
0 likes · 12 min read
How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models
PaperAgent
PaperAgent
Jan 27, 2026 · Artificial Intelligence

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

DeepSeek-OCR 2 introduces a novel dual‑stream (bidirectional + causal) attention architecture that replaces fixed raster scanning, leverages a Qwen2‑0.5B encoder, and achieves state‑of‑the‑art accuracy on OmniDocBench while reducing token budget and improving reading‑order consistency.

DeepEncoderDeepSeekDual-Stream Attention
0 likes · 8 min read
How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding
AntTech
AntTech
Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchMultimodalReasoning
0 likes · 6 min read
Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase
AI Frontier Lectures
AI Frontier Lectures
Mar 7, 2025 · Artificial Intelligence

Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive

Mistral AI’s newly launched OCR API claims to deliver world‑class document understanding with multilingual support, high speed, and self‑hosting options, and benchmark tests show it outperforms Azure OCR and Google Doc AI, yet independent evaluations reveal limitations on complex tables and legal forms, prompting a balanced assessment of its readiness for enterprise use.

AI modelMistral AIOCR
0 likes · 7 min read
Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive
DataFunSummit
DataFunSummit
Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

RAGTensor Retrievaldocument understanding
0 likes · 11 min read
Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects
Baidu Geek Talk
Baidu Geek Talk
Jan 6, 2025 · Information Security

MarkupLM-based Detection of Malicious Content Scraping

The article presents a MarkupLM‑based approach that enriches BERT with XPath embeddings to jointly model webpage text and structure, enabling site‑level detection of malicious content‑scraping pages that bypass traditional rule‑based filters and demonstrating the critical role of structural cues in improving spam classification accuracy.

Machine LearningMarkupLMXPath embedding
0 likes · 16 min read
MarkupLM-based Detection of Malicious Content Scraping
NewBeeNLP
NewBeeNLP
Jan 2, 2025 · Artificial Intelligence

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

This article examines the implementation paths and future prospects of multimodal Retrieval‑Augmented Generation, covering semantic extraction, transformer‑based OCR, visual language models, scaling challenges, tensor indexing, and practical evaluations with tools like Infinity and ColPali.

AI RetrievalInfinity DatabaseMultimodal RAG
0 likes · 12 min read
Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions
360 Tech Engineering
360 Tech Engineering
Nov 15, 2024 · Artificial Intelligence

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

At the 2024 Global Machine Learning Conference in Beijing, 360 AI Research Institute showcased cutting‑edge multimodal large‑model research, fine‑grained open‑world object detection, and document understanding technologies, highlighting open‑source releases, real‑world deployments, and competitive achievements in AI competitions.

AI researchLarge Modelsdocument understanding
0 likes · 7 min read
Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)
Sohu Tech Products
Sohu Tech Products
Nov 6, 2024 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

The talk outlines RAG2.0’s design challenges—low vector recall, complex documents, semantic gaps—and presents a two‑stage architecture using deep multimodal understanding and knowledge‑graph‑enhanced retrieval, detailing advanced chunking, multi‑index and multi‑path retrieval, efficient sorting models like ColBERT, and future multi‑modal and memory‑augmented agent directions.

ColBERTDelayed InteractionKnowledge Graphs
0 likes · 23 min read
RAG2.0 Engine Design Challenges and Implementation
360 Tech Engineering
360 Tech Engineering
Jul 3, 2024 · Artificial Intelligence

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

The 360LayoutAnalysis project from 360 AI Lab releases lightweight, yolov8‑based layout analysis models covering Chinese and English papers, Chinese research reports, and a general document scenario, providing fast inference, paragraph‑level detection, and open‑source code and weights for flexible document‑understanding pipelines.

AI modelLayout AnalysisMultimodal
0 likes · 9 min read
360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios
DataFunSummit
DataFunSummit
Sep 5, 2023 · Artificial Intelligence

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, the evolution of related technologies, large‑model approaches such as multimodal pre‑training and domain‑specific models, and concrete enterprise use cases across various business functions.

Document Intelligencedocument understandingenterprise AI
0 likes · 14 min read
Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications
AntTech
AntTech
Aug 25, 2023 · Artificial Intelligence

LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding

LayoutGCN is a lightweight, graph‑based framework that jointly encodes text, layout, and image features of visually rich documents, achieving competitive performance on multiple downstream tasks while drastically reducing model size and computational cost, making it suitable for edge deployment.

Graph Neural NetworkLayoutGCNdocument understanding
0 likes · 24 min read
LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding
DataFunSummit
DataFunSummit
Apr 7, 2023 · Artificial Intelligence

Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope

This article provides a detailed introduction to OCR technology, covering its fundamental concepts, major categories (document, scene, and handwritten OCR), typical processing pipelines, a suite of open‑source models on ModelScope—including detection, recognition, and table OCR—and recent multimodal pre‑training methods such as VLDoc and VLPT.

ModelScopeOCRTable OCR
0 likes · 15 min read
Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope
AntTech
AntTech
Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

MultimodalVision TransformerXYCut
0 likes · 10 min read
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding
Architects Research Society
Architects Research Society
Jan 9, 2022 · Artificial Intelligence

Five Key Trends in AI-Powered Search and Unstructured Data Analysis

The article outlines five major trends—neural-network-enhanced search, semantic search, document understanding, image and voice search, and knowledge graphs—that are transforming enterprise use of unstructured data by leveraging AI to deliver precise, context-aware answers and insights.

AISearchdocument understanding
0 likes · 15 min read
Five Key Trends in AI-Powered Search and Unstructured Data Analysis