Artificial Intelligence 12 min read

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

Xiaomi’s AI team announced seven ACL 2026 papers that span low‑bit KV‑cache quantization for 8.3× faster LLM inference, OCR‑free multi‑page document VQA, a new attention‑basin analysis, non‑autoregressive spoken dialogue generation, a comprehensive mobile‑agent benchmark, a success‑rate‑aware training policy, and a progressive universal information‑extraction framework.

Xiaomi Tech

Apr 10, 2026

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

VecInfer: Efficient LLM Inference with Low‑Bit KV Cache via Outlier‑Suppressed Vector Quantization

KV‑cache memory is the primary bottleneck for long‑context LLM inference. Scalar quantization methods such as TurboQuant suffer from outlier interference. VecInfer applies smooth and Hadamard double transforms to suppress outliers, integrates de‑quantization into a custom CUDA kernel, and adopts a vector‑quantization scheme. On Llama‑3.1‑8B with a 196k‑token context, 2‑bit quantization attains near‑full‑precision accuracy, yields a 2.7× speedup of the attention layer and an 8.3× reduction of end‑to‑end latency. Paper: https://arxiv.org/pdf/2510.06175. Code: https://github.com/ydyhello/VecInfer

Doc‑V*: Coarse‑to‑Fine Interactive Visual Reasoning for Multi‑Page Document VQA

Doc‑V* introduces an OCR‑free active‑perception paradigm that lets the model browse, retrieve, and integrate information across pages on demand, avoiding the high cost of processing all pages at once and the information loss of static RAG pipelines. Evaluated on four public benchmarks, it achieves the best open‑source results, improving performance on long‑document (>80 pages) tasks by 9.8 percentage points over RAG baselines while substantially lowering peak GPU memory usage.

Attention Basin: Why Contextual Position Matters in Large Language Models

The authors discover a “U‑shaped” attention basin that causes models to overlook crucial middle‑position information in long texts. Analysis attributes this bias to structural perception of semantic block boundaries rather than absolute position. They propose AttnRank, which performs a single low‑cost attention pass to reorder tokens so that key information aligns with attention peaks. This lightweight reordering improves performance across ten architectures without additional training or latency overhead.

ZipVoice‑Dialog: Non‑Autoregressive Spoken Dialogue Generation with Flow Matching

ZipVoice‑Dialog employs flow‑matching to generate spoken dialogue non‑autoregressively. A curriculum‑learning strategy ensures tight text‑audio alignment, and speaker‑turn embeddings enable natural voice switching and stereo output. Experiments show higher inference speed, intelligibility, turn‑switch accuracy, and voice‑similarity compared with autoregressive baselines. The work also releases OpenDialog, a 6,800‑hour multi‑speaker dialogue dataset. Paper: https://arxiv.org/pdf/2507.09318. Code: https://github.com/k2-fsa/ZipVoice

MobileBench‑OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real‑World Environment

MobileBench‑OL evaluates agents on 80 mainstream Chinese apps covering 1,080 tasks. It measures basic ability, long‑tail usage, long‑duration tasks, GUI reasoning, and noise robustness, and provides an automated evaluation framework with device‑reset mechanisms. Results reveal that current agents achieve less than 20 % success on tasks requiring more than 20 steps, exposing a gap between demo performance and reliable daily use. Paper: https://arxiv.org/pdf/2601.20335. Code: https://github.com/xiaomi-research/mobilebench-ol

STEP: Success‑Rate‑Aware Trajectory‑Efficient Policy Optimization

Training mobile agents is costly because each trial must run on a real device. Existing trajectory‑level methods suffer from (1) uniform sampling across difficulty levels, (2) penalizing correct steps in failed trajectories, and (3) high sampling cost for multi‑turn interactions. STEP adopts a “hard‑first, step‑wise” strategy: it adaptively resamples based on task success rate and decomposes trajectories into step‑level samples for fine‑grained optimization. On OSWorld and AndroidWorld benchmarks, STEP converges faster and generalizes better under the same compute budget. Paper: https://arxiv.org/pdf/2511.13091

ProUIE: A Macro‑to‑Micro Progressive Learning Method for LLM‑based Universal Information Extraction

ProUIE presents a three‑stage progressive learning pipeline without external knowledge. Stage 1 builds a universal extraction backbone, stage 2 refines structured output, and stage 3 applies GRPO with fine‑grained rewards for deep exploration. Evaluated on 36 public IE datasets, ProUIE consistently improves performance, and a 4B backbone surpasses many larger baselines. Paper: https://arxiv.org/pdf/2508.05128

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mobile AI Inference Optimization large language models benchmark document understanding information extraction dialogue generation

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.