Xiaomi’s Imaging Algorithms Win CVPR 2026 NTIRE: Super‑Resolution, Portrait Restoration, Reflection Removal Breakthroughs
Xiaomi secured three top spots at CVPR 2026 NTIRE—first in Efficient Super‑Resolution with SPANV2, first in Portrait Restoration using a dual‑stage cascade, and second in Reflection Removal via RDNet‑XL and diffusion‑model distillation—showcasing hardware‑software co‑design, ultra‑fast inference, and novel algorithmic innovations.
At CVPR 2026 NTIRE, Xiaomi achieved three awards: champion in the Efficient Super‑Resolution track, champion in the Portrait Restoration track, and runner‑up in the Reflection Removal track, demonstrating a comprehensive imaging pipeline from capture to display.
Efficient Super‑Resolution (SPANV2)
The NTIRE Efficient Super‑Resolution challenge required maintaining high reconstruction quality while minimizing inference time, parameters, and compute for resource‑constrained devices. Xiaomi’s multimedia algorithm team extended the previous SPAN architecture with SPANV2, introducing a Governed Residual Update framework that splits each module into a candidate‑correction branch and a learnable governor that decides how the correction is injected.
Key redesigns:
Learnable channel‑mixing attention: replaces the fixed element‑wise product of SPAN with a C×C 1×1 learnable projection, allowing negative values and explicit inter‑channel modeling, improving quality with negligible parameter increase.
span_attn_op: a custom GPU kernel that fuses 1×1 attention convolution, element‑wise addition and multiplication into a single memory‑efficient operation, cutting redundant DRAM traffic and accelerating inference.
Near‑pixel up‑sampling branch: a lightweight depth‑wise convolution branch initialized as nearest‑neighbor up‑sampling, handling low‑frequency regions (sky, walls) without extra computation, letting the main backbone focus on high‑frequency details.
SPANV2 contains only 0.139 M parameters, 32 feature channels, five stacked SPABV2 modules and the near‑pixel branch, resulting in a model size of a few hundred KB that runs on mainstream mobile ISPs/NPUs. It achieved a composite score of 4.43 , ranking first, and an average inference latency of 5.256 ms —over 30 % faster than the SPAN baseline.
Portrait Restoration (MiPlusCV)
The portrait restoration task aims to recover high‑quality, detail‑rich faces from severely degraded inputs while preserving identity. Xiaomi’s large‑model team proposed a two‑stage cascade with a single‑step diffusion refinement :
Stage 1 – OSDFace coarse restoration: restores facial structure, corrects severe degradation, and ensures stable geometry.
Stage 2 – Z‑Image one‑step diffusion: injects fine‑grained skin texture, hair, and edge details.
Training employs a multi‑objective loss suite:
Pixel‑level L1 loss for basic reconstruction accuracy.
DISTS perceptual loss for natural texture.
ArcFace identity constraint to keep the restored face recognizable.
DINOv2 adversarial loss for high‑level semantic consistency.
After supervised training, a quality‑reward fine‑tuning stage uses several image‑quality models as feedback, nudging results toward higher subjective visual appeal. The method topped both reference‑free quality and identity‑consistency metrics, earning first place.
Reflection Removal (RDNet‑XL)
Reflection removal seeks to separate the transmitted scene from unwanted glass reflections, a task complicated by diverse materials, angles, and lighting. Xiaomi upgraded the backbone from FocalNet‑L to FocalNet‑XL , boosting multi‑scale representation and global context modeling.
To handle difficult reflective samples, the team introduced diffusion‑model knowledge distillation :
Generate 1,000 high‑quality pseudo‑labels by applying SOTA diffusion models (WindowSeat, DAI) to open‑source images.
Align domains by passing both reflected and clean images through the same VAE encoder‑decoder.
Use the diffusion model’s output as a teacher signal for additional distillation training.
A three‑stage progressive resolution schedule (384×384 → 512×512 → 768×768) stabilizes training on large images, first learning local reflection patterns and then expanding to global structure. This approach secured second place with a subjective score of 4.31 and top objective rankings.
The official NTIRE report highlighted that Xiaomi’s custom fusion operator exposed a shift in bottlenecks from arithmetic complexity to memory bandwidth, underscoring the importance of low‑level hardware‑software co‑design in modern efficient imaging research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
