Artificial Intelligence 9 min read

Neural Rendering Based 3D Modeling and Multi‑Video Visual Localization for E‑commerce

The paper presents Object Drawer, a Taobao Tech system that uses neural‑rendering and a SuperPoint‑SuperGlue‑based SfM pipeline—enhanced by sparse sampling, loop constraints, frame‑skipping, and a novel 2D‑matching‑3D‑solving alignment across multiple videos—to achieve 99.3% visual‑localization success and high‑quality 3‑D reconstructions with pixel‑accurate segmentation for e‑commerce displays.

DaTaobao Tech

Mar 21, 2022

Neural Rendering Based 3D Modeling and Multi‑Video Visual Localization for E‑commerce

In October 2021, Taobao Tech released Object Drawer, a neural‑rendering based 3D modeling product that creates a 3D model from a single circular video of a commodity.

Accurate camera pose (translation and rotation) and foreground‑background pixel‑level segmentation are prerequisites for high‑quality reconstruction.

The visual localization task is formally Structure‑from‑Motion (SfM): given multiple view images, algorithms recover camera intrinsics, 6‑DoF poses, and a sparse point cloud. COLMAP is a widely used solution, but its success rate drops below 80% on weak or repetitive textures and fast camera motion.

We replace COLMAP’s SIFT features and brute‑force matching with neural network features (SuperPoint & SuperGlue), which greatly improves robustness on weak‑texture and repetitive scenes.

Because SuperGlue matching is time‑consuming (~50 ms per image pair) and not parallelizable, we adopt a sparse‑sampling + loop‑enhancement strategy that reduces invalid matches and speeds up matching by 15× for a 400‑image set (≈4 min).

Symmetric objects cause false loop detections. We use a coarse‑to‑fine SfM pipeline: first compute a rough pose, then refine it with loop constraints.

Fast camera motion produces blurry frames; we add a frame‑skipping matching strategy for such cases.

Turntable setups simplify capture but require filtering out points outside the turntable plane using planar tracking.

For full‑view reconstruction, we align poses from multiple videos. Simple foreground‑only merging degrades pose accuracy, and point‑cloud registration fails due to scale ambiguity and low overlap. We propose a “2D matching, 3D solving” framework that projects point clouds to images, matches features, counts 2D correspondences, and solves for scale s, rotation R, and translation T (7‑DoF).

The aligned poses enable high‑quality reconstruction of both front and bottom surfaces, achieving a visual‑localization success rate of 99.3% (up from 80% with open‑source methods).

Extending from 2 to 3 videos eliminates blind spots, providing complete coverage of the object.

We integrate the pipeline into Taobao’s 720° display chain, rendering offline results as image sequences shown on the product’s second main image.

To improve segmentation, we propose an end‑to‑end network that jointly optimizes image segmentation and neural rendering, yielding pixel‑level accurate masks even in complex backgrounds.

The 3D‑algorithm team behind Object Drawer has published papers at ICCV, NeurIPS, KDD, CVPR and contributed the 3D‑FRONT dataset.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

neural rendering Multi-view Pose Alignment Structure from Motion

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.