Artificial Intelligence 17 min read

How Kuaishou’s Y‑Tech Achieved Real‑Time 3D Photo Rendering on Any Smartphone

The article details Kuaishou Y‑Tech’s end‑to‑end solution for converting a single RGB image into an interactive 3D photo on mobile devices, covering depth estimation, image‑inpainting, custom KwaiNN inference, and real‑time 3D rendering techniques that run on all smartphone models without depth sensors.

Kuaishou Large Model
Kuaishou Large Model
Kuaishou Large Model
How Kuaishou’s Y‑Tech Achieved Real‑Time 3D Photo Rendering on Any Smartphone

3D Photo Overview

The Kuaishou Y‑Tech team proposes a method to transform a single RGB image into a dynamic 3D photo in real time on mobile devices, leveraging learning‑based depth estimation and image‑inpainting together with their proprietary KwaiNN inference engine and SKwai 3D effects engine.

Algorithm Framework Overview

Generating a 3D photo requires accurate scene depth, occlusion handling, and efficient mobile execution. The main challenges are (1) universal scene depth estimation that preserves facial detail and overall scene geometry, (2) high‑quality image and depth inpainting for large occluded regions, and (3) real‑time performance across diverse phone hardware.

General scene depth estimation : produce high‑quality depth maps for indoor and outdoor scenes, balancing facial fidelity and scene scale.

Universal image repair : recover missing regions of arbitrary size with high visual fidelity.

Reconstruction and rendering : rebuild the scene, design camera trajectories, and render new views.

Mobile‑side real‑time operation : ensure all modules run efficiently on the device.

Core Steps

Predict portrait segmentation and monocular depth using custom models; refine facial depth with a dedicated 3‑D face reconstruction pipeline and fuse with scene depth.

Apply portrait‑aware image inpainting to synthesize background content for occluded areas, then use Poisson diffusion to fill missing depth.

Reconstruct foreground and background meshes, generate continuous virtual camera paths, and render new views with the 3‑D graphics engine.

Monocular Depth Estimation

A U‑shaped encoder‑decoder network with skip connections extracts semantic and spatial features. Global context blocks (GCB) recalibrate channel features, and a spatial attention block (SAB) modulates local region weights. Multi‑task training jointly learns depth, surface normals, and portrait segmentation, improving both scene and facial depth accuracy.

Monocular depth estimation model
Monocular depth estimation model

Image and Depth Repair

Portrait segmentation isolates the subject, after which a custom inpainting model restores occluded background pixels. Poisson diffusion then propagates depth values into the repaired region, yielding separate foreground/background layers with consistent depth maps.

Image and depth repair pipeline
Image and depth repair pipeline

3D Scene Reconstruction and Rendering

Using the fused depth and repaired background, the system performs adaptive foreground‑background mesh reconstruction. The reconstructed data is fed to the proprietary 3‑D graphics engine, enabling smooth camera motions, gyroscope‑controlled view changes, and optional visual effects such as particles, rain, and atmospheric fog.

3D scene reconstruction
3D scene reconstruction

Mobile KwaiNN Inference Engine

KwaiNN, the upgraded version of YCNN, is a mobile‑first AI inference engine that supports CPUs, GPUs (Mali, Adreno, Apple, NVIDIA) and NPUs (Apple Bionic, Huawei HiAI, Qualcomm SNPE, MediaTek APU). It handles CNN/RNN models in float32, float16, and uint8 precision, with hardware‑specific operators (Metal, OpenCL, NEON) and a full toolchain for PyTorch/TFLite conversion, quantization, and architecture search, delivering roughly 10% performance advantage over competing engines.

KwaiNN inference engine
KwaiNN inference engine

Conclusion

The presented 3D Photo pipeline combines high‑quality monocular depth estimation, robust image/depth inpainting, and an optimized KwaiNN inference stack to deliver the first real‑time mobile 3D photo experience that works on virtually all smartphones without requiring dedicated depth sensors.

mobile AIreal-time renderingImage Inpaintingmonocular depth estimation3D PhotoKwaiNN
Kuaishou Large Model
Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.