Neural Radiance Fields and Generative Intelligent Media: Recent Advances and Applications
Professor Hu Qiang presented recent progress in Neural Radiance Fields—covering implicit/explicit representations, hybrid models, and solutions for dynamic scenes, cloud‑based and edge‑cloud rendering—while also reviewing generative AI advances such as diffusion‑based text‑to‑image/video/3D, LoRA fine‑tuning, and large‑scale story‑book datasets, highlighting applications in virtual‑real content, smart‑city modeling, and 6‑DoF e‑commerce displays.
During Bilibili's 1024 Programmer Festival, a technical sharing session was held featuring several heavyweight guests. Professor Hu Qiang from Shanghai Jiao Tong University's Future Media Network Collaborative Innovation Center delivered a talk on the latest progress in neural radiance fields (NeRF) and generative intelligent media.
Professor Hu introduced the concept of free‑viewpoint scenes and explained that such scenes rely on multi‑view video capture and graphics rendering to allow users to interactively explore virtual environments. He described two main representation methods: explicit (images, videos, meshes, point clouds, volumes, octrees) and implicit (SDF, NeRF, neural light fields), as well as hybrid approaches such as PlenOctrees, Plenoxels, TensoRF, and Instant‑NGP.
NeRF, the award‑winning 2020 ECCV paper, solves novel view synthesis by representing a scene with an implicit neural field. A 5‑D input (spatial coordinate + view direction) is fed into a multilayer perceptron (MLP) that outputs color and volume density. These outputs are then composited via volumetric rendering to produce images. The training loss is computed between rendered images and the original captured views, enabling the model to generate novel viewpoints after training.
Key factors behind NeRF's high‑quality rendering include positional encoding that maps coordinates to a high‑frequency space, continuous 3D representation that captures fine details, the ability to render from arbitrary viewpoints, and a coarse‑to‑fine hierarchical sampling strategy that reduces computational load.
Despite its strengths, NeRF faces challenges: handling dynamic content and complex reflections, the massive data size of 3D scenes, and the heavy computational resources required for training and inference.
To address these issues, recent research explores: (1) end‑to‑end neural rendering pipelines that generate high‑quality free‑viewpoint images from low‑resolution point clouds; (2) cloud‑based rendering where NeRF representations are processed on powerful servers and streamed to end‑users via low‑latency WebRTC; (3) dynamic scene modeling using key‑frame plus residual encoding (DCT, quantization, entropy coding) to enable compact, streamable neural radiance fields; and (4) edge‑cloud collaborative adaptive coding to improve quality, latency, bandwidth, and compute efficiency for real‑time free‑viewpoint video.
The second part of the talk shifted to generative AI for media. Professor Hu highlighted the rapid growth of AIGC (AI‑generated content) across text, image, and video domains, mentioning models such as DALL·E‑2, Midjourney, VideoGPT, and Stable Diffusion.
He explained text‑to‑image generation: a prompt is encoded by a CLIP model, then a diffusion process adds and removes noise to synthesize an image, with Stable Diffusion being a prominent open‑source example.
Text‑to‑video generation remains challenging due to limited datasets, short video lengths, and difficulties preserving temporal consistency. Current approaches extend image diffusion models frame‑by‑frame, but quality and coherence are still limited.
In the realm of text‑to‑3D, DreamFusion (Google + UC Berkeley) combines NeRF with diffusion models: a randomly initialized NeRF is rendered, noise is added, and a frozen diffusion model evaluates the result against the textual description, providing a loss that updates the NeRF parameters.
Professor Hu also presented the team’s work on diffusion‑based face restoration for old video footage, using a two‑stage pipeline: first a diffusion model denoises the input, then a specialized enhancement module refines facial details.
To adapt large diffusion models (e.g., Stable Diffusion with 175 B parameters) to specific tasks with limited data, the team employed LoRA (Low‑Rank Adaptation), updating only low‑rank matrices A and B (W + AB) while keeping the bulk of the model frozen, achieving efficient fine‑tuning for super‑resolution and other downstream tasks.
They also built a large‑scale story‑book dataset (≈2 000 books, 30 k image‑text pairs) to explore text‑guided story illustration generation, emphasizing the importance of data scale and quality in the era of foundation models.
During the Q&A, Professor Hu outlined practical applications of NeRF technology, including virtual‑real content creation, large‑scale city modeling for smart‑city initiatives, and 6‑DoF product displays on e‑commerce platforms. Existing deployments mentioned were Shanghai AI Lab’s city‑scale NeRF model “Shusheng·Tianji”, the University of Science and Technology of China’s NeuDim cloud‑reconstruction platform, and Luma AI’s 3D reconstruction service. He also noted remaining challenges: massive compute requirements, multi‑view data acquisition costs, model size, handling dynamic scenes, and achieving real‑time performance.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.