FuseAnyPart: Diffusion‑Driven Facial Parts Swapping via Multiple Reference Images
FuseAnyPart is a diffusion‑model‑based facial part swapping technique that fuses features from multiple reference images via mask‑based fusion and additive injection modules, delivering high‑fidelity, consistent face edits with lower computational cost, outperforming prior methods on CelebA‑HQ and FaceForensics++ and already boosting commercial AIGC applications.
This article presents FuseAnyPart, a novel diffusion‑model‑based method for facial part swapping that can seamlessly fuse features from multiple reference images to generate high‑fidelity face images. The method was accepted as a Spotlight paper at NeurIPS 2024.
Existing facial swapping techniques either focus on whole‑face replacement or suffer from high computational cost and poor control when handling multiple reference images. Key challenges include efficient multi‑reference feature fusion, maintaining consistency among swapped parts, and preserving high visual fidelity.
FuseAnyPart addresses these challenges with two core modules: a mask‑based fusion module and an addition‑based injection module. The fusion module first employs an open‑set detector (e.g., Grounding‑DINO) to obtain masks for facial organs (eyes, nose, mouth). Features of the masked regions are extracted from a pretrained image encoder (e.g., CLIP) at the penultimate layer, preserving rich spatial details. These region features are then spliced in the latent space of the diffusion model, replacing the corresponding target regions.
The injection module directly adds the fused feature map to the latent features inside the UNet of the diffusion model, bypassing cross‑attention mechanisms. This additive injection retains spatial alignment, reduces computational overhead, and requires fewer additional parameters. A linear layer with a learnable weight factor adapts the injected features to match the UNet’s latent dimensions.
Experiments were conducted on CelebA‑HQ (training) and FaceForensics++ (evaluation). Quantitative results show that FuseAnyPart achieves superior Frechet Inception Distance (FID) and organ‑similarity (OSim) scores compared with prior methods such as FacePartsSwap, E4S, and DiffSwap. Qualitative comparisons demonstrate more natural and coherent results for single‑part, multi‑part, and cross‑style swaps, as well as robust performance across different races and ages.
The technique is already applied in commercial AIGC scenarios (e.g., virtual‑human generation for marketing videos), yielding noticeable improvements in click‑through rates. Future work includes fine‑tuning on domain‑specific datasets and expanding the generated face database for virtual‑human, image‑to‑image, and video‑to‑video applications.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.