FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou

Imperial College London, UK

TL;DR: We introduce FitDiff:
    🔥 A multi-modal diffusion-based generative model that jointly produces facial geometry and appearance (Diffuse Albedo, Specular Slbedo, and Normal Maps.
    🔥 The first diffusion model conditioned on identity embeddings, acquired from an off-the-shelf face recognition network, whilst introducing a SPADE conditioned UNet architecture.
    🔥 Through a novel guidance algorithm, it achieves acurate facial identity reconstruction.

FitDiff, a versatile multi-modal diffusion model, produces relightable facial avatars that seamlessly integrate into various commercial rendering platforms.

Given "in-the-wild" facial images, FitDiff reconstructs facial avatars consisting of facial shape and reflectance.

Abstract

In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. Leveraging diffusion principles, our model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an “in-thewild” 2D facial image.

The introduced multi-modal diffusion model is the first to concurrently output facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses.

Being the first 3D LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.

Method

Starting from Gaussian noise, FitDiff concurently generates facial shape and reflectance maps (diffuse albedo, specular albedo and normals), conditioned on an identity embedding vector. During sampling, a novel guidance algorithm is applied for further control of the resulting facial avatar. Z_T,Z_k and Z_k−1 are visualized in the actual picture space for illustration purposes

Unconditional Sampling

FitDiff can generate diverse facial identities without the need for pre-existing input. These assets offer significant potential across diverse applications, including enhancing existing datasets through augmentation and enrichment, as well as the creation of genuinely random identities for computer-based applications.

BibTeX


      @InProceedings{Galanakis_2025_WACV,
        author    = {Galanakis, Stathis and Lattas, Alexandros and Moschoglou, Stylianos and Zafeiriou, Stefanos},
        title     = {FitDiff: Robust Monocular 3D Facial Shape and Reflectance Estimation using Diffusion Models},
        booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
        month     = {February},
        year      = {2025},
        pages     = {992-1004}
    }