In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. Leveraging
diffusion principles, our model accurately generates relightable facial avatars, utilizing an identity
embedding extracted from an “in-thewild” 2D facial image.
The introduced multi-modal diffusion model is the first to concurrently output facial reflectance maps
(diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities.
It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions.
We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual
and face recognition losses.
Being the first 3D LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars,
that can be used as-is in common rendering engines, starting only from an unconstrained facial image,
and achieving state-of-the-art performance.