Action Unit-Based 3D Face Reconstruction Using Transformers

doi:10.21203/rs.3.rs-4310180/v1

Action Unit-Based 3D Face Reconstruction Using Transformers

2024 · doi:10.21203/rs.3.rs-4310180/v1

preprint OA: closed

Full text JSON View at publisher

Full text 132,513 characters · extracted from preprint-html · click to expand

Action Unit-Based 3D Face Reconstruction Using Transformers | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Action Unit-Based 3D Face Reconstruction Using Transformers Hyeonjin Kim, Pei Wang, Hyukjoon Lee This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4310180/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The reconstruction of 3D face shapes and expressions from single 2D images remains unconquered due to the lack of detailed modeling of human facial movements such as the correlation between the different parts of faces. Facial action units (AUs), which represent detailed taxonomy of the human facial movements based on observation of activation of muscles or muscle groups, can be used to model various facial expression types. We present a novel 3D face reconstruction framework called AU feature-based 3D FAce Reconstruction using Transformer (AUFART) that can generate a 3D face model that is responsive to AU activation given a single monocular 2D image to capture expressions. AUFART leverages AU-specific features as well as facial global features to achieve accurate 3D reconstruction of facial expressions using transformers. We also introduce a loss function which is to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. The proposed framework achieves an average F1 score of 0.39, outperforming state-of-the-art methods. 3D face reconstruction Facial action unit Transformer Deep learning Figures Figure 1 Figure 2 Figure 3 Figure 4 1 Introduction In recent years, rapid advances in deep learning technology have led to numerous innovative advances in computer vision and graphics research. 3D face reconstruction from 2D images has received a tremendous amount of attention in computer vision and has made major progresses thanks to the highly accurate modeling capability of deep learning. 3D face reconstruction enables a wide range of applications such as speech-driven 3D facial animation, 3D avatar generation, virtual makeup, performance capture, virtual and augmented reality, and human-robot interaction [2–7]. Most existing studies use pre-computed 3D morphable models (3DMMs) with prior knowledge about facial geometry and appearance to improve the accuracy and fidelity of 3D face reconstruction [8, 9]. Recent studies utilize deep learning frameworks based on self-supervised learning to predict 3DMM parameters from input images. They can create plausible 3D face without ground-truth 3D facial scan data by employing various loss functions, such as the landmark reprojection loss, photometric loss, and face recognition loss, to train the deep neural networks [1, 10–13]. Recently, various new loss functions and architectures have been introduced to address the limitations of existing methods with respect to reconstruction accuracy of the rich and detailed facial expressions [12, 13, 46, 47]. In particular, the method of capturing emotions and reconstructing them into 3D faces demonstrates notable efficacy [12]. In contrast, the Facial Action Coding System (FACS) is a system describing a taxonomy of AUs for encoding facial movements and expressions, based on the observation of muscle activations [15]. It is observed that that within the existing 3D face reconstruction process, there is commendable proficiency in handling emotions, while the performance in encoding AUs is comparatively modest [48]. There exist a number of studies that have emphasized the importance of utilizing AUs in the process of 3D face reconstruction [46, 47]. However, they do not explicitly consider the correlations between AUs occurring in the frame-based reconstruction process and require the use of AU labels during training, leading to a lack of guaranteed performance in in-the-wild scenarios. In this paper, we leverage AU features extracted from in-the-wild images in the frame-based reconstruction process. Our approach enables accurate 3D face reconstruction while accounting for AUs, by utilizing a Transformer to model the correlations between AUs within frames. The correlation between AUs is an important factor to be modeled since human facial expressions are formed by multiple AUs in general. Therefore, a proper method of modeling and leveraging the correlation, not just the straight-forward utilization of the information about individual AUs, on top of global facial features may play a crucial role in reconstructing accurate facial expressions. In this paper, we propose AUFART (AU feature-based 3D FAce Reconstruction with Transformer) which enables detailed modeling of various facial expression types based on AU information for 3D face reconstruction. Unlike existing methods that use only global facial features generated from the face in an image using an encoder network,, our method can enhance the performance of the 3D face reconstruction model by providing richer representation of subtle details in facial expressions. A transformer-based 3D face reconstruction model is used to take advantage of the AU-specific features as well as the relationships between these features through the cross-attention mechanism. Several novel AU-based loss functions are also proposed. The reconstructed 3D faces generated by our method is found to be more responsive to the activated AUs in input images. In summary, our proposed framework comprises three key contributions: (i) We propose a Transformer-based 3D face reconstruction framework that leverages the features of AUs in the frame-based 3D face reconstruction process, explicitly considering their correlations; (ii) We integrate a state-of-the-art AU feature extraction module for effective AU feature extraction from in-the-wild images, along with a Transformer model for reconstructing 3D faces from these features. This integration enables high-accuracy facial reconstruction even in diverse environmental conditions and allows modeling of challenging correlations among less easily captured AUs; (iii) Additionally, to ensure precise 3D restoration of AU information, we design an AU-based loss function for training our proposed 3D face reconstruction framework. 2 Related Works 2.1 3D Morphable Models 3DMM is statistical models capable of capturing and representing various facial changes in low-dimensional space. These models are built from a vast amount of 3D facial scan data. Vetter and Blantz explained a method for reconstructing a 3D face from a single image with a pre-computed 3DMM in an analysis-by-synthesis fashion [8]. While the traditional 3DMM is based on Principal Component Analysis (PCA) for facial shape, more recent models such as FLAME, Basel Face Model, FaceWarehouse have separated shape, expression, and appearance spaces, enabling richer representations [8, 9]. FLAME is trained on 33,000 scan data and represents shape, pose, and expression parameters in the well-separated spaces through an effective parameter separation process. FLAME consists of a template mesh, shape blendshapes, pose blendshapes, and expression blendshapes. Each blendshape is composed of displacements from the template mesh with PCA applied to shape and expression. An iterative optimization approach was used to separate the spaces of each parameter during the model training phase. As a result, FLAME has made 3D facial reconstruction more accurate and manageable than the other 3DMM models. For this reason, FLAME is most widely used as a powerful and expressive tool in modeling facial geometry and expressions in many research works involving 3D faces including ours. 2.2 3D Face reconstruction The popularity of deep learning-based methods that learn the mapping between 2D images and 3D face models directly has grown rapidly over the last few years [10]. Early deep learning-based 3D face reconstruction methods faced challenges related to the dataset and training strategies. A huge number of 3D facial scan data corresponding to 2D images had to be collected to train a deep learning-based model, which incurred a large amount of labor and cost. Self-supervised learning frameworks that try to minimize the difference between input images and rendered images have been proposed to address this issue. They utilize a differentiable rendering layer to enable end-to-end learning by calculating the difference between input and rendered images without ground-truth 3D faces [44]. For each of the frameworks, a training strategy has been proposed for effective self-supervised learning. RingNet and DECA apply a landmark-based training strategy by predicting landmarks for input images and using them indirectly as pseudo ground truth [1, 11]. They use landmark reprojection loss which computes the distance between the ground-truth 2D face landmark and its corresponding landmark on the surface of the 3DMM, projected onto the image. Additionally, EMOCA employs a perception-based training strategy by utilizing a deep learning-based emotion recognition model as a feature extractor to minimize the distance of features for input and rendered images [12]. 2.3 Facial Action Unit AU detection involves analyzing facial expressions to detect independent movements in each region of the face [15]. Universally recognizable expressions such as surprise, anger, and sadness coexist, but actual facial movements. and expression styles vary between individuals [16]. Facial Action Coding System (FACS) has been developed to represent human expressions independent of each individual [15]. FACS is a taxonomy system that encodes facial movements into AUs based on observations of the activation of facial muscles or muscle groups. Compared to categorical emotion models, AUs offer a more comprehensive and objective description of facial expressions [14]. A considerable amount of research has been actively conducted in automated AU detection which is useful in tasks related to image-based facial behavior analysis [23]. AU detection can be formulated as a multi-label classification problem, and most research works propose to use machine learning techniques. More recently, the correlation between AUs is taken into account as the underlying relationships are found to play an important role in modeling facial expressions [40]. The AU Relationship-aware Node Feature Learning (ANFL) in ME-GraphAU utilizes a Convolutional Neural Network (CNN) and Graph Neural Network (GNN)-based model for AU detection, considering the relationships between AUs [17]. A CNN-based network generates a facial representation for the input image. Then an AU-specific Feature Generator (AFG) which is composed of Fully Connected layers (FC layer) and Global Average Pooling layer (GAP layer) extracts AU-specific features from the overall facial representation. A GNN-based network produces an AU relation graph to model the relationships between the extracted AU features. The AU relation graph includes relationships for each pair of AUs and predicts the activation probabilities and co-occurrence patterns of AUs. ME-GraphAU demonstrates state-of-the-art performance in AU detection benchmarks BP4D and DISFA [19, 24]. In this paper, we apply these AU characteristics to 3D face reconstruction, enhancing the performance of 3D expression representation. 3 Method The main design goal of AUFART is to build a self-supervised learning-based 3D face reconstruction framework that takes advantage of the information on AU activation given a single monocular 2D image. Figure 1 shows the overall architecture AUFART framework. 3.1 Architecture AUFART learns relationships among AU-specific features and global facial representations to predict accurate 3D face reconstruction parameters. Activation of AUs has individual relationships with each other and describes overall facial expressions [17, 22]. We model the relationships among the AU-specific features and the global facial features by a transformer with cross-attention. We use the pre-trained AFG block from ME-GraphAU to generate the AU-specific features from the face in an image. The AFG is encouraged to generate the AU-specific features dedicated to the AU detection model. The AU-specific features contain both AU activation status and their associations for each facial display. These features can enhance the capability of the 3D face reconstruction model by providing a richer representation of subtle details in facial expressions. The AFG takes an input image, passes it through the backbone network, and generates the AU-specific features as: $${V}_{AFG}=\left\{{v}_{1},{v}_{2},\dots ,{v}_{N}\right\}, {v}_{i}\in {\mathbb{R}}^{512}, N=27,$$ 1 where N is the number of AU-specific features. We also use the pretrained 3D face reconstruction model DECA as a facial global feature generator. The DECA encoder is composed of a CNN and a FC layer. The CNN extracts the global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}\in {\mathbb{R}}^{2048}$ while the FC layer generates the 3D face reconstruction parameters ${\varvec{\Theta }}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}\in {\mathbb{R}}^{236}$ from ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ . The global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ contains generalized global features of the face in an input image. The global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ is projected to ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}\in {\mathbb{R}}^{512}$ with FC layer $\varvec{L}$ : $${v}_{GLB}={X}_{DECA}^{T}L, L\in {\mathbb{R}}^{2048\times 512}.$$ 2 The overall procedure of generating input features of our model named ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ from an input with AFG and DECA is illustrated in Fig. 1 . We use a transformer-based 3D face reconstruction model which learns semantic relationships within generated features ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ and regresses 3D face reconstruction parameters ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$ . A cross-attention mechanism in our transformer-based model enhances the interplay between ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ by enabling the exchange of mutual information between input features. This dynamic interaction allows the model to consider a global context, learning dependencies and correlations among these features. The model consists of layer normalizations (LN), multi-layer perceptron layers (MLP layers), and multi-head cross-attention layers (MHC layer). We add a learnable regression token [REG] and apply input embedding and position embedding to the set of the input features. In the cross-attention process, AU-specific features ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ are used as queries, while global facial features ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ are treated as keys and values: $${z}_{0}=\left[{v}_{REG}; {v}_{1}E,{v}_{2}E,\dots ,{v}_{N}E\right]+{E}_{pos},$$ 3 $${z{\prime }}_{l}=\text{M}\text{H}\text{C}\left(\text{L}\text{N}\left({z}_{l-1}\right),{v}_{GLB},{v}_{GLB} \right)+{z}_{l-1},$$ 4 $${z}_{l}=\text{M}\text{L}\text{P}\left(LN\left({z{\prime }}_{l}\right)\right)+{z{\prime }}_{l},$$ 5 Output: $y=\text{M}\text{L}\text{P}\left(\text{L}\text{N}\left({z}_{L}^{0}\right)\right),$ (6) where ${\varvec{E}}_{\varvec{p}\varvec{o}\varvec{s}}$ is the position embedding, and $\varvec{E}$ is input embedding. The $\text{M}\text{H}\text{C}$ receives query, key, and value input in order. The learnable regression token [REG] is represented as ${\varvec{v}}_{\varvec{R}\varvec{E}\varvec{G}}$ and added to the front of the input features. The output $\varvec{y}$ through the above process is used as our 3D face reconstruction parameter ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$ . Once the 3D face reconstruction parameter values ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$ are generated, we use the FLAME decoder for the 3D face reconstruction. Subsequently, we employ a differentiable renderer to generate a rendered image from the reconstructed 3D face. The differentiable renderer makes it possible to compute gradients during the rendering process, enabling end-to-end training. Finally, we minimize the losses between the input image $\varvec{I}$ and the rendered image ${\varvec{I}}_{\varvec{R}\varvec{e}}$ to train our model. 3.2 Loss function Given a dataset of 2D face images, AUFART is trained by minimizing: $${L}_{total}= {L}_{auLmk}+{L}_{auRel}+{L}_{auFeat}+{L}_{reg}$$ 7 with AU-weighted landmark reprojection loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$ , AU-based relative distance loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{R}\varvec{e}\varvec{l}}$ , AU feature loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{F}\varvec{e}\varvec{a}\varvec{t}}$ , and parameter regularizer ${\varvec{L}}_{\varvec{r}\varvec{e}\varvec{g}}$ . AU-weighted landmark reprojection loss. This loss dynamically assigns higher weights to the landmark positions corresponding to activated AUs during the computation of the landmark reprojection loss. The landmark reprojection loss in existing studies assigns fixed weights for each facial part in every image [1, 11]. However, the movements of landmarks triggered by the activation of AUs serve as an effective means to describe the AUs [43]. ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$ assigns dynamic weights to the facial regions where AUs are activated to encourage the accurate representation of AUs in the reconstructed face. This enables AUFART to pay more attention to activated AUs during the training process. The AU-weighted landmark reprojection loss function is defined as: $${L}_{auLmk}={\sum }_{i=1}^{N}{\sum }_{j=1}^{{L}_{i}}{p}_{i}{‖{k}_{j}-s\varPi \left({M}_{j}\right)+t‖}_{1},$$ 8 where $\varvec{N}$ is the number of AUs used in this loss function, ${\varvec{L}}_{\varvec{i}}$ is the number of landmarks related to ${\varvec{i}}^{\varvec{t}\varvec{h}}$ AU, ${\varvec{p}}_{\varvec{i}}$ is the activation status of the ${\varvec{i}}^{\varvec{t}\varvec{h}}$ AU predicted by ME-GraphAU, ${\varvec{k}}_{\varvec{j}}$ is the ${\varvec{j}}^{\varvec{t}\varvec{h}}$ landmark coordinate in the input image and the ${\varvec{M}}_{\varvec{j}}$ is corresponding landmark on the FLAME model’s surface. $\varvec{s},\varvec{\varPi },\varvec{t}$ represent the predicted camera parameters, denoting the isotropic scale $\varvec{s}$ , orthographic 3D-to-2D projection matrix $\varvec{\varPi }$ , and 2D transition $\varvec{t}$ , respectively. We employ the Mediapipe landmark detector to predict landmarks from 2D images, utilizing a total of 105 landmarks distributed across the eyebrows, eyes, nose, and mouth regions [27]. Table 1 provides details on the facial landmarks associated with AUs, and Fig. 2 (a) illustrates the 105 landmark indices and positions. AU-based relative distance loss. The AU-based relative distance loss computes the relative distance between AU configural features for image landmarks and the projected 3D landmarks. The AU configural features involve calculating relative distances between facial landmark points and are used to determine AUs [27]. For example, AU 4 (Brow Lowerer) is determined based on the distance between the landmark points 21 and 22, which correspond to the inner eyebrow landmarks on the left and right. This type of loss function is similar to eye closure loss of DECA, which computes an error in the relative offset between landmarks on the upper and lower eyelids for image landmarks and their corresponding projected 3D landmarks. We extend this approach in the context of AU by incorporating configural features. The AU-based relative distance loss computes the errors in configural features of image landmarks $\varvec{k}$ and corresponding 3D landmarks $\varvec{M}$ projected onto the image plane: $${L}_{auRel}=\sum _{i=1}^{23}{‖{c}_{i}^{k}-{c}_{i}^{s\varPi \left(M\right)}‖}_{1},$$ 9 where ${\varvec{c}}_{\varvec{i}}^{\varvec{k}}$ and ${\varvec{c}}_{\varvec{i}}^{\varvec{s}\varvec{\Pi }\left(\varvec{M}\right)}$ are ${\varvec{i}}^{\varvec{t}\varvec{h}}$ configural features of image landmarks $\varvec{k}$ and projected 3D landmarks $\varvec{s}\varvec{\Pi }\left(\varvec{M}\right)$ . The proposed configural features from are defined using 66 landmarks model, but we modify landmark model with 68 landmarks from HRNet [28]. The 68 landmark indices are illustrated in Fig. 2 (b) and configural features corresponding to each AU are described in Table 2 . Table 1 Table captions should be placed above the tables. Facial parts Related AUs Involved landmarks Brow Brow Lowerer 0, 1, 2, …, 19 Inner brow Inner Brow Raiser 1, 3, 5, 6, 8, 9, 11, 13, 15, 16, 18, 19 Outer brow Outer Brow Raiser Elements excluding Inner brow from Brow Eye Lid Tightener 20, 21, …, 51 Lower eye Cheek Raiser 20, 21, …, 27, 33, 36, 37, …, 43, 49 Upper eye Upper Lid Raiser Elements excluding Lower eye from Eye Nose Nose Wrinkler 52, 53, …, 64 Mouth Lip Pucker, Lip Stretch, Lip Tightener 65, 66, …, 104 Upper mouth Upper Lip Raiser 65, 66, 69, 70, …, 76, 85, 86, …, 94, 103, 104 Mouth corner Lip Corner Puller, Lip Corner Depressor 71, 72, 73, 74, 79, 80, 81, 82, 85, 86, 88, 89, 90, 91, 92, 93, 97, 98, 99, 100, 103, 104 Table 2 Table captions should be placed above the tables. Facial AU Configural features Inner Brow Raiser ${c}_{1}=‖{p}_{21}-{p}_{39}‖, {c}_{4}=‖{p}_{26}-{p}_{45}‖.$ Outer Brow Raiser ${c}_{5}=‖\frac{{p}_{19}-{p}_{20}}{2}-\frac{{p}_{37}-{p}_{38}}{2}‖, {c}_{6}=‖\frac{{p}_{23}-{p}_{24}}{2}-\frac{{p}_{43}-{p}_{44}}{2}‖.$ Brow Lowerer ${c}_{7}=‖{p}_{21}-{p}_{22}‖$ . Upper Lid Raiser Similar to ${c}_{5}$ , ${c}_{6}$ , and ${c}_{8}=‖\frac{{p}_{37}-{p}_{38}}{2}-\frac{{p}_{40}-{p}_{41}}{2}‖, {c}_{9}=‖\frac{{p}_{43}-{p}_{44}}{2}-\frac{{p}_{46}-{p}_{47}}{2}‖.$ Lid Tightener Similar to ${c}_{8}$ , ${c}_{9}$ . Nose Wrinkler ${c}_{10}=‖{p}_{27}-{p}_{29}‖$ . Upper Lip Raiser ${c}_{11}=‖{p}_{60}-{p}_{65}‖, {c}_{12}=‖{p}_{62}-{p}_{63}‖,{c}_{13}=‖{p}_{32}-{p}_{50}‖, {c}_{14}=‖{p}_{33}-{p}_{51}‖,{c}_{15}=‖{p}_{34}-{p}_{52}‖, {c}_{16}=‖{p}_{41}-{p}_{48}‖,{c}_{17}=‖{p}_{46}-{p}_{54}‖.$ Lip Corner Puller ${c}_{18}=‖{p}_{48}-{p}_{54}‖,{c}_{19}=‖\frac{{p}_{39}+{p}_{40}+{p}_{41}}{3}-{p}_{48}‖,{c}_{20}=‖\frac{{p}_{42}+{p}_{47}+{p}_{46}}{3}-{p}_{54}‖.$ Lip Corner Depressor Similar to ${c}_{19}$ , ${c}_{20}$ . Lip Stretcher Similar to ${c}_{18}$ . Lip Tightener ${c}_{21}=‖{p}_{51}-{p}_{57}‖$ . Jaw Drop Similar to ${c}_{21}$ and ${c}_{22}=‖{p}_{50}-{p}_{58}‖, {c}_{23}=‖{p}_{52}-{p}_{56}‖.$ AU feature loss. The AU feature loss computes the distances between the AU-specific features of the input image $\varvec{I}$ and the rendered image ${\varvec{I}}_{\varvec{R}\varvec{e}}$ . Optimizing this loss during training encourages the reconstructed 3D face to convey AU activations that are visually similar to the image. We utilize AFG to generate the AU-specific features from both images: $${L}_{auFeat}={‖\text{A}\text{F}\text{G}\left(I\right)-\text{A}\text{F}\text{G}\left({I}_{Re}\right)‖}_{2}$$ 10 . Parameter regularization. ${\varvec{L}}_{\varvec{r}\varvec{e}\varvec{g}}$ regularizes expression $\varvec{\psi }$ , pose $\varvec{\theta }$ , camera $\varvec{c}$ parameters with regularization coefficient and is specified as: $${L}_{reg}={{\lambda }_{\psi }‖\psi ‖}_{2}^{2}+{{\lambda }_{\theta }‖\theta ‖}_{2}^{2}+{{\lambda }_{c}‖c‖}_{2}^{2}$$ 11 . 4 Experiments 4.1 Implementation details AUFART was trained with a total of approximately 300,000 images from VGGFace2, Aff-wild2, CelebA-HQ, FFHQ, and BUPT-CB [29–33]. We used PyTorch3D to render the reconstructed 3D face onto the image plane [37]. In addition, we used the Adam optimizer with the learning rate of 1e-05, the batch size of 16, and 15 epochs. For parameter regularization, 1e-05 is applied to the expression parameter, and 0.1 is applied to the pose parameter. The loss function weighting parameters for each loss function is 0.75 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$ , 0.25 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{R}\varvec{e}\varvec{l}}$ , and 0.75 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{F}\varvec{e}\varvec{a}\varvec{t}}$ . The AUFART model predicts the values of $\varvec{\psi }$ , $\varvec{\theta }$ , and $\varvec{c}$ only among the 3D face reconstruction parameters, and DECA predicts the values of ${\varvec{\beta }}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ , ${\varvec{l}}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ , ${\varvec{a}}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ . 4.2 Quantitative evaluation Currently, there is no standard benchmark specifically designed to evaluate the performance of the expression reconstruction while there exist a number of benchmarks for quantitative evaluation of the facial identity in the context of 3D face reconstruction [18]. The point-wise distance between a 3D face and a GT scan is not an appropriate performance metric for the 3D reconstruction of facial expressions since it is dominated by the identity parameters of the two images. Therefore, we propose to compare the AU activation states of input images and rendered images detected by ME-GraphAU using F1 scores. We use DISFA dataset for quantitative evaluation as it serves as one of the training datasets for ME-GraphAU. This choice is made with the confidence that AU detection model will effectively detect AU activation states during the evaluation. DISFA contains 27 subjects watching a video and consists of 130,815 frames. Table 3 presents results for 9 subjects, each identified by their respective subject numbers, out of a total of 27 subjects. Additionally, the average results for the remaining 18 subjects are collectively labeled as "others". This choice is grounded in the nature of the DISFA dataset, where frames within sequences predominantly exhibit neutral expressions. Consequently, the selection of 9 videos with high AU activations, or expressiveness, is deemed appropriate for evaluating the performance of expression reconstruction. In Table 4 , the evaluation outcomes for each AU are presented, with the assessed AUs being those that the ME-GraphAU model can predict. In the per-subject evaluation results presented in Table 3 , AUFART outperforms both DECA and EMOCA for all subjects. Table 4 provides the per-AU performance of the three methods. AUFART demonstrates superior performance compared to both DECA and EMOCA for AUs related to the upper face area (AU 4, 5, 7), particularly, for AU 4 (Brow Lowerer). It exhibits 5-fold and 1.5-fold performance increase over DECA and EMOCA, respectively. In the lower face areas, AUFART outperforms DECA significantly whereas it slightly outperforms EMOCA. In summary, the average F1 scores are 0.39, 0.18, and 0.30 for AUFART, DECA, and EMOCA, respectively. It confirms that AUFART can achieve higher performance than DECA and EMOCA. Reconstruction of face details such as forehead wrinkles and eyebrow movements, which can be detected by AU1 (Inner Brow Raiser) and AU2 (Outer Brow Raiser) [38], were not considered and left for future work. Table 3 Per-subject F1 score evaluation results for AU detection on input images and rendered images in DISFA Subject Method AUFART DECA EMOCA 03 0.46 0.23 0.24 06 0.35 0.13 0.32 11 0.45 0.18 0.30 12 0.46 0.30 0.43 16 0.43 0.05 0.24 18 0.45 0.19 0.25 23 0.27 0.16 0.24 25 0.34 0.21 0.25 27 0.33 0.07 0.30 Others 0.35 0.18 0.20 Avg. 0.39 0.18 0.30 Table 4 Per-AU F1 score evaluation results for AU detection on input images and rendered images in DISFA AU Method AUFART DECA EMOCA 01 0.00 0.00 0.00 02 0.00 0.00 0.00 04 0.47 0.09 0.30 05 0.24 0.08 0.12 07 0.78 0.47 0.64 09 0.17 0.00 0.21 10 0.83 0.47 0.81 12 0.84 0.40 0.86 15 0.13 0.00 0.02 20 0.04 0.01 0.06 23 0.28 0.03 0.02 26 0.56 0.46 0.36 Avg. 0.39 0.18 0.30 4.3 Qualitative evaluation For the qualitative evaluation, we used a in-the-wild face image dataset called 300W and DISFA dataset. Figure 3 shows the 3D face reconstruction results for the 300W dataset. We highlight the sub-image areas of high and low accuracy with green/blue and red-boxes, respectively. It is clearly noticed that AUFART can generate better eyebrow and eye movements than DECA and EMOCA. This confirms that our transformer-based model cannot just learn the features related to AU 1, 2, and 4 from the input images but can also be guided by AU-based loss functions. It shows more robust reconstruction performance for upper facial movements as well. Similarly, as it can be observed in the blue-boxes AUFART demonstrates higher reconstruction accuracy for various mouth shapes in the lower face areas, especially for AU 12 (Lip Corner Puller) and AU 15 (Lip Corner Depressor) compared to DECA and EMOCA. The experimental results for the DISFA dataset are presented in Fig. 4 . It shows the images of subjects 3, 16, and 27 in the DISFA dataset and compares whether changes in facial expression are captured accurately in two adjacent frames for each subject. For Subject 3, DECA shows facial expressions with less variation between adjacent frames. In contrast, both AUFART and EMOCA show clear expression changes between adjacent frames. AUFART and EMOCA capture and reconstruct accurate facial movements within each frame. Subject 16’s activation of AU 25 (Lips part) is observable in both AUFART and DECA, whereas it is not observed in EMOCA. Additionally, the activation of AU 1 and 2 is observable in AUFART and EMOCA, but not in DECA. In Subject 27, AUFART captures the subtle activation changes of AU 1 between adjacent frames. Consequently, the performance of AUFART is confirmed to be comparable to state-of-the-art models such as DECA and EMOCA, while surpassing them in certain scenarios. In summary, the qualitative evaluation results for the 300W and the DISFA dataset indicate that AUFART outperforms DECA and EMOCA in the 3D reconstruction of facial expressions induced by AU activations. 5 Conclusion In this paper, we propose a 3D face reconstruction framework, named AUFART which is based on a transformer-based model guided with AU features. Our framework incorporates with pretrained feature generators for AU-specific features and the global facial features from state-of-the-art AU detection model and 3D face reconstruction model. Our transformer-based model is able to capture the relationships among generated AU-specific features and global facial features and predicts accurate 3D face reconstruction parameters. In addition, we introduce AU-based loss functions to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. AUFART achieves more accurate 3D face reconstruction of AUs, which were not fully considered in existing frame-based 3D face reconstructions studies. We compare the AU activation states of input images and rendered images detected by state-of-the-art AU detection model. It shows a performance improvement of at least 30% compared to the frame-based state-of-the-art 3D face reconstruction methods DECA and EMOCA, achieving an average F1 score of 0.39. This highlights the superior performance of our proposed method, especially in capturing and reconstructing facial expressions related to AUs. In the future, our study may not only explore detailed 3D facial restoration based on AU features but also investigate temporal modeling for 3D face reconstruction, taking into consideration the temporal characteristics of AUs. Declarations Author Contribution Hyeonjin Kim conceptualized the main idea, designed experiments, conducted research, and wrote the manuscript. Pei Wang assisted in data collection and experimental work. Professor Hyukjoon Lee verified the main idea, provided guidance on research direction, and contributed to manuscript writing. All authors participated in reviewing and revising the manuscript. References Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG), Vol.40, No.88, pp.1-13, 2021. Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. AvatarMe: Realistically renderable 3D facial reconstruction “in-the-wild.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 760–769, 2020. Kristina Scherbaum, Tobias Ritschel, Matthias Hullin, Thorsten Thormählen, Volker Blanz, and Hans-Peter Seidel. Computer-suggested facial makeup. Comput. Graph. Forum, vol. 30, no. 2, pages 485-492, 2011. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial performance capture. ACM Trans. Graph., 34(4), Jul 2015. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jae- woo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen- Chun Chen, and Hao Li. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), Nov. 2017. Diego R. Faria, Mario Vieira, Fernanda C.C. Faria, and Cristiano Premebida. Affective facial expressions recognition for human-robot interaction. Proc. In 26th IEEE Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), pages 805-810, Aug. 2017. Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial an- imation with discrete motion prior. arXiv preprint arXiv:2301.02379, 2023. Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. IEEE, 2009. Araceli Morales, Gemma Piella, and Federico M. Sukno. Survey on 3d face reconstruction from uncalibrated images. Computer Science Review, 40:1–35, 2021. Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J. Black. Learning to regress 3D face shape and expression from an image without 3D supervision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019. Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022. Panagiotis P Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Spectre: Visual speech-informed per- ceptual 3d facial expression reconstruction from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5744–5754, 2023. Tetiana Martyniuk, Orest Kupyn, Yana Kurlyak, Igor Krashenyi, Jiˇri Matas, and Viktoriia Sharmanska. Dad- 3dheads: A large-scale dense, accurate and diverse dataset for 3d head alignment from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20942–20952, 2022. Paul Ekman and Wallace V. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978. Geethu Miriam Jacob and Bjorn Stenger. Facial action unit detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680–7689, 2021. Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022. Zhen-Hua Feng, Patrik Huber, Josef Kittler, Peter Han- cock, Xiao-Jun Wu, Qijun Zhao, Paul Koppen, and Matthias R ̈atsch. Evaluation of dense 3D reconstruction from 2D face images in the wild. In International Conference on Automatic Face & Gesture Recognition (FG), pages 780–786, 2018. S. Mohammad Mavadati, Mohammad H. Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F. Cohn. Disfa: A sponta- neous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013. Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999. Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Keliang Zhou. FaceWarehouse: A 3D facial expression database for visual computing. Transactions on Visualiza- tion and Computer Graphics, 20:413–425, 2014. Brais Martinez, Michel F. Valstar, Bihan Jiang, and Maja Pantic. Automatic Analysis of Facial Actions: A Survey. in IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 325-347, 1 July-Sept. 2019. Cheng-Hao Tu, Chih-Yuan Yang, and Jane Yung-jen Hsu. IdenNet: Identity-aware facial action unit detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019. Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M. Girard. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Com- puting, 32(10):692–706, 2014. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. arXiv:1710.10903, 2017. Yury Kartynnik, Artsiom Ablavatski, Ivan Gr- ishchenko, and Matthias Grundmann. Real-time fa- cial surface geometry from monocular video on mo- bile GPUs. In Third Workshop on Computer Vision for AR/VR, Long Beach, CA, 2019. Nazil Perveen and Chalavadi Krishna Mohan. Configural representation of facial action units for spontaneous facial expression recognition in the wild. In VISIGRAPP (4: VISAPP), pages 93–102, 2020. Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019. Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Auto- matic Face & Gesture Recognition (FG), pages 67–74, 2018. Dimitrios Kollias and Stefanos Zafeiriou. Aff-Wild2: Ex- tending the Aff-Wild database for affect recognition. arXiv preprint arXiv: 1811.07770, 2018. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. Yaobin Zhang and Weihong Deng. Class-balanced training for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 824–825, 2020. Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multilevel face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020. Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016. Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020. Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action units for facial expression analysis. T-PAMI, 23(2):97–115, 2001. Dolley Shukla, Chandra Shekhar Mithlesh, and Manisha Sharma. A survey on different video scene change detection techniques. International Journal of Science and Research (IJSR). National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE). 2015. Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. Uncertain Graph Neural Networks for Facial Action Unit Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(7), 5993-6001, 2021. Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1021-1030, 2017. Tengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. Hybrid message passing with performance-driven structures for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6267–6276, 2021. Shangfei Wang, Yanan Chang, and Can Wang. Dual learning for joint facial landmark detection and action unit recognition. IEEE Transactions on Affective Computing, 2021. A. Tewari, M. Zollh¨ofer, H. Kim, P. Garrido, F. Bernard, P. P´erez, C. Theobalt, MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction, in: Proc. of IEEE ICCV, 2017, pp. 1274–1283. J. Yang, F. Zhang, B. Chen and S. U. Khan, "Facial expression recognition based on facial action unit", Proc. 10th Int. Green Sustain. Comput. Conf. (IGSC), pp. 1-6, Oct. 2019. Kuang, Chenyi, Jeffrey O. Kephart, and Qiang Ji. "AU-Aware Dynamic 3D Face Reconstruction From Videos With Transformer." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024. Kuang, Chenyi, et al. "AU-Aware 3D Face Reconstruction through Personalized AU-Specific Blendshape Learning." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022. Tellamekala, Mani Kumar, et al. "Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?." IEEE Transactions on Affective Computing (2023). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4310180","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":294599586,"identity":"a7b76293-ab3d-4c51-82fb-b72b41674414","order_by":0,"name":"Hyeonjin Kim","email":"","orcid":"","institution":"Kwangwoon University","correspondingAuthor":false,"prefix":"","firstName":"Hyeonjin","middleName":"","lastName":"Kim","suffix":""},{"id":294599587,"identity":"6939eed6-5db7-4b7f-ab43-15d1cf00b2f2","order_by":1,"name":"Pei Wang","email":"","orcid":"","institution":"Kwangwoon University","correspondingAuthor":false,"prefix":"","firstName":"Pei","middleName":"","lastName":"Wang","suffix":""},{"id":294599588,"identity":"e52b4e80-f55d-4eb8-9e29-647eff4e2154","order_by":2,"name":"Hyukjoon Lee","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAoklEQVRIiWNgGAWjYBACA3Yg8eEAmM1MpBagOsYZJGth5iFJizkz88HHNmcOM/C3H2A2riBGi2UzW7Jxzo3DDBJnEpgTzxDlsMM8ZtI5H24zMNxgYD7YQJwW/u+/LYBa5EnQwsPGzHDjNoMBUEsikVrYjCV7zvznMTyT2GxInJbjzQ8//DiWJid3/PBhSaK0wAAPMEZJ0jAKRsEoGAWjAB8AAIOjMNarXirqAAAAAElFTkSuQmCC","orcid":"","institution":"Kwangwoon University","correspondingAuthor":true,"prefix":"","firstName":"Hyukjoon","middleName":"","lastName":"Lee","suffix":""}],"badges":[],"createdAt":"2024-04-23 07:38:56","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4310180/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4310180/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":55628990,"identity":"f5ef7af1-0150-4bb5-b71a-b9b31e16e58f","added_by":"auto","created_at":"2024-04-30 19:03:08","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":488562,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of architecture for AU-guided 3D face reconstruction with AUFART. (a) Pretrained AU-specific feature generator receives input image and generate AU-specific features. (b) Pretrained DECA encoder receives input image and generate global facial features as well as 3D face reconstruction parameters. (c) Our transformer-based model receives both AU-specific features and global facial features and predict 3D face reconstruction parameters. (d) Transformer encoder with multi-head cross attention.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4310180/v1/b012ac64a9b02b853b546b0d.png"},{"id":55628988,"identity":"0314f676-37bc-46dd-af56-059c972d0ac1","added_by":"auto","created_at":"2024-04-30 19:03:07","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":206534,"visible":true,"origin":"","legend":"\u003cp\u003eLandmark position and index for each detector\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4310180/v1/ddad0f644aa059d948ccb4d9.png"},{"id":55628987,"identity":"f7b3f814-bfd6-4c93-a030-826872181963","added_by":"auto","created_at":"2024-04-30 19:03:07","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":3297437,"visible":true,"origin":"","legend":"\u003cp\u003eVisual comparison with DECA, EMOCA, and AUFART on the 300W dataset\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4310180/v1/765fdee68d207a30039c6a55.png"},{"id":55629625,"identity":"637c1efb-f6fa-4536-bf4b-28059e83e453","added_by":"auto","created_at":"2024-04-30 19:11:07","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":602013,"visible":true,"origin":"","legend":"\u003cp\u003eVisual comparison with DECA, EMOCA, and AUFART on the DISFA, from top to bottom: Input image, DECA, EMOCA, and AUFART, respectively\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4310180/v1/c810b4dd3f71c2713053f414.png"},{"id":58503968,"identity":"ebf06b86-0d6b-4058-a0f2-2bc697b8f760","added_by":"auto","created_at":"2024-06-17 14:07:25","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7739112,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4310180/v1/db9c3e24-a303-40a8-ab28-f16c0cf79bc3.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Action Unit-Based 3D Face Reconstruction Using Transformers","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eIn recent years, rapid advances in deep learning technology have led to numerous innovative advances in computer vision and graphics research. 3D face reconstruction from 2D images has received a tremendous amount of attention in computer vision and has made major progresses thanks to the highly accurate modeling capability of deep learning. 3D face reconstruction enables a wide range of applications such as speech-driven 3D facial animation, 3D avatar generation, virtual makeup, performance capture, virtual and augmented reality, and human-robot interaction [2\u0026ndash;7].\u003c/p\u003e \u003cp\u003eMost existing studies use pre-computed 3D morphable models (3DMMs) with prior knowledge about facial geometry and appearance to improve the accuracy and fidelity of 3D face reconstruction [8, 9]. Recent studies utilize deep learning frameworks based on self-supervised learning to predict 3DMM parameters from input images. They can create plausible 3D face without ground-truth 3D facial scan data by employing various loss functions, such as the landmark reprojection loss, photometric loss, and face recognition loss, to train the deep neural networks [1, 10\u0026ndash;13].\u003c/p\u003e \u003cp\u003eRecently, various new loss functions and architectures have been introduced to address the limitations of existing methods with respect to reconstruction accuracy of the rich and detailed facial expressions [12, 13, 46, 47]. In particular, the method of capturing emotions and reconstructing them into 3D faces demonstrates notable efficacy [12]. In contrast, the Facial Action Coding System (FACS) is a system describing a taxonomy of AUs for encoding facial movements and expressions, based on the observation of muscle activations [15]. It is observed that that within the existing 3D face reconstruction process, there is commendable proficiency in handling emotions, while the performance in encoding AUs is comparatively modest [48]. There exist a number of studies that have emphasized the importance of utilizing AUs in the process of 3D face reconstruction [46, 47]. However, they do not explicitly consider the correlations between AUs occurring in the frame-based reconstruction process and require the use of AU labels during training, leading to a lack of guaranteed performance in in-the-wild scenarios. In this paper, we leverage AU features extracted from in-the-wild images in the frame-based reconstruction process. Our approach enables accurate 3D face reconstruction while accounting for AUs, by utilizing a Transformer to model the correlations between AUs within frames. The correlation between AUs is an important factor to be modeled since human facial expressions are formed by multiple AUs in general. Therefore, a proper method of modeling and leveraging the correlation, not just the straight-forward utilization of the information about individual AUs, on top of global facial features may play a crucial role in reconstructing accurate facial expressions.\u003c/p\u003e \u003cp\u003eIn this paper, we propose AUFART (AU feature-based 3D FAce Reconstruction with Transformer) which enables detailed modeling of various facial expression types based on AU information for 3D face reconstruction. Unlike existing methods that use only global facial features generated from the face in an image using an encoder network,, our method can enhance the performance of the 3D face reconstruction model by providing richer representation of subtle details in facial expressions. A transformer-based 3D face reconstruction model is used to take advantage of the AU-specific features as well as the relationships between these features through the cross-attention mechanism. Several novel AU-based loss functions are also proposed. The reconstructed 3D faces generated by our method is found to be more responsive to the activated AUs in input images.\u003c/p\u003e \u003cp\u003eIn summary, our proposed framework comprises three key contributions: (i) We propose a Transformer-based 3D face reconstruction framework that leverages the features of AUs in the frame-based 3D face reconstruction process, explicitly considering their correlations; (ii) We integrate a state-of-the-art AU feature extraction module for effective AU feature extraction from in-the-wild images, along with a Transformer model for reconstructing 3D faces from these features. This integration enables high-accuracy facial reconstruction even in diverse environmental conditions and allows modeling of challenging correlations among less easily captured AUs; (iii) Additionally, to ensure precise 3D restoration of AU information, we design an AU-based loss function for training our proposed 3D face reconstruction framework.\u003c/p\u003e"},{"header":"2 Related Works","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 3D Morphable Models\u003c/h2\u003e \u003cp\u003e3DMM is statistical models capable of capturing and representing various facial changes in low-dimensional space. These models are built from a vast amount of 3D facial scan data. Vetter and Blantz explained a method for reconstructing a 3D face from a single image with a pre-computed 3DMM in an analysis-by-synthesis fashion [8]. While the traditional 3DMM is based on Principal Component Analysis (PCA) for facial shape, more recent models such as FLAME, Basel Face Model, FaceWarehouse have separated shape, expression, and appearance spaces, enabling richer representations [8, 9].\u003c/p\u003e \u003cp\u003eFLAME is trained on 33,000 scan data and represents shape, pose, and expression parameters in the well-separated spaces through an effective parameter separation process. FLAME consists of a template mesh, shape blendshapes, pose blendshapes, and expression blendshapes. Each blendshape is composed of displacements from the template mesh with PCA applied to shape and expression. An iterative optimization approach was used to separate the spaces of each parameter during the model training phase. As a result, FLAME has made 3D facial reconstruction more accurate and manageable than the other 3DMM models. For this reason, FLAME is most widely used as a powerful and expressive tool in modeling facial geometry and expressions in many research works involving 3D faces including ours.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 3D Face reconstruction\u003c/h2\u003e \u003cp\u003eThe popularity of deep learning-based methods that learn the mapping between 2D images and 3D face models directly has grown rapidly over the last few years [10]. Early deep learning-based 3D face reconstruction methods faced challenges related to the dataset and training strategies. A huge number of 3D facial scan data corresponding to 2D images had to be collected to train a deep learning-based model, which incurred a large amount of labor and cost. Self-supervised learning frameworks that try to minimize the difference between input images and rendered images have been proposed to address this issue. They utilize a differentiable rendering layer to enable end-to-end learning by calculating the difference between input and rendered images without ground-truth 3D faces [44]. For each of the frameworks, a training strategy has been proposed for effective self-supervised learning. RingNet and DECA apply a landmark-based training strategy by predicting landmarks for input images and using them indirectly as pseudo ground truth [1, 11]. They use landmark reprojection loss which computes the distance between the ground-truth 2D face landmark and its corresponding landmark on the surface of the 3DMM, projected onto the image. Additionally, EMOCA employs a perception-based training strategy by utilizing a deep learning-based emotion recognition model as a feature extractor to minimize the distance of features for input and rendered images [12].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Facial Action Unit\u003c/h2\u003e \u003cp\u003eAU detection involves analyzing facial expressions to detect independent movements in each region of the face [15]. Universally recognizable expressions such as surprise, anger, and sadness coexist, but actual facial movements. and expression styles vary between individuals [16]. Facial Action Coding System (FACS) has been developed to represent human expressions independent of each individual [15]. FACS is a taxonomy system that encodes facial movements into AUs based on observations of the activation of facial muscles or muscle groups. Compared to categorical emotion models, AUs offer a more comprehensive and objective description of facial expressions [14].\u003c/p\u003e \u003cp\u003eA considerable amount of research has been actively conducted in automated AU detection which is useful in tasks related to image-based facial behavior analysis [23]. AU detection can be formulated as a multi-label classification problem, and most research works propose to use machine learning techniques. More recently, the correlation between AUs is taken into account as the underlying relationships are found to play an important role in modeling facial expressions [40]. The AU Relationship-aware Node Feature Learning (ANFL) in ME-GraphAU utilizes a Convolutional Neural Network (CNN) and Graph Neural Network (GNN)-based model for AU detection, considering the relationships between AUs [17]. A CNN-based network generates a facial representation for the input image. Then an AU-specific Feature Generator (AFG) which is composed of Fully Connected layers (FC layer) and Global Average Pooling layer (GAP layer) extracts AU-specific features from the overall facial representation. A GNN-based network produces an AU relation graph to model the relationships between the extracted AU features. The AU relation graph includes relationships for each pair of AUs and predicts the activation probabilities and co-occurrence patterns of AUs. ME-GraphAU demonstrates state-of-the-art performance in AU detection benchmarks BP4D and DISFA [19, 24]. In this paper, we apply these AU characteristics to 3D face reconstruction, enhancing the performance of 3D expression representation.\u003c/p\u003e \u003c/div\u003e"},{"header":"3 Method","content":"\u003cp\u003eThe main design goal of AUFART is to build a self-supervised learning-based 3D face reconstruction framework that takes advantage of the information on AU activation given a single monocular 2D image. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the overall architecture AUFART framework.\u003c/p\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Architecture\u003c/h2\u003e \u003cp\u003eAUFART learns relationships among AU-specific features and global facial representations to predict accurate 3D face reconstruction parameters. Activation of AUs has individual relationships with each other and describes overall facial expressions [17, 22]. We model the relationships among the AU-specific features and the global facial features by a transformer with cross-attention.\u003c/p\u003e \u003cp\u003eWe use the pre-trained AFG block from ME-GraphAU to generate the AU-specific features from the face in an image. The AFG is encouraged to generate the AU-specific features dedicated to the AU detection model. The AU-specific features contain both AU activation status and their associations for each facial display. These features can enhance the capability of the 3D face reconstruction model by providing a richer representation of subtle details in facial expressions. The AFG takes an input image, passes it through the backbone network, and generates the AU-specific features as:\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$${V}_{AFG}=\\left\\{{v}_{1},{v}_{2},\\dots ,{v}_{N}\\right\\}, {v}_{i}\\in {\\mathbb{R}}^{512}, N=27,$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eN\u003c/em\u003e is the number of AU-specific features. We also use the pretrained 3D face reconstruction model DECA as a facial global feature generator. The DECA encoder is composed of a CNN and a FC layer. The CNN extracts the global face representation \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{X}}_{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\\in {\\mathbb{R}}^{2048}\$\u003c/span\u003e\u003c/span\u003e while the FC layer generates the 3D face reconstruction parameters \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{\\Theta }}_{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\\in {\\mathbb{R}}^{236}\$\u003c/span\u003e\u003c/span\u003e from \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{X}}_{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e. The global face representation \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{X}}_{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e contains generalized global features of the face in an input image. The global face representation \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{X}}_{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e is projected to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{G}\\varvec{L}\\varvec{B}}\\in {\\mathbb{R}}^{512}\$\u003c/span\u003e\u003c/span\u003e with FC layer \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{L}\$\u003c/span\u003e\u003c/span\u003e:\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$${v}_{GLB}={X}_{DECA}^{T}L, L\\in {\\mathbb{R}}^{2048\\times 512}.$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe overall procedure of generating input features of our model named \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{V}}_{\\varvec{A}\\varvec{F}\\varvec{G}}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{G}\\varvec{L}\\varvec{B}}\$\u003c/span\u003e\u003c/span\u003e from an input with AFG and DECA is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe use a transformer-based 3D face reconstruction model which learns semantic relationships within generated features \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{V}}_{\\varvec{A}\\varvec{F}\\varvec{G}}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{G}\\varvec{L}\\varvec{B}}\$\u003c/span\u003e\u003c/span\u003e and regresses 3D face reconstruction parameters \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{\\Theta }}_{\\varvec{A}\\varvec{U}\\varvec{F}\\varvec{R}\\varvec{T}}\$\u003c/span\u003e\u003c/span\u003e. A cross-attention mechanism in our transformer-based model enhances the interplay between \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{V}}_{\\varvec{A}\\varvec{F}\\varvec{G}}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{G}\\varvec{L}\\varvec{B}}\$\u003c/span\u003e\u003c/span\u003e by enabling the exchange of mutual information between input features. This dynamic interaction allows the model to consider a global context, learning dependencies and correlations among these features. The model consists of layer normalizations (LN), multi-layer perceptron layers (MLP layers), and multi-head cross-attention layers (MHC layer). We add a learnable regression token [REG] and apply input embedding and position embedding to the set of the input features. In the cross-attention process, AU-specific features \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{V}}_{\\varvec{A}\\varvec{F}\\varvec{G}}\$\u003c/span\u003e\u003c/span\u003e are used as queries, while global facial features \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{G}\\varvec{L}\\varvec{B}}\$\u003c/span\u003e\u003c/span\u003e are treated as keys and values:\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$${z}_{0}=\\left[{v}_{REG}; {v}_{1}E,{v}_{2}E,\\dots ,{v}_{N}E\\right]+{E}_{pos},$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$${z{\\prime }}_{l}=\\text{M}\\text{H}\\text{C}\\left(\\text{L}\\text{N}\\left({z}_{l-1}\\right),{v}_{GLB},{v}_{GLB} \\right)+{z}_{l-1},$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$${z}_{l}=\\text{M}\\text{L}\\text{P}\\left(LN\\left({z{\\prime }}_{l}\\right)\\right)+{z{\\prime }}_{l},$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eOutput:\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$y=\\text{M}\\text{L}\\text{P}\\left(\\text{L}\\text{N}\\left({z}_{L}^{0}\\right)\\right),\$\u003c/span\u003e\u003c/span\u003e (6)\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{E}}_{\\varvec{p}\\varvec{o}\\varvec{s}}\$\u003c/span\u003e\u003c/span\u003e is the position embedding, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{E}\$\u003c/span\u003e\u003c/span\u003e is input embedding. The \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\text{M}\\text{H}\\text{C}\$\u003c/span\u003e\u003c/span\u003e receives query, key, and value input in order. The learnable regression token [REG] is represented as \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{v}}_{\\varvec{R}\\varvec{E}\\varvec{G}}\$\u003c/span\u003e\u003c/span\u003e and added to the front of the input features. The output \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{y}\$\u003c/span\u003e\u003c/span\u003e through the above process is used as our 3D face reconstruction parameter \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{\\Theta }}_{\\varvec{A}\\varvec{U}\\varvec{F}\\varvec{R}\\varvec{T}}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eOnce the 3D face reconstruction parameter values \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{\\Theta }}_{\\varvec{A}\\varvec{U}\\varvec{F}\\varvec{R}\\varvec{T}}\$\u003c/span\u003e\u003c/span\u003e are generated, we use the FLAME decoder for the 3D face reconstruction. Subsequently, we employ a differentiable renderer to generate a rendered image from the reconstructed 3D face. The differentiable renderer makes it possible to compute gradients during the rendering process, enabling end-to-end training. Finally, we minimize the losses between the input image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{I}\$\u003c/span\u003e\u003c/span\u003e and the rendered image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{I}}_{\\varvec{R}\\varvec{e}}\$\u003c/span\u003e\u003c/span\u003e to train our model.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Loss function\u003c/h2\u003e \u003cp\u003eGiven a dataset of 2D face images, AUFART is trained by minimizing:\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$${L}_{total}= {L}_{auLmk}+{L}_{auRel}+{L}_{auFeat}+{L}_{reg}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewith AU-weighted landmark reprojection loss \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{L}\\varvec{m}\\varvec{k}}\$\u003c/span\u003e\u003c/span\u003e, AU-based relative distance loss \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{R}\\varvec{e}\\varvec{l}}\$\u003c/span\u003e\u003c/span\u003e, AU feature loss \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{F}\\varvec{e}\\varvec{a}\\varvec{t}}\$\u003c/span\u003e\u003c/span\u003e, and parameter regularizer \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{r}\\varvec{e}\\varvec{g}}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cb\u003eAU-weighted landmark reprojection loss.\u003c/b\u003e This loss dynamically assigns higher weights to the landmark positions corresponding to activated AUs during the computation of the landmark reprojection loss. The landmark reprojection loss in existing studies assigns fixed weights for each facial part in every image [1, 11]. However, the movements of landmarks triggered by the activation of AUs serve as an effective means to describe the AUs [43]. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{L}\\varvec{m}\\varvec{k}}\$\u003c/span\u003e\u003c/span\u003e assigns dynamic weights to the facial regions where AUs are activated to encourage the accurate representation of AUs in the reconstructed face. This enables AUFART to pay more attention to activated AUs during the training process. The AU-weighted landmark reprojection loss function is defined as:\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$${L}_{auLmk}={\\sum }_{i=1}^{N}{\\sum }_{j=1}^{{L}_{i}}{p}_{i}{‖{k}_{j}-s\\varPi \\left({M}_{j}\\right)+t‖}_{1},$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{N}\$\u003c/span\u003e\u003c/span\u003e is the number of AUs used in this loss function, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{i}}\$\u003c/span\u003e\u003c/span\u003e is the number of landmarks related to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{i}}^{\\varvec{t}\\varvec{h}}\$\u003c/span\u003e\u003c/span\u003e AU, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{p}}_{\\varvec{i}}\$\u003c/span\u003e\u003c/span\u003e is the activation status of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{i}}^{\\varvec{t}\\varvec{h}}\$\u003c/span\u003e\u003c/span\u003e AU predicted by ME-GraphAU, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{k}}_{\\varvec{j}}\$\u003c/span\u003e\u003c/span\u003e is the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{j}}^{\\varvec{t}\\varvec{h}}\$\u003c/span\u003e\u003c/span\u003e landmark coordinate in the input image and the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{M}}_{\\varvec{j}}\$\u003c/span\u003e\u003c/span\u003e is corresponding landmark on the FLAME model\u0026rsquo;s surface. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{s},\\varvec{\\varPi },\\varvec{t}\$\u003c/span\u003e\u003c/span\u003e represent the predicted camera parameters, denoting the isotropic scale \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{s}\$\u003c/span\u003e\u003c/span\u003e, orthographic 3D-to-2D projection matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{\\varPi }\$\u003c/span\u003e\u003c/span\u003e, and 2D transition \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{t}\$\u003c/span\u003e\u003c/span\u003e, respectively. We employ the Mediapipe landmark detector to predict landmarks from 2D images, utilizing a total of 105 landmarks distributed across the eyebrows, eyes, nose, and mouth regions [27]. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides details on the facial landmarks associated with AUs, and Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a) illustrates the 105 landmark indices and positions.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eAU-based relative distance loss.\u003c/b\u003e The AU-based relative distance loss computes the relative distance between AU configural features for image landmarks and the projected 3D landmarks. The AU configural features involve calculating relative distances between facial landmark points and are used to determine AUs [27]. For example, AU 4 (Brow Lowerer) is determined based on the distance between the landmark points 21 and 22, which correspond to the inner eyebrow landmarks on the left and right. This type of loss function is similar to eye closure loss of DECA, which computes an error in the relative offset between landmarks on the upper and lower eyelids for image landmarks and their corresponding projected 3D landmarks. We extend this approach in the context of AU by incorporating configural features. The AU-based relative distance loss computes the errors in configural features of image landmarks \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{k}\$\u003c/span\u003e\u003c/span\u003e and corresponding 3D landmarks \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{M}\$\u003c/span\u003e\u003c/span\u003e projected onto the image plane:\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$${L}_{auRel}=\\sum _{i=1}^{23}{‖{c}_{i}^{k}-{c}_{i}^{s\\varPi \\left(M\\right)}‖}_{1},$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{c}}_{\\varvec{i}}^{\\varvec{k}}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{c}}_{\\varvec{i}}^{\\varvec{s}\\varvec{\\Pi }\\left(\\varvec{M}\\right)}\$\u003c/span\u003e\u003c/span\u003e are \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{i}}^{\\varvec{t}\\varvec{h}}\$\u003c/span\u003e\u003c/span\u003e configural features of image landmarks \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{k}\$\u003c/span\u003e\u003c/span\u003e and projected 3D landmarks \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{s}\\varvec{\\Pi }\\left(\\varvec{M}\\right)\$\u003c/span\u003e\u003c/span\u003e. The proposed configural features from are defined using 66 landmarks model, but we modify landmark model with 68 landmarks from HRNet [28]. The 68 landmark indices are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(b) and configural features corresponding to each AU are described in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTable captions should be placed above the tables.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFacial parts\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRelated AUs\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eInvolved landmarks\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBrow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBrow Lowerer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0, 1, 2, \u0026hellip;, 19\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInner brow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eInner Brow Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1, 3, 5, 6, 8, 9, 11, 13, 15, 16, 18, 19\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOuter brow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOuter Brow Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eElements excluding \u003cb\u003eInner brow\u003c/b\u003e from \u003cb\u003eBrow\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLid Tightener\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e20, 21, \u0026hellip;, 51\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLower eye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCheek Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e20, 21, \u0026hellip;, 27, 33, 36, 37, \u0026hellip;, 43, 49\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUpper eye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUpper Lid Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eElements excluding \u003cb\u003eLower eye\u003c/b\u003e from \u003cb\u003eEye\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNose\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNose Wrinkler\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e52, 53, \u0026hellip;, 64\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMouth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLip Pucker, \u003c/p\u003e \u003cp\u003eLip Stretch, \u003c/p\u003e \u003cp\u003eLip Tightener\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e65, 66, \u0026hellip;, 104\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUpper mouth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUpper Lip Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e65, 66, 69, 70, \u0026hellip;, 76, 85, 86, \u0026hellip;, 94, 103, 104\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMouth corner\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLip Corner Puller,\u003c/p\u003e \u003cp\u003eLip Corner Depressor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e71, 72, 73, 74, 79, 80, 81, 82, 85, 86, 88, 89, 90, 91, 92, 93, 97, 98, 99, 100, 103, 104\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTable captions should be placed above the tables.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFacial AU\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConfigural features\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInner Brow Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{1}=‖{p}_{21}-{p}_{39}‖, {c}_{4}=‖{p}_{26}-{p}_{45}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOuter Brow Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{5}=‖\\frac{{p}_{19}-{p}_{20}}{2}-\\frac{{p}_{37}-{p}_{38}}{2}‖, {c}_{6}=‖\\frac{{p}_{23}-{p}_{24}}{2}-\\frac{{p}_{43}-{p}_{44}}{2}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBrow Lowerer\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{7}=‖{p}_{21}-{p}_{22}‖\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUpper Lid Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimilar to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{5}\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{6}\$\u003c/span\u003e\u003c/span\u003e, and\u003c/p\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{8}=‖\\frac{{p}_{37}-{p}_{38}}{2}-\\frac{{p}_{40}-{p}_{41}}{2}‖, {c}_{9}=‖\\frac{{p}_{43}-{p}_{44}}{2}-\\frac{{p}_{46}-{p}_{47}}{2}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLid Tightener\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimilar to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{8}\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{9}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNose Wrinkler\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{10}=‖{p}_{27}-{p}_{29}‖\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUpper Lip Raiser\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{11}=‖{p}_{60}-{p}_{65}‖, {c}_{12}=‖{p}_{62}-{p}_{63}‖,{c}_{13}=‖{p}_{32}-{p}_{50}‖, {c}_{14}=‖{p}_{33}-{p}_{51}‖,{c}_{15}=‖{p}_{34}-{p}_{52}‖, {c}_{16}=‖{p}_{41}-{p}_{48}‖,{c}_{17}=‖{p}_{46}-{p}_{54}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLip Corner Puller\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{18}=‖{p}_{48}-{p}_{54}‖,{c}_{19}=‖\\frac{{p}_{39}+{p}_{40}+{p}_{41}}{3}-{p}_{48}‖,{c}_{20}=‖\\frac{{p}_{42}+{p}_{47}+{p}_{46}}{3}-{p}_{54}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLip Corner Depressor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimilar to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{19}\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{20}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLip Stretcher\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimilar to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{18}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLip Tightener\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{21}=‖{p}_{51}-{p}_{57}‖\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eJaw Drop\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimilar to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{21}\$\u003c/span\u003e\u003c/span\u003e and\u003c/p\u003e \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${c}_{22}=‖{p}_{50}-{p}_{58}‖, {c}_{23}=‖{p}_{52}-{p}_{56}‖.\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eAU feature loss.\u003c/b\u003e The AU feature loss computes the distances between the AU-specific features of the input image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{I}\$\u003c/span\u003e\u003c/span\u003e and the rendered image \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{I}}_{\\varvec{R}\\varvec{e}}\$\u003c/span\u003e\u003c/span\u003e. Optimizing this loss during training encourages the reconstructed 3D face to convey AU activations that are visually similar to the image. We utilize AFG to generate the AU-specific features from both images:\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$${L}_{auFeat}={‖\\text{A}\\text{F}\\text{G}\\left(I\\right)-\\text{A}\\text{F}\\text{G}\\left({I}_{Re}\\right)‖}_{2}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e10\u003c/div\u003e\u003c/div\u003e.\u003c/p\u003e \u003cp\u003e \u003cb\u003eParameter regularization.\u003c/b\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{r}\\varvec{e}\\varvec{g}}\$\u003c/span\u003e\u003c/span\u003e regularizes expression \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{\\psi }\$\u003c/span\u003e\u003c/span\u003e, pose \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{\\theta }\$\u003c/span\u003e\u003c/span\u003e, camera \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{c}\$\u003c/span\u003e\u003c/span\u003e parameters with regularization coefficient and is specified as:\u003cdiv id=\"Equ10\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ10\" name=\"EquationSource\"\u003e\n$${L}_{reg}={{\\lambda }_{\\psi }‖\\psi ‖}_{2}^{2}+{{\\lambda }_{\\theta }‖\\theta ‖}_{2}^{2}+{{\\lambda }_{c}‖c‖}_{2}^{2}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e11\u003c/div\u003e\u003c/div\u003e.\u003c/p\u003e \u003c/div\u003e"},{"header":"4 Experiments","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Implementation details\u003c/h2\u003e \u003cp\u003eAUFART was trained with a total of approximately 300,000 images from VGGFace2, Aff-wild2, CelebA-HQ, FFHQ, and BUPT-CB [29\u0026ndash;33]. We used PyTorch3D to render the reconstructed 3D face onto the image plane [37]. In addition, we used the Adam optimizer with the learning rate of 1e-05, the batch size of 16, and 15 epochs. For parameter regularization, 1e-05 is applied to the expression parameter, and 0.1 is applied to the pose parameter. The loss function weighting parameters for each loss function is 0.75 for the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{L}\\varvec{m}\\varvec{k}}\$\u003c/span\u003e\u003c/span\u003e, 0.25 for the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{R}\\varvec{e}\\varvec{l}}\$\u003c/span\u003e\u003c/span\u003e, and 0.75 for the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{L}}_{\\varvec{a}\\varvec{u}\\varvec{F}\\varvec{e}\\varvec{a}\\varvec{t}}\$\u003c/span\u003e\u003c/span\u003e. The AUFART model predicts the values of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{\\psi }\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{\\theta }\$\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\varvec{c}\$\u003c/span\u003e\u003c/span\u003e only among the 3D face reconstruction parameters, and DECA predicts the values of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{\\beta }}^{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{l}}^{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\varvec{a}}^{\\varvec{D}\\varvec{E}\\varvec{C}\\varvec{A}}\$\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Quantitative evaluation\u003c/h2\u003e \u003cp\u003eCurrently, there is no standard benchmark specifically designed to evaluate the performance of the expression reconstruction while there exist a number of benchmarks for quantitative evaluation of the facial identity in the context of 3D face reconstruction [18]. The point-wise distance between a 3D face and a GT scan is not an appropriate performance metric for the 3D reconstruction of facial expressions since it is dominated by the identity parameters of the two images. Therefore, we propose to compare the AU activation states of input images and rendered images detected by ME-GraphAU using F1 scores. We use DISFA dataset for quantitative evaluation as it serves as one of the training datasets for ME-GraphAU. This choice is made with the confidence that AU detection model will effectively detect AU activation states during the evaluation.\u003c/p\u003e \u003cp\u003eDISFA contains 27 subjects watching a video and consists of 130,815 frames. Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e presents results for 9 subjects, each identified by their respective subject numbers, out of a total of 27 subjects. Additionally, the average results for the remaining 18 subjects are collectively labeled as \"others\". This choice is grounded in the nature of the DISFA dataset, where frames within sequences predominantly exhibit neutral expressions. Consequently, the selection of 9 videos with high AU activations, or expressiveness, is deemed appropriate for evaluating the performance of expression reconstruction. In Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the evaluation outcomes for each AU are presented, with the assessed AUs being those that the ME-GraphAU model can predict.\u003c/p\u003e \u003cp\u003eIn the per-subject evaluation results presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, AUFART outperforms both DECA and EMOCA for all subjects. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e provides the per-AU performance of the three methods. AUFART demonstrates superior performance compared to both DECA and EMOCA for AUs related to the upper face area (AU 4, 5, 7), particularly, for AU 4 (Brow Lowerer). It exhibits 5-fold and 1.5-fold performance increase over DECA and EMOCA, respectively. In the lower face areas, AUFART outperforms DECA significantly whereas it slightly outperforms EMOCA. In summary, the average F1 scores are 0.39, 0.18, and 0.30 for AUFART, DECA, and EMOCA, respectively. It confirms that AUFART can achieve higher performance than DECA and EMOCA. Reconstruction of face details such as forehead wrinkles and eyebrow movements, which can be detected by AU1 (Inner Brow Raiser) and AU2 (Outer Brow Raiser) [38], were not considered and left for future work.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePer-subject F1 score evaluation results for AU detection on input images and rendered images in DISFA\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSubject\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUFART\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDECA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEMOCA\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.46\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.24\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e06\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.35\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.32\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.45\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.46\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.43\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.43\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.24\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.45\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.25\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.27\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.24\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.34\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.25\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.33\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.07\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOthers\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.35\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.20\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAvg.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.39\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePer-AU F1 score evaluation results for AU detection on input images and rendered images in DISFA\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAU\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUFART\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDECA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEMOCA\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.47\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.24\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.08\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e07\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.78\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.64\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.21\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.83\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.81\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.86\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.13\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.02\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.06\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.28\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.02\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.56\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.36\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAvg.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.39\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Qualitative evaluation\u003c/h2\u003e \u003cp\u003eFor the qualitative evaluation, we used a in-the-wild face image dataset called 300W and DISFA dataset. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the 3D face reconstruction results for the 300W dataset. We highlight the sub-image areas of high and low accuracy with green/blue and red-boxes, respectively. It is clearly noticed that AUFART can generate better eyebrow and eye movements than DECA and EMOCA. This confirms that our transformer-based model cannot just learn the features related to AU 1, 2, and 4 from the input images but can also be guided by AU-based loss functions. It shows more robust reconstruction performance for upper facial movements as well. Similarly, as it can be observed in the blue-boxes AUFART demonstrates higher reconstruction accuracy for various mouth shapes in the lower face areas, especially for AU 12 (Lip Corner Puller) and AU 15 (Lip Corner Depressor) compared to DECA and EMOCA.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe experimental results for the DISFA dataset are presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. It shows the images of subjects 3, 16, and 27 in the DISFA dataset and compares whether changes in facial expression are captured accurately in two adjacent frames for each subject. For Subject 3, DECA shows facial expressions with less variation between adjacent frames. In contrast, both AUFART and EMOCA show clear expression changes between adjacent frames. AUFART and EMOCA capture and reconstruct accurate facial movements within each frame. Subject 16\u0026rsquo;s activation of AU 25 (Lips part) is observable in both AUFART and DECA, whereas it is not observed in EMOCA. Additionally, the activation of AU 1 and 2 is observable in AUFART and EMOCA, but not in DECA. In Subject 27, AUFART captures the subtle activation changes of AU 1 between adjacent frames. Consequently, the performance of AUFART is confirmed to be comparable to state-of-the-art models such as DECA and EMOCA, while surpassing them in certain scenarios.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn summary, the qualitative evaluation results for the 300W and the DISFA dataset indicate that AUFART outperforms DECA and EMOCA in the 3D reconstruction of facial expressions induced by AU activations.\u003c/p\u003e \u003c/div\u003e"},{"header":"5 Conclusion","content":"\u003cp\u003eIn this paper, we propose a 3D face reconstruction framework, named AUFART which is based on a transformer-based model guided with AU features. Our framework incorporates with pretrained feature generators for AU-specific features and the global facial features from state-of-the-art AU detection model and 3D face reconstruction model. Our transformer-based model is able to capture the relationships among generated AU-specific features and global facial features and predicts accurate 3D face reconstruction parameters. In addition, we introduce AU-based loss functions to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. AUFART achieves more accurate 3D face reconstruction of AUs, which were not fully considered in existing frame-based 3D face reconstructions studies. We compare the AU activation states of input images and rendered images detected by state-of-the-art AU detection model. It shows a performance improvement of at least 30% compared to the frame-based state-of-the-art 3D face reconstruction methods DECA and EMOCA, achieving an average F1 score of 0.39. This highlights the superior performance of our proposed method, especially in capturing and reconstructing facial expressions related to AUs. In the future, our study may not only explore detailed 3D facial restoration based on AU features but also investigate temporal modeling for 3D face reconstruction, taking into consideration the temporal characteristics of AUs.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eHyeonjin Kim conceptualized the main idea, designed experiments, conducted research, and wrote the manuscript. Pei Wang assisted in data collection and experimental work. Professor Hyukjoon Lee verified the main idea, provided guidance on research direction, and contributed to manuscript writing. All authors participated in reviewing and revising the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eYao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG), Vol.40, No.88, pp.1-13, 2021.\u003c/li\u003e\n\u003cli\u003eAlexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. AvatarMe: Realistically renderable 3D facial reconstruction \u0026ldquo;in-the-wild.\u0026rdquo; In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 760\u0026ndash;769, 2020.\u003c/li\u003e\n\u003cli\u003eKristina Scherbaum, Tobias Ritschel, Matthias Hullin, Thorsten Thorm\u0026auml;hlen, Volker Blanz, and Hans-Peter Seidel. Computer-suggested facial makeup. Comput. Graph. Forum, vol. 30, no. 2, pages 485-492, 2011.\u003c/li\u003e\n\u003cli\u003eChen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial performance capture. ACM Trans. Graph., 34(4), Jul 2015. \u003c/li\u003e\n\u003cli\u003eLiwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jae- woo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen- Chun Chen, and Hao Li. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), Nov. 2017.\u003c/li\u003e\n\u003cli\u003eDiego R. Faria, Mario Vieira, Fernanda C.C. Faria, and Cristiano Premebida. Affective facial expressions recognition for human-robot interaction. Proc. In 26th IEEE Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), pages 805-810, Aug. 2017.\u003c/li\u003e\n\u003cli\u003eJinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial an- imation with discrete motion prior. arXiv preprint arXiv:2301.02379, 2023.\u003c/li\u003e\n\u003cli\u003eTianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1\u0026ndash;194:17, 2017.\u003c/li\u003e\n\u003cli\u003ePascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296\u0026ndash;301. IEEE, 2009.\u003c/li\u003e\n\u003cli\u003eAraceli Morales, Gemma Piella, and Federico M. Sukno. Survey on 3d face reconstruction from uncalibrated images. Computer Science Review, 40:1\u0026ndash;35, 2021.\u003c/li\u003e\n\u003cli\u003eSoubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J. Black. Learning to regress 3D face shape and expression from an image without 3D supervision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7763\u0026ndash;7772, 2019.\u003c/li\u003e\n\u003cli\u003eRadek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20311\u0026ndash;20322, 2022.\u003c/li\u003e\n\u003cli\u003ePanagiotis P Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Spectre: Visual speech-informed per- ceptual 3d facial expression reconstruction from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5744\u0026ndash;5754, 2023.\u003c/li\u003e\n\u003cli\u003eTetiana Martyniuk, Orest Kupyn, Yana Kurlyak, Igor Krashenyi, Jiˇri Matas, and Viktoriia Sharmanska. Dad- 3dheads: A large-scale dense, accurate and diverse dataset for 3d head alignment from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20942\u0026ndash;20952, 2022.\u003c/li\u003e\n\u003cli\u003ePaul Ekman and Wallace V. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978.\u003c/li\u003e\n\u003cli\u003eGeethu Miriam Jacob and Bjorn Stenger. Facial action unit detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680\u0026ndash;7689, 2021.\u003c/li\u003e\n\u003cli\u003eCheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022.\u003c/li\u003e\n\u003cli\u003eZhen-Hua Feng, Patrik Huber, Josef Kittler, Peter Han- cock, Xiao-Jun Wu, Qijun Zhao, Paul Koppen, and Matthias R ̈atsch. Evaluation of dense 3D reconstruction from 2D face images in the wild. In International Conference on Automatic Face \u0026amp; Gesture Recognition (FG), pages 780\u0026ndash;786, 2018.\u003c/li\u003e\n\u003cli\u003eS. Mohammad Mavadati, Mohammad H. Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F. Cohn. Disfa: A sponta- neous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151\u0026ndash;160, 2013.\u003c/li\u003e\n\u003cli\u003eVolker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187\u0026ndash;194, 1999.\u003c/li\u003e\n\u003cli\u003eChen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Keliang Zhou. FaceWarehouse: A 3D facial expression database for visual computing. Transactions on Visualiza- tion and Computer Graphics, 20:413\u0026ndash;425, 2014.\u003c/li\u003e\n\u003cli\u003eBrais Martinez, Michel F. Valstar, Bihan Jiang, and Maja Pantic. Automatic Analysis of Facial Actions: A Survey. in IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 325-347, 1 July-Sept. 2019.\u003c/li\u003e\n\u003cli\u003eCheng-Hao Tu, Chih-Yuan Yang, and Jane Yung-jen Hsu. IdenNet: Identity-aware facial action unit detection. In 2019 14th IEEE International Conference on Automatic Face \u0026amp; Gesture Recognition (FG 2019), pages 1\u0026ndash;8. IEEE, 2019.\u003c/li\u003e\n\u003cli\u003eXing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M. Girard. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Com- puting, 32(10):692\u0026ndash;706, 2014.\u003c/li\u003e\n\u003cli\u003ePetar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u0026ograve;, and Yoshua Bengio. Graph attention networks. arXiv:1710.10903, 2017.\u003c/li\u003e\n\u003cli\u003eYury Kartynnik, Artsiom Ablavatski, Ivan Gr- ishchenko, and Matthias Grundmann. Real-time fa- cial surface geometry from monocular video on mo- bile GPUs. In Third Workshop on Computer Vision for AR/VR, Long Beach, CA, 2019. \u003c/li\u003e\n\u003cli\u003eNazil Perveen and Chalavadi Krishna Mohan. Configural representation of facial action units for spontaneous facial expression recognition in the wild. In VISIGRAPP (4: VISAPP), pages 93\u0026ndash;102, 2020.\u003c/li\u003e\n\u003cli\u003eKe Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.\u003c/li\u003e\n\u003cli\u003eQiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Auto- matic Face \u0026amp; Gesture Recognition (FG), pages 67\u0026ndash;74, 2018.\u003c/li\u003e\n\u003cli\u003eDimitrios Kollias and Stefanos Zafeiriou. Aff-Wild2: Ex- tending the Aff-Wild database for affect recognition. arXiv preprint arXiv: 1811.07770, 2018.\u003c/li\u003e\n\u003cli\u003eZiwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018.\u003c/li\u003e\n\u003cli\u003eTero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401\u0026ndash;4410, 2019.\u003c/li\u003e\n\u003cli\u003eYaobin Zhang and Weihong Deng. Class-balanced training for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 824\u0026ndash;825, 2020.\u003c/li\u003e\n\u003cli\u003eJiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multilevel face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203\u0026ndash;5212, 2020.\u003c/li\u003e\n\u003cli\u003eAlex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464\u0026ndash;3468. IEEE, 2016.\u003c/li\u003e\n\u003cli\u003eJiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690\u0026ndash;4699, 2019.\u003c/li\u003e\n\u003cli\u003eNikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.\u003c/li\u003e\n\u003cli\u003eY-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action units for facial expression analysis. T-PAMI, 23(2):97\u0026ndash;115, 2001.\u003c/li\u003e\n\u003cli\u003eDolley Shukla, Chandra Shekhar Mithlesh, and Manisha Sharma. A survey on different video scene change detection techniques. International Journal of Science and Research (IJSR). National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE). 2015.\u003c/li\u003e\n\u003cli\u003eTengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. Uncertain Graph Neural Networks for Facial Action Unit Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(7), 5993-6001, 2021.\u003c/li\u003e\n\u003cli\u003eAdrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d \u0026amp; 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1021-1030, 2017.\u003c/li\u003e\n\u003cli\u003eTengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. Hybrid message passing with performance-driven structures for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6267\u0026ndash;6276, 2021.\u003c/li\u003e\n\u003cli\u003eShangfei Wang, Yanan Chang, and Can Wang. Dual learning for joint facial landmark detection and action unit recognition. IEEE Transactions on Affective Computing, 2021.\u003c/li\u003e\n\u003cli\u003eA. Tewari, M. Zollh\u0026uml;ofer, H. Kim, P. Garrido, F. Bernard, P. P\u0026acute;erez, C. Theobalt, MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction, in: Proc. of IEEE ICCV, 2017, pp. 1274\u0026ndash;1283.\u003c/li\u003e\n\u003cli\u003eJ. Yang, F. Zhang, B. Chen and S. U. Khan, \u0026quot;Facial expression recognition based on facial action unit\u0026quot;, Proc. 10th Int. Green Sustain. Comput. Conf. (IGSC), pp. 1-6, Oct. 2019.\u003c/li\u003e\n\u003cli\u003eKuang, Chenyi, Jeffrey O. Kephart, and Qiang Ji. \u0026quot;AU-Aware Dynamic 3D Face Reconstruction From Videos With Transformer.\u0026quot; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.\u003c/li\u003e\n\u003cli\u003eKuang, Chenyi, et al. \u0026quot;AU-Aware 3D Face Reconstruction through Personalized AU-Specific Blendshape Learning.\u0026quot; European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.\u003c/li\u003e\n\u003cli\u003eTellamekala, Mani Kumar, et al. \u0026quot;Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?.\u0026quot; IEEE Transactions on Affective Computing (2023).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"3D face reconstruction, Facial action unit, Transformer, Deep learning","lastPublishedDoi":"10.21203/rs.3.rs-4310180/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4310180/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe reconstruction of 3D face shapes and expressions from single 2D images remains unconquered due to the lack of detailed modeling of human facial movements such as the correlation between the different parts of faces. Facial action units (AUs), which represent detailed taxonomy of the human facial movements based on observation of activation of muscles or muscle groups, can be used to model various facial expression types. We present a novel 3D face reconstruction framework called AU feature-based 3D FAce Reconstruction using Transformer (AUFART) that can generate a 3D face model that is responsive to AU activation given a single monocular 2D image to capture expressions. AUFART leverages AU-specific features as well as facial global features to achieve accurate 3D reconstruction of facial expressions using transformers. We also introduce a loss function which is to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. The proposed framework achieves an average F1 score of 0.39, outperforming state-of-the-art methods.\u003c/p\u003e","manuscriptTitle":"Action Unit-Based 3D Face Reconstruction Using Transformers","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-04-30 19:03:03","doi":"10.21203/rs.3.rs-4310180/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1f73af63-8721-409b-9f0f-5e38fbe6b131","owner":[],"postedDate":"April 30th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-06-17T13:59:14+00:00","versionOfRecord":[],"versionCreatedAt":"2024-04-30 19:03:03","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4310180","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4310180","identity":"rs-4310180","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00