A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning

doi:10.21203/rs.3.rs-9295088/v1

A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning

2026 · doi:10.21203/rs.3.rs-9295088/v1

preprint OA: closed

Full text JSON View at publisher

⚙ AI-generated deep summary by claude@2026-07, 2026-07-03 · read from full text ⓘ

This preprint studies facial expression recognition (FER) by combining multiple publicly available FER datasets (e.g., FER2013, Humans Face Emotions, CK+) to increase expression diversity, then applying an automated quality pipeline to filter out blurred, low-resolution, and ambiguous images to create a curated set of 16,828 images across six emotion categories. The authors extract deep facial embeddings using frozen convolutional backbones such as ResNet18 and MobileNetV2 and evaluate these fixed representations with classical machine-learning classifiers under similar experimental conditions, reporting that ResNet18 embeddings are more discriminative and that Random Forests achieve a peak accuracy of 70.34%. They also analyze optimization behavior and find AdamW provides the most consistent convergence and validation consistency across folds, attributing performance differences mainly to dataset fidelity rather than end-to-end architectural complexity. The paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 82,465 characters · extracted from preprint-html · click to expand

A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning Himanshu Verma, Nimish Kumar, Pankaj Vyas This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9295088/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Current facial expression recognition (FER) models have a tendency to be unstable when trained on heterogeneous public data because of annotation noise, visual artifacts, and bias depending on the data set. This paper proposes a representation-based FER model of psychological stress measurement, where automated data cleaning and feature learning that is agnostic of architecture is preferred, as opposed to architectural complexity. Various publicly available FER datasets are combined to enhance the level of expression diversity and are filtered by an automated quality pipeline to eliminate blurred, low-resolution, and ambiguous images to produce a curated corpus with 16,828 images of six emotion categories. Frozen convolutional backbones, such as ResNet18 and MobileNetV2, are used to extract deep facial embeddings that can be fairly compared with each other and avoiding effects of fine-tuning. Such embeddings are tested with various classical machine-learning classifiers with similar experimental conditions. Findings indicate that ResNet18 embeddings are more discriminative than lightweight counterparts and that the best overall results are along with Random Forests classifiers which reach a peak accuracy of 70.34%. A further comprehensive analysis of optimizer shows that AdamW has the most consistent convergence and validation consistency across folds. Results validate that the dataset integrity and fidelity of representation are a dominating factor in FER performance outperforming the effects of end-to-end architectural complexity. The proposed framework will provide a reliable and reproducible stressful FER system, which will function under limited and noisy data environments. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction Facial expressions provide observable signals of human affective and behavioural states. Automatic FER can therefore be regarded as one of the central tasks of affective computing and behavioural analysis [1], [2]. In practice, FER face multiple issues such as huge variations of facial appearance, variable illumination, variable poses, as well as inconsistent image quality [3]. Deep learning and specifically CNNs have significantly enhanced FER performance compared to hand-crafted feature-based methods [4], [5], [17] is due to the innate capacity of CNNs to learn hierarchical representations of faces that to some extent are independent of changes in appearance. Nevertheless, FER models that have been trained on data that are publicly accessible have a tendency to generalize poorly and give unstable results on different datasets [6]. One of the contributing factors is dataset quality. FER datasets vary in terms of acquisition conditions, resolution, annotation protocols and classes distributions [22]. Controlled datasets like CK+ is a high quality dataset but exaggerated [7], while datasets like FER2013 reflect realistic conditions but suffer from low resolution, label noise, and ambiguous expressions [8]. Training on such data propagates noise into learned representations. Prior studies indicate that label noise and low-quality samples have disastrous effects on accuracy of classification and feature discriminability [9], [10].Multiple datasets can be combined and enhance variety and decrease data-related bias. Nonetheless, there are inconsistencies in the annotation of naive dataset aggregation, as well as visual artefacts [11]. This indicates that dataset curation and automated quality control are required in FER systems. In addition to the accuracy of classification, there is also the quality of representation. Facial embeddings of high-fidelity facilitate the stability of downstream tasks, as well as enhance resistance between classes. Deep hierarchical representations obtained by residual network architectures are known to be effective at capturing small differences in faces [12]. Hybrid systems, which take the approach of using end-to-end training, are additionally prone to perform worse than deep feature extractors, typically in settings where the number of datasets is moderate or even partially noisy [13], [14]. This work is motivated by the limitations of current FER systems and aims at the strong learning of facial expression representation through the use of multi-datasets which are carefully curated. Multiple datasets (FER2013+ Humans Face Emotions+ CK+) are merged to increase variety and is screened using an automated pipeline to filter out incorrect or mislabelled data. Deep embeddings obtained with the help of typical CNN backbones are evaluated systematically with the help of a series of machine-learning classifiers. The paper shows that dataset curation combined with representation-oriented learning is used to amplify consistent and discriminative facial features that are robust to dataset change and annotation noise and represents data quality rather than architecture complexity. 2. Related Work Early FER models were based on geometric features and texture descriptors as Local Binary Patterns and Gabor filters. These methods achieved moderate performance when used under controlled circumstances but remarkably deteriorated when pose changed, illumination varied and when there was occlusion [11], [12]. Their performance was highly sensitive to face alignment and to manually tuned parameters. The introduction of deep CNNs marked an important turn in FER research. CNNs directly learn hierarchical representations from pixel data, hence decreasing the reliance on manual-designed features [4], [13]. VGG style networks extracted effective features through stacked convolutions and achieved strong performance on FER2013 [5], [14], [24]. Residual networks further increased representational depth by solving vanishing-gradient problems with identity mappings [12]. The features of the dataset are big determinants of FER performance. The CK+ dataset is highly resolved and labelled expressions with posed and exaggerated emotions [7]. Models that have been trained only with CK+ are also prone to fail when generalized to unconstrained settings. FER2013 presented in-the-wild facial images in large scale, but its resolution is very low and the annotations are noisy, limiting achievable accuracy, and the human performance on it is reported as 65-68% [8]. On FER2013, single-network CNN models usually reach a plateau of between 70% to 73 % without extraneous information [14], [19]. According to a number of studies, the noise in the dataset has a direct impact on the learned decision boundaries. Noise in the labels decreases the accuracy of the classification by over 5-10 percent without being mitigated [9]. Analysis of FER-specific data proves that mislabelled and ambiguous samples violate the geometry of class separability and embedding [10]. Noise-aware learning and sample filtering approaches improve robustness without full manual relabelling [9], [21], [15]. Multi dataset learning has already been investigated in an attempt to enhance the diversity of expressions and decrease datasets bias. However, naive dataset merging introduces annotation inconsistencies and duplication of samples [11]. Cross-dataset tests show over 15 % decreases in the accuracy in cases when the models are tested outside of their training sample [18]. These results highlight that dataset harmonization and quality control should be followed by representation learning. Recent advancements in FER methods, benchmarks, and datasets further emphasize the importance of standardized evaluation protocols [20]. Using automated data curation methods is a relatively new development. Laplacian variance-based blur detection is highly effective to eliminate low-quality facial images and enhance the training stability of CNN [32]. The classifier-confidence-based filtering uses the concept of detecting the repeatedly low-confidence samples across the models [21]. Such strategies have been shown to improve downstream accuracy by 2–5% in noisy datasets [9], [10]. The choice of model architecture has also an impact on the fidelity of representation. ResNet18 balance depth and computational efficiency without affecting fine grained features of the face [12]. MobileNetV2 reduce the model size and inference cost but can lose discriminative power especially to subtle expressions [37]. According to comparative studies, both ResNet18 and MobileNetV2 models show 2 to 4% accuracy differences in the same training conditions [28]. Hybrid FER pipelines combining deep embeddings with classical machine-learning classifiers have demonstrated competitive performance. Gradient-boosting classifiers,Random Forests, and Support Vector Machines usually perform better in comparison with softmax classifiers when trained on fixed deep features [13], [23]. Ensemble approaches, specifically, Random Forests, are resistant to residual noise and nonlinear decision boundaries in moderate-sized data [18], [33]. Reported gains of 1–3% over end-to-end CNN classifiers are common in hybrid settings [23], [34]. Optimization strategy further influences representation quality and convergence behavior. The RMSProp and Adam adaptive optimizers are faster but can have unstable generalization [26]. Weight-decay techniques that are decoupled like AdamW are better at regularizing decoupled models and also have more predictable performance during validation [6], [27]. According to the reports of FER studies, the use of AdamW increases the accuracy by an average of 1-2% compared to the use of standard Adam [6]. Recent efforts have extended FER research to be able to deal with not only static but also dynamic emotion representations. This is a systematic review of categorizing methods by image-based and sequence-based and identifying the difficulties related to the ambiguity of expressions, intensity variation, cross-dataset inconsistency, and uncertainty with time [25]. In order to solve the noise in annotation, ReSup concurrently learns clean and noisy labels with agreement between two networks, which demonstrate quantifiable accuracy gains on FERPlus [26]. FER has also been pointed out to be strong when there is representation diversity. Global and local feature cues learned with diversified feature learning models lead to high accuracy on the RAF-DB, FER+, and AffectNet and prove the relevance of deep embeddings in unconstrained conditions [27]. There is also Ensemble learning that enhances strength. Multi-architectural ensemble models using distillation are competitive with a lower cost to compute [28]. Construction of datasets is still a critical issue with FER evaluation. RAF-DB is a set of large-scale annotations of both basic and compound emotions in varied demographic situations [29]. Refinement work has been demonstrated to improve the performance on benchmark tasks such as relabelling and category expansion of FER datasets [30]. There have also been architectural improvements done based on attention. The use of convolutional block attention modules enhances discriminative learning of features, and accuracy is reported to be stronger on common benchmarks [31]. Comparative studies on real-time FER highlight trade-offs between accuracy and computational efficiency, favouring lightweight models for low-latency deployment [32]. AffectNet-HQ, a large-scale dataset, make it possible to do additional benchmarking, as deep CNN models have reached over 85 % recognition accuracy [33]. The recognition of expression dynamics is enhanced with the help of temporary modeling methods that are based on integrating CNNs with RNN-LSTM networks, which emphasizes the importance of a temporal context [34]. Multi-loss and ensemble strategies are also used to make FER more robust. The feature fusion models and voting-based ensembles outperform single-model baselines on both RAF-DB and unbalanced versions of AffectNet [35]. The label distribution learning minimizes annotation ambivalency and enhances the performance in multiple datasets [36]. The self-supervised learning systems separate out identity and expression properties and achieve the same accuracy gains on in-the-wild datasets [37]. Graph attention networks encode both spatial-temporal connections and generate competition among FER datasets [38]. The lightweight attention architectures deal with the computational limitations, but are robust to pose variation [39]. The simplified cross-attention networks are state of the art and less complex to implement [40]. Hybrid attention processes also enhance their feature discriminability through a combination of both local and global attention cues[41], [42]. Comparative evaluations on benchmark datasets remain essential. CNN-based FER studies confirm strong performance on FER2013 and CK+ under real-world variability, providing practical baselines for research and deployment [43], [44]. 3. Experiment A. Dataset and Preprocessing Experiments are conducted on a filtered multi-dataset corpus of facial expressions, created by combining CK+, FER2013, and other publicly available datasets containing labelled pictures of human face emotions. A total of 28,272 images are filtered to 16,828 images following an automated quality filtering process that eliminates blurred, low-resolution, and ambiguous images. Six emotion categories are used in this study: anger, disgust, fear, happiness, sadness, and surprise. Neutral expressions are not included. All images are resized to 224×224 in RGB format and are normalized using ImageNet statistics. During training, horizontal flipping, ±15° rotation, and brightness variation are applied as data augmentation techniques. The dataset is divided into 80% training data and 20% test data. To evaluate stability in performance, four-fold cross-validation (k = 4) is used with a constant random seed. B. Feature Extraction and Classification The frozen feature extractors, i.e., ResNet18 and MobileNetV2, will be utilized in the present paper with pre-trained ImageNet weights, and all the layers are maintained at the frozen level to avoid any bias that can be brought by fine-tuning. This guarantees a healthy comparison of intrinsic representational abilities of the two architectures in the same conditions of the experiment. As a result, the deep features obtained are based on the learned general-purpose visual representations and not adaptation to the task. The architectures of the two backbones are depicted in Figure 1 and Figure 2 to gain a better insight into the structural differences between them. In Figure 1, the architecture ResNet18 is shown and is distinguished by residual connections that make information bypass the intermediate layers. These skip connections are useful in gradient flow preservation during training as well as allowing the network to learn more discriminative and deeper feature representations. Consequently, ResNet18 has been specifically useful in the acquisition of minute differences in facial expressions, which is essential in the differentiation of the near related emotional states. Figure 2, in turn, shows the structure of MobileNetV2 architecture, which is created with the emphasis on computational efficiency. It applies depthwise separable convolution and inverted residual blocks to greatly cut the quantity of parameters and computation expenses. Although such a design allows MobileNetV2 to be applicable to real-time and resource-constrained scenarios, it can limit its capability to represent fine-grained facial features in comparison to more profound residual networks. This comparison of these two architectures brings out the trade of between representational power and computational efficiency which is also seen in the experimental results. Feature embeddings are extracted from the global pooling layer of each network. Classification is performed using Support Vector Machine, Random Forest, XGBoost, K-Nearest Neighbors, Gaussian Naive Bayes, and Decision Tree models trained on the extracted features. Among all classifiers, Random Forest achieves the best performance, with the highest test accuracy of 70.34% using ResNet18 embeddings, and shows consistent outperformance compared to MobileNetV2 across classifiers. C. Optimization and Training Setup Optimizers are evaluated using SGD, SGD with momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam, Nadam, AMSGrad, and AdamW under identical conditions. Training is conducted using weighted cross-entropy loss with ReduceLROnPlateau scheduling. Based on validation performance and convergence behavior, AdamW is used for final reporting. Although CNN backbones are frozen, optimizers are evaluated for training the classifier head and loss convergence stability under identical conditions. All experiments are conducted in TensorFlow under a CPU-only, deterministic execution setup. Results Table 1 shows the benchmark comparison of various models which were considered in this work. As one may notice, the offered hybrid scheme with the ResNet18 embeddings and the use of a Random Forest as a classifier have the highest accuracy of 70.34% and, therefore, outperform both the classical and deep learning baselines. Although some of the deep CNN based models like VGG, ResNet50 and Inception V3 are also able to perform competitively, their performance is a bit lower when operated under the same circumstances. This underscores the efficiency of representation based learning in combination with sound classical classifiers. Fig. 3 shows that the trend in accuracy is different in various models. It can be noted that there is a definite performance disparity between lightweight architectures and more profound residual networks. Specifically, ResNet18 is always capable of generating more discriminative embeddings than MobileNetV2, and this results in a better classification score among all the considered classifiers. Fig. 5 presents both backbone models confusion matrices. As it can be noted, ResNet18 is more successful in class separation with less misclassifications between the visually similar emotions like fear and surprise. Table 1. Model benchmark Model Name Accuracy (%) ARM (ResNet) 69.5 VGG + SMI 69.1 Attentional ConvNet 68.8 CNN + SVM 68.2 MobileNetV2 + RF 67.49 VGG16 67.05 GoogLeNet 66.84 ResNet50 66.37 Inception V3 66.1 Bag of Words 63.5 Custom CNN 55.44 ResNet18 + RF 70.34 MobileNetV2, on the contrary, has a comparatively greater level of confusion between these classes, which suggests a weakness in the recognition of minor facial changes. The consistency of the suggested solution is also supported by cross-validation test. The performance of all the folds has the same variance as illustrated in Fig. 4 which confirms the strength of the learned representations with varying data splits. This shows that the model is also generalized and is not highly reliant on a particular training subset. Table 2 shows the validity acquired through various optimization strategies. Of all optimizers considered, AdamW has been found to be the most accurate (70.94) and the most stable. Although adaptive optimizers like Adam and RMSProp have a lower convergence rate, they have a little more variance in validation performance. Table 2 . Validation accuracy of optimizers Optimizer Best Validation Accuracy (%) SGD 68.86 SGD + Momentum 69.18 NAG 68.62 Adagrad 68.32 RMSprop 69.96 Adadelta 69.18 Adam 69.87 AdamW 70.94 Nadam 70.61 AMSGrad 70.73 Conversely, the convergence speed and total accuracy of SGD-based methods are slower. All in all, the findings prove that the quality of feature representation and dataset curation are also more important in FER performance compared to architectural complexity itself. Curation multi-dataset learning with learned high-quality embeddings and classical classifiers offers a dependable and stable system of facial expression recognition in noisy and heterogeneous settings. Optimizer Comparison Optimizer performance is evaluated under identical training conditions using weighted cross-entropy loss and ReduceLROnPlateau scheduling. Among the tested optimizers (SGD, SGD with momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam, Nadam, AMSGrad, and AdamW), AdamW consistently achieves the highest test accuracy and the most stable convergence behavior across folds. SGD-based optimizers show slower convergence and higher variance. Adam and RMSProp converge faster than SGD but exhibit less stable validation performance compared to AdamW. Nadam and AMSGrad provide no consistent improvement over Adam. Based on accuracy stability and convergence consistency, AdamW is selected for all final experiments. Backbone Comparison Backbone evaluation is performed using frozen feature extractors under identical data splits and classifiers. ResNet18 consistently outperforms MobileNetV2 across all evaluated classifiers and cross-validation folds. The highest overall accuracy of 70.34% is obtained using ResNet18 embeddings with a Random Forest classifier, while MobileNetV2 achieves lower peak accuracy under the same settings. ResNet18 embeddings exhibit improved class separability and reduced confusion between closely related emotion classes. Results indicate that deeper residual representations provide higher discriminative capacity than lightweight architectures for multi-dataset FER under frozen-feature constraints. Conclusion This work presents a high-fidelity facial expression representation framework evaluated under controlled multi-dataset conditions. Automated data curation reduces noise and improves representation reliability. Frozen CNN feature extraction enables fair backbone comparison without fine-tuning bias. ResNet18 produces more discriminative embeddings than MobileNetV2 across classifiers. Random Forest gives the best classification results of all considered models. AdamW has the most consistent and stable behaviour of optimisation. The findings indicate that the quality of representation and integrity of data are the major factors to drive FER performance in limited training environments. Declarations Ethics approval and consent to participate Not applicable. Consent for publication The author confirms that the image presented in Figure 2 is of the author themself and provides consent for its publication. Availability of data and materials The datasets analyzed in this study are publicly available facial expression recognition datasets including CK+ and FER2013. These datasets are accessible from their respective official repositories subject to their licensing terms. The curated dataset generated through automated quality filtering is derived from these publicly available datasets. Competing interests The authors declare that they have no competing interests. Funding The authors received no specific funding for this work. Authors' contributions H.V. conceptualized the study, designed the methodology, conducted the experiments, analyzed the results, and wrote the main manuscript text. P.V. supervised the research work and contributed to conceptual guidance and manuscript review. N.K. co-supervised the research and contributed to methodological refinement and technical validation. Acknowledgements The authors acknowledge the use of publicly available facial expression datasets used in this research. References Bartlett MS, Littlewort G, Fasel I, Movellan JR. Real-Time Face Detection and Facial Expression Recognition: Development and Applications to Human–Computer Interaction, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops , pp. 53–60, 2003. Abdat F, Maaoui C, Pruski A. Human–computer interaction using emotion recognition from facial expression, Proc. UKSim 5th European Modelling Symposium on Computer Modelling and Simulation , pp. 196–201, 2011. Fasel B, Luettin J. Automatic facial expression analysis: A survey. Pattern Recogn. 2003;36(1):259–75. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition, Proc. International Conference on Learning Representations (ICLR) , 2015. Li S, Deng W. Deep facial expression recognition: A survey. IEEE Trans Affect Comput. 2020;11(1):3–18. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The extended Cohn–Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression, Proc. IEEE CVPR Workshops , pp. 94–101, 2010. Goodfellow IJ, et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw. 2015;64:59–63. Frenay F, Verleysen M. Classification in the presence of label noise: A survey. Pattern Recogn. 2014;45(9):1–28. Patrini G, Rozza A, Menon AK, Nock R, Qu L. Making deep neural networks robust to label noise: A loss correction approach, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 1944–1952, 2017. Cohn JF, De la Torre F. Automated face analysis for affective computing. in The Oxford Handbook of Affective Computing. Oxford University Press; 2014. pp. 131–50. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 770–778, 2016. Choi M, Lee J, Kim S. Facial expression recognition using deep features and SVM classifiers. Sensors, 21, 9, Article 3179, 2021. Hasani B, Mahoor MH. Facial expression recognition using enhanced deep 3D convolutional neural networks, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 30–40, 2017. Wang K, Peng X, Yang J, Lu S, Qiao Y. Suppressing uncertainties for large-scale facial expression recognition, arXiv preprint arXiv :2007. 03149 , 2020. Oguine OC, Oguine KJ, Bisallah HI, Ofuani D. Hybrid facial expression recognition (FER2013) model for real-time emotion classification, arXiv preprint arXiv:2203.08901 , 2022. Sajjad M, et al. A comprehensive survey on deep facial expression recognition: Challenges, applications and future guidelines. IEEE Access. 2023;11:11245–70. Li S, Deng W, Du J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. Neural Netw. 2016;73:70–81. Reghunathan RK. Facial expression recognition using pre-trained ensemble models for FER2013. Sensors, 24, 3, Article 812, 2024. Kopalidis T. Advances in facial expression recognition: Methods, benchmarks, models, and datasets, Information , vol. 15, no. 2, Article 98, 2024. Liang C, Dong J. A survey of deep learning-based facial expression recognition research. Front Comput Intell Syst. 2025;4(1):1–20. Agung ES et al. Image-based facial emotion recognition using deep learning with extended emotion categories. Sci Rep, 14, Article 6123, 2024. Ullah S. Facial expression recognition survey: Edge computing deep learning approaches, PeerJ Computer Science , vol. 10, e1913, 2024. Hasani B, Mahoor MH. Facial expression recognition using enhanced deep 3D convolutional neural networks. IEEE Trans Affect Comput. 2020;11(4):653–66. Zhang Z, Zhao X, Li Y. A survey on facial expression recognition of static and dynamic emotions. IEEE Access. 2023;11:32415–38. Wang K, Peng X, Yang J, Lu S, Qiao Y. ReSup: Reliable label noise suppression for facial expression recognition, arXiv preprint arXiv:2305.17895 , 2023. Li J, Zhang Y, Liu Z. Learning diversified feature representations for facial expression recognition in the wild, arXiv preprint arXiv:2210.09381 , 2022. Kollias D, Zafeiriou S, Tefas A. Recognizing facial expressions in the wild using multi-architectural ensembles with distillation, arXiv preprint arXiv:2106.16126 , 2021. Li S, Deng W, Du J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. Neural Netw. 2016;73:70–81. Ben A, Ammar, et al. Introducing a novel dataset for facial emotion recognition. Heliyon. 2024;10(2):e21345. Khan MS, Hussain J, Aboalsamh H, Bebis G. Attention-based facial expression recognition: A survey. IEEE Access. 2021;9:165002–28. Chen Y, et al. Real-time facial expression recognition: Advances and comparative analysis. Int J Comput Vision. 2023;131(4):987–1006. Li H et al. Deep learning-based facial emotion recognition on AffectNet-HQ. Int J Pattern recognit Artif Intell, 38, 5, 2024. Sharma R, Singh P. Integrating CNN and RNN-LSTM for facial expression recognition, Proc. International Conference on Computer Vision Theory and Applications , pp. 214–221, 2024. Zhou G, Xie Y, Fu Y, Wang Z. Multi-loss based feature fusion and top-two voting ensemble decision strategy for facial expression recognition in the wild, arXiv preprint arXiv:2311.03478 , 2023. Liu S, Xu Y, Wan T, Kui X. Ada-DF: An adaptive label distribution fusion network for facial expression recognition. Inf Sci. 2024;677:1–14. He R, Xing Z, Tan W, Yan B. A generative framework for self-supervised facial representation learning, arXiv preprint arXiv:2309.08273 , 2023. Li Y, Zhao Y, Xia X, Jiang D. ARPGNet: Appearance- and relation-aware parallel graph attention fusion network for facial expression recognition, arXiv preprint arXiv:2511.22188 , 2025. Ezati A, Dezyani M, Rana R, Rajabi R, Ayatollahi A. A lightweight attention-based deep network via multi-scale feature fusion for multi-view facial expression recognition, arXiv preprint arXiv:2403.14318 , 2024. Mao J, Xu R, Yin X, Chang Y, Nie B, Huang A. POSTER++: A simpler and stronger facial expression recognition network, arXiv preprint arXiv:2301.12149 , 2023. Khan MS, Hussain J, Aboalsamh H, Bebis G. Attention-based facial expression recognition: A survey. IEEE Access. 2021;9:165002–28. Song D, Liu C. A facial expression recognition network using hybrid feature extraction. PLoS ONE, 19, 1, e0280213, 2024. Gailan MJ. Performance evaluation of facial emotion recognition through convolutional neural network model. Int J Eng Res Technol (IJERT). 2025;14(3):1–6. Gan Y. Facial expression recognition using convolutional neural networks, Proc. 2nd International Conference on Intelligent Computing and Applications , pp. 1–6, 2018. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 01 May, 2026 Reviewers agreed at journal 24 Apr, 2026 Reviewers agreed at journal 24 Apr, 2026 Reviewers invited by journal 24 Apr, 2026 Editor assigned by journal 15 Apr, 2026 Submission checks completed at journal 14 Apr, 2026 First submitted to journal 14 Apr, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9295088","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":634274533,"identity":"bc9177c4-4f17-48c2-ba35-7ce26c27a19e","order_by":0,"name":"Himanshu Verma","email":"","orcid":"","institution":"Manipal University Jaipur","correspondingAuthor":false,"prefix":"","firstName":"Himanshu","middleName":"","lastName":"Verma","suffix":""},{"id":634274534,"identity":"3cb2d87e-12fe-4bbc-b395-287e263694eb","order_by":1,"name":"Nimish Kumar","email":"","orcid":"","institution":"B K Birla Institute of Engineering \u0026 Technology","correspondingAuthor":false,"prefix":"","firstName":"Nimish","middleName":"","lastName":"Kumar","suffix":""},{"id":634274535,"identity":"9398efe6-1f68-4723-a802-d4e9fd259a51","order_by":2,"name":"Pankaj Vyas","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABR0lEQVRIie2RP0vDUBDALwSTJW3XC9boR0h5UFPwwyR0cAlBCIiDlEIgU3R2UPwKGR1TAs0SmzWhi13qkkBBkIp/8KWhgiQ6C77f8Lh3937c8Q6Awfib8ADSJuAeEHCbpYEI28qPCq9+V/jflAoB64UG5VC8D9V1Fww/ipZn2rlmqfNgx3+506yOQ0ct3Joy8Kyh4UlUic1+ilO01ZkuZBcx2hgC37upK2pgkkCSgNBASFFA6oKQtlw0xiEIu60GJcnJ5K1UksflCX5USvZOldsQxNcmJTXJkHZR1FTvg+xWyrzs4tMufIMyuMoJ6UqoyGlOUL5EW445d77not0LOUe+ntV/rGMSufCOpHZyvHjC55HVjvlpVrgjS4mcySo/rQ9WHpyHcBDQRdDF6PQ6LpNfQaMCa4B9WuZWm5cVetNzBoPB+Jd8ApbMacKtmtR9AAAAAElFTkSuQmCC","orcid":"","institution":"Manipal University Jaipur","correspondingAuthor":true,"prefix":"","firstName":"Pankaj","middleName":"","lastName":"Vyas","suffix":""}],"badges":[],"createdAt":"2026-04-01 18:08:45","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9295088/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9295088/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108978619,"identity":"54380afa-13fe-4b76-b11a-a38e77791947","added_by":"auto","created_at":"2026-05-11 11:46:56","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":114181,"visible":true,"origin":"","legend":"\u003cp\u003eArchitecture of the ResNet-based convolutional backbone used for feature extraction, illustrating the residual block structure and hierarchical feature learning.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/52c2c9229e807c805dfdbd7d.png"},{"id":108982154,"identity":"4baf279f-2dd3-4264-8fc9-c8408f04a079","added_by":"auto","created_at":"2026-05-11 12:23:18","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":168521,"visible":true,"origin":"","legend":"\u003cp\u003eArchitecture of the MobileNet-based convolutional backbone employing depthwise separable convolutions for lightweight feature extraction.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/d6de8cdce4d542ae67bece00.png"},{"id":108978620,"identity":"54b7e701-76d5-400a-9c15-b60d659909f5","added_by":"auto","created_at":"2026-05-11 11:46:57","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":198145,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy comparison of models\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/7fdb12eff27de33f40934e15.png"},{"id":108979480,"identity":"c4d8f11a-5a93-4414-8970-e411cefd885d","added_by":"auto","created_at":"2026-05-11 11:59:18","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":255328,"visible":true,"origin":"","legend":"\u003cp\u003eCross Validation (k = 4) of MobileNetV2+RF and ResNet18+RF\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/05d47c740e628865776f7fce.jpeg"},{"id":108978613,"identity":"6624b1f1-672f-438d-94ee-475a0544ae9d","added_by":"auto","created_at":"2026-05-11 11:46:40","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":208522,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix of mobile MobileNetV2 (Left) and ResNet18 (right)\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/51b94ba54aa847f5e1fcdc43.jpeg"},{"id":108984305,"identity":"c201c93e-9d45-4dca-8574-4433adc338df","added_by":"auto","created_at":"2026-05-11 12:39:40","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1038276,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9295088/v1/1f4f4688-1be5-43ba-bbe5-39d118da518e.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eFacial expressions provide observable signals of human affective and behavioural states. Automatic FER can therefore be regarded as one of the central tasks of affective computing and behavioural analysis [1], [2]. In practice, FER face multiple issues such as huge variations of facial appearance, variable illumination, variable poses, as well as inconsistent image quality [3]. Deep learning and specifically CNNs have significantly enhanced FER performance compared to hand-crafted feature-based methods [4], [5], [17] \u0026nbsp;is due to the innate capacity of CNNs to learn hierarchical representations of faces that to some extent are independent of changes in appearance. Nevertheless, FER models that have been trained on data that are publicly accessible have a tendency to generalize poorly and give unstable results on different datasets [6]. One of the contributing factors is dataset quality. FER datasets vary in terms of acquisition conditions, resolution, annotation protocols and classes distributions [22]. Controlled datasets like CK+ is a high quality dataset but exaggerated [7], while datasets like FER2013 reflect realistic conditions but suffer from low resolution, label noise, and ambiguous expressions [8]. Training on such data propagates noise into learned representations. Prior studies indicate that label noise and low-quality samples have disastrous effects on accuracy of classification and feature discriminability [9], [10].Multiple datasets can be combined and enhance variety and decrease data-related bias. Nonetheless, there are inconsistencies in the annotation of naive dataset aggregation, as well as visual artefacts [11]. This indicates that dataset curation and automated quality control are required in FER systems. In addition to the accuracy of classification, there is also the quality of representation. Facial embeddings of high-fidelity facilitate the stability of downstream tasks, as well as enhance resistance between classes. Deep hierarchical representations obtained by residual network architectures are known to be effective at capturing small differences in faces [12]. Hybrid systems, which take the approach of using end-to-end training, are additionally prone to perform worse than deep feature extractors, typically in settings where the number of datasets is moderate or even partially noisy [13], [14].\u003c/p\u003e\n\u003cp\u003eThis work is motivated by the limitations of current FER systems and aims at the strong learning of facial expression representation through the use of multi-datasets which are carefully curated. Multiple datasets (FER2013+ Humans Face Emotions+ CK+) are merged to increase variety and is screened using an automated pipeline to filter out incorrect or mislabelled data. Deep embeddings obtained with the help of typical CNN backbones are evaluated systematically with the help of a series of machine-learning classifiers. The paper shows that dataset curation combined with representation-oriented learning is used to amplify consistent and discriminative facial features that are robust to dataset change and annotation noise and represents data quality rather than architecture complexity.\u0026nbsp;\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003eEarly FER models were based on geometric features and texture descriptors as Local Binary Patterns and Gabor filters. These methods achieved moderate performance when used under controlled circumstances but remarkably deteriorated when pose changed, illumination varied and when there was occlusion [11], [12].\u0026nbsp;Their performance was highly sensitive to face alignment and to manually tuned parameters. The introduction of deep CNNs marked an important turn in FER research. CNNs directly learn hierarchical representations from pixel data, hence decreasing the reliance on manual-designed features [4], [13]. VGG style networks extracted effective features through stacked convolutions and achieved strong performance on FER2013 [5], [14], [24]. Residual networks further increased representational depth by solving vanishing-gradient problems with identity mappings [12].\u003c/p\u003e\n\u003cp\u003eThe features of the dataset are big determinants of FER performance. The CK+ dataset is highly resolved and labelled expressions with posed and exaggerated emotions [7]. Models that have been trained only with CK+ are also prone to fail when generalized to unconstrained settings. FER2013 presented in-the-wild facial images in large scale, but its resolution is very low and the annotations are noisy, limiting achievable accuracy, and the human performance on it is reported as 65-68% [8]. On FER2013, single-network CNN models usually reach a plateau of between 70% to 73 % without extraneous information [14], [19]. According to a number of studies, the noise in the dataset has a direct impact on the learned decision boundaries. Noise in the labels decreases the accuracy of the classification by over 5-10 percent without being mitigated [9]. Analysis of FER-specific data proves that mislabelled and ambiguous samples violate the geometry of class separability and embedding [10]. Noise-aware learning and sample filtering approaches improve robustness without full manual relabelling [9], [21], [15].\u003c/p\u003e\n\u003cp\u003eMulti dataset learning has already been investigated in an attempt to enhance the diversity of expressions and decrease datasets bias.\u0026nbsp;However, naive dataset merging introduces annotation inconsistencies and duplication of samples [11].\u0026nbsp;Cross-dataset tests show over 15 % decreases in the accuracy in cases when the models are tested outside of their training sample [18]. These results highlight that dataset harmonization and quality control should be followed by representation learning.\u0026nbsp;Recent advancements in FER methods, benchmarks, and datasets further emphasize the importance of standardized evaluation protocols [20]. Using automated data curation methods is a relatively new development. Laplacian variance-based blur detection is highly effective to eliminate low-quality facial images and enhance the training stability of CNN [32]. The classifier-confidence-based filtering uses the concept of detecting the repeatedly low-confidence samples across the models [21].\u0026nbsp;Such strategies have been shown to improve downstream accuracy by 2–5% in noisy datasets [9], [10].\u0026nbsp;The choice of model architecture has also an impact on the fidelity of representation. ResNet18 balance depth and computational efficiency without affecting fine grained features of the face [12]. MobileNetV2 reduce the model size and inference cost but can lose discriminative power especially to subtle expressions [37]. According to comparative studies, both ResNet18 and MobileNetV2 models show 2 to 4% accuracy differences in the same training conditions [28].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHybrid FER pipelines combining deep embeddings with classical machine-learning classifiers have demonstrated competitive performance.\u0026nbsp;Gradient-boosting classifiers,Random Forests, and Support Vector Machines usually perform better in comparison with softmax classifiers when trained on fixed deep features [13], [23]. Ensemble approaches, specifically, Random Forests, are resistant to residual noise and nonlinear decision boundaries in moderate-sized data [18], [33].\u0026nbsp;Reported gains of 1–3% over end-to-end CNN classifiers are common in hybrid settings [23], [34]. Optimization strategy further influences representation quality and convergence behavior.\u0026nbsp;The RMSProp and Adam adaptive optimizers are faster but can have unstable generalization [26]. Weight-decay techniques that are decoupled like AdamW are better at regularizing decoupled models and also have more predictable performance during validation [6], [27]. According to the reports of FER studies, the use of AdamW increases the accuracy by an average of 1-2% compared to the use of standard Adam [6].\u0026nbsp;Recent efforts have extended FER research to be able to deal with not only static but also dynamic emotion representations. This is a systematic review of categorizing methods by image-based and sequence-based and identifying the difficulties related to the ambiguity of expressions, intensity variation, cross-dataset inconsistency, and uncertainty with time [25]. In order to solve the noise in annotation, ReSup concurrently learns clean and noisy labels with agreement between two networks, which demonstrate quantifiable accuracy gains on FERPlus [26].\u003c/p\u003e\n\u003cp\u003eFER has also been pointed out to be strong when there is representation diversity. Global and local feature cues learned with diversified feature learning models lead to high accuracy on the RAF-DB, FER+, and AffectNet and prove the relevance of deep embeddings in unconstrained conditions [27]. There is also Ensemble learning that enhances strength. Multi-architectural ensemble models using distillation are competitive with a lower cost to compute [28]. Construction of datasets is still a critical issue with FER evaluation. RAF-DB is a set of large-scale annotations of both basic and compound emotions in varied demographic situations [29]. Refinement \u0026nbsp;work has been demonstrated to improve the performance on benchmark tasks such as relabelling and category expansion of FER datasets [30]. There have also been architectural improvements done based on attention. The use of convolutional block attention modules enhances discriminative learning of features, and accuracy is reported to be stronger on common benchmarks [31].\u0026nbsp;Comparative studies on real-time FER highlight trade-offs between accuracy and computational efficiency, favouring lightweight models for low-latency deployment [32].\u003c/p\u003e\n\u003cp\u003eAffectNet-HQ, a large-scale dataset, make it possible to do additional benchmarking, as deep CNN models have reached over 85 % recognition accuracy [33]. The recognition of expression dynamics is enhanced with the help of temporary modeling methods that are based on integrating CNNs with RNN-LSTM networks, which emphasizes the importance of a temporal context [34]. Multi-loss and ensemble strategies are also used to make FER more robust. The feature fusion models and voting-based ensembles outperform single-model baselines on both RAF-DB and unbalanced versions of AffectNet [35]. The label distribution learning minimizes annotation ambivalency and enhances the performance in\u0026nbsp;multiple\u0026nbsp;datasets [36]. The self-supervised learning systems separate out identity and expression properties and achieve the same accuracy gains on in-the-wild datasets [37]. Graph attention networks encode both spatial-temporal connections and generate competition among FER datasets [38].\u003c/p\u003e\n\u003cp\u003eThe lightweight attention architectures deal with the computational limitations, but are robust to pose variation [39]. The simplified cross-attention networks are state of the art and less complex to implement [40]. Hybrid attention processes also enhance their feature discriminability through a combination of both local and global attention cues[41], [42].\u0026nbsp;Comparative evaluations on benchmark datasets remain essential. CNN-based FER studies confirm strong performance on FER2013 and CK+ under real-world variability, providing practical baselines for research and deployment [43], [44].\u003c/p\u003e"},{"header":" 3. Experiment","content":"\u003cp\u003e\u003cstrong\u003eA. Dataset and Preprocessing\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eExperiments are conducted on a filtered multi-dataset corpus of facial expressions, created by combining CK+, FER2013, and other publicly available datasets containing labelled pictures of human face emotions. A total of 28,272 images are filtered to 16,828 images following an automated quality filtering process that eliminates blurred, low-resolution, and ambiguous images. Six emotion categories are used in this study: anger, disgust, fear, happiness, sadness, and surprise. Neutral expressions are not included. All images are resized to 224\u0026times;224 in RGB format and are normalized using ImageNet statistics. During training, horizontal flipping, \u0026plusmn;15\u0026deg; rotation, and brightness variation are applied as data augmentation techniques. The dataset is divided into 80% training data and 20% test data. To evaluate stability in performance, four-fold cross-validation (k = 4) is used with a constant random seed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eB. Feature Extraction and Classification\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe frozen feature extractors, i.e., ResNet18 and MobileNetV2, will be utilized in the present paper with pre-trained ImageNet weights, and all the layers are maintained at the frozen level to avoid any bias that can be brought by fine-tuning. This guarantees a healthy comparison of intrinsic representational abilities of the two architectures in the same conditions of the experiment. As a result, the deep features obtained are based on the learned general-purpose visual representations and not adaptation to the task. The architectures of the two backbones are depicted in Figure 1 and Figure 2 to gain a better insight into the structural differences between them. In Figure 1, the architecture ResNet18 is shown and is distinguished by residual connections that make information bypass the intermediate layers. These skip connections are useful in gradient flow preservation during training as well as allowing the network to learn more discriminative and deeper feature representations. Consequently, ResNet18 has been specifically useful in the acquisition of minute differences in facial expressions, which is essential in the differentiation of the near related emotional states.\u003c/p\u003e\n\u003cp\u003eFigure 2, in turn, shows the structure of MobileNetV2 architecture, which is created with the emphasis on computational efficiency. It applies depthwise separable convolution and inverted residual blocks to greatly cut the quantity of parameters and computation expenses. Although such a design allows MobileNetV2 to be applicable to real-time and resource-constrained scenarios, it can limit its capability to represent fine-grained facial features in comparison to more profound residual networks. This comparison of these two architectures brings out the trade of between representational power and computational efficiency which is also seen in the experimental results.\u003c/p\u003e\n\u003cp\u003eFeature embeddings are extracted from the global pooling layer of each network. Classification is performed using Support Vector Machine, Random Forest, XGBoost, K-Nearest Neighbors, Gaussian Naive Bayes, and Decision Tree models trained on the extracted features. Among all classifiers, Random Forest achieves the best performance, with the highest test accuracy of 70.34% using ResNet18 embeddings, and shows consistent outperformance compared to MobileNetV2 across classifiers.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eC. Optimization and Training Setup\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOptimizers are evaluated using SGD, SGD with momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam, Nadam, AMSGrad, and AdamW under identical conditions. Training is conducted using weighted cross-entropy loss with ReduceLROnPlateau scheduling. Based on validation performance and convergence behavior, AdamW is used for final reporting. Although CNN backbones are frozen, optimizers are evaluated for training the classifier head and loss convergence stability under identical conditions. All experiments are conducted in TensorFlow under a CPU-only, deterministic execution setup.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTable 1 shows the benchmark comparison of various models which were considered in this work. As one may notice, the offered hybrid scheme with the ResNet18 embeddings and the use of a Random Forest as a classifier have the highest accuracy of 70.34% and, therefore, outperform both the classical and deep learning baselines. Although some of the deep CNN based models like VGG, ResNet50 and Inception V3 are also able to perform competitively, their performance is a bit lower when operated under the same circumstances. This underscores the efficiency of representation based learning in combination with sound classical classifiers. Fig. 3 shows that the trend in accuracy is different in various models. It can be noted that there is a definite performance disparity between lightweight architectures and more profound residual networks. Specifically, ResNet18 is always capable of generating more discriminative embeddings than MobileNetV2, and this results in a better classification score among all the considered classifiers. Fig. 5 presents both backbone models confusion matrices. As it can be noted, ResNet18 is more successful in class separation with less misclassifications between the visually similar emotions like fear and surprise.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e Model benchmark\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eModel Name\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eARM (ResNet)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eVGG + SMI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAttentional ConvNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e68.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCNN + SVM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e68.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eMobileNetV2 + RF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e67.49\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eVGG16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e67.05\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGoogLeNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e66.84\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eResNet50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e66.37\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eInception V3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e66.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eBag of Words\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e63.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCustom CNN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e55.44\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eResNet18 + RF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e70.34\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eMobileNetV2, on the contrary, has a comparatively greater level of confusion between these classes, which suggests a weakness in the recognition of minor facial changes. The consistency of the suggested solution is also supported by cross-validation test.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe performance of all the folds has the same variance as illustrated in Fig. 4 which confirms the strength of the learned representations with varying data splits. This shows that the model is also generalized and is not highly reliant on a particular training subset.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTable 2 shows the validity acquired through various optimization strategies. Of all optimizers considered, AdamW has been found to be the most accurate (70.94) and the most stable. Although adaptive optimizers like Adam and RMSProp have a lower convergence rate, they have a little more variance in validation performance.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2\u003c/strong\u003e. Validation accuracy of optimizers\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eOptimizer\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eBest Validation Accuracy (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eSGD\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e68.86\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eSGD + Momentum\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNAG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e68.62\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAdagrad\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e68.32\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eRMSprop\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAdadelta\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAdam\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e69.87\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eAdamW\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e70.94\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNadam\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e70.61\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAMSGrad\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e70.73\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eConversely, the convergence speed and total accuracy of SGD-based methods are slower. All in all, the findings prove that the quality of feature representation and dataset curation are also more important in FER performance compared to architectural complexity itself. Curation multi-dataset learning with learned high-quality embeddings and classical classifiers offers a dependable and stable system of facial expression recognition in noisy and heterogeneous settings.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOptimizer Comparison\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOptimizer performance is evaluated under identical training conditions using weighted cross-entropy loss and ReduceLROnPlateau scheduling. Among the tested optimizers (SGD, SGD with momentum, Nesterov, Adagrad, RMSProp, Adadelta, Adam, Nadam, AMSGrad, and AdamW), AdamW consistently achieves the highest test accuracy and the most stable convergence behavior across folds. SGD-based optimizers show slower convergence and higher variance. Adam and RMSProp converge faster than SGD but exhibit less stable validation performance compared to AdamW. Nadam and AMSGrad provide no consistent improvement over Adam. Based on accuracy stability and convergence consistency, AdamW is selected for all final experiments.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBackbone Comparison\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBackbone evaluation is performed using frozen feature extractors under identical data splits and classifiers. ResNet18 consistently outperforms MobileNetV2 across all evaluated classifiers and cross-validation folds. The highest overall accuracy of 70.34% is obtained using ResNet18 embeddings with a Random Forest classifier, while MobileNetV2 achieves lower peak accuracy under the same settings. ResNet18 embeddings exhibit improved class separability and reduced confusion between closely related emotion classes. Results indicate that deeper residual representations provide higher discriminative capacity than lightweight architectures for multi-dataset FER under frozen-feature constraints.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis work presents a high-fidelity facial expression representation framework evaluated under controlled multi-dataset conditions. Automated data curation reduces noise and improves representation reliability. Frozen CNN feature extraction enables fair backbone comparison without fine-tuning bias. ResNet18 produces more discriminative embeddings than MobileNetV2 across classifiers. Random Forest gives the best classification results of all considered models. AdamW has the most consistent and stable behaviour of optimisation. The findings indicate that the quality of representation and integrity of data are the major factors to drive FER performance in limited training environments.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cbr\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe author confirms that the image presented in Figure 2 is of the author themself and provides consent for its publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets analyzed in this study are publicly available facial expression recognition datasets including CK+ and FER2013. These datasets are accessible from their respective official repositories subject to their licensing terms. The curated dataset generated through automated quality filtering is derived from these publicly available datasets.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors received no specific funding for this work.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eH.V. conceptualized the study, designed the methodology, conducted the experiments, analyzed the results, and wrote the main manuscript text.\u003c/p\u003e\n\u003cp\u003eP.V. supervised the research work and contributed to conceptual guidance and manuscript review.\u003c/p\u003e\n\u003cp\u003eN.K. co-supervised the research and contributed to methodological refinement and technical validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors acknowledge the use of publicly available facial expression datasets used in this research.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBartlett MS, Littlewort G, Fasel I, Movellan JR. Real-Time Face Detection and Facial Expression Recognition: Development and Applications to Human\u0026ndash;Computer Interaction, \u003cem\u003eProc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops\u003c/em\u003e, pp. 53\u0026ndash;60, 2003.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbdat F, Maaoui C, Pruski A. Human\u0026ndash;computer interaction using emotion recognition from facial expression, \u003cem\u003eProc. UKSim 5th European Modelling Symposium on Computer Modelling and Simulation\u003c/em\u003e, pp. 196\u0026ndash;201, 2011.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFasel B, Luettin J. Automatic facial expression analysis: A survey. Pattern Recogn. 2003;36(1):259\u0026ndash;75.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSimonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition, \u003cem\u003eProc. International Conference on Learning Representations (ICLR)\u003c/em\u003e, 2015.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi S, Deng W. Deep facial expression recognition: A survey. IEEE Trans Affect Comput. 2020;11(1):3\u0026ndash;18.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The extended Cohn\u0026ndash;Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression, \u003cem\u003eProc. IEEE CVPR Workshops\u003c/em\u003e, pp. 94\u0026ndash;101, 2010.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoodfellow IJ, et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw. 2015;64:59\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFrenay F, Verleysen M. Classification in the presence of label noise: A survey. Pattern Recogn. 2014;45(9):1\u0026ndash;28.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePatrini G, Rozza A, Menon AK, Nock R, Qu L. Making deep neural networks robust to label noise: A loss correction approach, \u003cem\u003eProc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e, pp. 1944\u0026ndash;1952, 2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCohn JF, De la Torre F. Automated face analysis for affective computing. in The Oxford Handbook of Affective Computing. Oxford University Press; 2014. pp. 131\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, \u003cem\u003eProc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e, pp. 770\u0026ndash;778, 2016.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChoi M, Lee J, Kim S. Facial expression recognition using deep features and SVM classifiers. Sensors, 21, 9, Article 3179, 2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHasani B, Mahoor MH. Facial expression recognition using enhanced deep 3D convolutional neural networks, \u003cem\u003eProc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e, pp. 30\u0026ndash;40, 2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang K, Peng X, Yang J, Lu S, Qiao Y. Suppressing uncertainties for large-scale facial expression recognition, \u003cem\u003earXiv preprint arXiv\u003c/em\u003e:2007.\u003cem\u003e03149\u003c/em\u003e, 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOguine OC, Oguine KJ, Bisallah HI, Ofuani D. Hybrid facial expression recognition (FER2013) model for real-time emotion classification, \u003cem\u003earXiv preprint arXiv:2203.08901\u003c/em\u003e, 2022.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSajjad M, et al. A comprehensive survey on deep facial expression recognition: Challenges, applications and future guidelines. IEEE Access. 2023;11:11245\u0026ndash;70.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi S, Deng W, Du J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. Neural Netw. 2016;73:70\u0026ndash;81.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReghunathan RK. Facial expression recognition using pre-trained ensemble models for FER2013. Sensors, 24, 3, Article 812, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKopalidis T. Advances in facial expression recognition: Methods, benchmarks, models, and datasets, \u003cem\u003eInformation\u003c/em\u003e, vol. 15, no. 2, Article 98, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang C, Dong J. A survey of deep learning-based facial expression recognition research. Front Comput Intell Syst. 2025;4(1):1\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAgung ES et al. Image-based facial emotion recognition using deep learning with extended emotion categories. Sci Rep, 14, Article 6123, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUllah S. Facial expression recognition survey: Edge computing deep learning approaches, \u003cem\u003ePeerJ Computer Science\u003c/em\u003e, vol. 10, e1913, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHasani B, Mahoor MH. Facial expression recognition using enhanced deep 3D convolutional neural networks. IEEE Trans Affect Comput. 2020;11(4):653\u0026ndash;66.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Zhao X, Li Y. A survey on facial expression recognition of static and dynamic emotions. IEEE Access. 2023;11:32415\u0026ndash;38.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang K, Peng X, Yang J, Lu S, Qiao Y. ReSup: Reliable label noise suppression for facial expression recognition, \u003cem\u003earXiv preprint arXiv:2305.17895\u003c/em\u003e, 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi J, Zhang Y, Liu Z. Learning diversified feature representations for facial expression recognition in the wild, \u003cem\u003earXiv preprint arXiv:2210.09381\u003c/em\u003e, 2022.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKollias D, Zafeiriou S, Tefas A. Recognizing facial expressions in the wild using multi-architectural ensembles with distillation, \u003cem\u003earXiv preprint arXiv:2106.16126\u003c/em\u003e, 2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi S, Deng W, Du J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. Neural Netw. 2016;73:70\u0026ndash;81.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBen A, Ammar, et al. Introducing a novel dataset for facial emotion recognition. Heliyon. 2024;10(2):e21345.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhan MS, Hussain J, Aboalsamh H, Bebis G. Attention-based facial expression recognition: A survey. IEEE Access. 2021;9:165002\u0026ndash;28.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen Y, et al. Real-time facial expression recognition: Advances and comparative analysis. Int J Comput Vision. 2023;131(4):987\u0026ndash;1006.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi H et al. Deep learning-based facial emotion recognition on AffectNet-HQ. Int J Pattern recognit Artif Intell, 38, 5, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma R, Singh P. Integrating CNN and RNN-LSTM for facial expression recognition, \u003cem\u003eProc. International Conference on Computer Vision Theory and Applications\u003c/em\u003e, pp. 214\u0026ndash;221, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou G, Xie Y, Fu Y, Wang Z. Multi-loss based feature fusion and top-two voting ensemble decision strategy for facial expression recognition in the wild, \u003cem\u003earXiv preprint arXiv:2311.03478\u003c/em\u003e, 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu S, Xu Y, Wan T, Kui X. Ada-DF: An adaptive label distribution fusion network for facial expression recognition. Inf Sci. 2024;677:1\u0026ndash;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe R, Xing Z, Tan W, Yan B. A generative framework for self-supervised facial representation learning, \u003cem\u003earXiv preprint arXiv:2309.08273\u003c/em\u003e, 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi Y, Zhao Y, Xia X, Jiang D. ARPGNet: Appearance- and relation-aware parallel graph attention fusion network for facial expression recognition, \u003cem\u003earXiv preprint arXiv:2511.22188\u003c/em\u003e, 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEzati A, Dezyani M, Rana R, Rajabi R, Ayatollahi A. A lightweight attention-based deep network via multi-scale feature fusion for multi-view facial expression recognition, \u003cem\u003earXiv preprint arXiv:2403.14318\u003c/em\u003e, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMao J, Xu R, Yin X, Chang Y, Nie B, Huang A. POSTER++: A simpler and stronger facial expression recognition network, \u003cem\u003earXiv preprint arXiv:2301.12149\u003c/em\u003e, 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhan MS, Hussain J, Aboalsamh H, Bebis G. Attention-based facial expression recognition: A survey. IEEE Access. 2021;9:165002\u0026ndash;28.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSong D, Liu C. A facial expression recognition network using hybrid feature extraction. PLoS ONE, 19, 1, e0280213, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGailan MJ. Performance evaluation of facial emotion recognition through convolutional neural network model. Int J Eng Res Technol (IJERT). 2025;14(3):1\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGan Y. Facial expression recognition using convolutional neural networks, \u003cem\u003eProc. 2nd International Conference on Intelligent Computing and Applications\u003c/em\u003e, pp. 1\u0026ndash;6, 2018.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [BMC Artificial Intelligence](https://bmcartificialintel.biomedcentral.com)","snPcode":"44398","submissionUrl":"https://submission.nature.com/new-submission/44398/3","title":"BMC Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-9295088/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9295088/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCurrent facial expression recognition (FER) models have a tendency to be unstable when trained on heterogeneous public data because of annotation noise, visual artifacts, and bias depending on the data set. This paper proposes a representation-based FER model of psychological stress measurement, where automated data cleaning and feature learning that is agnostic of architecture is preferred, as opposed to architectural complexity. Various publicly available FER datasets are combined to enhance the level of expression diversity and are filtered by an automated quality pipeline to eliminate blurred, low-resolution, and ambiguous images to produce a curated corpus with 16,828 images of six emotion categories. Frozen convolutional backbones, such as ResNet18 and MobileNetV2, are used to extract deep facial embeddings that can be fairly compared with each other and avoiding effects of fine-tuning. Such embeddings are tested with various classical machine-learning classifiers with similar experimental conditions. Findings indicate that ResNet18 embeddings are more discriminative than lightweight counterparts and that the best overall results are along with Random Forests classifiers which reach a peak accuracy of 70.34%. A further comprehensive analysis of optimizer shows that AdamW has the most consistent convergence and validation consistency across folds. Results validate that the dataset integrity and fidelity of representation are a dominating factor in FER performance outperforming the effects of end-to-end architectural complexity. The proposed framework will provide a reliable and reproducible stressful FER system, which will function under limited and noisy data environments.\u003c/p\u003e","manuscriptTitle":"A High-Fidelity Facial Expression Representation Framework Using Curated Multi- Dataset Learning","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-09 00:40:58","doi":"10.21203/rs.3.rs-9295088/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"142311220467577430165718198658454035811","date":"2026-05-01T18:14:57+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"56573431309225680473685984414121183990","date":"2026-04-25T03:56:10+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"225170305631600622915865325147071283604","date":"2026-04-24T16:35:53+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-24T15:41:23+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-15T05:24:06+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-04-14T12:59:24+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Artificial Intelligence","date":"2026-04-14T11:55:56+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [BMC Artificial Intelligence](https://bmcartificialintel.biomedcentral.com)","snPcode":"44398","submissionUrl":"https://submission.nature.com/new-submission/44398/3","title":"BMC Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"0270713e-5c73-4379-a21d-26998df66d25","owner":[],"postedDate":"May 9th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"142311220467577430165718198658454035811","date":"2026-05-01T18:14:57+00:00","index":26,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-09T00:40:58+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-09 00:40:58","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9295088","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9295088","identity":"rs-9295088","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00