Feature Significance in Speech Emotion Recognition

preprint OA: closed
Full text JSON View at publisher
AI-generated deep summary by claude@2026-06, 2026-06-24 · read from full text

This paper studies feature significance for speech emotion recognition by comparing how different audio features (Log-Mel Spectrograms, MFCCs, pitch, and energy) affect classification performance. Using the RAVDESS emotional speech database (24 actors, 14-class setup across two genders and seven emotions) and preprocessed raw audio, the authors train and evaluate multiple models, including LSTM, 2D CNN, HMM, and deep neural networks. They report the best result as 56% accuracy using a 4-layer 2D CNN with Log-Mel Spectrogram features, and conclude that selecting effective audio features matters more than increasing model complexity for performance in their setup, with limited detail on other potential constraints beyond their dataset and experimental configuration. The paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 109,491 characters · extracted from preprint-html · click to expand
Feature Significance in Speech Emotion Recognition | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Feature Significance in Speech Emotion Recognition Atul Mishra, Sarthak Jindal This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7474053/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In the field of speech emotion recognition, the choice of audio features can dramatically influence the accuracy and effectiveness of classification systems. This study presents a comprehensive comparative analysis of feature significance, shedding light on how different audio characteristics contribute to the success of emotion recognition methodologies. This paper analyzes speech-based emotion recognition techniques and works with audio analyses using the Ryerson Audio-Visual Database (RAVD) of Emotional Speech and Song, a database consisting of audio analysis on raw audio files. Analysis involved features like Log-Mel Spectrograms (LMS), Mel-Frequency Cepstral Coefficients (MFCCs), pitch, and energy, after raw audio files are pre-processed. We measured the relevance of these features to the classification of emotion through a series of approaches that include Long Short-Term Memory networks, Convolutional Neural Networks, Hidden Markov Models, and Deep Neural Networks. On a 14-class classification problem that covers two genders and seven emotions, we obtained 56% accuracy by using a 4-layer 2-dimensional CNN with Log-Mel Spectrogram features. Our results show the importance of selection of good audio features while complexity is not that important for performance for emotion recognition. Emotion Classification Feature Significance MFCCs Audio Features Deep Learning Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 1 Introduction Emotion recognition from speech is a complex and evolving area of research, involving the analysis and interpretation of subtle variations in vocal characteristics to infer emotional states [ 1 ]. This paper explores how specific audio features—particularly Log-Mel Spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs)—influence the performance of machine learning models in accurately classifying emotions. Recognizing and interpreting emotions is a fundamental aspect of human interaction. In recent years, there has been a growing interest in developing systems capable of detecting emotions from speech, music, and other audio inputs. These systems apply machine learning algorithms to acoustic signals in order to identify the emotions being conveyed. Their potential applications span a wide range of domains, including virtual assistants, speech-based interfaces, education, mental health monitoring, and entertainment. At their core, these technologies strive to replicate human sensitivity to emotional cues in voice and sound, thus paving the way for more emotionally intelligent machines. A central challenge in this domain is achieving precise emotion detection from audio streams. This is typically accomplished by extracting and analysing key features—such as pitch, tone, rhythm, and intensity—and training classification models capable of interpreting the emotional content. Various approaches have emerged: rule-based systems that rely on predefined heuristics, statistical models that classify emotions using probabilistic logic, and deep learning methods that automatically learn high-level features through neural networks [ 1 ]. As these systems evolve, their significance becomes even more apparent. Emotionally aware technologies have the potential to enhance user experience across multiple domains. In virtual interactions, they can respond more personally and empathetically. In gaming, they offer more immersive and responsive gameplay [ 15 ]. In healthcare, these systems may support the early detection and monitoring of emotional disorders such as depression and anxiety, enabling timely intervention and more effective treatment outcomes. This study is driven by several key research directions. One line of inquiry focuses on developing sound emotion recognition systems by testing various algorithms, feature extraction methods, and training datasets in order to maximize classification accuracy [ 2 ], [ 3 ]. Another important direction examines how contextual variables—such as the user's emotional state, the nature of the sound, and the surrounding environment—affect recognition accuracy [ 1 ]. Furthermore, exploring cultural variations in emotional expression is crucial, as emotions are often conveyed and interpreted differently across societies [ 1 ]. The rationale for this research lies in the potential of emotion-aware systems to redefine human-computer interaction. These systems can enhance communication, support personalized learning, and even offer mental health insights. By accurately identifying users' emotional states, they open the door to more intuitive and effective technological experiences. Particularly in mental health, emotion recognition systems can contribute to the diagnosis and management of conditions like anxiety and depression through passive yet intelligent monitoring. Based on these foundations, two key hypotheses are proposed: first, that deep learning-based sound emotion recognition systems offer superior accuracy compared to traditional statistical models [ 2 ], [ 3 ]; and second, that the accuracy of these systems is significantly influenced by contextual factors surrounding the audio input. The remainder of this paper is organized as follows. The next section presents a review of existing literature in the field of sound emotion recognition. This is followed by a detailed explanation of the methodology, including the dataset, feature extraction techniques, and model implementation. The subsequent sections discuss the experimental results and performance evaluations. A discussion of the findings and their implications is then provided. Finally, the paper concludes with a summary of key insights and suggestions for future research. 2 Literature Review Over the past decade, sound emotion recognition has emerged as a vibrant interdisciplinary field, bridging the gap between audio signal processing, psychology, machine learning, and human-computer interaction. Researchers have extensively explored how subtle vocal cues—such as pitch, tone, energy, and spectral dynamics—can reveal emotional states, with machine learning techniques steadily enhancing the reliability and accuracy of these insights. This review synthesizes notable contributions to the field, identifying prevailing methodologies, datasets, and recent advances that continue to shape this evolving area. A fundamental aspect of speech emotion recognition (SER) lies in the extraction of relevant features from audio signals. Traditionally, features such as Mel-Frequency Cepstral Coefficients (MFCCs), pitch, energy, duration, spectral flux, and spectral centroid have been widely adopted. Studies have shown that combining these features typically yields superior performance compared to using them in isolation. In recent years, the use of deep learning architecture like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) has gained popularity due to their ability to automatically learn intricate emotional patterns embedded in audio data. Various classification algorithms have been applied to speech emotion tasks. Conventional models like Support Vector Machines (SVM), Random Forests, and Naive Bayes have provided solid baselines. However, deep learning-based models have demonstrated substantial improvements in performance. El Ayadi et al. [ 2 ] offered a comprehensive survey of classification models, while Han et al. [ 3 ] and Zhang et al. [ 4 ] demonstrated the effectiveness of deep neural networks and ensemble-based methods in capturing emotional nuances more accurately. Datasets have played a pivotal role in enabling consistent benchmarking of SER systems. Prominent among them are RAVDESS, Emo-DB, TESS, and SAVEE, each offering a variety of emotional expressions across gender and speech types [ 15 ][ 16 ][ 17 ]. These databases have allowed for meaningful comparisons across models and techniques, facilitating standardized evaluation. To assess model performance, common metrics include accuracy, precision, recall, and F1-score. Advanced studies have also introduced measures such as the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) to better evaluate model robustness, especially under class-imbalanced scenarios. As shown in Table 1 , several studies have utilized traditional techniques for speech emotion recognition achieving varying levels of accuracy. Taken together, these studies illustrate the field’s rapid progression from traditional classifiers to sophisticated deep learning and ensemble techniques. As feature extraction and model design continue to evolve, so too will the potential for highly accurate and context-aware emotion recognition systems, bringing us closer to truly empathetic human-computer interaction. 3 Methodology This section outlines the methodological framework employed for developing and evaluating the proposed speech emotion recognition system, detailing the data preprocessing steps, feature extraction techniques, model architecture, training process, and evaluation metrics used throughout the study. 3.1 Data collection Designing an effective Speech Emotion Recognition (SER) system typically involves three key components: selecting an appropriate emotional speech database, extracting meaningful features from audio signals, and choosing suitable classification algorithms. For this study, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) has been selected due to its comprehensive and balanced structure. This multimodal, gender-balanced dataset features recordings from 24 professional actors who perform 104 distinct vocalizations that span a wide range of emotions, including happiness, sadness, anger, fear, surprise, contempt, calmness, and neutrality [ 15 ]. Each actor delivers two standardized sentences” The kids are talking outside the door" and "The dogs are sitting outside the door"—for every emotion. Except for the neutral emotion, each expression is recorded at both normal and heightened emotional intensities, with each line spoken twice. The dataset includes 1,440 vocal utterances and 1,012 chanted utterances, offering a diverse and rich emotional context. Unlike many other datasets, such as the Toronto Emotional Speech Set (TESS) and the Surrey Audio-Visual Expressed Emotion (SAVEE), which include only audio recordings, RAVDESS provides a multimodal setup and maintains a well-balanced distribution across emotion classes, mitigating common issues like class imbalance. 3.2 Selection and selection restrictions While the RAVDESS dataset is widely regarded for its balanced emotional representation and professional-quality recordings, it is not without limitations. One of the primary concerns is its lack of linguistic and cultural diversity. Factors such as language, accent, dialect, and cultural background play a critical role in the way emotions are expressed and perceived in speech. Consequently, a model trained exclusively on English utterances—such as those in RAVDESS—may struggle to accurately recognize emotions in non-English languages like Mandarin or Hindi. Furthermore, since the dataset comprises performances from 24 English-speaking actors based in Toronto, Canada, it reflects a strong North American cultural bias. This limitation restricts the model’s generalizability across diverse global populations and real-world applications where linguistic and cultural variation is the norm. 3.3 Data preparation Each audio file in the RAVDESS dataset follows a structured naming convention composed of seven numerical components: actor, voice channel, modality, emotional intensity, statement, and repetition. This format allows for organized parsing and efficient categorization of the data. Interestingly, the dataset designates gender based on actor ID—odd-numbered actors represent male voices, while even-numbered actors represent female voices. This structured metadata is essential for preprocessing, enabling systematic filtering and labeling of audio samples according to emotion type, speaker identity, and other relevant attributes. Such organization simplifies downstream tasks like feature extraction and model training. 3.4 Data Cleaning About three seconds were recorded for each piece. At the beginning and end of the audio recordings, the silence was interrupted. Although the audio was well captured, there were very few noise patterns in the data. We tested several signal-processing techniques, such as filtering 3 and voice-activity detection (VAD), to eliminate the noise. A well-known method of noise reduction called spectral subtraction works by subtracting an estimate of the noise spectrum from the noisy speech spectrum to eliminate background noise (also known as additive noise). The result of this process is visualized in Fig. 1 through 6 , which shows the waveforms and spectrograms before and after trimming and filtering for male audio samples. 3.5 Feature Selection and Modelling The two basic categories into which audio properties are divided are time domain characteristics and frequency domain qualities. Examples of time domain functions include short-term signal energy, zero crossing rate, maximum amplitude, lowest energy, and energy entropy. It is quite easy to extract these components, which simplifies audio signal analysis. The frequency domain characteristics display more complex patterns in the audio signal with little information, which may be able to reveal the emotion concealed in the signal. Examples of frequency domain functions include spectrograms, Mel frequency cepstral coefficients (MFCC), C spectral centroid, spectral slope, spectral entropy, and chrominance coefficients. During the exploratory examination of the data, each feature was thoroughly looked at. However, for the purposes of this report, we have concentrated on two key characteristics, namely the Mel spectra and the cepstral coefficients of the Mel frequency. 3.6 Mel-Frequency Cepstrum (MFCC) Mel-Frequency Cepstrum simulates the human cochlea by changing the audio stream in a sequence of stages to approximate the short-term sound power spectrum. A variety of audio signal processing techniques are used to extract MFCCs, including: Pre-emphasis: boosting higher frequency components of the signal and reducing the impact of noise by adding a high-pass filter to the signal. Frame segmentation is the process of breaking a signal into brief, 20 to 30-millisecond frames with minimum overlap. Windowing: To lessen spectrum loss and boost spectrum resolution, multiply each image using a windowing function like a Hamming or Hanning window. The energy in each Mel frequency band is logarithmically scaled to produce a set of spectral coefficients for the Mel scale. Discrete Cosine Transform: To calculate the corrected MFCC, the scaled coefficients Mel are subjected to the discrete cosine transform (DCT). The MFCCs for male and female audio sample are visualized in Figs. 7 and 8 respectively , 3.7 Mel-Spectrograms The Mel spectrogram is another well-liked feature extraction technique in sound emotion recognition applications. A Mel spectrogram, a particular type of spectrogram, is produced using a Mel-scale filter bank, which is a collection of triangular filters spaced in line with the Mel scale. In the context of sound emotion identification, Mel spectrograms can capture significant elements of the audio signal that are related to the emotional content, such as the pitch, timbre, and intensity of the voice. Using the Fast Fourier Transform (FFT), each frame's frequency spectrum is calculated. Mel filter bank: The frequency spectrum is passed through a series of triangular bandpass filters to yield a set of logarithmically spaced Mel-frequency bands. The logarithm of the energy in each band of the Mel spectrum is used to create a set of Mel-scale spectral coefficients. The power spectrum of the Mel-scale coefficients for each frame is calculated to produce a Mel spectrogram. The Mel spectrograms for male and female audio samples are shown in Figs. 9 and 10 , respectively. The overall solution pipeline for our SER system is illustrated in Fig. 11 . 3.8 Convolutional Neural Networks The development of convolutional neural networks (CNNs) is responsible for the enormous improvements in image recognition tasks in recent years. CNNs excel at automatically picking out vital details from photos with several dimensions. To benefit from the associated two-dimensional picture data format, CNNs employ common kernels (weights). To introduce invariance, maximum splicing is introduced to CNNs, and only the pertinent multidimensional functions are used for various tasks like classification, segmentation, etc. Unexpectedly, CNNs are effective for tasks involving tone identification. This is because "Linearity" is applied and introduced on the logarithmic frequency axis when the log-Mel filter bank is applied to the FFT representation of the raw audio signal, allowing us to execute a convolution operation along the frequency axis. The prediction pipeline for our CNN model is depicted in Fig. 12 . If not, we would need to employ various cores or filters for various frequency ranges. The model can effectively learn fundamental patterns from brief delays thanks to this quality, which when combined with the CNN's excellent drawing power, leads to peak performance in speech-based emotion recognition systems. On tasks requiring video comprehension, 3D CNN typically performs well. Within each frame of video data, there is no temporal correlation, only between the frames. However, there is a temporal relationship along two axes in the case of converted 3D audio, with a slight delay along one and a larger delay along the other. So, we were interested in determining whether modelling the short- and long-term dependencies in the data would enhance our SER system. As a result, our study also comprised a 3D CNN model. Each frame in 3D audio was set to a constant length of 250 ms, which proved to be the shortest effective time for evoking emotions. To discover patterns in sound waveforms, 1D multilayer CNNs were also trained on raw audio. 3.9 LSTM (Long Short-Term Memory) Model The working procedures of different models, including CNN, SVM, and LSTM, for SER are shown in Fig. 13 . Sound emotion detection experiments have shown that recurrent neural network (RNN) models, sometimes referred to as Long Short-Term Memory (LSTM) models, perform well. The process by which an LSTM model recognizes auditory emotions in their totality is described as follows : Pre-processing of data: The spectral and temporal characteristics of the audio signal are captured by relevant features such as MFCCs or Mel spectrograms, which are extracted from the audio data during pre-processing. The data is then created into training, validation, and testing sets. Model architecture for LSTM: The LSTM model architecture is given to generate a probability distribution over the emotional classes. Usually, it starts with one or more LSTM layers and then moves on to one or more completely linked layers. Model training: The LSTM model is trained on the training set using optimization techniques like stochastic gradient descent (SGD) or Adam to minimize a loss function, such as categorical cross-entropy. During the model's multi-epoch training, early stopping is done using the validation set to prevent overfitting. Model evaluation: A trained LSTM model is evaluated to examine how well it performs in terms of metrics like accuracy, precision, recall, and F1-score using the testing set as a basis. It is also possible to assess the model's classification performance using confusion matrices. Application of the model: To enable real-time sound emotion recognition, the trained LSTM model may be used to classify brand-new audio samples into one of the emotional categories. Metrics like classification accuracy and reaction time can be used to assess the effectiveness of the deployed model. 3.10 Hidden Markov Models Before RNNs became popular, HMMs were commonly used for voice recognition tasks like emotion recognition. The assumption behind how HMMs work is that observations come from a set of hidden states, each of which includes transitions that follow the Markov principle. Among generative models are HMMs. A significant disadvantage of HMM is how poorly it models non-linear data. We employed two diverse types of HMMs with various emission properties: Gaussian HMMs and Gaussian Mixture Model HMMs. Phones, the smallest discrete sound, or hidden state in these HMMs, are categorized based on the likelihood that a set of phones will fall into a particular class. The architecture of the HMM model used for SER is illustrated in Fig. 14 . 3.11 Description of each model The accuracy may be used to assess how well different models perform because the data distribution is homogeneous. Unweighted accuracy was chosen as the model selection metric as a result. All the input audios were clipped or appropriately spread out to accommodate 3 seconds for training models with fixed-size inputs. Depending on the complexity of the architecture, a configurable batch size was used to train each model across 100 epochs. The initial training of the models was carried out using the stochastic gradient descent (2) (SGD) optimizer. Later studies used the ADAM (5) optimizer with default parameters due to the slower convergence with SGD. By keeping an eye on the validity of the validation set, models were kept up to date. Using MFCC characteristics, which are the best ones for the seven class predictions, researchers discovered that the network was unable to comprehend recordings from the "Surprised" class. We hypothesized that the "Surprised" category of this data may be the result of a shortage. Only the six additional classes—angry, sad, neutral, disgusted, glad, and fearful—were consequently considered in subsequent analyses. Consequently, the "Surprised" class was eliminated from the dataset. We found that the optimum DNN performance was achieved by increasing the number of MFCCs and deltas. The MFCC visualization and Log-Mel spectrogram demonstrates how little information is included in MFCCs at higher frequencies. Therefore, we think that the log-mel spectrogram features will produce more fruitful outcomes in this study. It was discovered that adding gender provided better results when 1D CNNs and 2D CNNs were utilized for the classification of 12 gender emotion class classifications and six emotion class classifications in tandem. [ 19 ], [ 20 ]. The 1D CNN and 1D CNN-LSTM architectures were trained on the raw audio input since CNNs are naturally good feature extractors. Designing attributes such as MFCCs and Log-mel spectrograms was done using 2D CNNs. The training of 2D CNNs began with the use of max pooling with 2x2 filters, stride 2, and two convolutional layers with 3x3 filters. To adjust the basic layers, the filter widths were increased, and new convolutional layers were introduced. It was discovered that performance was unaffected by depth increases beyond 4 levels. A different CNN architecture was also put to the test, one in which the final convolutional feature map was flattened rather than fixed-length feature maps utilizing global average pooling. This method addressed the issue of a vast number of parameters throughout all connected levels. The capability for variable-sized inputs, which speech data commonly has, is another benefit of this approach. By boosting the filters in the early convolutional layers, we were able to improve performance by using the global average pooling layer. A visual representation of the 2D CNN with different layers is provided in Fig. 15 . Utilizing MFCCs and Log-Mel spectrogram features, Gaussian HMM and GMM-HMM were implemented. HMMs utilizing MFCCs offered about 20% accuracy, whereas log-mel spectrogram models offered 32% accuracy. We wanted to make sure that our model could at least roughly categorize emotions as either good or negative to have a deployable model. To determine if the speaker is feeling favorably (neutral, thrilled, astonished) or negatively (angry, sad, scared, disgusted), we created a 4-layer-2D CNN model (with global average pooling). Since the model was trained using features from the log-mel spectrogram, it was able to perform this binary classification on test and validation data with 88% accuracy. This model may be applied with simplicity since the models work well together and are able to both categorize emotions generally and pinpoint the precise experience. The mean energy by emotion is illustrated in Fig. 16 . 4 Results We conducted an extensive evaluation of various feature extraction techniques and deep learning models for speech-based emotion recognition. The experiments highlighted the significant impact of feature selection on model performance, particularly under data-constrained conditions. Traditional handcrafted features such as MFCCs and Log-Mel spectrograms yielded markedly better results compared to raw audio waveforms. Among these, Log-Mel spectrograms consistently outperformed MFCCs , underscoring their superior representational capacity for capturing emotional cues from speech. A deeper analysis into gender-specific classification revealed an important insight: male and female voices exhibit different emotional patterns , often driven by variations in pitch, energy, and articulation. This aligns with previous research [ 7 ], [ 10 ] and is further supported by our findings where MFCCs enhanced with pitch and energy information showed significant improvement. This suggests that essential prosodic features required for emotion prediction are not inherently captured in standard MFCC representations. The performance of different models is summarized in Table 5 , which compares accuracy, validation accuracy, and the number of classes for various architectures. A deeper analysis into gender-specific classification revealed an important insight: male and female voices exhibit different emotional patterns , often driven by variations in pitch, energy, and articulation. This aligns with previous research [ 7 ], [ 10 ] and is further supported by our findings where MFCCs enhanced with pitch and energy information showed significant improvement. This suggests that essential prosodic features required for emotion prediction are not inherently captured in standard MFCC representations. The performance comparison across models shows that 2D CNNs surpass traditional LSTM and hybrid CNN-LSTM models . This is attributed to their ability to better capture spatial patterns in the time-frequency representation of audio signals. Specifically, Fig. 17 shows the confusion matrix of the 1D CNN-LSTM model , revealing moderate misclassification among similar emotions, particularly in neutral and sad categories. Figure 18 , highlighting the 2D CNN_4L model , demonstrates improved precision, particularly in classifying neutral, fearful, and sad emotions across genders. Notably, the 2D CNN_4L exhibits fewer misclassifications , highlighting its advantage in emotion recognition tasks. In contrast, a binary classification approach with male-specific data using deeper CNNs (6, 7, and 8 layers) yielded high validation scores and F-scores, as shown in Fig. 19 . Here, emotions were broadly categorized as positive and negative , and the 1D CNN_8L model achieved the highest accuracy of 96% , with a validation accuracy and F-score both peaking at 96% . A comparative summary of model performances is presented in Table 6 . Models were evaluated based on accuracy, validation accuracy, F-score, and the number of emotion classes. The 1D CNN_8L not only achieved the highest overall scores in binary classification but also exhibited robustness in generalization.1 A comparative analysis of emotion recognition models, based on the confusion matrices in Figs. 17 , 18 , and 19 , is provided in Table 7 . Figures 17 , 18 , and 19 represent different configurations of emotion recognition through confusion matrices. Figure 17 illustrates a gender-specific, fine-grained emotion classification model distinguishing between six emotions (angry, disgust, fearful, happy, neutral, and sad) for both male and female speakers. The model shows a moderate performance, with the highest classification accuracy observed for the female_neutral category (12 correct predictions), while the female_disgust category yielded the lowest accuracy with only 5 correct classifications, indicating potential confusion with adjacent emotional tones. In comparison, Fig. 18 demonstrates an improved and more refined version of the same classification task, showing better alignment between predicted and actual labels. The most accurate prediction in this case was again female_neutral with 20 correct predictions, whereas the least accurate was female_angry with just 6 correct outcomes, highlighting the challenge of detecting certain expressive emotions. Lastly, Fig. 19 simplifies the classification by grouping emotions into two binary categories— male_positive and male_negative . This binary setup yielded a high performance overall, especially for the male_negative category, which had the highest accuracy with 58 correct predictions. However, the male_positive class showed slightly lower performance with 46 correct predictions, suggesting some overlapping traits in positive emotional tones. Overall, this comparison highlights the trade-off between granularity and classification accuracy, with finer emotional distinctions being harder to detect, while binary models tend to deliver more reliable outcomes. 5 Discussion The results obtained from our speech emotion recognition (SER) experiments reveal several critical insights into the model's behaviour under varying classification strategies. A close analysis of the confusion matrices illustrated in Figs. 17 , 18 , and 19 shows both the strengths and the persistent challenges in classifying emotions from speech signals. In the first configuration, the SER model was trained and tested on gender-separated emotion classes, yielding a more granular 12-class output. This setup allowed us to evaluate how male and female vocal expressions of the same emotion influenced recognition performance. Interestingly, emotions such as "female_neutral" and "male angry" were classified with high accuracy, suggesting that these emotions possess distinct acoustic patterns that are easily separable from others. However, the model struggled significantly with emotions like "male_disgust" and "female_disgust," where predictions were scattered across related negative emotions such as fear and sadness. This outcome underscores a common challenge in emotion recognition: the overlapping spectral and prosodic features of certain affective states, especially within the negative emotion spectrum, often lead to confusion. The complexity increases when subtle variations in gender-specific prosody are introduced, potentially impacting the consistency of the model’s predictions. The second experimental setup simplified the framework by removing the gender split while retaining the full range of emotional classes. This adjustment led to a noticeable improvement in classification performance. Emotions such as "female_neutral" and "male_sad" stood out as the most accurately predicted, reflecting the model’s improved ability to generalize across genders. Furthermore, the overall confusion across similar emotional categories was visibly reduced. One plausible explanation for this improvement lies in the model's broader exposure to a mixed pool of vocal characteristics, which may have enhanced its robustness and reduced gender-specific overfitting. Nevertheless, the "female_angry" category continued to exhibit higher misclassification rates, pointing to the inherent variability in how different individuals vocally express anger—especially when influenced by gender-based vocal dynamics. In the third and final approach, the classification task was distilled into a binary problem: distinguishing between positive and negative emotions. This re-framing significantly boosted the model’s predictive accuracy. The "male_negative" class achieved the highest number of correct predictions, followed closely by "male_positive." With fewer output classes and more distinct categorical boundaries, the model found it easier to differentiate between broad emotional valence. The simplicity of the classification scheme played a vital role in this improvement, as it reduced the cognitive load on the model and minimized confusion between subtly different emotions. However, this abstraction sacrifices the granularity of emotion recognition, which may not be desirable in domains such as mental health monitoring or affective computing applications where precise emotional insight is necessary. Overall, the findings highlight the inherent trade-off between emotional detail and classification performance. While coarse-grained categorization enhances accuracy, it limits the depth of emotional understanding. On the other hand, fine-grained, multi-class setups offer richer emotional data but demand more complex model training and face greater challenges in achieving high precision. Additionally, gender-based separation of data seems to introduce variability rather than improve performance, indicating the need for balanced and inclusive training datasets that reflect diverse vocal characteristics. These insights emphasize the importance of aligning the choice of SER model configuration with the specific requirements of the target application. Whether the aim is to capture nuanced emotional shifts in therapeutic settings or simply detect positive or negative mood in customer service interactions, the model must be calibrated to balance performance with emotional resolution. Future research should also explore multimodal approaches, incorporating visual cues or physiological signals, to further enhance the reliability of emotion detection systems. 6 Conclusion This study offers an in-depth analysis of Speech Emotion Recognition (SER), assessing the effectiveness of various machine learning models across gender-based, gender-neutral, and binary classification schemes. Our findings reveal that while fine-grained emotional classification can provide deeper insights, it often compromises accuracy due to the subtle acoustic overlaps between emotions. On the other hand, binary classification proved significantly more reliable, highlighting the trade-off between complexity and precision. Interestingly, separating data by gender did not yield the expected performance gains, pointing instead to the value of diverse and inclusive datasets. The results emphasize that the ideal SER model should be context-driven—tailored to applications that either require rapid, broad emotion detection or more nuanced emotional interpretation, such as in therapeutic or clinical settings. Looking ahead, future research should address challenges like class imbalance, limited dataset diversity, and overlapping emotion, while exploring the potential of multimodal integration to enhance emotion detection. Ethical considerations—such as transparency, fairness, and data privacy—must also remain at the forefront. Ultimately, our work contributes meaningful insights into the design and deployment of SER systems, paving the way for more emotionally intelligent and human-aware technologies Declarations Funding Declaration This research received no external funding. Author Contribution Atul Mishra - Conceptualization, Methodology, results and validation, reviewSarthak - Conceptualization, Methodology, results and validation References Bora, S. S., & Rathore, S. S. (2018). A Review of Sound Emotion Recognition Techniques. Journal of Information Technology and Computer Science , 3 (2), 45–56. El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognition , 44 (3), 572–587. Han, K. (2014). Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 5498–5502. Zhang, Y., et al. (2019). An Efficient Approach for Speech Emotion Recognition Based on Ensemble Extreme Learning Machine. Ieee Access : Practical Innovations, Open Solutions , 7 , 6789–6798. Abdelwahab, O., et al. (2020). Emotion Recognition from Speech Signals Using New Deep Learning Techniques. IEEE Trans Affective Comput , 11 (4), 694–705. Wen, T., et al. (2021). Speech Emotion Recognition Using Deep Learning with Sequence Modeling and Data Augmentation. Ieee Signal Processing Letters , 28 , 1245–1249. Hassan, M., et al. (2020). Emotion Recognition in Speech Using a Hybrid Model of BiLSTM and 1-D CNN. Ieee Access : Practical Innovations, Open Solutions , 8 , 105678–105689. Pandey, A. (2020). Speech Emotion Recognition Using GRU-RNN and Convolutional Neural Networks, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 1234–1238. Khan, S., et al. (2021). Speech Emotion Recognition using Convolutional Recurrent Neural Network with Transfer Learning. IEEE J Sel Top Signal Process , 15 (3), 453–461. Singh, R., et al. (2021). Multimodal Fusion of Audio and Video Using CNN and Bi-GRU for Speech Emotion Recognition. Ieee Transactions On Multimedia , 23 , 789–800. Li, X., et al. (2022). Speech Emotion Recognition Based on XGBoost and Multi-Features Fusion. Ieee Access : Practical Innovations, Open Solutions , 10 , 2345–2355. Bhattacharyya, S., et al. (2022). Speech Emotion Recognition Using Deep Learning Techniques: A Comparative Study. IEEE Trans Neural Netw Learn Syst , 33 (6), 2312–2323. Kumar, P., et al. (2022). Multimodal Speech Emotion Recognition using Attention-based Bi-GRU and Textual Features. IEEE Trans Affective Comput , 13 (1), 178–188. Sahu, S., et al. (2021). Speech Emotion Recognition using Deep Learning and Transfer Learning Techniques: A Comprehensive Study. Ieee Access : Practical Innovations, Open Solutions , 9 , 112345–112355. Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Neurocomputing , 312 , 172–179. Peeters, D., & Quatieri, T. F. (2003). The Toronto Emotional Speech Set (TESS) . URL. [Online]. University of Surrey The Surrey Audio-Visual Expressed Emotion (SAVEE) Dataset, 2008. [Online]. Available: URL. Zhang, Z., Li, L., & Li, M. (2023). Transformer-Based Models for Speech Emotion Recognition. IEEE Trans Affective Comput , 14 (2), 345–357. Patel, R., Kumar, S., & Singh, A. (2023). Robust End-to-End Speech Emotion Recognition in Noisy Environments Using Deep Learning. Ieee Signal Processing Magazine , 40 (1), 85–98. Chen, Y., Wang, H., & Zhang, J. (2022). Hybrid CNN-RNN Architectures for Robust Speech Emotion Recognition. IEEE Trans Neural Netw Learn Syst , 33 (10), 4150–4161. Wadhwa, M., Gupta, A., & Pandey, P. K. (2020). Speech emotion recognition (SER) through machine learning . Analytics Insight. Sai, K. A. (2023). Deep Learning for the Recognition of Human Speech. Tables Tables 1 to 7 are available in the Supplementary Files section. Additional Declarations No competing interests reported. Supplementary Files Tables.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7474053","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":506603664,"identity":"bb88f0ea-c1de-41b4-8aea-b70a7e16a490","order_by":0,"name":"Atul Mishra","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYHACxgMQmofxAYjkI6iBjYEBpoXZAESykaKFTQIqgB/wz+8xOPil5o5dP//ZY5Vfc+xk2BiYHz66gUeLxDEeg8Myx54lz5yRl3Zbdlsy0GFsxsY5+KwBaZFgO5xscIPH7LbkNmagFh42aXxa5MFa/h1Otj9/xqxYcls9YS0GQC0HP7YdtjNgyDFj/LjtMGEthsfSCg4z9h1OkLiRYyzNuO04DxszAb/IHT688eGPb4ft+fvPGH78ua3anp+9+eFjvN4HAmYeBobEBigDSBJQDgKMPxgY7GGMUTAKRsEoGAUYAAAbnUmkU7EM9gAAAABJRU5ErkJggg==","orcid":"","institution":"BML Munjal University","correspondingAuthor":true,"prefix":"","firstName":"Atul","middleName":"","lastName":"Mishra","suffix":""},{"id":506603665,"identity":"54c9f2d5-01fc-4697-bf89-07e544e04f62","order_by":1,"name":"Sarthak Jindal","email":"","orcid":"","institution":"BML Munjal University","correspondingAuthor":false,"prefix":"","firstName":"Sarthak","middleName":"","lastName":"Jindal","suffix":""}],"badges":[],"createdAt":"2025-08-27 18:08:22","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7474053/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7474053/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":90086102,"identity":"7bc200d4-dbc8-4587-95d7-10c5d7e26ec4","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":30400,"visible":true,"origin":"","legend":"\u003cp\u003ewaveform of the male audio (amplitude vs time)\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/0055f8e9e0990a0902f7ea5a.png"},{"id":90085819,"identity":"aefc7b93-816c-4a9a-9f10-8f4e0628bf10","added_by":"auto","created_at":"2025-08-28 09:59:20","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":31074,"visible":true,"origin":"","legend":"\u003cp\u003espectrogram of the male audio (frequency vs seconds)\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/7d3bb3f64beac8aef2d0717d.png"},{"id":90086101,"identity":"42666ffa-0555-471a-9ea5-19ebbd8116fd","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":39052,"visible":true,"origin":"","legend":"\u003cp\u003ewaveform of the trimmed male audio (Amplitude vs time(s))\u003c/p\u003e","description":"","filename":"image11.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/18431a1f835c8d458a26328c.png"},{"id":90085825,"identity":"2d39ef75-6b0b-4e4b-998b-09175597451b","added_by":"auto","created_at":"2025-08-28 09:59:20","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":35819,"visible":true,"origin":"","legend":"\u003cp\u003espectrogram of the male trimmed audio (frequency vs seconds)\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/2de2dfaa104c5408cbc93696.png"},{"id":90085829,"identity":"b89cff5b-a027-449a-97d5-18c8531dd21e","added_by":"auto","created_at":"2025-08-28 09:59:20","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":112507,"visible":true,"origin":"","legend":"\u003cp\u003ewaveform of the trimmed and filtered audio of the male (amplitude vs time)\u003c/p\u003e","description":"","filename":"image12.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/31b9020c4bd42e82299a5859.png"},{"id":90086103,"identity":"27f41101-9dc1-4cc1-bae5-b5a0042b3b41","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":93093,"visible":true,"origin":"","legend":"\u003cp\u003espectrogram of the trimmed and filtered audio of the male (frequency vs seconds)\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/c0efa6e76b2004527eb60f68.png"},{"id":90085828,"identity":"102b5193-8c24-43c0-a90d-dae3ca7f1498","added_by":"auto","created_at":"2025-08-28 09:59:20","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":29332,"visible":true,"origin":"","legend":"\u003cp\u003eMFCC of the male audio (MFCC vs Time)\u003c/p\u003e","description":"","filename":"image13.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/6999bded6fe5453944210a53.png"},{"id":90086104,"identity":"eb374e42-5b19-4883-a7c6-fb2bf8d7da3b","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":29497,"visible":true,"origin":"","legend":"\u003cp\u003eMFCC of the female audio (MFCC vs Time)\u003c/p\u003e","description":"","filename":"image10.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/28d4991829cb6b79ffe339a1.png"},{"id":90085834,"identity":"a7a15655-c848-4c39-a72d-e98684592447","added_by":"auto","created_at":"2025-08-28 09:59:20","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":36883,"visible":true,"origin":"","legend":"\u003cp\u003eMel spectrogram of the male audio (Frequency vs Time)\u003c/p\u003e","description":"","filename":"image14.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/9e463a954e65d660586a73c1.png"},{"id":90085839,"identity":"7dc9d2af-8861-4899-b6da-465e30c6e969","added_by":"auto","created_at":"2025-08-28 09:59:21","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":39450,"visible":true,"origin":"","legend":"\u003cp\u003eMel spectrogram of the female audio (Frequency vs Time)\u003c/p\u003e","description":"","filename":"image8.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/ad0b212c96a8c7ce5cc060a2.png"},{"id":90086106,"identity":"c0f55747-275f-4bdd-8138-028ea84f2044","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":77811,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic of solution pipeline [20]\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/d197292f6de6ade6ed34ad3f.png"},{"id":90086108,"identity":"1a71ac99-7890-4efa-b3a3-c1fc4188b891","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":217826,"visible":true,"origin":"","legend":"\u003cp\u003eFinal Prediction Pipeline of the CNN\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/db5c8651f8972063effa5a43.png"},{"id":90086109,"identity":"13613b97-8b4e-496d-8a82-95a826f1b4b5","added_by":"auto","created_at":"2025-08-28 10:07:21","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":123042,"visible":true,"origin":"","legend":"\u003cp\u003eworking procedure of diverse types of models (CNN, SVM, LSTM) for SER [22]\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/87af0211da13a0e2561d8075.png"},{"id":90086699,"identity":"3f5ce6e8-32c0-40e8-8201-8f99ddafc7c9","added_by":"auto","created_at":"2025-08-28 10:15:21","extension":"png","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":224473,"visible":true,"origin":"","legend":"\u003cp\u003eThe architecture of the HMM model for the SER\u003c/p\u003e","description":"","filename":"image16.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/63e90218ed0133c0f46b5c8a.png"},{"id":90085868,"identity":"2579a0a4-ce71-49bc-bb7c-6508fd4ddc5d","added_by":"auto","created_at":"2025-08-28 09:59:22","extension":"png","order_by":15,"title":"Figure 15","display":"","copyAsset":false,"role":"figure","size":79270,"visible":true,"origin":"","legend":"\u003cp\u003eVisual representation of 2D CNN with different layers.\u003c/p\u003e","description":"","filename":"image19.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/8f22b763db92f2de14d159ad.png"},{"id":90085840,"identity":"ed82dd9c-2f90-456a-96a5-6446f6a4b3a9","added_by":"auto","created_at":"2025-08-28 09:59:21","extension":"png","order_by":16,"title":"Figure 16","display":"","copyAsset":false,"role":"figure","size":46874,"visible":true,"origin":"","legend":"\u003cp\u003eBar chart of diverse types of emotion (Energy vs Emotion)\u003c/p\u003e","description":"","filename":"image17.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/9987981ff2e9663b3928ca45.png"},{"id":90086118,"identity":"04419825-dcd7-46e1-967b-1869582753e6","added_by":"auto","created_at":"2025-08-28 10:07:21","extension":"png","order_by":17,"title":"Figure 17","display":"","copyAsset":false,"role":"figure","size":109599,"visible":true,"origin":"","legend":"\u003cp\u003econfusion matrix of the 1D CNN _LSTM (True label Vs Predicted Label)\u003c/p\u003e","description":"","filename":"image9.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/f0e2a0761fc06dd9a41b0217.png"},{"id":90086113,"identity":"7f2bd6a0-0727-4a37-bd1c-69378ffdd714","added_by":"auto","created_at":"2025-08-28 10:07:21","extension":"png","order_by":18,"title":"Figure 18","display":"","copyAsset":false,"role":"figure","size":106670,"visible":true,"origin":"","legend":"\u003cp\u003econfusion matrix of the 2D CNN _4L (True Label Vs Predicted Label)\u003c/p\u003e","description":"","filename":"image18.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/10a9a1d9878fb42492ee627a.png"},{"id":90086700,"identity":"fa7c5256-ea17-4867-9219-2398d7857baf","added_by":"auto","created_at":"2025-08-28 10:15:21","extension":"png","order_by":19,"title":"Figure 19","display":"","copyAsset":false,"role":"figure","size":61923,"visible":true,"origin":"","legend":"\u003cp\u003econfusion matrix for the augmentation part (True Label Vs Predicted Label)\u003c/p\u003e","description":"","filename":"image15.png","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/5e99d2b9411dadcdc8ba187d.png"},{"id":91241536,"identity":"4b552978-eb44-47ac-af2f-a9109bebdfa1","added_by":"auto","created_at":"2025-09-13 12:01:40","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2315557,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/069e18dc-760b-41c3-954c-0c7fa25e1774.pdf"},{"id":90086100,"identity":"2303b296-fbef-4d70-8797-dfe161783b71","added_by":"auto","created_at":"2025-08-28 10:07:20","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":29251,"visible":true,"origin":"","legend":"","description":"","filename":"Tables.docx","url":"https://assets-eu.researchsquare.com/files/rs-7474053/v1/2ba0cc3361c5c38e505320c7.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Feature Significance in Speech Emotion Recognition","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eEmotion recognition from speech is a complex and evolving area of research, involving the analysis and interpretation of subtle variations in vocal characteristics to infer emotional states [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. This paper explores how specific audio features\u0026mdash;particularly Log-Mel Spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs)\u0026mdash;influence the performance of machine learning models in accurately classifying emotions.\u003c/p\u003e\u003cp\u003eRecognizing and interpreting emotions is a fundamental aspect of human interaction. In recent years, there has been a growing interest in developing systems capable of detecting emotions from speech, music, and other audio inputs. These systems apply machine learning algorithms to acoustic signals in order to identify the emotions being conveyed. Their potential applications span a wide range of domains, including virtual assistants, speech-based interfaces, education, mental health monitoring, and entertainment. At their core, these technologies strive to replicate human sensitivity to emotional cues in voice and sound, thus paving the way for more emotionally intelligent machines.\u003c/p\u003e\u003cp\u003eA central challenge in this domain is achieving precise emotion detection from audio streams. This is typically accomplished by extracting and analysing key features\u0026mdash;such as pitch, tone, rhythm, and intensity\u0026mdash;and training classification models capable of interpreting the emotional content. Various approaches have emerged: rule-based systems that rely on predefined heuristics, statistical models that classify emotions using probabilistic logic, and deep learning methods that automatically learn high-level features through neural networks [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eAs these systems evolve, their significance becomes even more apparent. Emotionally aware technologies have the potential to enhance user experience across multiple domains. In virtual interactions, they can respond more personally and empathetically. In gaming, they offer more immersive and responsive gameplay [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. In healthcare, these systems may support the early detection and monitoring of emotional disorders such as depression and anxiety, enabling timely intervention and more effective treatment outcomes.\u003c/p\u003e\u003cp\u003eThis study is driven by several key research directions. One line of inquiry focuses on developing sound emotion recognition systems by testing various algorithms, feature extraction methods, and training datasets in order to maximize classification accuracy [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Another important direction examines how contextual variables\u0026mdash;such as the user's emotional state, the nature of the sound, and the surrounding environment\u0026mdash;affect recognition accuracy [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Furthermore, exploring cultural variations in emotional expression is crucial, as emotions are often conveyed and interpreted differently across societies [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe rationale for this research lies in the potential of emotion-aware systems to redefine human-computer interaction. These systems can enhance communication, support personalized learning, and even offer mental health insights. By accurately identifying users' emotional states, they open the door to more intuitive and effective technological experiences. Particularly in mental health, emotion recognition systems can contribute to the diagnosis and management of conditions like anxiety and depression through passive yet intelligent monitoring. Based on these foundations, two key hypotheses are proposed: first, that deep learning-based sound emotion recognition systems offer superior accuracy compared to traditional statistical models [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]; and second, that the accuracy of these systems is significantly influenced by contextual factors surrounding the audio input.\u003c/p\u003e\u003cp\u003eThe remainder of this paper is organized as follows. The next section presents a review of existing literature in the field of sound emotion recognition. This is followed by a detailed explanation of the methodology, including the dataset, feature extraction techniques, and model implementation. The subsequent sections discuss the experimental results and performance evaluations. A discussion of the findings and their implications is then provided. Finally, the paper concludes with a summary of key insights and suggestions for future research.\u003c/p\u003e"},{"header":"2 Literature Review","content":"\u003cp\u003eOver the past decade, sound emotion recognition has emerged as a vibrant interdisciplinary field, bridging the gap between audio signal processing, psychology, machine learning, and human-computer interaction. Researchers have extensively explored how subtle vocal cues\u0026mdash;such as pitch, tone, energy, and spectral dynamics\u0026mdash;can reveal emotional states, with machine learning techniques steadily enhancing the reliability and accuracy of these insights. This review synthesizes notable contributions to the field, identifying prevailing methodologies, datasets, and recent advances that continue to shape this evolving area.\u003c/p\u003e\n\u003cp\u003eA fundamental aspect of speech emotion recognition (SER) lies in the extraction of relevant features from audio signals. Traditionally, features such as Mel-Frequency Cepstral Coefficients (MFCCs), pitch, energy, duration, spectral flux, and spectral centroid have been widely adopted. Studies have shown that combining these features typically yields superior performance compared to using them in isolation. In recent years, the use of deep learning architecture like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) has gained popularity due to their ability to automatically learn intricate emotional patterns embedded in audio data.\u003c/p\u003e\n\u003cp\u003eVarious classification algorithms have been applied to speech emotion tasks. Conventional models like Support Vector Machines (SVM), Random Forests, and Naive Bayes have provided solid baselines. However, deep learning-based models have demonstrated substantial improvements in performance. El Ayadi et al. [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e] offered a comprehensive survey of classification models, while Han et al. [\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e] and Zhang et al. [\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e] demonstrated the effectiveness of deep neural networks and ensemble-based methods in capturing emotional nuances more accurately.\u003c/p\u003e\n\u003cp\u003eDatasets have played a pivotal role in enabling consistent benchmarking of SER systems. Prominent among them are RAVDESS, Emo-DB, TESS, and SAVEE, each offering a variety of emotional expressions across gender and speech types [\u003cspan class=\"CitationRef\"\u003e15\u003c/span\u003e][\u003cspan class=\"CitationRef\"\u003e16\u003c/span\u003e][\u003cspan class=\"CitationRef\"\u003e17\u003c/span\u003e]. These databases have allowed for meaningful comparisons across models and techniques, facilitating standardized evaluation.\u003c/p\u003e\n\u003cp\u003eTo assess model performance, common metrics include accuracy, precision, recall, and F1-score. Advanced studies have also introduced measures such as the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) to better evaluate model robustness, especially under class-imbalanced scenarios.\u003c/p\u003e\n\u003cp\u003eAs shown in Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, several studies have utilized traditional techniques for speech emotion recognition achieving varying levels of accuracy.\u003c/p\u003e\n\u003cp\u003eTaken together, these studies illustrate the field\u0026rsquo;s rapid progression from traditional classifiers to sophisticated deep learning and ensemble techniques. As feature extraction and model design continue to evolve, so too will the potential for highly accurate and context-aware emotion recognition systems, bringing us closer to truly empathetic human-computer interaction.\u003c/p\u003e"},{"header":"3 Methodology","content":"\u003cp\u003eThis section outlines the methodological framework employed for developing and evaluating the proposed speech emotion recognition system, detailing the data preprocessing steps, feature extraction techniques, model architecture, training process, and evaluation metrics used throughout the study.\u003c/p\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.1 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eData collection\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eDesigning an effective Speech Emotion Recognition (SER) system typically involves three key components: selecting an appropriate emotional speech database, extracting meaningful features from audio signals, and choosing suitable classification algorithms. For this study, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) has been selected due to its comprehensive and balanced structure. This multimodal, gender-balanced dataset features recordings from 24 professional actors who perform 104 distinct vocalizations that span a wide range of emotions, including happiness, sadness, anger, fear, surprise, contempt, calmness, and neutrality [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Each actor delivers two standardized sentences\u0026rdquo; The kids are talking outside the door\" and \"The dogs are sitting outside the door\"\u0026mdash;for every emotion. Except for the neutral emotion, each expression is recorded at both normal and heightened emotional intensities, with each line spoken twice. The dataset includes 1,440 vocal utterances and 1,012 chanted utterances, offering a diverse and rich emotional context. Unlike many other datasets, such as the Toronto Emotional Speech Set (TESS) and the Surrey Audio-Visual Expressed Emotion (SAVEE), which include only audio recordings, RAVDESS provides a multimodal setup and maintains a well-balanced distribution across emotion classes, mitigating common issues like class imbalance.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.2 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eSelection and selection restrictions\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eWhile the RAVDESS dataset is widely regarded for its balanced emotional representation and professional-quality recordings, it is not without limitations. One of the primary concerns is its lack of linguistic and cultural diversity. Factors such as language, accent, dialect, and cultural background play a critical role in the way emotions are expressed and perceived in speech. Consequently, a model trained exclusively on English utterances\u0026mdash;such as those in RAVDESS\u0026mdash;may struggle to accurately recognize emotions in non-English languages like Mandarin or Hindi. Furthermore, since the dataset comprises performances from 24 English-speaking actors based in Toronto, Canada, it reflects a strong North American cultural bias. This limitation restricts the model\u0026rsquo;s generalizability across diverse global populations and real-world applications where linguistic and cultural variation is the norm.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.3 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eData preparation\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eEach audio file in the RAVDESS dataset follows a structured naming convention composed of seven numerical components: actor, voice channel, modality, emotional intensity, statement, and repetition. This format allows for organized parsing and efficient categorization of the data. Interestingly, the dataset designates gender based on actor ID\u0026mdash;odd-numbered actors represent male voices, while even-numbered actors represent female voices. This structured metadata is essential for preprocessing, enabling systematic filtering and labeling of audio samples according to emotion type, speaker identity, and other relevant attributes. Such organization simplifies downstream tasks like feature extraction and model training.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e3.4 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eData Cleaning\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eAbout three seconds were recorded for each piece. At the beginning and end of the audio recordings, the silence was interrupted. Although the audio was well captured, there were very few noise patterns in the data. We tested several signal-processing techniques, such as filtering 3 and voice-activity detection (VAD), to eliminate the noise.\u003c/p\u003e\u003cp\u003eA well-known method of noise reduction called spectral subtraction works by subtracting an estimate of the noise spectrum from the noisy speech spectrum to eliminate background noise (also known as additive noise). The result of this process is visualized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e through \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, which shows the waveforms and spectrograms before and after trimming and filtering for male audio samples.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e3.5 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eFeature Selection and Modelling\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eThe two basic categories into which audio properties are divided are time domain characteristics and frequency domain qualities. Examples of time domain functions include short-term signal energy, zero crossing rate, maximum amplitude, lowest energy, and energy entropy. It is quite easy to extract these components, which simplifies audio signal analysis. The frequency domain characteristics display more complex patterns in the audio signal with little information, which may be able to reveal the emotion concealed in the signal. Examples of frequency domain functions include spectrograms, Mel frequency cepstral coefficients (MFCC), C spectral centroid, spectral slope, spectral entropy, and chrominance coefficients. During the exploratory examination of the data, each feature was thoroughly looked at.\u003c/p\u003e\u003cp\u003eHowever, for the purposes of this report, we have concentrated on two key characteristics, namely the Mel spectra and the cepstral coefficients of the Mel frequency.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.6 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eMel-Frequency Cepstrum (MFCC)\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eMel-Frequency Cepstrum simulates the human cochlea by changing the audio stream in a sequence of stages to approximate the short-term sound power spectrum.\u003c/p\u003e\u003cp\u003eA variety of audio signal processing techniques are used to extract MFCCs, including:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ePre-emphasis: boosting higher frequency components of the signal and reducing the impact of noise by adding a high-pass filter to the signal.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eFrame segmentation is the process of breaking a signal into brief, 20 to 30-millisecond frames with minimum overlap.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eWindowing: To lessen spectrum loss and boost spectrum resolution, multiply each image using a windowing function like a Hamming or Hanning window.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe energy in each Mel frequency band is logarithmically scaled to produce a set of spectral coefficients for the Mel scale.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eDiscrete Cosine Transform: To calculate the corrected MFCC, the scaled coefficients Mel are subjected to the discrete cosine transform (DCT).\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe MFCCs for male and female audio sample are visualized in\u003c/span\u003e Figs.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e and \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003erespectively\u003c/span\u003e,\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.7 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eMel-Spectrograms\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eThe Mel spectrogram is another well-liked feature extraction technique in sound emotion recognition applications. A Mel spectrogram, a particular type of spectrogram, is produced using a Mel-scale filter bank, which is a collection of triangular filters spaced in line with the Mel scale. In the context of sound emotion identification, Mel spectrograms can capture significant elements of the audio signal that are related to the emotional content, such as the pitch, timbre, and intensity of the voice.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eUsing the Fast Fourier Transform (FFT), each frame's frequency spectrum is calculated.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eMel filter bank: The frequency spectrum is passed through a series of triangular bandpass filters to yield a set of logarithmically spaced Mel-frequency bands.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe logarithm of the energy in each band of the Mel spectrum is used to create a set of Mel-scale spectral coefficients.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe power spectrum of the Mel-scale coefficients for each frame is calculated to produce a Mel spectrogram.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe Mel spectrograms for male and female audio samples are shown in\u003c/span\u003e Figs.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e and \u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e, \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003erespectively.\u003c/span\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe overall solution pipeline for our SER system is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.8 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eConvolutional Neural Networks\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eThe development of convolutional neural networks (CNNs) is responsible for the enormous improvements in image recognition tasks in recent years. CNNs excel at automatically picking out vital details from photos with several dimensions. To benefit from the associated two-dimensional picture data format, CNNs employ common kernels (weights). To introduce invariance, maximum splicing is introduced to CNNs, and only the pertinent multidimensional functions are used for various tasks like classification, segmentation, etc. Unexpectedly, CNNs are effective for tasks involving tone identification. This is because \"Linearity\" is applied and introduced on the logarithmic frequency axis when the log-Mel filter bank is applied to the FFT representation of the raw audio signal, allowing us to execute a convolution operation along the frequency axis. The prediction pipeline for our CNN model is depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIf not, we would need to employ various cores or filters for various frequency ranges. The model can effectively learn fundamental patterns from brief delays thanks to this quality, which when combined with the CNN's excellent drawing power, leads to peak performance in speech-based emotion recognition systems. On tasks requiring video comprehension, 3D CNN typically performs well. Within each frame of video data, there is no temporal correlation, only between the frames.\u003c/p\u003e\u003cp\u003eHowever, there is a temporal relationship along two axes in the case of converted 3D audio, with a slight delay along one and a larger delay along the other. So, we were interested in determining whether modelling the short- and long-term dependencies in the data would enhance our SER system. As a result, our study also comprised a 3D CNN model. Each frame in 3D audio was set to a constant length of 250 ms, which proved to be the shortest effective time for evoking emotions. To discover patterns in sound waveforms, 1D multilayer CNNs were also trained on raw audio.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.9 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eLSTM (Long Short-Term Memory) Model\u003c/span\u003e\u003c/h2\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThe working procedures of different models, including CNN, SVM, and LSTM, for SER are shown in\u003c/span\u003e Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e13\u003c/span\u003e. \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eSound emotion detection experiments have shown that recurrent neural network (RNN) models, sometimes referred to as Long Short-Term Memory (LSTM) models, perform well. The process by which an LSTM model recognizes auditory emotions in their totality is described as follows\u003c/span\u003e:\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003ePre-processing of data: The spectral and temporal characteristics of the audio signal are captured by relevant features such as MFCCs or Mel spectrograms, which are extracted from the audio data during pre-processing. The data is then created into training, validation, and testing sets.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eModel architecture for LSTM: The LSTM model architecture is given to generate a probability distribution over the emotional classes. Usually, it starts with one or more LSTM layers and then moves on to one or more completely linked layers.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eModel training: The LSTM model is trained on the training set using optimization techniques like stochastic gradient descent (SGD) or Adam to minimize a loss function, such as categorical cross-entropy. During the model's multi-epoch training, early stopping is done using the validation set to prevent overfitting.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eModel evaluation: A trained LSTM model is evaluated to examine how well it performs in terms of metrics like accuracy, precision, recall, and F1-score using the testing set as a basis. It is also possible to assess the model's classification performance using confusion matrices.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eApplication of the model: To enable real-time sound emotion recognition, the trained LSTM model may be used to classify brand-new audio samples into one of the emotional categories. Metrics like classification accuracy and reaction time can be used to assess the effectiveness of the deployed model.\u003c/span\u003e\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e3.10 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eHidden Markov Models\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eBefore RNNs became popular, HMMs were commonly used for voice recognition tasks like emotion recognition. The assumption behind how HMMs work is that observations come from a set of hidden states, each of which includes transitions that follow the Markov principle. Among generative models are HMMs. A significant disadvantage of HMM is how poorly it models non-linear data. We employed two diverse types of HMMs with various emission properties: Gaussian HMMs and Gaussian Mixture Model HMMs. Phones, the smallest discrete sound, or hidden state in these HMMs, are categorized based on the likelihood that a set of phones will fall into a particular class.\u003c/p\u003e\u003cp\u003eThe architecture of the HMM model used for SER is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e3.11 \u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eDescription of each model\u003c/span\u003e\u003c/h2\u003e\u003cp\u003eThe accuracy may be used to assess how well different models perform because the data distribution is homogeneous. Unweighted accuracy was chosen as the model selection metric as a result. All the input audios were clipped or appropriately spread out to accommodate 3 seconds for training models with fixed-size inputs. Depending on the complexity of the architecture, a configurable batch size was used to train each model across 100 epochs. The initial training of the models was carried out using the stochastic gradient descent (2) (SGD) optimizer. Later studies used the ADAM (5) optimizer with default parameters due to the slower convergence with SGD. By keeping an eye on the validity of the validation set, models were kept up to date. Using MFCC characteristics, which are the best ones for the seven class predictions, researchers discovered that the network was unable to comprehend recordings from the \"Surprised\" class. We hypothesized that the \"Surprised\" category of this data may be the result of a shortage. Only the six additional classes\u0026mdash;angry, sad, neutral, disgusted, glad, and fearful\u0026mdash;were consequently considered in subsequent analyses. Consequently, the \"Surprised\" class was eliminated from the dataset. We found that the optimum DNN performance was achieved by increasing the number of MFCCs and deltas. The MFCC visualization and Log-Mel spectrogram demonstrates how little information is included in MFCCs at higher frequencies. Therefore, we think that the log-mel spectrogram features will produce more fruitful outcomes in this study. It was discovered that adding gender provided better results when 1D CNNs and 2D CNNs were utilized for the classification of 12 gender emotion class classifications and six emotion class classifications in tandem. [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe 1D CNN and 1D CNN-LSTM architectures were trained on the raw audio input since CNNs are naturally good feature extractors. Designing attributes such as MFCCs and Log-mel spectrograms was done using 2D CNNs. The training of 2D CNNs began with the use of max pooling with 2x2 filters, stride 2, and two convolutional layers with 3x3 filters. To adjust the basic layers, the filter widths were increased, and new convolutional layers were introduced. It was discovered that performance was unaffected by depth increases beyond 4 levels. A different CNN architecture was also put to the test, one in which the final convolutional feature map was flattened rather than fixed-length feature maps utilizing global average pooling. This method addressed the issue of a vast number of parameters throughout all connected levels. The capability for variable-sized inputs, which speech data commonly has, is another benefit of this approach. By boosting the filters in the early convolutional layers, we were able to improve performance by using the global average pooling layer. A visual representation of the 2D CNN with different layers is provided in Fig.\u0026nbsp;\u003cspan refid=\"Fig15\" class=\"InternalRef\"\u003e15\u003c/span\u003e. Utilizing MFCCs and Log-Mel spectrogram features, Gaussian HMM and GMM-HMM were implemented. HMMs utilizing MFCCs offered about 20% accuracy, whereas log-mel spectrogram models offered 32% accuracy.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eWe wanted to make sure that our model could at least roughly categorize emotions as either good or negative to have a deployable model. To determine if the speaker is feeling favorably (neutral, thrilled, astonished) or negatively (angry, sad, scared, disgusted), we created a 4-layer-2D CNN model (with global average pooling). Since the model was trained using features from the log-mel spectrogram, it was able to perform this binary classification on test and validation data with 88% accuracy. This model may be applied with simplicity since the models work well together and are able to both categorize emotions generally and pinpoint the precise experience. The mean energy by emotion is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig16\" class=\"InternalRef\"\u003e16\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4 Results","content":"\u003cp\u003eWe conducted an extensive evaluation of various feature extraction techniques and deep learning models for speech-based emotion recognition. The experiments highlighted the significant impact of feature selection on model performance, particularly under data-constrained conditions. Traditional handcrafted features such as \u003cstrong\u003eMFCCs\u003c/strong\u003e and \u003cstrong\u003eLog-Mel spectrograms\u003c/strong\u003e yielded markedly better results compared to raw audio waveforms. Among these, \u003cstrong\u003eLog-Mel spectrograms consistently outperformed MFCCs\u003c/strong\u003e, underscoring their superior representational capacity for capturing emotional cues from speech.\u003c/p\u003e\n\u003cp\u003eA deeper analysis into gender-specific classification revealed an important insight: \u003cstrong\u003emale and female voices exhibit different emotional patterns\u003c/strong\u003e, often driven by variations in pitch, energy, and articulation. This aligns with previous research [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e] and is further supported by our findings where \u003cstrong\u003eMFCCs enhanced with pitch and energy information\u003c/strong\u003e showed significant improvement. This suggests that essential prosodic features required for emotion prediction are not inherently captured in standard MFCC representations. The performance of different models is summarized in Table \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e, which compares accuracy, validation accuracy, and the number of classes for various architectures.\u003c/p\u003e\n\u003cp\u003eA deeper analysis into gender-specific classification revealed an important insight: \u003cstrong\u003emale and female voices exhibit different emotional patterns\u003c/strong\u003e, often driven by variations in pitch, energy, and articulation. This aligns with previous research [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e] and is further supported by our findings where \u003cstrong\u003eMFCCs enhanced with pitch and energy information\u003c/strong\u003e showed significant improvement. This suggests that essential prosodic features required for emotion prediction are not inherently captured in standard MFCC representations.\u003c/p\u003e\n\u003cp\u003eThe performance comparison across models shows that \u003cstrong\u003e2D CNNs surpass traditional LSTM and hybrid CNN-LSTM models\u003c/strong\u003e. This is attributed to their ability to better capture spatial patterns in the time-frequency representation of audio signals. Specifically, Fig. \u003cspan class=\"InternalRef\"\u003e17\u003c/span\u003e shows the confusion matrix of the \u003cstrong\u003e1D CNN-LSTM model\u003c/strong\u003e, revealing moderate misclassification among similar emotions, particularly in neutral and sad categories. Figure \u003cspan class=\"InternalRef\"\u003e18\u003c/span\u003e, highlighting the \u003cstrong\u003e2D CNN_4L model\u003c/strong\u003e, demonstrates improved precision, particularly in classifying neutral, fearful, and sad emotions across genders. Notably, \u003cstrong\u003ethe 2D CNN_4L exhibits fewer misclassifications\u003c/strong\u003e, highlighting its advantage in emotion recognition tasks.\u003c/p\u003e\n\u003cp\u003eIn contrast, a binary classification approach with male-specific data using deeper CNNs (6, 7, and 8 layers) yielded high validation scores and F-scores, as shown in Fig. \u003cspan class=\"InternalRef\"\u003e19\u003c/span\u003e. Here, emotions were broadly categorized as \u003cstrong\u003epositive and negative\u003c/strong\u003e, and the \u003cstrong\u003e1D CNN_8L model\u003c/strong\u003e achieved the highest accuracy of \u003cstrong\u003e96%\u003c/strong\u003e, with a validation accuracy and F-score both peaking at \u003cstrong\u003e96%\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eA comparative summary of model performances is presented in Table \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e. Models were evaluated based on accuracy, validation accuracy, F-score, and the number of emotion classes. The \u003cstrong\u003e1D CNN_8L\u003c/strong\u003e not only achieved the highest overall scores in binary classification but also exhibited robustness in generalization.1\u003c/p\u003e\n\u003cp\u003eA comparative analysis of emotion recognition models, based on the confusion matrices in Figs. \u003cspan class=\"InternalRef\"\u003e17\u003c/span\u003e, \u003cspan class=\"InternalRef\"\u003e18\u003c/span\u003e, and \u003cspan class=\"InternalRef\"\u003e19\u003c/span\u003e, is provided in Table \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eFigures \u003cspan class=\"InternalRef\"\u003e17\u003c/span\u003e, \u003cspan class=\"InternalRef\"\u003e18\u003c/span\u003e, and \u003cspan class=\"InternalRef\"\u003e19\u003c/span\u003e represent different configurations of emotion recognition through confusion matrices. Figure \u003cspan class=\"InternalRef\"\u003e17\u003c/span\u003e illustrates a gender-specific, fine-grained emotion classification model distinguishing between six emotions (angry, disgust, fearful, happy, neutral, and sad) for both male and female speakers. The model shows a moderate performance, with the highest classification accuracy observed for the \u003cem\u003efemale_neutral\u003c/em\u003e category (12 correct predictions), while the \u003cem\u003efemale_disgust\u003c/em\u003e category yielded the lowest accuracy with only 5 correct classifications, indicating potential confusion with adjacent emotional tones. In comparison, Fig. \u003cspan class=\"InternalRef\"\u003e18\u003c/span\u003e demonstrates an improved and more refined version of the same classification task, showing better alignment between predicted and actual labels. The most accurate prediction in this case was again \u003cem\u003efemale_neutral\u003c/em\u003e with 20 correct predictions, whereas the least accurate was \u003cem\u003efemale_angry\u003c/em\u003e with just 6 correct outcomes, highlighting the challenge of detecting certain expressive emotions. Lastly, Fig. \u003cspan class=\"InternalRef\"\u003e19\u003c/span\u003e simplifies the classification by grouping emotions into two binary categories\u0026mdash;\u003cem\u003emale_positive\u003c/em\u003e and \u003cem\u003emale_negative\u003c/em\u003e. This binary setup yielded a high performance overall, especially for the \u003cem\u003emale_negative\u003c/em\u003e category, which had the highest accuracy with 58 correct predictions. However, the \u003cem\u003emale_positive\u003c/em\u003e class showed slightly lower performance with 46 correct predictions, suggesting some overlapping traits in positive emotional tones. Overall, this comparison highlights the trade-off between granularity and classification accuracy, with finer emotional distinctions being harder to detect, while binary models tend to deliver more reliable outcomes.\u003c/p\u003e"},{"header":"5 Discussion","content":"\u003cp\u003eThe results obtained from our speech emotion recognition (SER) experiments reveal several critical insights into the model's behaviour under varying classification strategies. A close analysis of the confusion matrices illustrated in Figs.\u0026nbsp;\u003cspan refid=\"Fig17\" class=\"InternalRef\"\u003e17\u003c/span\u003e, \u003cspan refid=\"Fig18\" class=\"InternalRef\"\u003e18\u003c/span\u003e, and \u003cspan refid=\"Fig19\" class=\"InternalRef\"\u003e19\u003c/span\u003e shows both the strengths and the persistent challenges in classifying emotions from speech signals.\u003c/p\u003e\u003cp\u003eIn the first configuration, the SER model was trained and tested on gender-separated emotion classes, yielding a more granular 12-class output. This setup allowed us to evaluate how male and female vocal expressions of the same emotion influenced recognition performance. Interestingly, emotions such as \"female_neutral\" and \"male angry\" were classified with high accuracy, suggesting that these emotions possess distinct acoustic patterns that are easily separable from others. However, the model struggled significantly with emotions like \"male_disgust\" and \"female_disgust,\" where predictions were scattered across related negative emotions such as fear and sadness. This outcome underscores a common challenge in emotion recognition: the overlapping spectral and prosodic features of certain affective states, especially within the negative emotion spectrum, often lead to confusion. The complexity increases when subtle variations in gender-specific prosody are introduced, potentially impacting the consistency of the model\u0026rsquo;s predictions.\u003c/p\u003e\u003cp\u003eThe second experimental setup simplified the framework by removing the gender split while retaining the full range of emotional classes. This adjustment led to a noticeable improvement in classification performance. Emotions such as \"female_neutral\" and \"male_sad\" stood out as the most accurately predicted, reflecting the model\u0026rsquo;s improved ability to generalize across genders. Furthermore, the overall confusion across similar emotional categories was visibly reduced. One plausible explanation for this improvement lies in the model's broader exposure to a mixed pool of vocal characteristics, which may have enhanced its robustness and reduced gender-specific overfitting. Nevertheless, the \"female_angry\" category continued to exhibit higher misclassification rates, pointing to the inherent variability in how different individuals vocally express anger\u0026mdash;especially when influenced by gender-based vocal dynamics.\u003c/p\u003e\u003cp\u003eIn the third and final approach, the classification task was distilled into a binary problem: distinguishing between positive and negative emotions. This re-framing significantly boosted the model\u0026rsquo;s predictive accuracy. The \"male_negative\" class achieved the highest number of correct predictions, followed closely by \"male_positive.\" With fewer output classes and more distinct categorical boundaries, the model found it easier to differentiate between broad emotional valence. The simplicity of the classification scheme played a vital role in this improvement, as it reduced the cognitive load on the model and minimized confusion between subtly different emotions. However, this abstraction sacrifices the granularity of emotion recognition, which may not be desirable in domains such as mental health monitoring or affective computing applications where precise emotional insight is necessary.\u003c/p\u003e\u003cp\u003eOverall, the findings highlight the inherent trade-off between emotional detail and classification performance. While coarse-grained categorization enhances accuracy, it limits the depth of emotional understanding. On the other hand, fine-grained, multi-class setups offer richer emotional data but demand more complex model training and face greater challenges in achieving high precision. Additionally, gender-based separation of data seems to introduce variability rather than improve performance, indicating the need for balanced and inclusive training datasets that reflect diverse vocal characteristics.\u003c/p\u003e\u003cp\u003eThese insights emphasize the importance of aligning the choice of SER model configuration with the specific requirements of the target application. Whether the aim is to capture nuanced emotional shifts in therapeutic settings or simply detect positive or negative mood in customer service interactions, the model must be calibrated to balance performance with emotional resolution. Future research should also explore multimodal approaches, incorporating visual cues or physiological signals, to further enhance the reliability of emotion detection systems.\u003c/p\u003e"},{"header":"6 Conclusion","content":"\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eThis study offers an in-depth analysis of Speech Emotion Recognition (SER), assessing the effectiveness of various machine learning models across gender-based, gender-neutral, and binary classification schemes. Our findings reveal that while fine-grained emotional classification can provide deeper insights, it often compromises accuracy due to the subtle acoustic overlaps between emotions. On the other hand, binary classification proved significantly more reliable, highlighting the trade-off between complexity and precision. Interestingly, separating data by gender did not yield the expected performance gains, pointing instead to the value of diverse and inclusive datasets. The results emphasize that the ideal SER model should be context-driven\u0026mdash;tailored to applications that either require rapid, broad emotion detection or more nuanced emotional interpretation, such as in therapeutic or clinical settings. Looking ahead, future research should address challenges like class imbalance, limited dataset diversity, and overlapping emotion, while exploring the potential of multimodal integration to enhance emotion detection. Ethical considerations\u0026mdash;such as transparency, fairness, and data privacy\u0026mdash;must also remain at the forefront. Ultimately, our work contributes meaningful insights into the design and deployment of SER systems, paving the way for more emotionally intelligent and human-aware technologies\u003c/span\u003e\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eFunding Declaration\u003c/h2\u003e\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAtul Mishra - Conceptualization, Methodology, results and validation, reviewSarthak - Conceptualization, Methodology, results and validation\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBora, S. S., \u0026amp; Rathore, S. S. (2018). A Review of Sound Emotion Recognition Techniques. \u003cem\u003eJournal of Information Technology and Computer Science\u003c/em\u003e, \u003cem\u003e3\u003c/em\u003e(2), 45\u0026ndash;56.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEl Ayadi, M., Kamel, M. S., \u0026amp; Karray, F. (2011). Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. \u003cem\u003ePattern Recognition\u003c/em\u003e, \u003cem\u003e44\u003c/em\u003e(3), 572\u0026ndash;587.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHan, K. (2014). Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 5498\u0026ndash;5502.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, Y., et al. (2019). An Efficient Approach for Speech Emotion Recognition Based on Ensemble Extreme Learning Machine. \u003cem\u003eIeee Access : Practical Innovations, Open Solutions\u003c/em\u003e, \u003cem\u003e7\u003c/em\u003e, 6789\u0026ndash;6798.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAbdelwahab, O., et al. (2020). Emotion Recognition from Speech Signals Using New Deep Learning Techniques. \u003cem\u003eIEEE Trans Affective Comput\u003c/em\u003e, \u003cem\u003e11\u003c/em\u003e(4), 694\u0026ndash;705.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWen, T., et al. (2021). Speech Emotion Recognition Using Deep Learning with Sequence Modeling and Data Augmentation. \u003cem\u003eIeee Signal Processing Letters\u003c/em\u003e, \u003cem\u003e28\u003c/em\u003e, 1245\u0026ndash;1249.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHassan, M., et al. (2020). Emotion Recognition in Speech Using a Hybrid Model of BiLSTM and 1-D CNN. \u003cem\u003eIeee Access : Practical Innovations, Open Solutions\u003c/em\u003e, \u003cem\u003e8\u003c/em\u003e, 105678\u0026ndash;105689.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePandey, A. (2020). Speech Emotion Recognition Using GRU-RNN and Convolutional Neural Networks, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 1234\u0026ndash;1238.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKhan, S., et al. (2021). Speech Emotion Recognition using Convolutional Recurrent Neural Network with Transfer Learning. \u003cem\u003eIEEE J Sel Top Signal Process\u003c/em\u003e, \u003cem\u003e15\u003c/em\u003e(3), 453\u0026ndash;461.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSingh, R., et al. (2021). Multimodal Fusion of Audio and Video Using CNN and Bi-GRU for Speech Emotion Recognition. \u003cem\u003eIeee Transactions On Multimedia\u003c/em\u003e, \u003cem\u003e23\u003c/em\u003e, 789\u0026ndash;800.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, X., et al. (2022). Speech Emotion Recognition Based on XGBoost and Multi-Features Fusion. \u003cem\u003eIeee Access : Practical Innovations, Open Solutions\u003c/em\u003e, \u003cem\u003e10\u003c/em\u003e, 2345\u0026ndash;2355.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBhattacharyya, S., et al. (2022). Speech Emotion Recognition Using Deep Learning Techniques: A Comparative Study. \u003cem\u003eIEEE Trans Neural Netw Learn Syst\u003c/em\u003e, \u003cem\u003e33\u003c/em\u003e(6), 2312\u0026ndash;2323.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKumar, P., et al. (2022). Multimodal Speech Emotion Recognition using Attention-based Bi-GRU and Textual Features. \u003cem\u003eIEEE Trans Affective Comput\u003c/em\u003e, \u003cem\u003e13\u003c/em\u003e(1), 178\u0026ndash;188.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSahu, S., et al. (2021). Speech Emotion Recognition using Deep Learning and Transfer Learning Techniques: A Comprehensive Study. \u003cem\u003eIeee Access : Practical Innovations, Open Solutions\u003c/em\u003e, \u003cem\u003e9\u003c/em\u003e, 112345\u0026ndash;112355.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLivingstone, S. R., \u0026amp; Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). \u003cem\u003eNeurocomputing\u003c/em\u003e, \u003cem\u003e312\u003c/em\u003e, 172\u0026ndash;179.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePeeters, D., \u0026amp; Quatieri, T. F. (2003). \u003cem\u003eThe Toronto Emotional Speech Set (TESS)\u003c/em\u003e. URL. [Online].\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eUniversity of Surrey The Surrey Audio-Visual Expressed Emotion (SAVEE) Dataset, 2008. [Online]. Available: URL.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, Z., Li, L., \u0026amp; Li, M. (2023). Transformer-Based Models for Speech Emotion Recognition. \u003cem\u003eIEEE Trans Affective Comput\u003c/em\u003e, \u003cem\u003e14\u003c/em\u003e(2), 345\u0026ndash;357.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePatel, R., Kumar, S., \u0026amp; Singh, A. (2023). Robust End-to-End Speech Emotion Recognition in Noisy Environments Using Deep Learning. \u003cem\u003eIeee Signal Processing Magazine\u003c/em\u003e, \u003cem\u003e40\u003c/em\u003e(1), 85\u0026ndash;98.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen, Y., Wang, H., \u0026amp; Zhang, J. (2022). Hybrid CNN-RNN Architectures for Robust Speech Emotion Recognition. \u003cem\u003eIEEE Trans Neural Netw Learn Syst\u003c/em\u003e, \u003cem\u003e33\u003c/em\u003e(10), 4150\u0026ndash;4161.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWadhwa, M., Gupta, A., \u0026amp; Pandey, P. K. (2020). \u003cem\u003eSpeech emotion recognition (SER) through machine learning\u003c/em\u003e. Analytics Insight.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSai, K. A. (2023). Deep Learning for the Recognition of Human Speech.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTables 1 to 7 are available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Emotion Classification, Feature Significance, MFCCs, Audio Features, Deep Learning","lastPublishedDoi":"10.21203/rs.3.rs-7474053/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7474053/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cspan type=\"SmallCaps\" class=\"SmallCaps\" name=\"Emphasis\"\u003eIn the field of speech emotion recognition, the choice of audio features can dramatically influence the accuracy and effectiveness of classification systems. This study presents a comprehensive comparative analysis of feature significance, shedding light on how different audio characteristics contribute to the success of emotion recognition methodologies. This paper analyzes speech-based emotion recognition techniques and works with audio analyses using the Ryerson Audio-Visual Database (RAVD) of Emotional Speech and Song, a database consisting of audio analysis on raw audio files. Analysis involved features like Log-Mel Spectrograms (LMS), Mel-Frequency Cepstral Coefficients (MFCCs), pitch, and energy, after raw audio files are pre-processed. We measured the relevance of these features to the classification of emotion through a series of approaches that include Long Short-Term Memory networks, Convolutional Neural Networks, Hidden Markov Models, and Deep Neural Networks. On a 14-class classification problem that covers two genders and seven emotions, we obtained 56% accuracy by using a 4-layer 2-dimensional CNN with Log-Mel Spectrogram features. Our results show the importance of selection of good audio features while complexity is not that important for performance for emotion recognition.\u003c/span\u003e\u003c/p\u003e","manuscriptTitle":"Feature Significance in Speech Emotion Recognition","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-28 09:59:15","doi":"10.21203/rs.3.rs-7474053/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1d8059cc-c19b-47a3-8890-350e900e5648","owner":[],"postedDate":"August 28th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-09-13T11:53:25+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-28 09:59:15","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7474053","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7474053","identity":"rs-7474053","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00