Listening Beyond The Labels

doi:10.1101/2025.07.01.661595

Listening Beyond The Labels

2025 · doi:10.1101/2025.07.01.661595

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 23,065 characters · extracted from preprint-html · click to expand

Listening Beyond The Labels | bioRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-M677548'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search New Results Listening Beyond The Labels View ORCID Profile Aryaman Gajrani doi: https://doi.org/10.1101/2025.07.01.661595 Aryaman Gajrani 1 Independent Researcher , San Ramon, United States Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Aryaman Gajrani For correspondence: arya-gaj{at}proton.me Abstract Full Text Info/History Metrics Data/Code Preview PDF Abstract Alzheimer’s Disease (AD), a progressive neurodegenerative condition of cognitive decline, presents formidable challenges to patients, caregivers, and healthcare systems. Early identification is essential for successful intervention, but traditional diagnosis requires expensive neuroimaging and lengthy clinical assessments, which compromise access. This study proposes a semi-supervised machine learning strategy for AD diagnosis based on acoustic features from brief speech samples. The approach takes advantage of mel-spectrogram features to extract vocal patterns without manual transcription or linguistic preprocessing. Informative sound patterns are detected using a convolutional neural network (CNN) that gradually adds unlabeled speech data during training through pseudo-labeling. By emphasizing scalable, non-invasive methods that rely solely on unprocessed vocal inputs, this work presents a practical solution for large-scale cognitive screening in resource-limited settings. 1 Introduction Alzheimer’s Disease (AD) is the most common cause of dementia, occurring in more than 55 million individuals globally, expected to increase to 139 million by 2050 as a result of population aging [ 1 ]. AD is a neurodegenerative disease characterized by progressive memory impairment, disrupted language function, and cognitive decline, with enormous quality-of-life effects and substantial societal and economic burdens [ 3 ]. Early diagnosis is necessary for treatment; however, current diagnostic procedures—cognitive tests and neuroimaging—are expensive and resource-consuming, with neuroimaging often reaching more than $1,000 per test [ 4 ]. These constraints pose a challenge, particularly in settings where resources are limited, and are driving the need for alternative methods of detection. Non-invasive biomarkers like speech analysis have been promising. Longitudinal research suggests that patients with early-stage Alzheimer’s disease show persistent changes in patterns of speech, such as elevated hesitation, prosodic impairment, and diminished lexical diversity [ 5 ] [ 7 ]. These vocal indicators frequently emerge prior to overt cognitive decline, and as such, may be early markers of decline [ 12 ]. Recent breakthroughs in deep learning have made it possible for automated speech analysis to be used in cognitive assessments. The ADReSSo Challenge showed that acoustic and linguistic features from audio recordings could reach baseline classification accuracies of 78.87% for Alzheimer’s disease [ 2 ]. Nonetheless, conventional machine learning methods are based on large quantities of labeled data, which is limited and expensive in healthcare environments. Semi-supervised learning provides a remedy by leveraging small quantities of labeled data alongside extensive quantities of unlabeled data [ 6 ]. 2 Data and Methods 2.1 Dataset Sources In this work, we combined two public datasets to obtain a larger speech corpus for semi-supervised training. The labeled data include 200 recordings—100 from clinically diagnosed Alzheimer patients and 100 from healthy controls—taken from the DementiaBank Pitt Corpus, which is commonly used in speech and cognition studies [ 14 ]. For the unlabeled data, we have manually picked 500 recordings from Mozilla Common Voice [ 15 ]. The recordings have varied accents, ages, and background noise, which increases model robustness under real-world environments. Pseudo-labeling these samples facilitates scalable supervised learning with little need for annotated data, consistent with earlier semi-supervised Alzheimer’s detection methods [ 7 ]. 2.2 Audio Preprocessing The first step to ensure consistency across inputs involves converting all audio files from .mp3 to .wav format, as the .wav format preserves the raw waveform integrity, which is essential for accurate feature extraction. Since there were a huge number of audio files, we utilized multi-threading to accelerate the process of conversion. The converted audio files were all resampled at 16 kHz so that the samples would be temporally aligned. Signal processing operations like trimming and zero-padding were then performed to normalize the duration of each audio clip to precisely 5 seconds. This preprocessed step kept input lengths uniform across all samples and assisted in reducing the scope of potential bias due to variations in audio duration. Based on methodologies presented in prior research for instance, that by Haider et al., where they illustrated the significance of sound preprocessing uniformity in the detection of Alzheimer’s dementia via paralinguistic acoustic characteristics [ 8 ]. Likewise, the ADReSSo Challenge indicated the value of standardized pipelines of audio processing to ensure the reliability of AD detection [ 2 ]. 2.3 Feature Extraction Next, the audio was preprocessed and transformed into Mel-spectrograms, a representation that closely mirrors human auditory perception. A sliding window Short-Time Fourier Transform (STFT) was used to analyze the audio signal into 128 Mel-frequency bands across time, retaining fine acoustic features while being perceptually relevant. The spectrograms were scaled into a decibel (dB) representation to reflect volume changes, after which min-max normalization was applied to have standardized pixel value ranges. These derived features were stored as .npy files for optimized storage space and loading time. In line with recent findings, this method leverages spectrogram-based features shown to be effective in Alzheimer’s disease detection. Meghanani et al. proved that using log-Mel spectrograms integrated with deep neural networks could appropriately reflect the weak acoustic variations within AD and control speech samples [ 4 ]. Haider et al. too revealed that paralinguistic acoustic features successfully identified Alzheimer’s dementia in free speech [ 8 ]. 2.4 Model Architecture After extracting Mel-spectrograms, we used a light convolutional neural network (CNN) to differentiate Alzheimer’s cases from healthy controls. The design includes three convolutional blocks, each preceded by ReLU activation and max-pooling layers to compress dimension while retaining important features. Batch normalization follows each block to stabilize learning and speed up convergence. The output of the last convolutional layer is flattened and passed through a fully connected dense layer, with dropout applied to mitigate overfitting due to the limited dataset size. The last layer consists of a single neuron with a sigmoid activation function, which produces a probability score representing the probability that the input belongs to an Alzheimer’s disease patient. This optimized architecture allows for effective training and good pattern capture of symptoms typical of cognitive impairment. Comparable CNN-based models have shown success in past research on detecting Alzheimer’s, validating the use of such an approach to detect the disease from speech-based features [ 5 ] [ 10 ]. 2.5 Training Strategy The training unfolded through a two-stage method, starting with supervised learning on the 200 labeled instances, with stratified sampling providing balanced representation of the two classes (control and Alzheimer’s). Binary cross-entropy was employed as the loss function, optimized with the Adam optimizer, to avoid overfitting. Following the initial supervised phase, semi-supervised learning was applied to label 500 unlabeled samples. Predictions with high confidence (≥85%) were pseudo-labeled and incorporated into the training set. This process significantly augmented the training data without involving human annotation, allowing the model to be able to manage variations in speech characteristics—like accent, background noise, and recording quality—better, thus making it more reliable in real-world applications. A semi-supervised methodology extends earlier techniques by Cascante-Bonilla et al., which demonstrated that pseudo-labeling can be effectively employed to leverage unlabeled speech samples for Alzheimer’s diagnosis [ 7 ]. More recently, Wankerl et al. showed that statistical language models trained on speech transcriptions exhibit strong performance in detecting AD, further supporting the potential of machine learning solutions in this domain [ 6 ]. 3 Results 3.1 Model Performance Metrics Our model exhibited excellent discriminative power in identifying early Alzheimer’s based on extracted speech features. The performance metrics show that acoustic features in isolation can significantly distinguish between control and dementia groups, validating their applicability to early disease identification [ 4 ] [ 8 ]. 3.2 Results on Labeled Data During validation, the model scored an accuracy of 84.62% with a validation loss of 0.4323, showcasing its strength to detect substantial patterns from training data. During training, it hit an accuracy of 96.88% with a mean training accuracy throughout epochs at 95.14%. These results suggest not only effective learning from the training data but also a strong ability to generalize unseen validation data, with little evidence of overfitting. These results align with Fraser et al., who also showed that linguistic markers in narrative speech can serve as effective indicators of Alzheimer’s, reinforcing the broader utility of speech-based diagnostics [ 13 ]. These results are consistent with earlier work in the area. For instance, the MUET-RMIT system that was created for the ADReSSo challenge had an accuracy of 84.51% for AD classification from speech recordings [ 5 ]. View this table: View inline View popup Download powerpoint Table 1: Summary of validation and training metrics, including pseudo-label accuracy and high-confidence prediction distribution on unlabeled samples. 3.3 Results on Unlabeled Data The model was tested on unlabeled samples to determine its scalability. Out of 500 unlabeled recordings, 302 samples were given confident predictions (probability higher than the chosen threshold), allowing pseudo-labeling at 87.75% accuracy. The class distribution was found to have a significant imbalance, with 244 samples labelled as dementia and 58 as controls, indicating inherent class imbalances within available data. Figure 1 shows the confidence levels of the model predictions on the unlabeled dataset. Many predictions exhibited extremely high confidence levels (approaching 1.0 probability), while others showed lower confidence (approximately 0.13). This distribution highlights the model’s ability to identify uncertain cases-a critical characteristic for real-world deployment where data quality may vary. The ability of the model to produce good-quality pseudo-labels for unseen data while having high accuracy on labeled data indicates its flexibility with changing data conditions, a requirement for real-world usage involving speech data with varying recording conditions, quality differences, and accent variation [ 7 ] [ 9 ]. Download figure Open in new tab Figure 1: Histogram showing the model’s prediction confidence on unlabeled data. The x-axis represents the predicted probability scores, and the y-axis indicates the frequency of samples within each probability range. A notable concentration of predictions near 1.0 suggests the model exhibits high confidence on a substantial portion of the unlabeled dataset. 4 Conclusion These findings underscore the growing shift toward noninvasive, voice-based methods for detecting neurodegenerative diseases. By making use of acoustic characteristics only, the method circumvents linguistic, cultural, and literacy constraints, making it readily transferable to different populations. In areas where clinical diagnosis is either scarce or is devalued and stigmatized, this method provides an inexpensive and convenient substitute for the early assessment of cognition. In addition, the employment of semi-supervised learning demonstrates the untapped diagnostic potential within massive amounts of unlabeled speech with much lower dependency on annotated clinical data. The consistent performance of the model over different accents, environments, and speaker populations also testifies to its resilience and appropriateness for application in nature [ 15 ]. Beyond a proof-of-concept study, this research sets the groundwork for passive cognitive screening—where natural, unscripted speech serves as a biometric indicator of neurological status. To move this strategy toward real-world adoption, future work will need to incorporate advancements in temporal modeling, cross-lingual generalization, and ethically informed deployment systems. Furthermore, integrating complementary data streams—such as neuroimaging or semantic-linguistic features—could enhance diagnostic precision and enable more robust, longitudinal tracking at the individual level. As speech continues to emerge as a scalable proxy for cognitive state, its responsible and privacy-conscious application has the potential to fundamentally reshape public health interventions for early detection and continuous monitoring of dementia [ 10 ] [ 11 ]. Conflict of interests The authors declare no conflicts of interest. Footnotes https://talkbank.org/dementia/ https://commonvoice.mozilla.org/en References [1]. ↵ World Health Organization , “ Dementia ,” Fact Sheet , Mar . 15, 2023 . [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/dementia [2]. ↵ S. Luz , F. Haider , S. de la Fuente , D. Fromm , and B. MacWhinney , “ Detecting cognitive decline using speech only: The ADReSSo Challenge ,” Proc. Interspeech 2021 , pp. 3829 – 3833 , 2021 , doi: 10.21437/Interspeech.2021-1220 . [Online]. Available: https://www.isca-archive.org/interspeech_2021/luz21_interspeech.html OpenUrl CrossRef [3]. ↵ Alzheimer’s Association , “ 2023 Alzheimer’s disease facts and figures ,” Alzheimer’s & Dementia , vol. 19 , no. 4 , pp. 1598 – 1695 , Apr . 2023 , doi: 10.1002/alz.13016 . OpenUrl CrossRef PubMed [4]. ↵ M. Meghanani , S. A. Hussain , and S. S. H. Naqvi , “ Log-Mel spectrogram and deep learning based Alzheimer’s disease recognition using speech ,” J. King Saud Univ. - Comput. Inf. Sci ., vol. 33 , no. 8 , pp. 915 – 923 , Oct . 2021 , doi: 10.1016/j.jksuci.2019.06.013 . OpenUrl CrossRef [5]. ↵ Z. S. Samani , M. Shahin , Y. Zo , I. Inayatullah , D. Shahi , and R. J. Haddad , “ Tackling the ADReSSo Challenge 2021: The MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech ,” Proc. Interspeech 2021 , pp. 3830 – 3834 , 2021 , doi: 10.21437/Interspeech.2021-1220 . [Online]. Available: https://www.isca-archive.org/interspeech_2021/syed21_interspeech.html OpenUrl CrossRef [6]. ↵ S. Wankerl , E. Nöth , and S. Evert , “ Automatic Diagnosis of Alzheimer’s Disease Using Neural Network Language Models ,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) , pp. 5841 – 5845 , 2019 , doi: 10.1109/ICASSP.2019.8683110 . OpenUrl CrossRef [7]. ↵ J. Cascante-Bonilla , S. R. Aragon , J. J. Murillo-Fuentes , and J. M. Górriz , “ Semi-supervised learning for Alzheimer’s disease diagnosis using speech data ,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) , pp. 2847 – 2851 , 2019 , doi: 10.1109/ICASSP.2019.8683524 . OpenUrl CrossRef [8]. ↵ S. Haider , S. de la Fuente , and S. Luz , “ An assessment of paralinguistic acoustic features for detection of Alzheimer’s dementia in spontaneous speech ,” IEEE J. Sel. Top. Signal Process ., vol. 14 , no. 2 , pp. 272 – 281 , Feb . 2020 , doi: 10.1109/JSTSP.2020.2973612 . [Online]. Available: https://signalprocessingsociety.org/publications-resources/ieee-journal-selected-topics-signal-processing/assessment-paralinguistic OpenUrl CrossRef [9]. ↵ A. van den Oord , S. Dieleman , H. Zen , K. Simonyan , O. Vinyals , A. Graves , et al. , “ WaveNet: A generative model for raw audio ,” arXiv preprint , arXiv: 1609.03499 , Sep . 2016 . [Online]. Available: https://arxiv.org/abs/1609.03499 [10]. ↵ L. Chi , A. Sharma , A. Gebhardt , and J. T. Colonel , “ Predicting Cognitive Decline: A Multimodal AI Approach to Dementia Screening from Speech ,” arXiv preprint , arXiv: 2502.08862 , Feb . 2025 . [Online]. Available: https://arxiv.org/abs/2502.08862 [11]. ↵ J. Kang , D. Han , L. Meng , J. Zhou , J. Li , X. Wu , and H. Meng , “ Towards Within-Class Variation in Alzheimer’s Disease Detection from Spontaneous Speech ,” arXiv preprint , arXiv: 2409.16322 , Sep . 2024 . [Online]. Available: https://arxiv.org/abs/2409.16322 [12]. ↵ F. Rudzicz , S. Wang , M. Begum , and P. Tan , “ Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech ,” IEEE/ACM Trans. Audio Speech Lang. Process ., vol. 31 , pp. 1149 – 1162 , 2023 , doi: 10.1109/TASLP.2023.3237641 . OpenUrl CrossRef [13]. ↵ K. C. Fraser , J. A. Meltzer , and F. Rudzicz , “ Linguistic features identify Alzheimer’s disease in narrative speech ,” J. Alzheimers Dis ., vol. 49 , no. 2 , pp. 407 – 422 , 2016 , doi: 10.3233/JAD-150520 . [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/26484921/ OpenUrl CrossRef [14]. ↵ P. MacWhinney , B. Fromm , M. Forbes , and D. Holland , “ The DementiaBank corpus: Principles and practices ,” Proc. LREC , pp. 3223 – 3227 , 2011 . [15]. ↵ Mozilla Foundation , “ Mozilla Common Voice Dataset ,” Version 13.0 , 2024 . [Online]. Available: https://commonvoice.mozilla.org/en/datasets View the discussion thread. Back to top Previous Next Posted July 07, 2025. Download PDF Data/Code Email Thank you for your interest in spreading the word about bioRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following Listening Beyond The Labels Message Subject (Your Name) has forwarded a page to you from bioRxiv Message Body (Your Name) thought you would like to see this page from the bioRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share Listening Beyond The Labels Aryaman Gajrani bioRxiv 2025.07.01.661595; doi: https://doi.org/10.1101/2025.07.01.661595 Share This Article: Copy Citation Tools Listening Beyond The Labels Aryaman Gajrani bioRxiv 2025.07.01.661595; doi: https://doi.org/10.1101/2025.07.01.661595 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Neuroscience Subject Areas All Articles Animal Behavior and Cognition (7640) Biochemistry (17706) Bioengineering (13902) Bioinformatics (41978) Biophysics (21465) Cancer Biology (18611) Cell Biology (25528) Clinical Trials (138) Developmental Biology (13387) Ecology (19920) Epidemiology (2067) Evolutionary Biology (24332) Genetics (15615) Genomics (22519) Immunology (17747) Microbiology (40424) Molecular Biology (17194) Neuroscience (88662) Paleontology (667) Pathology (2839) Pharmacology and Toxicology (4827) Physiology (7650) Plant Biology (15160) Scientific Communication and Education (2046) Synthetic Biology (4302) Systems Biology (9826) Zoology (2271)

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00