Large-Scale Analytical Validation of Voice- Derived Digital Biomarkers Using Automated Speech Elicitation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Large-Scale Analytical Validation of Voice- Derived Digital Biomarkers Using Automated Speech Elicitation Adrian Attard Trevisan, Frederick R Carrick, Andrea Sprio This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8810881/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Voice-based digital biomarkers offer a non-invasive and inherently scalable approach to monitoring physiological and psychological state. A substantial body of foundational work in speech science, respiratory physiology, and psychophysiology demonstrates that vocal production systematically reflects underlying biological processes, including respiratory mechanics, autonomic regulation, cognitive demand, and affective state. Despite this strong theoretical basis, quantitative analytical validation of voice-derived biomarkers at population scale remains limited, particularly with respect to consistency, robustness, and distributional behaviour across large biomarker portfolios. This study presents a large-scale analytical validation of the Voice Biota programme using audited model performance outputs derived from standardised automated speech elicitation. Across 46 independently composite voice biomarkers and more than 1.5 million voice-derived data points, discrimination performance was assessed using the area under the receiver operating characteristic curve (AUC). Across all biomarkers, performance was consistently high (mean AUC = 0.899, SD = 0.029; range 0.851–0.946), with all biomarkers exceeding predefined analytical acceptance thresholds commonly adopted in digital biomarker evaluation. As well as providing summary statistics; overall, we also performed a full set of distribution, cumulative, and normality analyses to describe the performance characteristics of each biomarker throughout the entire biomarker portfolio, confirming stable, unimodal performance, free of evidence of pathological skewness as well as heavy-tailed distributions or weak outliers. Taken together, these results demonstrate evidence of the analytical validity and potential scalability of voice-derived biomarkers and establish a robust empirical basis for subsequent clinical, longitudinal, and regulatory validation studies. Analytical validation voice biomarkers digital biomarkers speech science large-scale evaluation automated speech elicitation Figures Figure 1 Figure 2 Figure 3 1. Introduction The human voice is a complex biological signal emitted by the human communication system through the tightly coordinated interaction of multiple physiological and psychological systems. Speech production is a result of controlled respiratory airflow, stable and finely regulated vocal fold oscillation, precise neuromotor coordination, and continuous modulation by higher, order cognitive and affective processes [ 1 , 2 ]. Consequently, vocal production is an integrated expression of systemic state and not a discrete signal that is generated by a single organ or pathway. Due to such an integrative character, changes in physical health, autonomic tone, mental workload, fatigue, or emotional regulation can become visible through the changes in acoustic structure, temporal organization, and vocal stability [ 3 , 4 ]. Empirical research has demonstrated systematic voice changes in respiratory disorders such as chronic obstructive pulmonary disease and asthma, where altered lung mechanics disrupt speech breathing patterns and phonatory control [ 5 – 7 ]. Parallel lines of investigation in psychophysiology and cognitive science have shown that autonomic activation, cognitive load, mental fatigue, stress, and affective state modulate speech rate, pause structure, pitch dynamics, and spectral variability [ 8 – 11 ]. These converging literatures provide a strong conceptual rationale for the use of voice as a digital biomarker modality. However, despite increasing interest, much of the contemporary voice biomarker literature remains limited in translational relevance. Common limitations include small and homogeneous samples, reliance on laboratory-based or scripted speech tasks, opaque feature engineering, and narrow validation strategies focused on isolated performance metrics [ 12 ]. In particular, relatively few studies have examined analytical robustness and consistency across large portfolios of biomarkers evaluated at population scale, a requirement increasingly emphasised in methodological and regulatory guidance for digital health technologies [ 15 , 14 ]. Analytical validation refers to the demonstration that a measurement reliably and consistently captures the intended signal under defined conditions. For digital biomarkers, analytical validation precedes and underpins clinical validation, providing evidence that observed performance is stable, reproducible, and not driven by artefacts or selective optimisation [ 13 ]. The Voice Biota project was created to help with these issues, using standardized automated speech capture, population-based data capture, and ongoing performance monitoring of voice-derived markers (also referred to as 'voice biomarkers'). This paper looks specifically at the analytic validity layer of the framework and is intended to provide a detailed quantitative description of how well voice biomarkers discriminate between diseased and non-diseased individuals; it does not make any disease-specific or clinically relevant claims. The specific goals of this analysis are to establish a rigorous empirical base for future stages of validation and/or translation; to assess not only average performance (i.e., discrimination between diseased and non-diseased individuals), but also distributional shape, cumulative behaviour, and normality. 2. Methods 2.1 Data Source and Automated Speech Elicitation This paper studies voice data collected through standardized automated speech prompting protocols via actual communication channels including telephone, based conversations and asynchronous voice interfaces. Automated speech elicitation guarantees protocol, level uniformity in combination with natural speech, thus solving the major dilemma of experimental control versus ecological validity in voice research (Coravos et al., 2019). The prompts aimed at being as non, intrusive as possible and not related to the task, thus, asking for the least prepared speech and at the same time allowing the speakers to be comparable across individuals, contexts, and time. After that, a simple consistent analytical pipeline was used for every speech interaction to obtain composite voice, derived biomarkers based on physiological and psychological theories. It is worth noting that the application of automated prompting facilitates scalability and auditability, which are the two requisites for the extensive deployment of digital biomarkers. The analyzed dataset contained over 1.5 million voice, derived data points gathered from a broad range of populations and different use cases. Forty, six biomarkers with complete and audited validation reports have been selected for this study. Each biomarker has been further subjected to independent scrutiny by the same analytical validation framework, thus, portfolio, level comparison has become possible without confounding differences in methodology. 2.2 Analytical Validation Framework For this analysis of discrimination performance the primary performance metric is the area under the receiver operating characteristic (ROC) curve (AUC). The AUC is a widely accepted, threshold-independent measure of separability between groups that is used frequently for evaluating digital biomarkers and health metrics derived from machine learning. The AUC is considered to be a robust measure in situations where class imbalance exists and does not rely upon arbitrary threshold decisions [ 14 , 13 ]. Performance metrics were derived from validated performance artefacts associated with audited validation conducted within the Voice Biota programme. This allows for the traceability, transparency and reproducibility of all findings reported. Instead of using only one summary statistic, multiple in parallel and complementary analyses were used to evaluate the overall performance characteristics of the biomarker collection. For example, distributional analysis was used to examine the central tendency and spread of AUC values in order to locate weak, performing values/outliers. Cumulative distributions (ECDFs) of data were calculated to indicate the percentage of biomarkers that meet common limits of analytical accuracy. The performance distribution's overall form was studied with kernel, density estimation, while quantile, quantile examination was applied for checking normality and recognizing potential skewness or heavy tails. Analyzed together, these techniques give a detailed insight into the behavior of analytical performance, thus, they are not only a measure of the average performance but also the consistency in performance of a large and varied portfolio. 3. Results 3.1 Distribution of Discrimination Performance The 46 confirmed voice, derived biomarkers demonstrated a limited range of discrimination performance focused mainly at the higher end of AUC, as shown in Fig. 1 . Most biomarkers had AUC values between roughly 0.87 and 0.93, which means that the discrimination performances of individual models didn't differ that much from one another. The lowest AUC recorded was 0.851 and the highest was 0.946, indicating that none of the biomarkers performed close to random chance level. Descriptive statistics of the total biomarker portfolio performance are given in Table 1 . The average AUC for all biomarkers was 0.899, and the standard deviation was 0.029, showing only moderate variation from the average performance level. The median AUC was 0.900, almost equal to the mean, which indicates that AUC values were roughly symmetrically distributed. The interquartile range from 0.871 (25th percentile) to 0.925 (75th percentile) further points to a large portion of biomarkers having very similar abilities to discriminate. Figure 1 shows a graphical summary of the AUC value distribution, combining a bar chart of AUC values with measures of central tendency and variability. The relatively small spread of data points, with no extreme values, suggests that there was not much variation in discrimination performance among the biomarker portfolio, as opposed to only a few outliers. Table 1 Descriptive statistics of AUC values for the 46 voice, derived biomarkers, which include mean, standard deviation, minimum, maximum, and selected percentile values. Statistics AUC Count 46 Mean 0.899 Standard deviation 0.029 Minimum 0.851 25th percentile 0.871 Median 0.900 75th percentile 0.925 Maximum 0.946 3.2 Threshold and Cumulative Performance The use of an empirical cumulative distribution function (ECDF) allowed for assessment of cumulative discrimination performance for the biomarker portfolio, in order to compare performance with accepted analytical thresholds of acceptability for AUC (area under the curve) when applying standardised analytical evaluations. As demonstrated in Fig. 2 , the ECDF indicates the percentage of biomarkers that exceeded their respective AUC thresholds. In total there were 46 biomarkers that had AUC thresholds above 0.85, 31 biomarkers (67.4%) had an AUC above 0.88, 23 biomarkers (50.0%) had an AUC above 0.90, and 13 biomarkers (28.3%) had an AUC above 0.92. The total number of biomarkers above the different values is presented in Table 2 . Combined, these results from the ECDF and tabulated data indicate that there is elevated performance for discrimination across nearly all biomarkers, and not just a few models that represent high performance. This indicates that there are many models that are achieving similar or greater levels of performance compared to the commonly used performance criteria created by analysts across the field, suggesting that performance is widely distributed across multiple models. Table 2 Number and proportion of biomarkers exceeding the selected AUC (≥ 0.85, ≥ 0.88, ≥ 0.90, and ≥ 0.92). AUC threshold Biomarkers ≥ threshold (n) Biomarkers ≥ threshold (%) ≥ 0.85 46 100.0 ≥ 0.88 31 67.4 ≥ 0.90 23 50.0 ≥ 0.92 13 28.3 3.3 Distributional Shape and Statistical Regularity Using Kernel Density Estimation (KDE) provided a clearer understanding of the AUC's distributional pattern when evaluating the AUC's overall shape (Fig. 3 A). As would be expected, the largest concentration of AUC was found in the vicinity of an AUC value of 0.90. Analysis of the distribution indicated there is no evidence for multiple modes or significant skewness; therefore there was no evidence that the biomarkers could be grouped into high and low performing individual groups of models. The distribution also met criteria for approximate normality using a Quantile-Quantile (Q-Q) plot (Fig. 3 B). Empirical quantiles of the AUC distribution closely followed the theoretical 45 degree reference line when evaluated over most of the distribution; only small deviations (and some non-normality) were present at the very low and high ends of the distribution. The quantiles reported in Table 3 indicate that the AUC at the 5th percentile is 0.853 and at the 95th percentile is 0.943. Overall, the combination of the KDE and Q-Q plots demonstrate that the discrimination performance of the total biomarker array exhibited very little skew and no evidence of heavy tails. Table 3 Selected quantile values of AUC across the 46 validated voice, derived biomarkers. Quantile AUC 5th percentile 0.853 10th percentile 0.862 25th percentile 0.871 Median (50th percentile) 0.900 75th percentile 0.925 90th percentile 0.937 95th percentile 0.943 3.4 Ranking and Portfolio Structure For further explanation of the performance differences among the individual biomarkers, the AUC values were ordered from the largest to the smallest. The top fifteen biomarkers with the highest AUC values are presented in Table 4 , where the AUC values are between 0.918 and 0.946. In between the adjacent ranks the differences in AUC values were quite small, as the performance decreased gently throughout the ranking. There were no sudden cut, offs or jumps that would suggest the existence of separate performance tiers. Rather, the ranked AUC values represented a continuous distribution, indicating a gradual change of discrimination performance among the biomarkers. Table 4 Biomarkers ranked by AUC values, listing the highest-performing models in descending order. Rank Biomarker AUC 1 Cognitive Load 0.946 2 Formant Dispersion 0.944 3 Depression Indicators 0.944 4 Neural Speech Coherence 0.943 5 Respiratory Speech Efficiency 0.937 6 MFCC Dynamics 0.937 7 Speech Rate 0.935 8 Vocal Tremor 0.933 9 Composite Respiratory Clarity Index 0.929 10 Readiness and Fatigue Assessment 0.927 11 Spectral Flux 0.926 12 Respiratory Patterns 0.926 13 Speech Clarity 0.923 14 Cognitive–Autonomic Dissociation Index 0.918 15 Phonation Stability Index 0.918 4. Discussion Within this research project, we performed an analytical validation of the use of digital biomarkers derived from voice, through analysis of a portfolio-level of performance metrics for each biomarker, based on performance metrics collected using an automated and standardised process of speech assessment. The portfolio-level analytical validation of the voice-derived digital biomarkers was accomplished by examining their distributional and cumulative levels of performance, as opposed to evaluating their performance based on the results of isolated models and/or comparisons of the results of randomised datasets to non-randomised datasets respectively. This analysis was essential in addressing the limitation of peer-reviewed research that has primarily focused on the validation of voice-derived biomarkers through the use of proof-of-concept attempts made in limited or highly controlled experimental contexts. One key observation made from this analysis was that there was a high level of discrimination performance across all 24 of the voice-derived digital biomarkers included in the analysis. In contrast to most health-related applications of machine learning, which are characterised by high levels of variability in performance and inconsistent long-tailed distributions of "weak" models (Topol, 2019; Beam & Kohane, 2018), the results of this analysis showed that AUC scores for all voice-derived digital biomarkers were within a narrow range around 0.90. The results of this study show that the limited variability in the performance of voice-derived digital biomarkers was due to the stable nature of the measurement and modelling framework, rather than being the result of the selective optimisation of any individual construct. The present study results are, at the portfolio level, consistent with the present findings and differentiate them from previous voice biomarker research which usually reports the performance of only a few features or task, specific models (Low et al., 2020; Schuller et al., 2018). Although such studies have shown that a voice can reveal changes in physiological and psychological conditions, they usually give very little information on the behaviour of performance across many constructs assessed on a large scale. On the contrary, the distributional analyses here provide proof that the discrimination ability retains its consistency across a diverse set of biomarkers thus endorsing the scalable deployment potential. Cumulative performance analyses also put these results into perspective by illustrating that a very large number of biomarkers surpass the commonly used analytical acceptance thresholds. From a translational angle, such a cumulative viewpoint is of great importance, as the regulatory and methodological guidelines are progressively putting greater emphasis on consistency and robustness across measurement portfolios rather than a single peak performance [ 15 , 14 ]. Apart from summary metrics, statistical regularity within the observed distributions of performance allows for greater reassurance of the analytical stability of a test. The unimodal and symmetric nature of AUC distribution, and the lack of heavy tails and extreme skewness, contrast greatly with performance distributions seen in typical high-dimensional ML systems that have a wide performance spread due to data heterogeneity, or model overfitting [ 16 ]. This is also evidence of predictable behaviour when deployed, a critical attribute for successful implementation of digital health technologies [ 17 ]. Unlike the summary data supporting the performance of individual test scores, the ranked performance of individual biomarkers also demonstrated smooth rather than discontinuous changes in performance. The inherent regularity of changes in ranked performance indicates that differences in discrimination performance are likely due to intrinsic properties associated with complexity of the underlying signal, definition of the construct, or the amount of physiological application (coupling) as opposed to differences in testing methodology. The most critical evidence of the overall analytical strength of the framework is the performance of the lowest-ranked biomarkers that exceeded chance level of performance. Overall, these results place automated speech elicitation supported by audited analytical pipelines as a solid basis for large, scale development of voice biomarkers at the population level. The present study, however, does not deal with clinical validity, longitudinal sensitivity, or disease specificity but rather lays down an essential analytical baseline for these types of evaluations to follow. The staged approach is in line with the developing regulatory frameworks that highlight analytical validation as a precondition for later clinical and regulatory assessment of digital biomarkers (Coravos et al., 2019; Goldsack et al., 2020; FDA, 2023). 5. Conclusion An extensive validation study of voice-derived digital biomarkers was performed on a dataset consisting of over 1.5 million voice-derived data points and 46 independently validated composite indices. Through the study’s analysis of the distribution patterns of discrimination performance, cumulative metric behaviour, and statistical regularity, it is evident that voice-derived digital biomarkers can produce very high analytical performance and be consistently reliable at a population level. Multiple analytical perspectives, including central tendency, dispersion, threshold exceedance, and distributional shape, provide an extensive understanding of the performance beyond a single analysis or measure to assess performance across the biomarker portfolio. Overall, the performance of the entire portfolio of voice-derived biomarkers is stable and unimodal, showing no weak-performing or pathologically performing outliers, indicating that analytical performance is a portfolio-level characteristic of the Voice Biota framework and not an artefact of selective optimisation. These data also provide a solid analytical foundation for further validation from a translational perspective. However, clinical validity, longitudinal sensitivity, and physiological specificity will need further validation and investigation; yet, analytical validation is a critical initial step for regulatory evaluation and real-world use. The methods used in this study illustrate how using standardised automated speech tasks with audited analytical pipelines can be utilised for scalable, population-level digital biomarker development. Declarations * Derived data supporting the findings of this study are available from the corresponding author on request Corresponding Author Corresponding author: Dr Adrian Attard Trevisan Email: [email protected] Ethics Declaration Ethics declaration: not applicable. Ethics, Consent to Participate, and Consent to Publish Ethics, Consent to Participate, and Consent to Publish declarations: not applicable. Funding This research received no external funding. Data Availability Derived data supporting the findings of this study are available from the corresponding author on reasonable request. Raw voice recordings are not publicly available due to the proprietary and ongoing validation framework of the Voice Biota programme. Competing Interests Dr Adrian Attard Trevisan developed the Voice Biota programme. The remaining authors declare that they have no competing interests. Author Contributions Dr Adrian Attard Trevisan conceived the study, designed the analytical framework and drafted the manuscript. Andrea Sprio contributed to methodological refinement, data interpretation, and critical manuscript revision. Frederick Robert Carrick contributed to conceptual development, clinical framing, and manuscript review. All authors read and approved the final manuscript. References Hixon TJ, Weismer G, Hoit JD. Preclinical speech science. Plural Publishing; 2008. Sapienza CM, Hoffman-Ruddy B. Voice disorders. 3rd ed. Plural Publishing; 2021. Scherer KR. Vocal communication of emotion. Speech Communication. 2003;40:227–256. Kreibig SD. Autonomic nervous system activity in emotion. Biological Psychology. 2010;84:394–421. Meek PM, Carding PN, Howard RS. Vocal function in chronic obstructive pulmonary disease. Thorax. 2001;56:334–339. Hoit JD, Plassman BL, Lansing RW, Hixon TJ. Abdominal muscle activity during speech production. Journal of Applied Physiology. 2003;94:665–673. Celli BR, Wedzicha JA. Update on clinical aspects of chronic obstructive pulmonary disease. New England Journal of Medicine. 2019;381:1257–1266. Kahneman D. Attention and effort. Prentice Hall; 1973. Porges SW. The polyvagal perspective. Biological Psychology. 2007;74:116–143. Fairclough SH, Houston K. A metabolic measure of mental effort. Biological Psychology. 2004;66:177–190. Cummins N, Schuller B, Krajewski J. A review of depression and suicide risk assessment using speech analysis. Speech Communication. 2015;71:10–49. Low DM, Bentley KH, Ghosh SS. Automated assessment of psychiatric disorders using speech: A systematic review. Biological Psychiatry. 2020;88:42–50. Food and Drug Administration. Digital health technologies for remote data acquisition in clinical investigations. 2023. Available from: https://www.fda.gov/regulatory-information/search-fda-guidance-documents Goldsack JC, Coravos A, Bakker JP, et al. Verification, analytical validation, and clinical validation of digital biomarkers. npj Digital Medicine. 2020;3:55. Coravos A, Khozin S, Mandl KD. Developing and adopting safe and effective digital biomarkers to improve patient outcomes. npj Digital Medicine. 2019;2:14. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018. Topol E. High-performance medicine. Nature Medicine. 2019. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8810881","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":592536056,"identity":"64831611-7a42-493c-9551-0692ed6614fd","order_by":0,"name":"Adrian Attard Trevisan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEElEQVRIiWNgGAWjYFCCBIYDMKYEEMvZtzcAKQPCWiR4QASQZWzAc4CwFgZkLYkbJBLwO4u/PffhgZ877Ors2c8+vP1xjw3jdsm3Dz/+KGCQ5xc7gFWLxJnnBgd7zyRL8PCkG1sceJbGbDk73Viax4DBcOZsHNbdSGM4wNvGDHRYGpvEgQOH2Rhup7ExA/2SYHAbuxZ5oJaDf9vqJXj4n4G0/OdhuHmMjfEHHi0GQC2HedsOS/BIgG05IGFwg42NgQePFsMzzxgOy7Ydl+y58YzZ4syBZAPJnjRmoF8kcPpF7nga88e3bdX87P1pjDcqDtjV97MfY/z444+NPL80Du/jAhKkKR8Fo2AUjIJRgAIA0A9cWJgh0skAAAAASUVORK5CYII=","orcid":"","institution":"Asomi College of Sciences","correspondingAuthor":true,"prefix":"","firstName":"Adrian","middleName":"Attard","lastName":"Trevisan","suffix":""},{"id":592536057,"identity":"5b3cee99-b112-424b-b313-19c77cff3305","order_by":1,"name":"Frederick R Carrick","email":"","orcid":"","institution":"University of Central Florida","correspondingAuthor":false,"prefix":"","firstName":"Frederick","middleName":"R","lastName":"Carrick","suffix":""},{"id":592536059,"identity":"aa43c5ad-80db-40d9-b919-397e588968aa","order_by":2,"name":"Andrea Sprio","email":"","orcid":"","institution":"Asomi College of Sciences","correspondingAuthor":false,"prefix":"","firstName":"Andrea","middleName":"","lastName":"Sprio","suffix":""}],"badges":[],"createdAt":"2026-02-06 21:38:02","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8810881/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8810881/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102892440,"identity":"5af719a3-dffc-42f1-b476-5acf6b9188c1","added_by":"auto","created_at":"2026-02-18 05:25:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":132412,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution of discrimination performance across the biomarker portfolio showing \u003c/strong\u003ehistogram of AUC values for the 46 validated voice-derived biomarkers. The vertical line denotes the median AUC, and the shaded region indicates the interquartile range (IQR).\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8810881/v1/d6a38ee25cbf660a4e71a507.png"},{"id":102892432,"identity":"1726a0d9-c45d-499d-a869-3ade3074515e","added_by":"auto","created_at":"2026-02-18 05:25:25","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":131125,"visible":true,"origin":"","legend":"\u003cp\u003eCumulative distribution of discrimination performance across the biomarker portfolio. Empirical cumulative distribution function (ECDF) of AUC values for the 46 validated voice-derived biomarkers. Dashed vertical lines indicate selected AUC thresholds used to summarise cumulative performance.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8810881/v1/1b04409a3c5020406f207a1a.png"},{"id":102892408,"identity":"55ef3779-bc75-4abf-ab23-08a3ee80f4e6","added_by":"auto","created_at":"2026-02-18 05:25:21","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":243852,"visible":true,"origin":"","legend":"\u003cp\u003eDistributional regularity of discrimination performance across the biomarker portfolio. (A) Kernel density estimate of AUC values for the 46 validated voice-derived biomarkers, with rug marks indicating individual biomarkers and a dashed vertical line denoting the median AUC.(B) Quantile–quantile plot comparing empirical AUC quantiles with theoretical normal quantiles.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8810881/v1/029348d26d934df8448a38de.png"},{"id":103404066,"identity":"218678e3-5921-4740-b205-cf50031819e3","added_by":"auto","created_at":"2026-02-25 09:44:55","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1095935,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8810881/v1/8fae8deb-84ad-4ee5-8f92-b23ac3bbb6dd.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Large-Scale Analytical Validation of Voice- Derived Digital Biomarkers Using Automated Speech Elicitation","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe human voice is a complex biological signal emitted by the human communication system through the tightly coordinated interaction of multiple physiological and psychological systems. Speech production is a result of controlled respiratory airflow, stable and finely regulated vocal fold oscillation, precise neuromotor coordination, and continuous modulation by higher, order cognitive and affective processes [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Consequently, vocal production is an integrated expression of systemic state and not a discrete signal that is generated by a single organ or pathway.\u003c/p\u003e \u003cp\u003eDue to such an integrative character, changes in physical health, autonomic tone, mental workload, fatigue, or emotional regulation can become visible through the changes in acoustic structure, temporal organization, and vocal stability [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Empirical research has demonstrated systematic voice changes in respiratory disorders such as chronic obstructive pulmonary disease and asthma, where altered lung mechanics disrupt speech breathing patterns and phonatory control [\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Parallel lines of investigation in psychophysiology and cognitive science have shown that autonomic activation, cognitive load, mental fatigue, stress, and affective state modulate speech rate, pause structure, pitch dynamics, and spectral variability [\u003cspan additionalcitationids=\"CR9 CR10\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThese converging literatures provide a strong conceptual rationale for the use of voice as a digital biomarker modality. However, despite increasing interest, much of the contemporary voice biomarker literature remains limited in translational relevance. Common limitations include small and homogeneous samples, reliance on laboratory-based or scripted speech tasks, opaque feature engineering, and narrow validation strategies focused on isolated performance metrics [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. In particular, relatively few studies have examined analytical robustness and consistency across large portfolios of biomarkers evaluated at population scale, a requirement increasingly emphasised in methodological and regulatory guidance for digital health technologies [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAnalytical validation refers to the demonstration that a measurement reliably and consistently captures the intended signal under defined conditions. For digital biomarkers, analytical validation precedes and underpins clinical validation, providing evidence that observed performance is stable, reproducible, and not driven by artefacts or selective optimisation [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. The Voice Biota project was created to help with these issues, using standardized automated speech capture, population-based data capture, and ongoing performance monitoring of voice-derived markers (also referred to as 'voice biomarkers').\u003c/p\u003e \u003cp\u003eThis paper looks specifically at the analytic validity layer of the framework and is intended to provide a detailed quantitative description of how well voice biomarkers discriminate between diseased and non-diseased individuals; it does not make any disease-specific or clinically relevant claims. The specific goals of this analysis are to establish a rigorous empirical base for future stages of validation and/or translation; to assess not only average performance (i.e., discrimination between diseased and non-diseased individuals), but also distributional shape, cumulative behaviour, and normality.\u003c/p\u003e"},{"header":"2. Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Data Source and Automated Speech Elicitation\u003c/h2\u003e \u003cp\u003eThis paper studies voice data collected through standardized automated speech prompting protocols via actual communication channels including telephone, based conversations and asynchronous voice interfaces. Automated speech elicitation guarantees protocol, level uniformity in combination with natural speech, thus solving the major dilemma of experimental control versus ecological validity in voice research (Coravos et al., 2019).\u003c/p\u003e \u003cp\u003eThe prompts aimed at being as non, intrusive as possible and not related to the task, thus, asking for the least prepared speech and at the same time allowing the speakers to be comparable across individuals, contexts, and time. After that, a simple consistent analytical pipeline was used for every speech interaction to obtain composite voice, derived biomarkers based on physiological and psychological theories. It is worth noting that the application of automated prompting facilitates scalability and auditability, which are the two requisites for the extensive deployment of digital biomarkers.\u003c/p\u003e \u003cp\u003eThe analyzed dataset contained over 1.5\u0026nbsp;million voice, derived data points gathered from a broad range of populations and different use cases. Forty, six biomarkers with complete and audited validation reports have been selected for this study. Each biomarker has been further subjected to independent scrutiny by the same analytical validation framework, thus, portfolio, level comparison has become possible without confounding differences in methodology.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Analytical Validation Framework\u003c/h2\u003e \u003cp\u003eFor this analysis of discrimination performance the primary performance metric is the area under the receiver operating characteristic (ROC) curve (AUC). The AUC is a widely accepted, threshold-independent measure of separability between groups that is used frequently for evaluating digital biomarkers and health metrics derived from machine learning. The AUC is considered to be a robust measure in situations where class imbalance exists and does not rely upon arbitrary threshold decisions [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003ePerformance metrics were derived from validated performance artefacts associated with audited validation conducted within the Voice Biota programme. This allows for the traceability, transparency and reproducibility of all findings reported. Instead of using only one summary statistic, multiple in parallel and complementary analyses were used to evaluate the overall performance characteristics of the biomarker collection.\u003c/p\u003e \u003cp\u003eFor example, distributional analysis was used to examine the central tendency and spread of AUC values in order to locate weak, performing values/outliers. Cumulative distributions (ECDFs) of data were calculated to indicate the percentage of biomarkers that meet common limits of analytical accuracy. The performance distribution's overall form was studied with kernel, density estimation, while quantile, quantile examination was applied for checking normality and recognizing potential skewness or heavy tails.\u003c/p\u003e \u003cp\u003eAnalyzed together, these techniques give a detailed insight into the behavior of analytical performance, thus, they are not only a measure of the average performance but also the consistency in performance of a large and varied portfolio.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Distribution of Discrimination Performance\u003c/h2\u003e \u003cp\u003eThe 46 confirmed voice, derived biomarkers demonstrated a limited range of discrimination performance focused mainly at the higher end of AUC, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Most biomarkers had AUC values between roughly 0.87 and 0.93, which means that the discrimination performances of individual models didn't differ that much from one another. The lowest AUC recorded was 0.851 and the highest was 0.946, indicating that none of the biomarkers performed close to random chance level.\u003c/p\u003e \u003cp\u003eDescriptive statistics of the total biomarker portfolio performance are given in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The average AUC for all biomarkers was 0.899, and the standard deviation was 0.029, showing only moderate variation from the average performance level. The median AUC was 0.900, almost equal to the mean, which indicates that AUC values were roughly symmetrically distributed. The interquartile range from 0.871 (25th percentile) to 0.925 (75th percentile) further points to a large portion of biomarkers having very similar abilities to discriminate.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows a graphical summary of the AUC value distribution, combining a bar chart of AUC values with measures of central tendency and variability. The relatively small spread of data points, with no extreme values, suggests that there was not much variation in discrimination performance among the biomarker portfolio, as opposed to only a few outliers.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDescriptive statistics of AUC values for the 46 voice, derived biomarkers, which include mean, standard deviation, minimum, maximum, and selected percentile values.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStatistics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e46\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMean\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.899\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStandard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.029\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMinimum\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.851\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e25th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMedian\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e75th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.925\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMaximum\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.946\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Threshold and Cumulative Performance\u003c/h2\u003e \u003cp\u003eThe use of an empirical cumulative distribution function (ECDF) allowed for assessment of cumulative discrimination performance for the biomarker portfolio, in order to compare performance with accepted analytical thresholds of acceptability for AUC (area under the curve) when applying standardised analytical evaluations. As demonstrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the ECDF indicates the percentage of biomarkers that exceeded their respective AUC thresholds.\u003c/p\u003e \u003cp\u003eIn total there were 46 biomarkers that had AUC thresholds above 0.85, 31 biomarkers (67.4%) had an AUC above 0.88, 23 biomarkers (50.0%) had an AUC above 0.90, and 13 biomarkers (28.3%) had an AUC above 0.92. The total number of biomarkers above the different values is presented in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eCombined, these results from the ECDF and tabulated data indicate that there is elevated performance for discrimination across nearly all biomarkers, and not just a few models that represent high performance. This indicates that there are many models that are achieving similar or greater levels of performance compared to the commonly used performance criteria created by analysts across the field, suggesting that performance is widely distributed across multiple models.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eNumber and proportion of biomarkers exceeding the selected AUC (\u0026ge;\u0026thinsp;0.85, \u0026ge;\u0026thinsp;0.88, \u0026ge;\u0026thinsp;0.90, and \u0026ge;\u0026thinsp;0.92).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAUC threshold\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBiomarkers\u0026thinsp;\u0026ge;\u0026thinsp;threshold (n)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBiomarkers\u0026thinsp;\u0026ge;\u0026thinsp;threshold (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u0026ge;\u0026thinsp;0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e100.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u0026ge;\u0026thinsp;0.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e67.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u0026ge;\u0026thinsp;0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e50.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u0026ge;\u0026thinsp;0.92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e28.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Distributional Shape and Statistical Regularity\u003c/h2\u003e \u003cp\u003eUsing Kernel Density Estimation (KDE) provided a clearer understanding of the AUC's distributional pattern when evaluating the AUC's overall shape (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). As would be expected, the largest concentration of AUC was found in the vicinity of an AUC value of 0.90. Analysis of the distribution indicated there is no evidence for multiple modes or significant skewness; therefore there was no evidence that the biomarkers could be grouped into high and low performing individual groups of models.\u003c/p\u003e \u003cp\u003eThe distribution also met criteria for approximate normality using a Quantile-Quantile (Q-Q) plot (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). Empirical quantiles of the AUC distribution closely followed the theoretical 45 degree reference line when evaluated over most of the distribution; only small deviations (and some non-normality) were present at the very low and high ends of the distribution. The quantiles reported in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e indicate that the AUC at the 5th percentile is 0.853 and at the 95th percentile is 0.943.\u003c/p\u003e \u003cp\u003eOverall, the combination of the KDE and Q-Q plots demonstrate that the discrimination performance of the total biomarker array exhibited very little skew and no evidence of heavy tails.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSelected quantile values of AUC across the 46 validated voice, derived biomarkers.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuantile\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.853\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e10th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.862\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e25th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMedian (50th percentile)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e75th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.925\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e90th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.937\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e95th percentile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.943\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Ranking and Portfolio Structure\u003c/h2\u003e \u003cp\u003eFor further explanation of the performance differences among the individual biomarkers, the AUC values were ordered from the largest to the smallest. The top fifteen biomarkers with the highest AUC values are presented in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, where the AUC values are between 0.918 and 0.946.\u003c/p\u003e \u003cp\u003eIn between the adjacent ranks the differences in AUC values were quite small, as the performance decreased gently throughout the ranking. There were no sudden cut, offs or jumps that would suggest the existence of separate performance tiers. Rather, the ranked AUC values represented a continuous distribution, indicating a gradual change of discrimination performance among the biomarkers.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBiomarkers ranked by AUC values, listing the highest-performing models in descending order.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRank\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBiomarker\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCognitive Load\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.946\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFormant Dispersion\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.944\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDepression Indicators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.944\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNeural Speech Coherence\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.943\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRespiratory Speech Efficiency\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.937\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMFCC Dynamics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.937\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpeech Rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.935\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eVocal Tremor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.933\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eComposite Respiratory Clarity Index\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.929\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eReadiness and Fatigue Assessment\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.927\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpectral Flux\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.926\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRespiratory Patterns\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.926\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpeech Clarity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.923\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCognitive\u0026ndash;Autonomic Dissociation Index\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.918\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePhonation Stability Index\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.918\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eWithin this research project, we performed an analytical validation of the use of digital biomarkers derived from voice, through analysis of a portfolio-level of performance metrics for each biomarker, based on performance metrics collected using an automated and standardised process of speech assessment. The portfolio-level analytical validation of the voice-derived digital biomarkers was accomplished by examining their distributional and cumulative levels of performance, as opposed to evaluating their performance based on the results of isolated models and/or comparisons of the results of randomised datasets to non-randomised datasets respectively. This analysis was essential in addressing the limitation of peer-reviewed research that has primarily focused on the validation of voice-derived biomarkers through the use of proof-of-concept attempts made in limited or highly controlled experimental contexts.\u003c/p\u003e \u003cp\u003eOne key observation made from this analysis was that there was a high level of discrimination performance across all 24 of the voice-derived digital biomarkers included in the analysis. In contrast to most health-related applications of machine learning, which are characterised by high levels of variability in performance and inconsistent long-tailed distributions of \"weak\" models (Topol, 2019; Beam \u0026amp; Kohane, 2018), the results of this analysis showed that AUC scores for all voice-derived digital biomarkers were within a narrow range around 0.90. The results of this study show that the limited variability in the performance of voice-derived digital biomarkers was due to the stable nature of the measurement and modelling framework, rather than being the result of the selective optimisation of any individual construct.\u003c/p\u003e \u003cp\u003eThe present study results are, at the portfolio level, consistent with the present findings and differentiate them from previous voice biomarker research which usually reports the performance of only a few features or task, specific models (Low et al., 2020; Schuller et al., 2018).\u003c/p\u003e \u003cp\u003eAlthough such studies have shown that a voice can reveal changes in physiological and psychological conditions, they usually give very little information on the behaviour of performance across many constructs assessed on a large scale.\u003c/p\u003e \u003cp\u003eOn the contrary, the distributional analyses here provide proof that the discrimination ability retains its consistency across a diverse set of biomarkers thus endorsing the scalable deployment potential.\u003c/p\u003e \u003cp\u003eCumulative performance analyses also put these results into perspective by illustrating that a very large number of biomarkers surpass the commonly used analytical acceptance thresholds.\u003c/p\u003e \u003cp\u003eFrom a translational angle, such a cumulative viewpoint is of great importance, as the regulatory and methodological guidelines are progressively putting greater emphasis on consistency and robustness across measurement portfolios rather than a single peak performance [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eApart from summary metrics, statistical regularity within the observed distributions of performance allows for greater reassurance of the analytical stability of a test. The unimodal and symmetric nature of AUC distribution, and the lack of heavy tails and extreme skewness, contrast greatly with performance distributions seen in typical high-dimensional ML systems that have a wide performance spread due to data heterogeneity, or model overfitting [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. This is also evidence of predictable behaviour when deployed, a critical attribute for successful implementation of digital health technologies [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eUnlike the summary data supporting the performance of individual test scores, the ranked performance of individual biomarkers also demonstrated smooth rather than discontinuous changes in performance. The inherent regularity of changes in ranked performance indicates that differences in discrimination performance are likely due to intrinsic properties associated with complexity of the underlying signal, definition of the construct, or the amount of physiological application (coupling) as opposed to differences in testing methodology. The most critical evidence of the overall analytical strength of the framework is the performance of the lowest-ranked biomarkers that exceeded chance level of performance.\u003c/p\u003e \u003cp\u003eOverall, these results place automated speech elicitation supported by audited analytical pipelines as a solid basis for large, scale development of voice biomarkers at the population level. The present study, however, does not deal with clinical validity, longitudinal sensitivity, or disease specificity but rather lays down an essential analytical baseline for these types of evaluations to follow. The staged approach is in line with the developing regulatory frameworks that highlight analytical validation as a precondition for later clinical and regulatory assessment of digital biomarkers (Coravos et al., 2019; Goldsack et al., 2020; FDA, 2023).\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eAn extensive validation study of voice-derived digital biomarkers was performed on a dataset consisting of over 1.5\u0026nbsp;million voice-derived data points and 46 independently validated composite indices. Through the study\u0026rsquo;s analysis of the distribution patterns of discrimination performance, cumulative metric behaviour, and statistical regularity, it is evident that voice-derived digital biomarkers can produce very high analytical performance and be consistently reliable at a population level.\u003c/p\u003e \u003cp\u003eMultiple analytical perspectives, including central tendency, dispersion, threshold exceedance, and distributional shape, provide an extensive understanding of the performance beyond a single analysis or measure to assess performance across the biomarker portfolio. Overall, the performance of the entire portfolio of voice-derived biomarkers is stable and unimodal, showing no weak-performing or pathologically performing outliers, indicating that analytical performance is a portfolio-level characteristic of the Voice Biota framework and not an artefact of selective optimisation.\u003c/p\u003e \u003cp\u003eThese data also provide a solid analytical foundation for further validation from a translational perspective. However, clinical validity, longitudinal sensitivity, and physiological specificity will need further validation and investigation; yet, analytical validation is a critical initial step for regulatory evaluation and real-world use. The methods used in this study illustrate how using standardised automated speech tasks with audited analytical pipelines can be utilised for scalable, population-level digital biomarker development.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e*\u003c/strong\u003eDerived data supporting the findings of this study are available from the corresponding author on request\u003c/p\u003e\n\u003ch2\u003eCorresponding Author\u003c/h2\u003e\n\u003cp\u003eCorresponding author: Dr Adrian Attard Trevisan\u003c/p\u003e\n\u003cp\u003eEmail:
[email protected]\u003c/p\u003e\n\u003ch2\u003eEthics Declaration\u003c/h2\u003e\n\u003cp\u003eEthics declaration: not applicable.\u003c/p\u003e\n\u003ch2\u003eEthics, Consent to Participate, and Consent to Publish\u003c/h2\u003e\n\u003cp\u003eEthics, Consent to Participate, and Consent to Publish declarations: not applicable.\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eDerived data supporting the findings of this study are available from the corresponding author on reasonable request. Raw voice recordings are not publicly available due to the proprietary and ongoing validation framework of the Voice Biota programme.\u003c/p\u003e\n\u003ch2\u003eCompeting Interests\u003c/h2\u003e\n\u003cp\u003eDr Adrian Attard Trevisan developed the Voice Biota programme. The remaining authors declare that they have no competing interests.\u003c/p\u003e\n\u003ch2\u003eAuthor Contributions\u003c/h2\u003e\n\u003cp\u003eDr Adrian Attard Trevisan conceived the study, designed the analytical framework and drafted the manuscript. Andrea Sprio contributed to methodological refinement, data interpretation, and critical manuscript revision. Frederick Robert Carrick contributed to conceptual development, clinical framing, and manuscript review. All authors read and approved the final manuscript.\u003c/p\u003e"},{"header":"References","content":" \u003col\u003e\n\u003cli\u003eHixon TJ, Weismer G, Hoit JD. Preclinical speech science. Plural Publishing; 2008.\u003c/li\u003e\n\u003cli\u003eSapienza CM, Hoffman-Ruddy B. Voice disorders. 3rd ed. Plural Publishing; 2021.\u003c/li\u003e\n\u003cli\u003eScherer KR. Vocal communication of emotion. Speech Communication. 2003;40:227\u0026ndash;256.\u003c/li\u003e\n\u003cli\u003eKreibig SD. Autonomic nervous system activity in emotion. Biological Psychology. 2010;84:394\u0026ndash;421.\u003c/li\u003e\n\u003cli\u003eMeek PM, Carding PN, Howard RS. Vocal function in chronic obstructive pulmonary disease. Thorax. 2001;56:334\u0026ndash;339.\u003c/li\u003e\n\u003cli\u003eHoit JD, Plassman BL, Lansing RW, Hixon TJ. Abdominal muscle activity during speech production. Journal of Applied Physiology. 2003;94:665\u0026ndash;673.\u003c/li\u003e\n\u003cli\u003eCelli BR, Wedzicha JA. Update on clinical aspects of chronic obstructive pulmonary disease. New England Journal of Medicine. 2019;381:1257\u0026ndash;1266.\u003c/li\u003e\n\u003cli\u003eKahneman D. Attention and effort. Prentice Hall; 1973.\u003c/li\u003e\n\u003cli\u003ePorges SW. The polyvagal perspective. Biological Psychology. 2007;74:116\u0026ndash;143.\u003c/li\u003e\n\u003cli\u003eFairclough SH, Houston K. A metabolic measure of mental effort. Biological Psychology. 2004;66:177\u0026ndash;190.\u003c/li\u003e\n\u003cli\u003eCummins N, Schuller B, Krajewski J. A review of depression and suicide risk assessment using speech analysis. Speech Communication. 2015;71:10\u0026ndash;49.\u003c/li\u003e\n\u003cli\u003eLow DM, Bentley KH, Ghosh SS. Automated assessment of psychiatric disorders using speech: A systematic review. Biological Psychiatry. 2020;88:42\u0026ndash;50.\u003c/li\u003e\n\u003cli\u003eFood and Drug Administration. Digital health technologies for remote data acquisition in clinical investigations. 2023. Available from: https://www.fda.gov/regulatory-information/search-fda-guidance-documents\u003c/li\u003e\n\u003cli\u003eGoldsack JC, Coravos A, Bakker JP, et al. Verification, analytical validation, and clinical validation of digital biomarkers. npj Digital Medicine. 2020;3:55.\u003c/li\u003e\n\u003cli\u003eCoravos A, Khozin S, Mandl KD. Developing and adopting safe and effective digital biomarkers to improve patient outcomes. npj Digital Medicine. 2019;2:14.\u003c/li\u003e\n\u003cli\u003eBeam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018.\u003c/li\u003e\n\u003cli\u003eTopol E. High-performance medicine. Nature Medicine. 2019.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Analytical validation, voice biomarkers, digital biomarkers, speech science, large-scale evaluation, automated speech elicitation","lastPublishedDoi":"10.21203/rs.3.rs-8810881/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8810881/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eVoice-based digital biomarkers offer a non-invasive and inherently scalable approach to monitoring physiological and psychological state. A substantial body of foundational work in speech science, respiratory physiology, and psychophysiology demonstrates that vocal production systematically reflects underlying biological processes, including respiratory mechanics, autonomic regulation, cognitive demand, and affective state. Despite this strong theoretical basis, quantitative analytical validation of voice-derived biomarkers at population scale remains limited, particularly with respect to consistency, robustness, and distributional behaviour across large biomarker portfolios.\u003c/p\u003e \u003cp\u003eThis study presents a large-scale analytical validation of the Voice Biota programme using audited model performance outputs derived from standardised automated speech elicitation. Across 46 independently composite voice biomarkers and more than 1.5\u0026nbsp;million voice-derived data points, discrimination performance was assessed using the area under the receiver operating characteristic curve (AUC). Across all biomarkers, performance was consistently high (mean AUC\u0026thinsp;=\u0026thinsp;0.899, SD\u0026thinsp;=\u0026thinsp;0.029; range 0.851\u0026ndash;0.946), with all biomarkers exceeding predefined analytical acceptance thresholds commonly adopted in digital biomarker evaluation.\u003c/p\u003e \u003cp\u003eAs well as providing summary statistics; overall, we also performed a full set of distribution, cumulative, and normality analyses to describe the performance characteristics of each biomarker throughout the entire biomarker portfolio, confirming stable, unimodal performance, free of evidence of pathological skewness as well as heavy-tailed distributions or weak outliers. Taken together, these results demonstrate evidence of the analytical validity and potential scalability of voice-derived biomarkers and establish a robust empirical basis for subsequent clinical, longitudinal, and regulatory validation studies.\u003c/p\u003e","manuscriptTitle":"Large-Scale Analytical Validation of Voice- Derived Digital Biomarkers Using Automated Speech Elicitation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-18 05:24:07","doi":"10.21203/rs.3.rs-8810881/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6d3abf9e-7ce7-4876-9ece-21cf6d187726","owner":[],"postedDate":"February 18th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-25T09:44:20+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-18 05:24:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8810881","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8810881","identity":"rs-8810881","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.