Pitfalls in using ML to predict cognitive function performance

doi:10.21203/rs.3.rs-4745684/v1

Pitfalls in using ML to predict cognitive function performance

2024 · doi:10.21203/rs.3.rs-4745684/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 159,987 characters · extracted from preprint-html · click to expand

Pitfalls in using ML to predict cognitive function performance | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Pitfalls in using ML to predict cognitive function performance Gianna Kuhles, Sami Hamdan, Stefan Heim, Simon Eickhoff, Kaustubh R. Patil, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4745684/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 29 Oct, 2025 Read the published version in Scientific Reports → Version 1 posted 10 You are reading this latest preprint version Abstract Machine learning analyses are widely used for predicting cognitive abilities, yet there are pitfalls that need to be considered during their implementation and interpretation of the results. Hence, the present study aimed at drawing attention to the risks of erroneous conclusions incurred by confounding variables illustrated by a case example predicting executive function performance by prosodic features. Healthy participants (n = 231) performed speech tasks and EF tests. From 264 prosodic features, we predicted EF performance using 66 variables, controlling for confounding effects of age, sex, and education. A reasonable model fit was apparently achieved for EF variables of the Trail Making Test. However, in-depth analyses revealed indications of confound leakage, leading to inflated prediction accuracies, due to a strong relationship between confounds and targets. These findings highlight the need to control confounding variables in ML pipelines and caution against potential pitfalls in ML predictions. Health sciences/Biomarkers Biological sciences/Neuroscience Biological sciences/Neuroscience/Computational neuroscience Biological sciences/Psychology Biological sciences/Psychology/Human behaviour Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Prediction of cognitive performance is a central goal in neuroscience and related areas of research. Predicting cognitive performance is relevant for several reasons. Firstly, it enables the identification of individuals who may be at risk of cognitive decline or neurodegenerative diseases at an early stage [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ]. This, in turn, allows for preventative measures and early treatment. Secondly, predicting cognitive performance can help us understand the underlying mechanisms of cognitive function and identify potential biomarkers for cognitive abilities [ 7 ] [ 8 ]. Thirdly, it can aid in the development of personalised training programs based on an individual's cognitive capabilities [ 9 ]. With the rising number of variables potentially related to cognitive performance, methods for predicting cognitive functions also increase in complexity. Machine learning (ML) offers a way to study individual differences by inspecting many different possible influencing factors. ML is a field of artificial intelligence in which models are trained on data, allowing them to uncover intricate relationships and improve over time. It involves advanced statistical algorithms, which learn patterns from feature-target data with the aim to generalise to previously unseen data [ 10 ]. Such methods are of practical use for exploratory research in various fields because unknown, linear, but most importantly non-linear, relationships of a large number of variables can be inspected easily and fast. ML approaches are gaining more importance as they are able to predict the target value of an unseen individual using their features. For instance, when impaired prosodic abilities are related to a disorder, a ML model could be useful for early detection and diagnosis. However, application of ML can be problematic when applied inappropriately leading to inaccurate results and misleading conclusions. One of the main challenges in ML relates to preventing models from displaying prediction values that are overly high in comparison to their actual predictive power [ 10 ] [ 11 ]. Barring other reasons, this is usually the case when information that should be kept strictly separate is unintentionally fed into the ML pipeline. This process is referred to as leakage [ 11 ] [ 12 ]. One form of leakage is the incorporation of information from confounding variables through the procedure of confound removal, i.e. confound-leakage [ 13 ]. Confounding variables share variance with both the dependent (target) and the independent (explanatory or predictive) variable. This means that they are associated with both variables in the analysis and can potentially have an impact on the relationship between them. It is desirable to remove the confounding information such that the model's predictions are not influenced by it. However, it is plausible that the standard confound removal procedure using linear regression might inadvertently introduce confounding information rather than removing it, causing confound leakage [ 14 ]. In the following, we demonstrate this issue using a specific example from our research, which aimed to predict cognitive performance based on prosodic variables. As executive functions are crucial cognitive capabilities in everyday human life and constitute a basic requirement for speech and communication [ 15 ] [ 16 ] [ 17 ]. we focused on predicting executive function performance in this particular application. The term “executive functions” represents a heterogeneous set of distinguishable processes [ 18 ]. According to Ward, executive functions represent complex abilities, with which people optimise their performance in situations that require the organisation of a series of cognitive processes [ 19 ]. In spite of the lack of a universal definition of executive function performance and its subordinated domains [ 20 ], the grouping of working memory, inhibition, and cognitive flexibility [ 21 ] [ 22 ] is still the most popular [ 23 ]. Executive functions are of great relevance in relation to various pathologies, as their impairment can be observed in numerous neurological and psychiatric disorders [ 24 ] [ 25 ] [ 26 ] [ 27 ] [ 28 ]. For this reason, their investigation, both in healthy people and in different patient groups, constitutes a central component of research and diagnostics. Despite great efforts, examination and characterisation of executive functions have proven to be extremely difficult [ 29 ]. Not only is data acquisition time-consuming and costly, but the results are also dependent on subjective application factors, such as the qualification of the test conductor and the current condition of the person being tested. In addition, the measured performance depends on the individual's motivation. What we can take advantage of in the context of testing EF is the knowledge about the relationship between executive functions and language: It is assumed that executive functions act as a cognitive control mechanism for the syntactic processing of sentences [ 30 ]. Moreover, a large variety of disorders in communication ability are associated with impaired executive functions, including dysarthria, aphasia, language pragmatic disturbances, and verbal reasoning impairments [ 15 ]. In addition to the symptoms shown on the linguistic levels of phonetics and phonology, morphology and syntax, semantics and pragmatics, the described aspects of the impaired language function also relate to the level of prosody. Prosody can be defined as the totality of all acoustically perceptible forms of expression of speech [ 31 ]. Since prosody belongs to the realm of the phonetic structures of language and is not tied to the categories of lexeme, morpheme or phoneme, prosodic subfunctions belong to the class of suprasegmentals of language. Although several classifications of prosody have been proposed, four main domains can be distinguished: frequency related parameters, energy/amplitude related parameters, spectral parameters, and temporal parameters [ 32 ]. Fundamental frequency refers to the F0 frequency and is described as the middle pitch. Intensity of speech relates to loudness, whereas duration is defined as the quantity of speech [ 31 ]. Against the background of current literature regarding the connections between linguistic and cognitive processes, methods can be developed to draw conclusions about underlying cognitive performance with the help of speech variables. In particular, the analysis of prosodic features by speech samples provides advantages, as it offers a high external validity as well as time and cost efficiency compared to classical diagnostic procedures [ 33 ] [ 34 ] [ 35 ]. This is why procedures for objective speech analysis are gaining increasing popularity and are already in use in clinical diagnostics [ 36 ] [ 37 ]. Studies suggest that prosodic impairments may occur due to immature executive functions [ 38 ]. In addition, earlier patient studies have already shown a connection between right-hemispheric frontal brain damage and impairments of prosody [ 39 ] [ 40 ]. Recent studies also demonstrated a relation between suprasegmental disorders, regarding impaired executive functions, in foreign accent syndrome [ 41 ] [ 42 ]. Moreover, impaired working memory and impairment in prosody were observed in Parkinson’s Disease [ 43 ], while reduced performance of fundamental frequency in connection with executive function damage was shown in frontotemporal dementia [ 44 ]. Furthermore, a link between prosody and divided attention, working memory and inhibition was shown in Autism Spectrum Disorder [ 45 ]. There is also clinical evidence that formant frequencies and Mel Frequency Cepstral Coefficients are associated with depressive disorders and potentially act as a biomarker [ 46 ] [ 47 ] [ 48 ] [ 49 ]. A relationship between prosodic performance, precisely disfluencies and inhibition in healthy participants was also reported by Engelhardt and colleagues [ 50 ]. In summary a link between deficient executive subfunctions and impaired prosodic skills is reported in different pathologies [ 38 ] [ 37 ] [ 48 ] [ 36 ]. These associations can be utilised to predict cognitive functions. However, these findings are primarily based on patient studies. Therefore, our initial aim was to test whether the reported correlations could predict cognitive performance in a healthy sample. Methods Participants Participants were recruited at the Forschungszentrum Jülich and through social networks. Testing took place at the Forschungszentrum Jülich (Germany) in 2018. Each test session lasted between 150 to 180 minutes, depending on the participants’ speed and the duration of the instructions. 231 healthy participants without a diagnosis of neurological or mental impairment were included in the present study (138 female, 93 male). The mean age of the sample at testing time was 35.2 years (standard deviation = 11.1, minimum = 20, maximum = 55). All participants were monolingual German. The sample varies regarding the level of education, ranging from participants who finished secondary school (n = 8), professional school/job training (n = 62), high school with a university-entrance diploma (n = 69), and university with a university degree (n = 92). All participants were paid an expense allowance of 50 EUR. The study was approved by the ethics committee of Heinrich Heine University Düsseldorf under the registration number 2017064341. Informed consent was obtained from all participants. All experiments were performed in accordance with relevant named guidelines and regulations. Part of the data used in this study is publicly available upon request, as not all participants consent to data sharing [ 51 ]. Design The test sessions were divided into two parts: Firstly, the executive performance of the participants was assessed. Secondly, spontaneous speech performance was recorded in order to extract prosodic features from speech samples. The executive function performance was assessed by the computerized test batteries Vienna Testsystem [ 52 ] and Psytoolkit [ 53 ], containing common standard tests for measuring executive function performance. In total, 66 variables from 14 different assessments of executive function performance were collected: Trail Making Test (TMT) [ 54 ], Raven’s Standard Progressive Matrices [ 55 ], Wisconsin Card Sorting Test [ 56 ], Tower of London [ 57 ], and Cued Task Switching [ 58 ] are related to cognitive flexibility. Performance of N-back Non-verbal [ 59 ], Non-verbal Learning Test [ 60 ], and Corsi Block Tapping Test [ 61 ] were used in relation to working memory. Inhibition was tested by Stop Signal Task [ 62 ], Simon Task [ 63 ], and Stroop Test [ 64 ]. Divided Attention Test [ 65 ], Spatial Attention Test [ 65 ], and Mackworth Clock Test [ 66 ] were used to measure divided and spatial attention as well as vigilance. An overview of the assessed tests and the exact variables from these are shown in Table 1 (see Appendix A for the descriptions of the tests). Spontaneous speech was tested based on a collection of three different speech samples per participant. Firstly, the participants were asked to describe the Cookie Theft Picture [ 67 ] within 90 seconds in as much detail as possible. Secondly, the participants were asked to talk about what they had watched on television / what kind of book they had read the day before. Thirdly, the participants were asked to describe what their favourite holiday trip would look like if money and time were no limiting factors. For the narrative tasks retelling a story and fictional storytelling, they were asked to talk for five minutes. Participants conducted all tests via a laptop, an external keyboard, and a headset-microphone. Table 1 Assessed executive function variables adapted from Amunts et al. [ 68 ] [ 69 ] Test Abbre-viation Variables COGNITIVE FLEXIBILITY Trail Making Test TMT Processing time part A, Processing time part B, Difference part B-A [seconds], Quotient B/A, Errors part A, Errors part B Raven’s Standard Progressive Matrices SPM Correct items, Processing time Wisconsin Card Sorting Test WCST Number of errors, Number of perseveration errors, Number of errors (non perseveration), Timeouts Tower Of London TOL Planning ability, Number of correct responses, Changed his/her mind, self-correction, Choice of wrong pole, Choice of blocked pole, Choice of impossible position Cued Task Switching SWITCH Number of errors, Timeouts, Errors of items which are incongruent WORKING MEMORY N-back Non-Verbal NBN Correct items, Number of commission errors, Number of errors, Mean reaction tine of correct items [seconds], Mean reaction time of errors [seconds] Non-Verbal Learning Test NVLT Sum of correct responses, Sum of false responses, Sum of difference between correct minus false responses, Processing time Corsi Block Tapping Test CORSI Block span, Correct items, False items, Missed items, Sequency errors INHIBITION Stop Signal Task INHIB Reaction time [seconds], Mean stop signal delay [seconds], Stop signal reaction time [seconds], Number of commission errors, Number of ommission errors Simon Task SIMON Number of errors in compatible items, Number of errors in incompatible items Stroop Test STROOP Reading interference [seconds], Naming interference [seconds], Interference-difference [seconds], Number of false reactions (reading-baseline), Number of false reactions (naming-baseline), Number of false reactions (reading-interference), Number of false reactions (naming-interference), Processing time ATTENTION / VIGILANCE Divided Attention Test WAF-G Number of missed items (unimodall visual), Number of false alarm (unimodal visual), Mean reaction time (unimodal visual) [ms], Number of missed items (crossmodal visual/auditive), Number of false alarm (crossmodal visual/auditive), Mean reaction time (crossmodal) [ms] Spatial Attention Test WAF-R Mean reaction time (unannounced items) [ms], Number of missed items (correct announced items), Mean reaction time (correct announced items) [ms], Number of missed items (wrong announced items), Mean reaction time (wrong announced items) [ms], Mean reaction time (short SOA) [ms], Mean reaction time (long SOA) [ms], Number of errors Mackworth Clock Test MACK Number of missed jumps, Number of false alarms Feature extraction To generate the prosodic features from the audio files collected from the speech tasks, the toolbox openSmile ( open - S ource M edia I nterpretation by L arge feature-space E xtraction) [ 70 ], version 2.1.3 was used to extract the suprasegmental parameters. Although the extraction and analysis of prosodic parameters for research purposes have been done for decades in various fields and is currently a topic of big interest in the context of speech biomarkers in different pathologies [ 33 ] a lack of standardisation and thus comparability was observed [ 70 ]. The benefit of using the open-source toolbox OpenSmile is its standardised automatic computation of the prosodic features resulting in a fixed feature set. It offers the extraction of prosodic features within a set that corresponds to the main categories frequency (representing the fundamental frequency), energy/amplitude (representing the intensity), spectral parameters, and temporal parameters (representing the duration). The choice of parameters was guided by the criteria of potentially indexing physiological changes in voice production and its theoretical significance in previous literature [ 32 ]. The feature set extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) was chosen, which contains 88 prosodic features. In order to keep the extraction comparable, the first 90 seconds from each audio file were chosen as input. As there are three audio samples per participant, a total of 264 prosodic features were generated per participant. All features were z-scored, i.e. the mean value was removed, and the variance was scaled to one unit. An overview of the extracted features and their descriptions, as well as the corresponding prosodic category, are shown in Table 2 . Table 2 Grouped listing of the prosodic features extracted by the toolbox OpenSmile [ 70 ]. Prosodic feature Variables Description FREQUENCY RELATED PARAMETERS F0semitone Mean, standard deviation, percentiles, range, rising slope, falling slope 10 Pitch, logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semitone 0) Jitter Mean, standard deviation 2 Deviations in individual consecutive F0 period lengths F 1–3 frequency & bandwith Mean, standard deviation 12 Centre frequency of 1., 2., 3. formant, bandwidth of first formants 1, 2, 3 ENERGY / AMPLITUDE RELATED PARAMETERS Loudness Mean, standard deviation, percentiles, range, rising slope, falling slope 10 Estimation of perceived signal intensity from an auditory spectrum Shimmer Mean standard deviation 2 Difference in peak amplitudes of consecutive F0 periods Harmonics to Noise Ratio Mean, standard deviation 2 Relation of energy in harmonic components to energy in noise- like components SPECTRAL PARAMETERS Spectral Flux Mean, standard deviation 3 Difference of the spectra of two consecutive frames Mel Frequency Cepstral Coefficients 1–4 Mean, standard deviation 16 Perceived pitch of the frequency spectrum Harmonic differences Mean, standard deviation 4 Ratio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2) / to the energy of the highest harmonic in the third formant range (A3) Alpha Ratio Mean, standard deviation 3 Ratio of summed energy from 50-1000 Hz and 1–5 kHz Hammerberg Index Mean, standard deviation 3 Ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz region Spectral Slopes Mean, standard deviation 6 Linear regression slope of the logarithmic power spectrum in the specified bands F 1–3 Energy Mean, standard deviation 6 Formant 1, 2, and 3 relative energy TEMPORAL PARAMETERS Loudness peaks per second 1 Number of volume highlights per second Voiced segments Mean, standard deviation 3 Amount of continuously voiced regions Unvoiced segments Mean, standard deviation 2 Amount of the continuously unvoiced regions Equivalent Sound Level 1 Sound pressure level which has the same total energy as the actual fluctuating noise Machine learning and statistical analyses Data management and analysis were performed using Python 3 [ 71 ]. A ML approach was applied to the data following the machine learning library JuLearn [ 72 ]. The 264 extracted prosodic feature variables were specified as features and the 66 executive function variables as targets. The initial goal of our analyses was to predict each of the executive function targets using all of the prosodic features. Firstly, cross-validation was used to determine the model performance. In cross-validation, the data set is randomly partitioned into equally sized folds. All folds except for one, are used for training the model. The hold-out fold is then used to determine the trained model’s performance on unseen data. This process is repeated once for each fold as the validation fold. Then, the average of the validation performances is calculated [ 73 ]. Cross-validation was applied with ten folds (Fig. 1 ). Since all of the prosodic features were used to predict each of the 66 targets, 66 independent cross-validation models were performed. In order to keep the folds balanced, stratification by target was implemented into the cross-validation pipeline, meaning that the different folds approximately followed the same distribution of the respective target [ 14 ]. Stratification can usually improve the success of model training by ensuring that the training and test data have similar distribution which reduces the risk of bias or error in the evaluation of the model. Knowing the influence of different demographic aspects on prosodic performance [ 74 ] [ 75 ] we regressed out the effects of the confounding variables sex, age, and education from the features with a linear regression model. This is standard practice since the goal is to shed light on the relationship between executive functions and prosodic features, independently of factors that are additionally related to the constructs [ 76 ] [ 12 ]. There are several regression models to choose from for usage in machine learning approaches. With his theorem No Free Lunch Wolpert postulated that there is no general best machine learning algorithm for all predictive modeling problems such as classification and regression [ 77 ]. We chose the Random Forest Regressor as it has already demonstrated to predict executive functions in previous studies[ 69 ][ 78 ][ 79 ] and is commonly used[ 80 ] [ 81 ]. Random Forest is an ensemble estimator that fits a number of decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and to control over-fitting. The decisions made by each tree carry equal weight, while the order of the decisions is random [ 82 ]. Following Poldrack et al. [ 83 ], accuracy was assessed by the coefficient of determination (R²) [ 84 ], which measures how well the regression predictions approximate the real data points. It can be interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variables. R² ranges from 0 to 1, where 1 indicates that the regression model perfectly predicts the data. In cases of negative values, the mean of the data alone fits the results better than the predicted values. Thus, negative values mean a very poor generalisation of the model. For the cross-validation results, the mean of R² was calculated over 10 folds. Secondly, the aim of our study was to investigate which of the many prosodic features were important in connection to all features to train the model successfully. For this purpose, the feature importance was calculated by the impurity-based feature importance of Random Forest, also known as the Gini importance [ 85 ] [ 86 ]. When building a decision tree, features are selected at each node in order to divide the data into subsets that are as “pure” as possible with regard to the target variable. Gini Impurity measures how often a randomly chosen data point within a subset would be incorrectly labeled, reflecting the degree of disorder or „impurity” within the data. In contrast, Gini Importance assesses the overall decrease in node impurity resulting from splits based on a specific feature. It considers the probability of reaching each node and calculates the weighted reduction in impurity. Features with higher Gini importance are considered more important for predicting the target variable [ 85 ]. Feature importance was computed for the final estimator, as well as for each fold to estimate the variability of the importance. The sum of all feature importance scores adds up to 1. Thirdly, detailed analyses were conducted to examine the effects of confound removal and stratification. Here, we used other models such as Random Forest Regressor, ExtraTree Regressor, and Ridge Regression to regress out the confounds from the features in order to compare model performance depending on how the confounds were removed. Moreover, we employed an approximate permutation test approach, suggested by North and colleagues [ 87 ], to disentangle predictive information of the features from that of the confounds. To achieve this, we permuted each feature separately. Here, the association between features and targets is randomised, while the association between confounds and targets remains unchanged. 10-fold cross-validation was performed for each permutation, and R² scores for 1000 permutations were used to construct an empirical null distribution, from which p-values were computed as the proportion of permuted R² scores greater than or equal to the R² score of the original non-permuted data. The threshold value for the two-tailed test was set to p = 0.05. Significant p-values indicate that predictive information stems from the features rather than the confounds alone. Results In cross-validation, the models were trained to predict each of the EF targets using all of the prosodic features. Regression of the confounding features sex, age, and education, and stratification by target distribution were performed. Evaluation was estimated using the coefficient of determination R² averaged over the 10 folds. Out of 66 executive function targets, 53 variables did not show positive R² values, indicating no predictive power for these targets using our modeling approach. 13 executive function targets showed positive R² values (Fig. 2 ). However, only two targets, TMT BTA (processing time part A) and TMT BTB (processing time part B), showed R² values > 0.1, representing a reasonable model fit. The described TMT variables belong to the cognitive flexibility domain. An overview of R² of all 66 EF targets can be found in the supplements. Feature importance was calculated in order to determine which of the prosodic features were particularly important for successfully predicting the EF targets. Since we observed good prediction performance (R² > 0.1) for TMT BTA and TMT BTB, we only computed feature importance for these targets. Figures 3 and 4 present the ten most important features predicting the EF targets TMT BTA and TMT BTB (see Appendix B for the feature importance of all prosodic variables). The majority of features identified as most important belong to the spectral prosodic domain. The most frequently appearing prosodic features were the Mel Frequency Cepstral Coefficients. For the purpose of validation, we contrasted the effects of confound removal and stratification on the prediction performance for the targets TMT BTA and TMT BTB. To begin with, we compared the prediction results with the performance of the cross-validation model without regressing out the confounding variables sex, age, and education. These results indicated a worse prediction compared to the results with confound removal. Results are displayed in Fig. 5 . For both TMT targets, prediction performance decreased when not removing the confounding variables. This is true for the stratified set up, as well as for the non-stratified set up. Prediction performance also decreases when not stratifying the cross-validation folds. To explore the mechanism behind the decrease in prediction performance for the pipeline without confound removal further, and to examine whether it is related to the specific confound removal model used, we exchanged the standard confound removal model Linear Regression with other models, such as Random Forest Regressor, ExtraTree Regressor and Ridge Regression. As demonstrated in Fig. 6 , the prediction performance varies depending on the choice of the confound removal model. The pipelines with the confound removal models Linear Regression and Ridge Regression indicate higher R² values than the pipelines with the confound removal models Random Forest Regressor and ExtraTree Regressor. Finally, we evaluated the conditions with different confound removal models by using permutation tests. For the EF target TMT BTA with the cross-validation regressor Random Forest and the confound removal model Random Forest R² of 0.057 is significant ( p = 0.001). For the EF target TMT BTB with the cross-validation regressor Random Forest and the confound removal model Ridge Regression R² of 0.196 is significant ( p = 0.032) such as with the cross-validation regressor Random Forest and the confound removal model Linear Regression R² of 0.205 ( p = 0.017). As shown in Table 3 , all other positive prediction performances, measured by R² values, are not significant. Table 3 Comparison of different confound removal models complemented by the p-value. CRmodel = Confound removal model, RF = Random Forest Regressor, ET = ExtraTree Regressor, LG = Linear Regression, Ridge = Ridge Regression, withoutCR = without confound removal, strat = stratified. TMT BTA TMT BTB Condition R 2 p-value Condition R 2 p-value CRmodelRF -0.142 0.009 CRmodelRF -0.343 0.161 CRmodelRF_strat 0.057 0.001 CRmodelRF_strat -0.171 0.069 CRmodelET -0.172 0.001 CRmodelET -0.156 0.001 CRmodelET_strat -0.003 0.005 CRmodelET_strat -0.082 0.001 CRmodelRidge 0.097 0.691 CRmodelRidge 0.106 0.058 CRmodelRidge_strat 0.262 0.188 CRmodelRidge_strat 0.196 0.032 CRmodelLG 0.102 0.633 CRmodelLG 0.081 0.162 CRmodelLG_strat 0.260 0.200 CRmodelLG_strat 0.205 0.017 To summarise, we initially found a moderate predictive power of TMT BTA and TMT BTB by prosodic features. However, considering all results, there is a decrease in predictive power when not removing the confounding variables sex, age, and education, indicating confound leakage. In addition, the predictive power increases when stratification is performed. Pipelines with different models for removing confounding factors perform differently. Ultimately, two out of 20 models are significant, which suggests that the prediction is at least partly driven by the features in these models. Discussion This study is based on an investigation of the relationship between executive functions and prosody through examining whether prosodic features can predict executive functions. In summary, we preliminary found a moderate predictive power of prosodic features for TMT BTA and TMT BTB. However, considering all results, there is a decrease in predictive power when not removing the confounding variables sex, age, and education, indicating confound leakage for most of the models. Firstly, we evaluated 66 models, each predicting one executive function variable from the prosodic features. We employed 10-fold cross-validation with stratification by target variable and confound removal of sex, age, and education. The results showed poor or no prediction performance for 64 out of 66 EF targets. Only the models for the TMT targets TMT BTA and TMT BTB, relating to cognitive flexibility, initially appeared to have a moderately valid predictive performance. Without the additional analyses that we conducted for validation, these results could be interpreted as follows: Our results would have confirmed findings from previous studies on a narrow correlation between executive functions and language in general [ 88 ] [ 17 ], and would have been in line with research conducted in different patient cohorts [ 43 ] [ 45 ] [ 50 ], reporting connections between cognitive flexibility and prosody [ 34 ]. In our study, we would have found these associations in healthy participants. Consistent with the literature, this study would have shown that features from various prosodic domains are important for the models to learn. This would have validated that prosodic features of different kinds are closely related to executive functions, as described in previous studies [ 89 ] [ 90 ] [ 91 ]. Furthermore, predominantly spectral prosodic parameters would have shown importance for the model fits, especially the Mel Frequency Cepstral Coefficients, which are already used as a biomarker in depressive disorders [ 47 ] [ 49 ]. As described in Table 2 , the Mel Frequency Cepstral Coefficients are defined as the perceived pitch of the frequency spectrum. More precisely, these are coefficients of the Mel scale, which relates the perceived frequency of a tone to the actual measured frequency. It scales the frequency in order to match more closely what the human ear can hear [ 92 ]. It therefore would have been deduced from the study that spectral parameters, in particular the Mel Frequency Cepstral Coefficients, are closely related to executive functions. Furthermore, the findings would have confirmed that easy-to-capture spontaneous speech derived from different tasks is suitable for the extraction of prosodic features. In summary, the present research would have raised the possibility that this predictive power of prosodic features could be an important biomarker for executive function impairment or its future decline. However, given the additional in-depth analyses of the ML pipeline that partly invalidate the initial results, our findings need to be reinterpreted as follows: We expect models to perform better if the effects of the confounding variables are not excluded, given that this would provide more information for the algorithm to learn. However, the prediction performance decreases for both TMT targets when not removing the confounding variables sex, age, and education. This is not in line with our expectation because in our scenario, the prediction performance should be worse if the confounding variables are removed, as the algorithm can then only learn from the association between confound-free features and the target. We found that information from these confounds, namely sex, age, and education leaked into the predictions through the confound removal procedure. The inadvertent injection of this information occurs particularly when the confounding variables and the targets show a strong correlation and this is coupled with the use of a high number of features, as explained by Hamdan et al. [ 13 ] and Sasse & Nicolaisen-Sobesky et al. [ 12 ]. This is indeed the case in our data set (see Appendix C). There is a strong correlation between the TMT targets and the confounding variables. In addition, we use a high number of features within the cross-validation pipeline, because we wanted to investigate EF and prosody in an exploratory manner. While our data set was relatively small compared to most ML studies, which typically increases the risk of leakage [ 93 ], it represents a reasonable size when compared to studies investigating speech biomarkers [ 33 ]. The results also confirm that these observations occur in both stratified and non-stratified conditions. As expected, it can be shown that stratification by target distribution generally increases the predictive performance. This is in line with Diamantidis et al. [ 94 ] and Hastie et al. [ 14 ], who show that equally representative cross-validation folds lead to improved predictive power. Additionally, it is demonstrated that stratification can also increase confound leakage This can be derived from the fact that the difference in predictive power between the pipelines with and without confound removal is even greater in the stratified condition (Fig. 6 ). Furthermore, the results illustrate that the observed confound leakage is not bound to the use of Linear Regression as the confound removal model, but also occurs when other models are employed. Overall, these observations raise concerns about the trustworthiness of the primary results. Nonetheless one cannot definitively rule out whether information from the features also influenced the predictive power of the present results. We, therefore, conducted permutation testing for the different cross-validation models. Since the permutation tests for the two TMT targets each identified models that can be interpreted as significant, we speculate that predictive power is partly due to the information contained in the features despite the confounding variables also contributing to the prediction. However, this was only observed in two of 66 EF targets and for these two targets only in specific confound removal models. For this reason, we only conditionally derive the predictive power of prosodic features. Further analyses of this type with other data sets would need to be carried out to verify this. In conclusion, the present results highlight the challenges and pitfalls when conducting ML analyses with the aim of predicting variables of interest including cognitive performance. This example shows which misinterpretations could have been deduced from the initial results. This can be particularly dangerous if the findings match previous studies, as in the case here. This is crucial, as ML studies are becoming increasingly important and widely employed, especially with the accessibility of large amounts of data. In this respect, we caution and recommend that when using ML analyses to predict cognitive performance, quality controls should be performed to prevent false results. This is also true when interpreting ML results of other researchers. This study has contributed to uncovering more insight into a pitfall in ML analysis arising due to confound leakage. As confounding is ubiquitous in social and biological sciences, it should be further deciphered how confound leakage occurs and which contributing factors can be taken into account. Additionally, our analysis framework provides a blueprint for further research investigating whether prosody can serve as a predictive biomarker of executive dysfunction. Declarations Data Availability Part of the data used in this study is publicly available upon request. Researchers who wish to acquire access to the data are kindly asked to contact Julia A. Camilleri at [email protected] , as described in the related publication Camilleri, J.A., Volkening, J. et al. SpEx: a German-language dataset of speech and executive function performance. Sci Rep 14, 9431 (2024). https://doi.org/10.1038/s41598-024-58617-3 Acknowledgements This study was supported by the Deutsche Forschungsgemeinschaft (DFG, GE 2835/2–1, EI 816/16-1 and EI 816/21-1), the National Institute of Mental Health (R01-MH074457), the Helmholtz Portfolio Theme "Supercomputing and Modeling for the Human Brain", the Virtual Brain Cloud (EU H2020, no. 826421) & the National Institute on Aging (R01AG067103). Author information These authors contributed equally: Julia A. Camilleri and Susanne Weis. Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany. Gianna Kuhles, Sami Hamdan, Simon B. Eickhoff, Kaustubh R. Patil, Julia A. Camilleri, Susanne Weis Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Research Centre Jülich, Jülich, Germany. Gianna Kuhles, Sami Hamdan, Simon B. Eickhoff, Kaustubh R. Patil, Julia A. Camilleri, Susanne Weis Department of Psychiatry, Psychotherapy and Psychosomatics, Medical Faculty, RWTH Aachen University, Aachen, Germany. Stefan Heim Institute of Neuroscience and Medicine, Structural and functional Organization of the Brain (INM-1), Research Center Jülich, Jülich, Germany. Stefan Heim Author contributions G.K., J.A.C., S.W. conceived the project and designed the study. S.H., S.H., S.B.E., K.R.P. contributed essential resources. G.K. with contributions from S.W. and all other authors wrote the manuscript. Competing interests The authors declare no competing interests. References Karako, K., Predictive deep learning models for cognitive risk using accessible data. BioSci. Trends 18, 66-72 (2024). Bzdok, D., Varoquaux, G., & Steyerberg, E. W. Prediction, not association, paves the road to precision medicine. JAMA Psychiatry 78, 127-128 (2021). Cotta Ramusino, M. et al. Diagnostic performance of molecular imaging methods in predicting the progression from mild cognitive impairment to dementia: an updated systematic review. Eur. J Nucl. Med. Mol. Imaging 51, 1876-1890 (2024). Roheger, M., Liebermann-Jordanidis, H., Krohm, F., Adams, A., & Kalbe, E. Prognostic factors and models for changes in cognitive performance after multi-domain cognitive training in healthy older adults: A systematic review. Front. Hum. Neurosci. 15, 636355; https://doi.org/10.3389/fnhum.2021.636355 (2021). Dwyer, D. B., Falkai, P., & Koutsouleris, N. Machine learning approaches for clinical psychology and psychiatry. Annu. Rev. Clin. Psychol. 14, 91-118 (2018). Arbabshirani, M. R., Plis, S., Sui, J., & Calhoun, V. D. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. Neuroimage 145, 137-165 (2017). Rankin, D. et al. Identifying key predictors of cognitive dysfunction in older people using supervised machine learning techniques: observational study. JMIR Med. Inform. 8, 20995; https://doi.org/10.2196/20995 (2020). Ansart, M. et al. Predicting the progression of mild cognitive impairment using machine learning: a systematic, quantitative and critical review. Med. Image Anal. 67, 101848; https://doi.org/10.1016/j.media.2020.101848 (2021). Ahmad, S., El-Affendi, M. A., Anwar, M. S., & Iqbal, R. Potential future directions in optimization of students’ performance prediction system. Comput. Intell. Neurosci. 1, 6864955; https://doi.org/10.1155/2022/6864955 (2022). Domingos, P. A few useful things to know about machine learning. Comm. ACM. 55, 78-87 (2012). Kapoor, S., & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804; https://doi.org/10.1016/j.patter.2023.100804 (2023). Sasse, L., & Nicolaisen-Sobesky, E. On Leakage in Machine Learning Pipelines. arXiv preprint arXiv: 2311.04179, (2024). Hamdan, S. et al. Confound-leakage: confound removal in machine learning leads to leakage. GigaScience 12, giad071; https://doi.org/10.1093/gigascience/giad071 (2023). Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction. 1-758 (Springer, 2009). Ardila, A. The executive functions in language and communication in Cognition and acquired language disorders (ed. Peach, R. K. & Shapiro, L. P.) 147-166 (Mosby, 2012). Baddeley, A. Working memory: looking back and looking forward. Nat. Rev. Neurosci. 4, 829-839 (2003). Levelt, W. J. Accessing words in speech production: Stages, processes, and representations. Cogn. 42, 1-22 (1992). Goldstein, S., Naglieri, J. A., Princiotta, D., & Otero, T. M. Introduction: A history of executive functioning as a theoretical and clinical construct. in Handbook of executive functioning. (ed. Goldstein, S. & Naglieri, J. A.) 3-12 (Springer Science, 2014). Ward, J. The Student‘s Guide to Cognitive Neuroscience. (Psychology Press, 2015). Friedman, N. et al. Individual differences in executive functions are almost entirely genetic in origin. J. Exper. Psychol. 137, 201-225 (2008). Diamond, A. Executive functions. An. Rev. Psy. 64, 135-168 (2013). Miyake, A. et al. The unity and diversity of executive functions and their contributions to complex ‘Frontal Lobe’ tasks: a latent variable analysis. Cognit. Psychol. 41, 49-100 (2000). Löffler, C., Frischkorn, G. T., Hagemann, D., Sadus, K., & Schubert, A. L. The common factor of executive functions measures nothing but speed of information uptake. Psychol. Res. 88, 1092-1114 (2024). Barch, D. M. The cognitive neuroscience of schizophrenia. Annu. Rev. Clin. Psychol. 1, 321-353 (2005). Guarino, A. et al. Executive functions in Alzheimer disease: a systematic review. Front. Neurosci. 10, 437 (2019). Kudlicka, A., Clare, L., & Hindle, J. V. Executive functions in Parkinson’s disease: Systematic review and meta-analysis. Mov. Disord. 26, 2305-2315 (2011). Nigg, J. T., Blaskey, L. G., Huang-pollock, C. L., & Rappley, M. D. Neuropsychological Executive Functions and DSM-IV ADHD. Subtypes. J. Am. Acad. Child Adolesc. Psych. 41, 59-66 (2002). Tavares, J. V. T. et al. Distinct profiles of neurocognitive function in unmedicated unipolar depression and bipolar II depression. Biol. Psychol. 62, 917-924 (2007). Salthouse, T., Atkinson, T., & Berish, D. Executive functioning as a potential mediator of age-related cognitive decline in normal adults. J. Exper. Psychol. 132, 566-594 (2003). Novick, J. M., Trueswell, J. C., & Thompson, S. L. Cognitive control and parsing: Reexamining the role of Broca’s area in sentence comprehension. Cogn, Affec. & Behav. Neurosci. 5, 263-281 (2005). Laver, J. Principles of phonetics. (Cambridge University Press, 1994). Eyben, F. et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. Transac. Affect. Com. 7, 190-202 (2015). Hecker, P., Steckhan, N., Eyben, F., Schuller, B. W., & Arnrich, B. Voice analysis for neurological disorder recognition – A systematic review and perspective on emerging trends. Front. Digit. Health 4, 842301; https://doi.org/10.3389/fdgth.2022.842301 (2022). Ramanarayanan, V., Lammert, A. C., Rowe, H. P., Quatieri, T. F., & Green, J. R. Speech as a biomarker: Opportunities, interpretability, and challenges. Perspect. ASHA Spec. Interest Groups 7, 276-283 (2022). Robin, J. et al. Evaluation of speech-based digital biomarkers: review and recommendations. Digl Biomark. 4, 99-108 (2020). Martínez-Sánchez, F., Meilán, J. J. G., Carro, J., and Ivanova, O. A prototype for the voice analysis diagnosis of Alzheimer's disease. J. Alzheimers. Dis. 64, 473-481 (2018). Parola, A., Simonsen, A., Bliksted, V., & Fusaroli, R. Voice patterns in schizophrenia: A systematic review and Bayesian meta-analysis. Schizo. Res. 216, 24-40 (2020). Speer, S. R., & Ito, K. Prosody in first language acquisition–Acquiring intonation as a tool to organize information in conversation. Lang & Ling Com. 3, 90-110 (2009). Alexander, M. P., Benson, D. F., & Stuss, D. T. Frontal lobes and language. Brain & Lang. 37, 656-691 (1989). Ross, E. D. The aprosodias: Functional-anatomical organization of the affective components of language in the right hemisphere. Arch Neurol. 140, 695-710 (1981). Keulen, S. et al. Psychogenic foreign accent syndrome: a new case. Front. Neurosci. 10, 143 (2016). Roy, A., Allain, P., Roulin, J. L., Fournet, N., & Le Gall, D. Ecological approach of executive functions using the behavioural assessment of the dysexecutive syndrome for children (BADS-C): Developmental and validity study. J. Neuropsych. 37, 956-971 (2015). Breitenstein, C., Van Lancker, D., Daum, I., & Waters, C. H. Impaired perception of vocal emotions in Parkinson's disease: influence of speech time processing and executive functioning. Brain & Cogn. 45, 277-314 (2001). Nevler, N. et al. Automatic measurement of prosody in behavioral variant FTD. Neurol. 89, 650-656 (2017). Filipe, M. G., Frota, S., & Vicente, S. G. Executive functions and prosodic abilities in children with high-functioning autism. Front. Psych. 9, 359 (2018). Alghowinem, S., Gedeon, T., Goecke, R., Cohn, J. F., & Parker, G. Interpretation of depression detection models via feature selection methods. IEEE Trans. Affect. Comput. 14, 133-152 (2020). Cummins, N., Epps, J., Sethu, V., Breakspear, M., & Goecke, R., Modeling Spectral Variability for the Classification of Depressed Speech. Proc. Interspeech. 857-861 (2013). Moore, I. I. E., Clements, M. A., Peifer, J. W., & Weisser, L. Critical analysis of the impact of glottal features in the classification of clinical depression in speech. Transact. Biomedic. 55, 96-107 (2007). Williamson, J. R. et al. Vocal biomarkers of depression based on motor incoordination. Proc. Aud. 3, 41-48 (2013). Engelhardt, P. E., Nigg, J. T., & Ferreira, F. Is the fluency of language outputs related to individual differences in intelligence and executive function? Acta Psychol. 144, 424-432 (2013). Camilleri, J. A. et al. SpEx: a German-language dataset of speech and executive function performance. Sci. Rep. 14, 9431; https://doi.org/10.1038/s41598-024-58617-3 (2024). Wiener Testsystem. (SCHUHFRIED GmbH, 2016). Stoet, G. PsyToolkit: A software package for programming psychological experiments using Linux. Behav. Res. Methods 42, 1096-1104 (2010). Reitan, R. M. Validity of the trail making test as an indicator of organic brain damage. Percept. Mot. Skills 8, 271–276 (1958). Raven, J. C., Raven, J. & Court, J. H. SPM Manual (Deutsche Bearbeitung und Normierung von St. Bulheller und H. Häcker). (Swets & Zeitlinger B.V, 1998). Grant, D. A. & Berg, E. A. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a Weigl-type card-sorting problem. J. Exp. Psychol. 38, 404-411 (1948). Kaller, C. P., Unterrainer, J. M. & Stahl, C. Assessing planning ability with the Tower of London task: Psychometric properties of a structurally balanced problem set. Psychol. Assess. 24, 46-53 (2012). Meiran, N. Reconfiguration of processing mode to task performance. J. Exp. Psychol. Learn. Mem. Cogn. 22, 1423-1442 (1996). Schellig, D., Schuri, U. & Arendasy, M. NBN-NBACK-nonverbal. (SCHUHFRIED GmbH, 2009). Sturm, W. & Willmes, K. NVLT Non-Verbal Learning Test. (SCHUHFRIED GmbH, 2016). Schellig, D. & Hättig, H. A. Die Bestimmung der visuellen Merkspanne mit dem Block-Board. Z. Neuropsychol. 4, 104-112 (1993). Kaiser, S., Aschenbrenner, S., Pfüller, U., Roesch-Ely, D., & Weisbrod, M. Response Inhibition. (SCHUHFRIED GmbH, 2016). Simon, J. R. & Wolf, J. D. Choice reaction time as a function of angular stimulus-response correspondence and age. Ergonomics 6, 99-105 (1963). Schuhfried, G. Interferenz nach Stroop. (SCHUHFRIED GmbH, 2016). Sturm, W. Wahrnehmungs- und Aufmerksamkeitsfunktionen: Geteilte Aufmerksamkeiten. (SCHUHFRIED GmbH, 2016). Mackworth, N. H. The breakdown of vigilance during prolonged visual search. J. Exper. Psych. 1, 6-21 (1948). Goodglass, H., & Kaplan, E. The assessment of aphasia and related disorders. (Lea & Febiger, 1972). Amunts, J., Camilleri, J. A., Eickhoff, S. B., Heim, S., & Weis, S. Executive functions predict verbal fluency scores in healthy participants. Sci. Rep. 10, 1-11 (2020). Amunts, J. et al. Comprehensive verbal fluency features predict executive function performance. Sci. Rep. 11, 1-14 (2021). Eyben, F., Wöllmer, M., & Schuller, B. Opensmile: the munich versatile and fast open-source audio feature extractor. Proc. Multimed. 18, 1459-1462 (2010). Van Rossum, G., & Drake, F. L. Python 3 Reference Manual. (CreateSpace, 2009). Hamdan, S. et al. Julearn: An Easy-to-Use Library for Leakage-Free Evaluation and Inspection of ML Models. Gigabyte, gigabyte 113; https://doi.org/10.46471%2Fgigabyte.113 (2024). Molinaro, A. M, Simon, R., & Pfeiffer, R. M. Prediction error estimation: a comparison of resampling methods. Bioinform. 21, 3301-3307 (2005). Dromey, C., Silveira, J., & Sandor, P. Recognition of affective prosody by speakers of English as a first or foreign language. Speech comm. 47, 351-359 (2005). Volin, J., Tykalová, T., & Boril, T. Stability of Prosodic Characteristics Across Age and Gender Groups. Inter Speech 3902-3906 (2017). Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012). Wolpert, D. H. The lack of a priori distinctions between learning algorithms. Neural. Com. 8, 1341-1390 (1996). Byeon H. Is the Random Forest Algorithm Suitable for Predicting Parkinson's Disease with Mild Cognitive Impairment out of Parkinson's Disease with Normal Cognition?. Int. J. Enviro. 17, 2594 (2020). Cordova, M. et al. Heterogeneity of executive function revealed by a functional random forest approach across ADHD and ASD. Neuro Im. Clin. 26, 102245; https://doi.org/10.1016/j.nicl.2020.102245 (2020). Adnan, M. N., Ip, R. H., Bewong, M., & Islam, M. Z. BDF: A new decision forest algorithm. Inform. Sci. 569, 687-705 (2021). Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst. 35, 507–520 (2022). Breiman, L. Random forests. Mach. Learn. 45, 5-32 (2001). Poldrack, R. A., Huckins, G. & Varoquaux, G. Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry 77, 534–540 (2020). Wright, S. Correlation and Causation. J. Agric. 20, 557-585 (1921). Nembrini, S., König, I. R. & Wright, M. N. The revival of the Gini importance? Bioinformatics 34, 3711–3718 (2018). Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2, 2825–2830 (2011). North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet. 71, 439–441 (2002). Baddeley, A. D., & Hitch, G. Working memory. Psych. of learn & motiv. 8, 47-89 (1974). Yap, P. et al. Development trends of white matter connectivity in the first years of life. Plos one. 6, e24678; https://doi.org/10.1371/journal.pone.0024678 (2011). Tamarit, L., Goudbeek, M., & Scherer, K. R. Spectral slope measurements in emotionally expressive speech in Proc. of Speech. 7, 169-183 (2008). Le, P., Ambikairajah, E., Epps, J., Sethu, V., & Choi, E. H. C. Investigation of spectral centroid features for cognitive load classification. Speech Comm. 54, 540-551 (2011). Hasan, M. R., Jamil, M., & Rahman, M. G. R. M. S. Speaker identification using mel frequency cepstral coefficients. Variat. 1, 565-568 (2004). Rosenblatt, M., Tejavibulya, L., Jiang, R., Noble, S. & Scheinost, D. Data leakage inflates prediction performance in connectome-based machine learning models. Nat. Commun. 15, 1829 (2024). Diamantidis, N. A., Karlis, D. & Giakoumakis, E. A. Unsupervised stratification of cross-validation for accuracy estimation. Artif. Intell. 1–16 (2000). Additional Declarations No competing interests reported. Supplementary Files AppendixPitfallsinusingMLtopredictcognitivefunctionperformance.docx Cite Share Download PDF Status: Published Journal Publication published 29 Oct, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 03 Apr, 2025 Reviews received at journal 12 Jan, 2025 Reviews received at journal 09 Jan, 2025 Reviewers agreed at journal 29 Dec, 2024 Reviewers agreed at journal 15 Dec, 2024 Reviewers invited by journal 05 Aug, 2024 Editor assigned by journal 29 Jul, 2024 Editor invited by journal 22 Jul, 2024 Submission checks completed at journal 18 Jul, 2024 First submitted to journal 15 Jul, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4745684","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":338518720,"identity":"5f9f92af-bf2f-45de-87cc-957ebc849adf","order_by":0,"name":"Gianna Kuhles","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABAUlEQVRIiWNgGAWjYFACNgZmhgIGBn7mAwwMEgwMMmwMPAwgRECLAQODZFsCWAsP8VoMjiWAuTwMhLTIO7ClSRcY2NgbH2N/wGDZZsPDx372AMObCtxaDA+wHZOeYZCWuO0YQwLQdWk8bDx5CYxzzuDR0sDeJs1jcDjB7H7D8R+SbYd52CR4DJh52whq+W9v3MbYALTlP1TLPzx+YQA6jMfgAOMGNmZQsB2AamnArcWAmS3ZmscgOXHGMTZgIJ9LBvolx+DgnGN4bGlvM7zNU2Fnz9/G/oBZosxOTr79jOGDNzV4bDmMxGGWgDIO4NYAtAXZ0Ywf8CkdBaNgFIyCEQsA7qk+PMeCmrwAAAAASUVORK5CYII=","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":true,"prefix":"","firstName":"Gianna","middleName":"","lastName":"Kuhles","suffix":""},{"id":338518722,"identity":"87537b89-e42e-4c40-93da-d75466ae66fd","order_by":1,"name":"Sami Hamdan","email":"","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":false,"prefix":"","firstName":"Sami","middleName":"","lastName":"Hamdan","suffix":""},{"id":338518724,"identity":"c0db1cee-478f-40f7-87b8-1804165c8a93","order_by":2,"name":"Stefan Heim","email":"","orcid":"","institution":"Department of Psychiatry, Psychotherapy and Psychosomatics, Medical Faculty, RWTH Aachen University","correspondingAuthor":false,"prefix":"","firstName":"Stefan","middleName":"","lastName":"Heim","suffix":""},{"id":338518726,"identity":"d6950576-16d6-4a25-98a1-6b470ac5915d","order_by":3,"name":"Simon Eickhoff","email":"","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":false,"prefix":"","firstName":"Simon","middleName":"","lastName":"Eickhoff","suffix":""},{"id":338518727,"identity":"065833c7-d250-41fb-a64d-f017abd541bc","order_by":4,"name":"Kaustubh R. Patil","email":"","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":false,"prefix":"","firstName":"Kaustubh","middleName":"R.","lastName":"Patil","suffix":""},{"id":338518728,"identity":"bcbb01dd-1e43-475f-9d55-6e64d0400127","order_by":5,"name":"Julia Camilleri","email":"","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":false,"prefix":"","firstName":"Julia","middleName":"","lastName":"Camilleri","suffix":""},{"id":338518729,"identity":"9ecd96db-badc-4ed1-be1a-f97427c51bc3","order_by":6,"name":"Susanne Weis","email":"","orcid":"","institution":"Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf","correspondingAuthor":false,"prefix":"","firstName":"Susanne","middleName":"","lastName":"Weis","suffix":""}],"badges":[],"createdAt":"2024-07-15 21:59:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4745684/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4745684/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-24325-9","type":"published","date":"2025-10-29T15:57:02+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":62658203,"identity":"0cb7fbad-eb0b-4334-9a47-579a01eef696","added_by":"auto","created_at":"2024-08-17 02:15:11","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":166831,"visible":true,"origin":"","legend":"\u003cp\u003e10-fold cross-validation design for each executive function target.\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/25cf0b93a93ead614439732e.png"},{"id":62658914,"identity":"24c23912-1ba4-4bb9-b70c-ab6c97069f28","added_by":"auto","created_at":"2024-08-17 02:23:11","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":255459,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction of executive function targets by prosodic features.\u003c/p\u003e\n\u003cp\u003eCross-validation model with confound removal and stratified by target distribution. Only targets with positive R² values are displayed. TMT BTA = Trail Making Test - processing time part A, TMT BTB = Trail Making Test - processing time part B.\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/9914eef6d6b2ff5b0d3c617a.png"},{"id":62658200,"identity":"5da12c3a-f41d-4dbd-adec-3183e707fc19","added_by":"auto","created_at":"2024-08-17 02:15:11","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":147854,"visible":true,"origin":"","legend":"\u003cp\u003eFeature importance for TMT BTA.\u003c/p\u003e\n\u003cp\u003eTMT BTA = Trail Making Test - processing time part A, SD = standard deviation, M = mean, PD = picture description, RS = retelling a story, FS = fictional storytelling.\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/dbcb15ecf62d11637468e6e6.png"},{"id":62658204,"identity":"61d1e03b-d8fc-4176-bb9e-179cbcfb703a","added_by":"auto","created_at":"2024-08-17 02:15:11","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":117899,"visible":true,"origin":"","legend":"\u003cp\u003eFeature importance for TMT BTB.\u003c/p\u003e\n\u003cp\u003eTMT BTB = Trail Making Test - processing time part B, SD = standard deviation, M = mean, PD = picture description, RS = retelling a story, FS = fictional storytelling.\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/39369b881aa7478181bb7ef2.png"},{"id":62658201,"identity":"f7e2881a-1cf4-4d51-a7e0-5dd8e85b7490","added_by":"auto","created_at":"2024-08-17 02:15:11","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":112109,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction of TMT targets in different conditions regarding confound removal and stratification. TMT BTA = Trail Making Test - processing time part A, TMT BTB = Trail Making Test - processing time part B, confounding variables (sex, age, education) and stratification: with CR = with confound removal, strat = stratified, without CR = without confound removal.\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/6662c6449995c349f916dd96.png"},{"id":62658206,"identity":"94c39725-bb71-4969-bce7-97dfd6aa4474","added_by":"auto","created_at":"2024-08-17 02:15:11","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":159644,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction of TMT targets in different conditions regarding different confound removal models. TMT BTA = Trail Making Test - processing time part A, TMT BTB = Trail Making Test - processing time part B, confounding variables (sex, age, education) and stratification. CRmodel = Confound removal model, RF = Random Forest Regressor, ET = ExtraTree Regressor, Ridge = Ridge Regression, LG = Linear Regression, withoutCR = without confound removal, strat = stratified.\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/8d68e4054cb05557a41c96bf.png"},{"id":95039849,"identity":"acf3f120-dc73-4b32-bda9-75dcebf2e190","added_by":"auto","created_at":"2025-11-03 16:04:32","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1851747,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/5a06fbc4-d19d-483b-9748-a838ad310b67.pdf"},{"id":62658913,"identity":"ad94baa7-9823-41fb-8dc2-a4f03170cc83","added_by":"auto","created_at":"2024-08-17 02:23:11","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":83861,"visible":true,"origin":"","legend":"","description":"","filename":"AppendixPitfallsinusingMLtopredictcognitivefunctionperformance.docx","url":"https://assets-eu.researchsquare.com/files/rs-4745684/v1/91bcdf1694c8fbf60ddd4c31.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Pitfalls in using ML to predict cognitive function performance","fulltext":[{"header":"Introduction","content":"\u003cp\u003ePrediction of cognitive performance is a central goal in neuroscience and related areas of research. Predicting cognitive performance is relevant for several reasons. Firstly, it enables the identification of individuals who may be at risk of cognitive decline or neurodegenerative diseases at an early stage [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. This, in turn, allows for preventative measures and early treatment. Secondly, predicting cognitive performance can help us understand the underlying mechanisms of cognitive function and identify potential biomarkers for cognitive abilities [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Thirdly, it can aid in the development of personalised training programs based on an individual's cognitive capabilities [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWith the rising number of variables potentially related to cognitive performance, methods for predicting cognitive functions also increase in complexity. Machine learning (ML) offers a way to study individual differences by inspecting many different possible influencing factors. ML is a field of artificial intelligence in which models are trained on data, allowing them to uncover intricate relationships and improve over time. It involves advanced statistical algorithms, which learn patterns from feature-target data with the aim to generalise to previously unseen data [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Such methods are of practical use for exploratory research in various fields because unknown, linear, but most importantly non-linear, relationships of a large number of variables can be inspected easily and fast. ML approaches are gaining more importance as they are able to predict the target value of an unseen individual using their features. For instance, when impaired prosodic abilities are related to a disorder, a ML model could be useful for early detection and diagnosis. However, application of ML can be problematic when applied inappropriately leading to inaccurate results and misleading conclusions.\u003c/p\u003e \u003cp\u003eOne of the main challenges in ML relates to preventing models from displaying prediction values that are overly high in comparison to their actual predictive power [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Barring other reasons, this is usually the case when information that should be kept strictly separate is unintentionally fed into the ML pipeline. This process is referred to as leakage [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. One form of leakage is the incorporation of information from confounding variables through the procedure of confound removal, i.e. confound-leakage [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Confounding variables share variance with both the dependent (target) and the independent (explanatory or predictive) variable. This means that they are associated with both variables in the analysis and can potentially have an impact on the relationship between them. It is desirable to remove the confounding information such that the model's predictions are not influenced by it. However, it is plausible that the standard confound removal procedure using linear regression might inadvertently introduce confounding information rather than removing it, causing confound leakage [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn the following, we demonstrate this issue using a specific example from our research, which aimed to predict cognitive performance based on prosodic variables. As executive functions are crucial cognitive capabilities in everyday human life and constitute a basic requirement for speech and communication [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. we focused on predicting executive function performance in this particular application.\u003c/p\u003e \u003cp\u003eThe term \u0026ldquo;executive functions\u0026rdquo; represents a heterogeneous set of distinguishable processes [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. According to Ward, executive functions represent complex abilities, with which people optimise their performance in situations that require the organisation of a series of cognitive processes [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. In spite of the lack of a universal definition of executive function performance and its subordinated domains [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], the grouping of working memory, inhibition, and cognitive flexibility [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] is still the most popular [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eExecutive functions are of great relevance in relation to various pathologies, as their impairment can be observed in numerous neurological and psychiatric disorders [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. For this reason, their investigation, both in healthy people and in different patient groups, constitutes a central component of research and diagnostics. Despite great efforts, examination and characterisation of executive functions have proven to be extremely difficult [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Not only is data acquisition time-consuming and costly, but the results are also dependent on subjective application factors, such as the qualification of the test conductor and the current condition of the person being tested. In addition, the measured performance depends on the individual's motivation.\u003c/p\u003e \u003cp\u003eWhat we can take advantage of in the context of testing EF is the knowledge about the relationship between executive functions and language: It is assumed that executive functions act as a cognitive control mechanism for the syntactic processing of sentences [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Moreover, a large variety of disorders in communication ability are associated with impaired executive functions, including dysarthria, aphasia, language pragmatic disturbances, and verbal reasoning impairments [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. In addition to the symptoms shown on the linguistic levels of phonetics and phonology, morphology and syntax, semantics and pragmatics, the described aspects of the impaired language function also relate to the level of prosody.\u003c/p\u003e \u003cp\u003eProsody can be defined as the totality of all acoustically perceptible forms of expression of speech [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Since prosody belongs to the realm of the phonetic structures of language and is not tied to the categories of lexeme, morpheme or phoneme, prosodic subfunctions belong to the class of suprasegmentals of language. Although several classifications of prosody have been proposed, four main domains can be distinguished: frequency related parameters, energy/amplitude related parameters, spectral parameters, and temporal parameters [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Fundamental frequency refers to the F0 frequency and is described as the middle pitch. Intensity of speech relates to loudness, whereas duration is defined as the quantity of speech [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAgainst the background of current literature regarding the connections between linguistic and cognitive processes, methods can be developed to draw conclusions about underlying cognitive performance with the help of speech variables. In particular, the analysis of prosodic features by speech samples provides advantages, as it offers a high external validity as well as time and cost efficiency compared to classical diagnostic procedures [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e] [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. This is why procedures for objective speech analysis are gaining increasing popularity and are already in use in clinical diagnostics [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e] [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eStudies suggest that prosodic impairments may occur due to immature executive functions [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. In addition, earlier patient studies have already shown a connection between right-hemispheric frontal brain damage and impairments of prosody [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e] [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]. Recent studies also demonstrated a relation between suprasegmental disorders, regarding impaired executive functions, in foreign accent syndrome [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e] [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]. Moreover, impaired working memory and impairment in prosody were observed in Parkinson\u0026rsquo;s Disease [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e], while reduced performance of fundamental frequency in connection with executive function damage was shown in frontotemporal dementia [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. Furthermore, a link between prosody and divided attention, working memory and inhibition was shown in Autism Spectrum Disorder [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]. There is also clinical evidence that formant frequencies and Mel Frequency Cepstral Coefficients are associated with depressive disorders and potentially act as a biomarker [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e] [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e] [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e] [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]. A relationship between prosodic performance, precisely disfluencies and inhibition in healthy participants was also reported by Engelhardt and colleagues [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn summary a link between deficient executive subfunctions and impaired prosodic skills is reported in different pathologies [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e] [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e] [\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e] [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. These associations can be utilised to predict cognitive functions. However, these findings are primarily based on patient studies. Therefore, our initial aim was to test whether the reported correlations could predict cognitive performance in a healthy sample.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eParticipants\u003c/h2\u003e \u003cp\u003eParticipants were recruited at the Forschungszentrum J\u0026uuml;lich and through social networks. Testing took place at the Forschungszentrum J\u0026uuml;lich (Germany) in 2018. Each test session lasted between 150 to 180 minutes, depending on the participants\u0026rsquo; speed and the duration of the instructions. 231 healthy participants without a diagnosis of neurological or mental impairment were included in the present study (138 female, 93 male). The mean age of the sample at testing time was 35.2 years (standard deviation\u0026thinsp;=\u0026thinsp;11.1, minimum\u0026thinsp;=\u0026thinsp;20, maximum\u0026thinsp;=\u0026thinsp;55). All participants were monolingual German. The sample varies regarding the level of education, ranging from participants who finished secondary school (n\u0026thinsp;=\u0026thinsp;8), professional school/job training (n\u0026thinsp;=\u0026thinsp;62), high school with a university-entrance diploma (n\u0026thinsp;=\u0026thinsp;69), and university with a university degree (n\u0026thinsp;=\u0026thinsp;92). All participants were paid an expense allowance of 50 EUR. The study was approved by the ethics committee of Heinrich Heine University D\u0026uuml;sseldorf under the registration number 2017064341. Informed consent was obtained from all participants. All experiments were performed in accordance with relevant named guidelines and regulations. Part of the data used in this study is publicly available upon request, as not all participants consent to data sharing [\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eDesign\u003c/h2\u003e \u003cp\u003eThe test sessions were divided into two parts: Firstly, the executive performance of the participants was assessed. Secondly, spontaneous speech performance was recorded in order to extract prosodic features from speech samples.\u003c/p\u003e \u003cp\u003eThe executive function performance was assessed by the computerized test batteries \u003cem\u003eVienna Testsystem\u003c/em\u003e [\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e] and \u003cem\u003ePsytoolkit\u003c/em\u003e [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e], containing common standard tests for measuring executive function performance. In total, 66 variables from 14 different assessments of executive function performance were collected: Trail Making Test (TMT) [\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e], Raven\u0026rsquo;s Standard Progressive Matrices [\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e], Wisconsin Card Sorting Test [\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e], Tower of London [\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e], and Cued Task Switching [\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e] are related to cognitive flexibility. Performance of N-back Non-verbal [\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e], Non-verbal Learning Test [\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e], and Corsi Block Tapping Test [\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e] were used in relation to working memory. Inhibition was tested by Stop Signal Task [\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e], Simon Task [\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e], and Stroop Test [\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e]. Divided Attention Test [\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e], Spatial Attention Test [\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e], and Mackworth Clock Test [\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e] were used to measure divided and spatial attention as well as vigilance. An overview of the assessed tests and the exact variables from these are shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e (see Appendix A for the descriptions of the tests).\u003c/p\u003e \u003cp\u003e Spontaneous speech was tested based on a collection of three different speech samples per participant. Firstly, the participants were asked to describe the \u003cem\u003eCookie Theft Picture\u003c/em\u003e [\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e] within 90 seconds in as much detail as possible. Secondly, the participants were asked to talk about what they had watched on television / what kind of book they had read the day before. Thirdly, the participants were asked to describe what their favourite holiday trip would look like if money and time were no limiting factors. For the narrative tasks retelling a story and fictional storytelling, they were asked to talk for five minutes. Participants conducted all tests via a laptop, an external keyboard, and a headset-microphone.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAssessed executive function variables adapted from Amunts et al. [\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e] [\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e]\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAbbre-viation\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVariables\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eCOGNITIVE FLEXIBILITY\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrail Making Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTMT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProcessing time part A, Processing time part B, Difference part B-A [seconds], Quotient B/A, Errors part A, Errors part B\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRaven\u0026rsquo;s Standard Progressive Matrices\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSPM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCorrect items, Processing time\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWisconsin Card Sorting Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWCST\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of errors, Number of perseveration errors, Number of errors (non perseveration), Timeouts\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTower Of London\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTOL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePlanning ability, Number of correct responses, Changed his/her mind, self-correction, Choice of wrong pole, Choice of blocked pole, Choice of impossible position\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCued Task Switching\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSWITCH\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of errors, Timeouts, Errors of items which are incongruent\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eWORKING MEMORY\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eN-back Non-Verbal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNBN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCorrect items, Number of commission errors, Number of errors, Mean reaction tine of correct items [seconds], Mean reaction time of errors [seconds]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNon-Verbal Learning Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNVLT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSum of correct responses, Sum of false responses, Sum of difference between correct minus false responses, Processing time\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCorsi Block Tapping Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCORSI\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBlock span, Correct items, False items, Missed items, Sequency errors\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eINHIBITION\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStop Signal Task\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eINHIB\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eReaction time [seconds], Mean stop signal delay [seconds], Stop signal reaction time [seconds], Number of commission errors, Number of ommission errors\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSimon Task\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSIMON\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of errors in compatible items, Number of errors in incompatible items\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStroop Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSTROOP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eReading interference [seconds], Naming interference [seconds], Interference-difference [seconds], Number of false reactions (reading-baseline), Number of false reactions (naming-baseline), Number of false reactions (reading-interference), Number of false reactions (naming-interference), Processing time\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eATTENTION / VIGILANCE\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDivided Attention Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWAF-G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of missed items (unimodall visual), Number of false alarm (unimodal visual), Mean reaction time (unimodal visual) [ms], Number of missed items (crossmodal visual/auditive), Number of false alarm (crossmodal visual/auditive), Mean reaction time (crossmodal) [ms]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSpatial Attention Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWAF-R\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMean reaction time (unannounced items) [ms], Number of missed items (correct announced items), Mean reaction time (correct announced items) [ms], Number of missed items (wrong announced items), Mean reaction time (wrong announced items) [ms], Mean reaction time (short SOA) [ms], Mean reaction time (long SOA) [ms], Number of errors\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMackworth Clock Test\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMACK\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of missed jumps, Number of false alarms\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eFeature extraction\u003c/h2\u003e \u003cp\u003eTo generate the prosodic features from the audio files collected from the speech tasks, the toolbox openSmile (\u003cb\u003eopen\u003c/b\u003e-\u003cb\u003eS\u003c/b\u003eource \u003cb\u003eM\u003c/b\u003eedia \u003cb\u003eI\u003c/b\u003enterpretation by \u003cb\u003eL\u003c/b\u003earge feature-space \u003cb\u003eE\u003c/b\u003extraction) [\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e], version 2.1.3 was used to extract the suprasegmental parameters. Although the extraction and analysis of prosodic parameters for research purposes have been done for decades in various fields and is currently a topic of big interest in the context of speech biomarkers in different pathologies [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] a lack of standardisation and thus comparability was observed [\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e]. The benefit of using the open-source toolbox OpenSmile is its standardised automatic computation of the prosodic features resulting in a fixed feature set. It offers the extraction of prosodic features within a set that corresponds to the main categories frequency (representing the fundamental frequency), energy/amplitude (representing the intensity), spectral parameters, and temporal parameters (representing the duration). The choice of parameters was guided by the criteria of potentially indexing physiological changes in voice production and its theoretical significance in previous literature [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. The feature set \u003cem\u003eextended Geneva Minimalistic Acoustic Parameter Set\u003c/em\u003e (eGeMAPS) was chosen, which contains 88 prosodic features. In order to keep the extraction comparable, the first 90 seconds from each audio file were chosen as input. As there are three audio samples per participant, a total of 264 prosodic features were generated per participant. All features were z-scored, i.e. the mean value was removed, and the variance was scaled to one unit. An overview of the extracted features and their descriptions, as well as the corresponding prosodic category, are shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eGrouped listing of the prosodic features extracted by the toolbox OpenSmile [\u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e70\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eProsodic feature\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eVariables\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eFREQUENCY RELATED PARAMETERS\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF0semitone\u003c/p\u003e \u003cp\u003eMean, standard deviation, percentiles, range, rising slope, falling slope\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePitch, logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semitone 0)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eJitter\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDeviations in individual consecutive F0 period lengths\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF 1\u0026ndash;3 frequency \u0026amp; bandwith\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCentre frequency of 1., 2., 3. formant, bandwidth of first formants 1, 2, 3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eENERGY / AMPLITUDE RELATED PARAMETERS\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLoudness\u003c/p\u003e \u003cp\u003eMean, standard deviation, percentiles, range, rising slope, falling slope\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEstimation of perceived signal intensity from an auditory spectrum\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eShimmer\u003c/p\u003e \u003cp\u003eMean standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDifference in peak amplitudes of consecutive F0 periods\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHarmonics to Noise Ratio\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRelation of energy in harmonic components to energy in noise- like components\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eSPECTRAL PARAMETERS\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSpectral Flux\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDifference of the spectra of two consecutive frames\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMel Frequency Cepstral Coefficients 1\u0026ndash;4\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePerceived pitch of the frequency spectrum\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHarmonic differences\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRatio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2) / to the energy of the highest harmonic in the third formant range (A3)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlpha Ratio\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRatio of summed energy from 50-1000 Hz and 1\u0026ndash;5 kHz\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHammerberg Index\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRatio of the strongest energy peak in the 0\u0026ndash;2 kHz region to the strongest peak in the 2\u0026ndash;5 kHz region\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSpectral Slopes\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLinear regression slope of the logarithmic power spectrum in the specified bands\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF 1\u0026ndash;3 Energy\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFormant 1, 2, and 3 relative energy\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eTEMPORAL PARAMETERS\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLoudness peaks per second\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of volume highlights per second\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVoiced segments\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAmount of continuously voiced regions\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUnvoiced segments\u003c/p\u003e \u003cp\u003eMean, standard deviation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAmount of the continuously unvoiced regions\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEquivalent Sound Level\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSound pressure level which has the same total energy as the actual fluctuating noise\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eMachine learning and statistical analyses\u003c/h2\u003e \u003cp\u003eData management and analysis were performed using Python 3 [\u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e71\u003c/span\u003e]. A ML approach was applied to the data following the machine learning library JuLearn [\u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e72\u003c/span\u003e]. The 264 extracted prosodic feature variables were specified as features and the 66 executive function variables as targets. The initial goal of our analyses was to predict each of the executive function targets using all of the prosodic features.\u003c/p\u003e \u003cp\u003eFirstly, cross-validation was used to determine the model performance. In cross-validation, the data set is randomly partitioned into equally sized folds. All folds except for one, are used for training the model. The hold-out fold is then used to determine the trained model\u0026rsquo;s performance on unseen data. This process is repeated once for each fold as the validation fold. Then, the average of the validation performances is calculated [\u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e73\u003c/span\u003e]. Cross-validation was applied with ten folds (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Since all of the prosodic features were used to predict each of the 66 targets, 66 independent cross-validation models were performed.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn order to keep the folds balanced, stratification by target was implemented into the cross-validation pipeline, meaning that the different folds approximately followed the same distribution of the respective target [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Stratification can usually improve the success of model training by ensuring that the training and test data have similar distribution which reduces the risk of bias or error in the evaluation of the model. Knowing the influence of different demographic aspects on prosodic performance [\u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e74\u003c/span\u003e] [\u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e75\u003c/span\u003e] we regressed out the effects of the confounding variables sex, age, and education from the features with a linear regression model. This is standard practice since the goal is to shed light on the relationship between executive functions and prosodic features, independently of factors that are additionally related to the constructs [\u003cspan citationid=\"CR76\" class=\"CitationRef\"\u003e76\u003c/span\u003e] [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThere are several regression models to choose from for usage in machine learning approaches. With his theorem \u003cem\u003eNo Free Lunch\u003c/em\u003e Wolpert postulated that there is no general best machine learning algorithm for all predictive modeling problems such as classification and regression [\u003cspan citationid=\"CR77\" class=\"CitationRef\"\u003e77\u003c/span\u003e]. We chose the Random Forest Regressor as it has already demonstrated to predict executive functions in previous studies[\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e69\u003c/span\u003e][\u003cspan citationid=\"CR78\" class=\"CitationRef\"\u003e78\u003c/span\u003e][\u003cspan citationid=\"CR79\" class=\"CitationRef\"\u003e79\u003c/span\u003e] and is commonly used[\u003cspan citationid=\"CR80\" class=\"CitationRef\"\u003e80\u003c/span\u003e] [\u003cspan citationid=\"CR81\" class=\"CitationRef\"\u003e81\u003c/span\u003e]. Random Forest is an ensemble estimator that fits a number of decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and to control over-fitting. The decisions made by each tree carry equal weight, while the order of the decisions is random [\u003cspan citationid=\"CR82\" class=\"CitationRef\"\u003e82\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eFollowing Poldrack et al. [\u003cspan citationid=\"CR83\" class=\"CitationRef\"\u003e83\u003c/span\u003e], accuracy was assessed by the coefficient of determination (R\u0026sup2;) [\u003cspan citationid=\"CR84\" class=\"CitationRef\"\u003e84\u003c/span\u003e], which measures how well the regression predictions approximate the real data points. It can be interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variables. R\u0026sup2; ranges from 0 to 1, where 1 indicates that the regression model perfectly predicts the data. In cases of negative values, the mean of the data alone fits the results better than the predicted values. Thus, negative values mean a very poor generalisation of the model. For the cross-validation results, the mean of R\u0026sup2; was calculated over 10 folds.\u003c/p\u003e \u003cp\u003eSecondly, the aim of our study was to investigate which of the many prosodic features were important in connection to all features to train the model successfully. For this purpose, the feature importance was calculated by the impurity-based feature importance of Random Forest, also known as the Gini importance [\u003cspan citationid=\"CR85\" class=\"CitationRef\"\u003e85\u003c/span\u003e] [\u003cspan citationid=\"CR86\" class=\"CitationRef\"\u003e86\u003c/span\u003e]. When building a decision tree, features are selected at each node in order to divide the data into subsets that are as \u0026ldquo;pure\u0026rdquo; as possible with regard to the target variable. Gini Impurity measures how often a randomly chosen data point within a subset would be incorrectly labeled, reflecting the degree of disorder or \u0026bdquo;impurity\u0026rdquo; within the data. In contrast, Gini Importance assesses the overall decrease in node impurity resulting from splits based on a specific feature. It considers the probability of reaching each node and calculates the weighted reduction in impurity. Features with higher Gini importance are considered more important for predicting the target variable [\u003cspan citationid=\"CR85\" class=\"CitationRef\"\u003e85\u003c/span\u003e]. Feature importance was computed for the final estimator, as well as for each fold to estimate the variability of the importance. The sum of all feature importance scores adds up to 1.\u003c/p\u003e \u003cp\u003eThirdly, detailed analyses were conducted to examine the effects of confound removal and stratification. Here, we used other models such as Random Forest Regressor, ExtraTree Regressor, and Ridge Regression to regress out the confounds from the features in order to compare model performance depending on how the confounds were removed.\u003c/p\u003e \u003cp\u003eMoreover, we employed an approximate permutation test approach, suggested by North and colleagues [\u003cspan citationid=\"CR87\" class=\"CitationRef\"\u003e87\u003c/span\u003e], to disentangle predictive information of the features from that of the confounds. To achieve this, we permuted each feature separately. Here, the association between features and targets is randomised, while the association between confounds and targets remains unchanged. 10-fold cross-validation was performed for each permutation, and R\u0026sup2; scores for 1000 permutations were used to construct an empirical null distribution, from which p-values were computed as the proportion of permuted R\u0026sup2; scores greater than or equal to the R\u0026sup2; score of the original non-permuted data. The threshold value for the two-tailed test was set to \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.05. Significant p-values indicate that predictive information stems from the features rather than the confounds alone.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eIn cross-validation, the models were trained to predict each of the EF targets using all of the prosodic features. Regression of the confounding features sex, age, and education, and stratification by target distribution were performed. Evaluation was estimated using the coefficient of determination R\u0026sup2; averaged over the 10 folds.\u003c/p\u003e\n\u003cp\u003eOut of 66 executive function targets, 53 variables did not show positive R\u0026sup2; values, indicating no predictive power for these targets using our modeling approach. 13 executive function targets showed positive R\u0026sup2; values (Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). However, only two targets, TMT BTA (processing time part A) and TMT BTB (processing time part B), showed R\u0026sup2; values\u0026thinsp;\u0026gt;\u0026thinsp;0.1, representing a reasonable model fit. The described TMT variables belong to the cognitive flexibility domain. An overview of R\u0026sup2; of all 66 EF targets can be found in the supplements.\u003c/p\u003e\n\u003cp\u003eFeature importance was calculated in order to determine which of the prosodic features were particularly important for successfully predicting the EF targets. Since we observed good prediction performance (R\u0026sup2; \u0026gt; 0.1) for TMT BTA and TMT BTB, we only computed feature importance for these targets. Figures \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e and \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e present the ten most important features predicting the EF targets TMT BTA and TMT BTB (see Appendix B for the feature importance of all prosodic variables). The majority of features identified as most important belong to the spectral prosodic domain. The most frequently appearing prosodic features were the Mel Frequency Cepstral Coefficients.\u003c/p\u003e\n\u003cp\u003eFor the purpose of validation, we contrasted the effects of confound removal and stratification on the prediction performance for the targets TMT BTA and TMT BTB. To begin with, we compared the prediction results with the performance of the cross-validation model without regressing out the confounding variables sex, age, and education. These results indicated a worse prediction compared to the results with confound removal. Results are displayed in Fig. \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e. For both TMT targets, prediction performance decreased when not removing the confounding variables. This is true for the stratified set up, as well as for the non-stratified set up. Prediction performance also decreases when not stratifying the cross-validation folds.\u003c/p\u003e\n\u003cp\u003eTo explore the mechanism behind the decrease in prediction performance for the pipeline without confound removal further, and to examine whether it is related to the specific confound removal model used, we exchanged the standard confound removal model Linear Regression with other models, such as Random Forest Regressor, ExtraTree Regressor and Ridge Regression. As demonstrated in Fig. \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e, the prediction performance varies depending on the choice of the confound removal model. The pipelines with the confound removal models Linear Regression and Ridge Regression indicate higher R\u0026sup2; values than the pipelines with the confound removal models Random Forest Regressor and ExtraTree Regressor.\u003c/p\u003e\n\u003cp\u003eFinally, we evaluated the conditions with different confound removal models by using permutation tests. For the EF target TMT BTA with the cross-validation regressor Random Forest and the confound removal model Random Forest R\u0026sup2; of 0.057 is significant (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.001). For the EF target TMT BTB with the cross-validation regressor Random Forest and the confound removal model Ridge Regression R\u0026sup2; of 0.196 is significant (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.032) such as with the cross-validation regressor Random Forest and the confound removal model Linear Regression R\u0026sup2; of 0.205 (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.017). As shown in Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e, all other positive prediction performances, measured by R\u0026sup2; values, are not significant.\u0026nbsp;\u003c/p\u003e\n\u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eComparison of different confound removal models complemented by the p-value. CRmodel\u0026thinsp;=\u0026thinsp;Confound removal model, RF\u0026thinsp;=\u0026thinsp;Random Forest Regressor, ET\u0026thinsp;=\u0026thinsp;ExtraTree Regressor, LG\u0026thinsp;=\u0026thinsp;Linear Regression, Ridge\u0026thinsp;=\u0026thinsp;Ridge Regression, withoutCR\u0026thinsp;=\u0026thinsp;without confound removal, strat\u0026thinsp;=\u0026thinsp;stratified.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colspan=\"3\"\u003e\n \u003cp\u003eTMT BTA\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"3\"\u003e\n \u003cp\u003eTMT BTB\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCondition\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ep-value\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCondition\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ep-value\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.142\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.009\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.343\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.161\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRF_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.057\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRF_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.171\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.069\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelET\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.172\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelET\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.156\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelET_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.003\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.005\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelET_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e-0.082\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRidge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.097\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.691\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRidge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.106\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.058\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRidge_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.262\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.188\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelRidge_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.196\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.032\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelLG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.102\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.633\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelLG\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.081\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.162\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelLG_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.260\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCRmodelLG_strat\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.205\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.017\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003eTo summarise, we initially found a moderate predictive power of TMT BTA and TMT BTB by prosodic features. However, considering all results, there is a decrease in predictive power when not removing the confounding variables sex, age, and education, indicating confound leakage. In addition, the predictive power increases when stratification is performed. Pipelines with different models for removing confounding factors perform differently. Ultimately, two out of 20 models are significant, which suggests that the prediction is at least partly driven by the features in these models.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study is based on an investigation of the relationship between executive functions and prosody through examining whether prosodic features can predict executive functions. In summary, we preliminary found a moderate predictive power of prosodic features for TMT BTA and TMT BTB. However, considering all results, there is a decrease in predictive power when not removing the confounding variables sex, age, and education, indicating confound leakage for most of the models.\u003c/p\u003e \u003cp\u003eFirstly, we evaluated 66 models, each predicting one executive function variable from the prosodic features. We employed 10-fold cross-validation with stratification by target variable and confound removal of sex, age, and education. The results showed poor or no prediction performance for 64 out of 66 EF targets.\u003c/p\u003e \u003cp\u003eOnly the models for the TMT targets TMT BTA and TMT BTB, relating to cognitive flexibility, initially appeared to have a moderately valid predictive performance. Without the additional analyses that we conducted for validation, these results could be interpreted as follows: Our results would have confirmed findings from previous studies on a narrow correlation between executive functions and language in general [\u003cspan citationid=\"CR88\" class=\"CitationRef\"\u003e88\u003c/span\u003e] [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], and would have been in line with research conducted in different patient cohorts [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e] [\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e] [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e], reporting connections between cognitive flexibility and prosody [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. In our study, we would have found these associations in healthy participants. Consistent with the literature, this study would have shown that features from various prosodic domains are important for the models to learn. This would have validated that prosodic features of different kinds are closely related to executive functions, as described in previous studies [\u003cspan citationid=\"CR89\" class=\"CitationRef\"\u003e89\u003c/span\u003e] [\u003cspan citationid=\"CR90\" class=\"CitationRef\"\u003e90\u003c/span\u003e] [\u003cspan citationid=\"CR91\" class=\"CitationRef\"\u003e91\u003c/span\u003e]. Furthermore, predominantly spectral prosodic parameters would have shown importance for the model fits, especially the Mel Frequency Cepstral Coefficients, which are already used as a biomarker in depressive disorders [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e] [\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e]. As described in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the Mel Frequency Cepstral Coefficients are defined as the perceived pitch of the frequency spectrum. More precisely, these are coefficients of the Mel scale, which relates the perceived frequency of a tone to the actual measured frequency. It scales the frequency in order to match more closely what the human ear can hear [\u003cspan citationid=\"CR92\" class=\"CitationRef\"\u003e92\u003c/span\u003e]. It therefore would have been deduced from the study that spectral parameters, in particular the Mel Frequency Cepstral Coefficients, are closely related to executive functions. Furthermore, the findings would have confirmed that easy-to-capture spontaneous speech derived from different tasks is suitable for the extraction of prosodic features. In summary, the present research would have raised the possibility that this predictive power of prosodic features could be an important biomarker for executive function impairment or its future decline.\u003c/p\u003e \u003cp\u003eHowever, given the additional in-depth analyses of the ML pipeline that partly invalidate the initial results, our findings need to be reinterpreted as follows:\u003c/p\u003e \u003cp\u003eWe expect models to perform better if the effects of the confounding variables are not excluded, given that this would provide more information for the algorithm to learn. However, the prediction performance decreases for both TMT targets when not removing the confounding variables sex, age, and education. This is not in line with our expectation because in our scenario, the prediction performance should be worse if the confounding variables are removed, as the algorithm can then only learn from the association between confound-free features and the target. We found that information from these confounds, namely sex, age, and education leaked into the predictions through the confound removal procedure. The inadvertent injection of this information occurs particularly when the confounding variables and the targets show a strong correlation and this is coupled with the use of a high number of features, as explained by Hamdan et al. [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] and Sasse \u0026amp; Nicolaisen-Sobesky et al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. This is indeed the case in our data set (see Appendix C). There is a strong correlation between the TMT targets and the confounding variables. In addition, we use a high number of features within the cross-validation pipeline, because we wanted to investigate EF and prosody in an exploratory manner. While our data set was relatively small compared to most ML studies, which typically increases the risk of leakage [\u003cspan citationid=\"CR93\" class=\"CitationRef\"\u003e93\u003c/span\u003e], it represents a reasonable size when compared to studies investigating speech biomarkers [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. The results also confirm that these observations occur in both stratified and non-stratified conditions. As expected, it can be shown that stratification by target distribution generally increases the predictive performance. This is in line with Diamantidis et al. [\u003cspan citationid=\"CR94\" class=\"CitationRef\"\u003e94\u003c/span\u003e] and Hastie et al. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], who show that equally representative cross-validation folds lead to improved predictive power. Additionally, it is demonstrated that stratification can also increase confound leakage This can be derived from the fact that the difference in predictive power between the pipelines with and without confound removal is even greater in the stratified condition (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Furthermore, the results illustrate that the observed confound leakage is not bound to the use of Linear Regression as the confound removal model, but also occurs when other models are employed.\u003c/p\u003e \u003cp\u003eOverall, these observations raise concerns about the trustworthiness of the primary results. Nonetheless one cannot definitively rule out whether information from the features also influenced the predictive power of the present results. We, therefore, conducted permutation testing for the different cross-validation models. Since the permutation tests for the two TMT targets each identified models that can be interpreted as significant, we speculate that predictive power is partly due to the information contained in the features despite the confounding variables also contributing to the prediction. However, this was only observed in two of 66 EF targets and for these two targets only in specific confound removal models. For this reason, we only conditionally derive the predictive power of prosodic features. Further analyses of this type with other data sets would need to be carried out to verify this.\u003c/p\u003e \u003cp\u003eIn conclusion, the present results highlight the challenges and pitfalls when conducting ML analyses with the aim of predicting variables of interest including cognitive performance. This example shows which misinterpretations could have been deduced from the initial results. This can be particularly dangerous if the findings match previous studies, as in the case here. This is crucial, as ML studies are becoming increasingly important and widely employed, especially with the accessibility of large amounts of data. In this respect, we caution and recommend that when using ML analyses to predict cognitive performance, quality controls should be performed to prevent false results. This is also true when interpreting ML results of other researchers. This study has contributed to uncovering more insight into a pitfall in ML analysis arising due to confound leakage. As confounding is ubiquitous in social and biological sciences, it should be further deciphered how confound leakage occurs and which contributing factors can be taken into account. Additionally, our analysis framework provides a blueprint for further research investigating whether prosody can serve as a predictive biomarker of executive dysfunction.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePart of the data used in this study is publicly available upon request. Researchers who wish to acquire access to the data are kindly asked to contact Julia A. Camilleri at [email protected], as described in the related publication Camilleri, J.A., Volkening, J. et al. SpEx: a German-language dataset of speech and executive function performance. Sci Rep 14, 9431 (2024). https://doi.org/10.1038/s41598-024-58617-3\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was supported by\u003c/p\u003e\n\u003cul class=\"decimal_type\"\u003e\n \u003cli\u003ethe Deutsche Forschungsgemeinschaft (DFG, GE 2835/2\u0026ndash;1, EI 816/16-1 and EI 816/21-1),\u0026nbsp;\u003c/li\u003e\n \u003cli\u003ethe National Institute of Mental Health (R01-MH074457),\u003c/li\u003e\n \u003cli\u003ethe Helmholtz Portfolio Theme \u0026quot;Supercomputing and Modeling for the Human Brain\u0026quot;,\u0026nbsp;\u003c/li\u003e\n \u003cli\u003ethe Virtual Brain Cloud (EU H2020, no. 826421) \u0026amp;\u0026nbsp;\u003c/li\u003e\n \u003cli\u003ethe National Institute on Aging (R01AG067103).\u0026nbsp;\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThese authors contributed equally: Julia A. Camilleri and Susanne Weis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInstitute of Systems Neuroscience, Medical Faculty, Heinrich Heine University D\u0026uuml;sseldorf, D\u0026uuml;sseldorf, Germany.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGianna Kuhles, Sami Hamdan, Simon B. Eickhoff, Kaustubh R. Patil, Julia A. Camilleri, Susanne Weis\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInstitute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Research Centre J\u0026uuml;lich, J\u0026uuml;lich, Germany.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGianna Kuhles, Sami Hamdan, Simon B. Eickhoff, Kaustubh R. Patil, Julia A. Camilleri, Susanne Weis\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDepartment of Psychiatry, Psychotherapy and Psychosomatics, Medical Faculty, RWTH Aachen University, Aachen, Germany.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eStefan Heim\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInstitute of Neuroscience and Medicine, Structural and functional Organization of the Brain (INM-1), Research Center Jülich, Jülich, Germany.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eStefan Heim\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eG.K., J.A.C., S.W. conceived the project and designed the study. S.H., S.H., S.B.E., K.R.P. contributed essential resources. G.K. with contributions from S.W. and all other authors wrote the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eKarako, K., Predictive deep learning models for cognitive risk using accessible data. BioSci. Trends 18, 66-72 (2024). \u003c/li\u003e\n\u003cli\u003eBzdok, D., Varoquaux, G., \u0026amp; Steyerberg, E. W. Prediction, not association, paves the road to precision medicine. JAMA Psychiatry 78, 127-128 (2021). \u003c/li\u003e\n\u003cli\u003eCotta Ramusino, M. et al. Diagnostic performance of molecular imaging methods in predicting the progression from mild cognitive impairment to dementia: an updated systematic review. Eur. J Nucl. Med. Mol. Imaging 51, 1876-1890 (2024).\u003c/li\u003e\n\u003cli\u003eRoheger, M., Liebermann-Jordanidis, H., Krohm, F., Adams, A., \u0026amp; Kalbe, E. Prognostic factors and models for changes in cognitive performance after multi-domain cognitive training in healthy older adults: A systematic review. Front. Hum. Neurosci. 15, 636355; https://doi.org/10.3389/fnhum.2021.636355 (2021). \u003c/li\u003e\n\u003cli\u003eDwyer, D. B., Falkai, P., \u0026amp; Koutsouleris, N. Machine learning approaches for clinical psychology and psychiatry. Annu. Rev. Clin. Psychol. 14, 91-118 (2018). \u003c/li\u003e\n\u003cli\u003eArbabshirani, M. R., Plis, S., Sui, J., \u0026amp; Calhoun, V. D. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. Neuroimage 145, 137-165 (2017). \u003c/li\u003e\n\u003cli\u003eRankin, D. et al. Identifying key predictors of cognitive dysfunction in older people using supervised machine learning techniques: observational study. JMIR Med. Inform. 8, 20995; https://doi.org/10.2196/20995 (2020). \u003c/li\u003e\n\u003cli\u003eAnsart, M. et al. Predicting the progression of mild cognitive impairment using machine learning: a systematic, quantitative and critical review. Med. Image Anal. 67, 101848; https://doi.org/10.1016/j.media.2020.101848 (2021). \u003c/li\u003e\n\u003cli\u003eAhmad, S., El-Affendi, M. A., Anwar, M. S., \u0026amp; Iqbal, R. Potential future directions in optimization of students\u0026rsquo; performance prediction system. Comput. Intell. Neurosci. 1, 6864955; https://doi.org/10.1155/2022/6864955 (2022). \u003c/li\u003e\n\u003cli\u003eDomingos, P. A few useful things to know about machine learning. Comm. ACM. 55, 78-87 (2012).\u003c/li\u003e\n\u003cli\u003eKapoor, S., \u0026amp; Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804; https://doi.org/10.1016/j.patter.2023.100804 (2023). \u003c/li\u003e\n\u003cli\u003eSasse, L., \u0026amp; Nicolaisen-Sobesky, E. On Leakage in Machine Learning Pipelines. arXiv preprint arXiv: 2311.04179, (2024). \u003c/li\u003e\n\u003cli\u003eHamdan, S. et al. Confound-leakage: confound removal in machine learning leads to leakage. GigaScience 12, giad071; https://doi.org/10.1093/gigascience/giad071 (2023). \u003c/li\u003e\n\u003cli\u003eHastie, T., Tibshirani, R., Friedman, J. H., \u0026amp; Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction. 1-758 (Springer, 2009).\u003c/li\u003e\n\u003cli\u003eArdila, A. The executive functions in language and communication in Cognition and acquired language disorders (ed. Peach, R. K. \u0026amp; Shapiro, L. P.) 147-166 (Mosby, 2012).\u003c/li\u003e\n\u003cli\u003eBaddeley, A. Working memory: looking back and looking forward. Nat. Rev. Neurosci. 4, 829-839 (2003).\u003c/li\u003e\n\u003cli\u003eLevelt, W. J. Accessing words in speech production: Stages, processes, and representations. Cogn. 42, 1-22 (1992).\u003c/li\u003e\n\u003cli\u003eGoldstein, S., Naglieri, J. A., Princiotta, D., \u0026amp; Otero, T. M. Introduction: A history of executive functioning as a theoretical and clinical construct. in Handbook of executive functioning. (ed. Goldstein, S. \u0026amp; Naglieri, J. A.) 3-12 (Springer Science, 2014). \u003c/li\u003e\n\u003cli\u003eWard, J. The Student\u0026lsquo;s Guide to Cognitive Neuroscience. (Psychology Press, 2015). \u003c/li\u003e\n\u003cli\u003eFriedman, N. et al. Individual differences in executive functions are almost entirely genetic in origin. J. Exper. Psychol. 137, 201-225 (2008).\u003c/li\u003e\n\u003cli\u003eDiamond, A. Executive functions. An. Rev. Psy. 64, 135-168 (2013).\u003c/li\u003e\n\u003cli\u003eMiyake, A. et al. The unity and diversity of executive functions and their contributions to complex \u0026lsquo;Frontal Lobe\u0026rsquo; tasks: a latent variable analysis. Cognit. Psychol. 41, 49-100 (2000). \u003c/li\u003e\n\u003cli\u003eL\u0026ouml;ffler, C., Frischkorn, G. T., Hagemann, D., Sadus, K., \u0026amp; Schubert, A. L. The common factor of executive functions measures nothing but speed of information uptake. Psychol. Res. 88, 1092-1114 (2024). \u003c/li\u003e\n\u003cli\u003eBarch, D. M. The cognitive neuroscience of schizophrenia. Annu. Rev. Clin. Psychol. 1, 321-353 (2005). \u003c/li\u003e\n\u003cli\u003eGuarino, A. et al. Executive functions in Alzheimer disease: a systematic review. Front. Neurosci. 10, 437 (2019).\u003c/li\u003e\n\u003cli\u003eKudlicka, A., Clare, L., \u0026amp; Hindle, J. V. Executive functions in Parkinson\u0026rsquo;s disease: Systematic review and meta-analysis. Mov. Disord. 26, 2305-2315 (2011).\u003c/li\u003e\n\u003cli\u003eNigg, J. T., Blaskey, L. G., Huang-pollock, C. L., \u0026amp; Rappley, M. D. Neuropsychological Executive Functions and DSM-IV ADHD. Subtypes. J. Am. Acad. Child Adolesc. Psych. 41, 59-66 (2002).\u003c/li\u003e\n\u003cli\u003eTavares, J. V. T. et al. Distinct profiles of neurocognitive function in unmedicated unipolar depression and bipolar II depression. Biol. Psychol. 62, 917-924 (2007).\u003c/li\u003e\n\u003cli\u003eSalthouse, T., Atkinson, T., \u0026amp; Berish, D. Executive functioning as a potential mediator of age-related cognitive decline in normal adults. J. Exper. Psychol. 132, 566-594 (2003).\u003c/li\u003e\n\u003cli\u003eNovick, J. M., Trueswell, J. C., \u0026amp; Thompson, S. L. Cognitive control and parsing: Reexamining the role of Broca\u0026rsquo;s area in sentence comprehension. Cogn, Affec. \u0026amp; Behav. Neurosci. 5, 263-281 (2005).\u003c/li\u003e\n\u003cli\u003eLaver, J. Principles of phonetics. (Cambridge University Press, 1994). \u003c/li\u003e\n\u003cli\u003eEyben, F. et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. Transac. Affect. Com. 7, 190-202 (2015).\u003c/li\u003e\n\u003cli\u003eHecker, P., Steckhan, N., Eyben, F., Schuller, B. W., \u0026amp; Arnrich, B. Voice analysis for neurological disorder recognition \u0026ndash; A systematic review and perspective on emerging trends. Front. Digit. Health 4, 842301; https://doi.org/10.3389/fdgth.2022.842301 (2022). \u003c/li\u003e\n\u003cli\u003eRamanarayanan, V., Lammert, A. C., Rowe, H. P., Quatieri, T. F., \u0026amp; Green, J. R. Speech as a biomarker: Opportunities, interpretability, and challenges. Perspect. ASHA Spec. Interest Groups 7, 276-283 (2022). \u003c/li\u003e\n\u003cli\u003eRobin, J. et al. Evaluation of speech-based digital biomarkers: review and recommendations. Digl Biomark. 4, 99-108 (2020). \u003c/li\u003e\n\u003cli\u003eMart\u0026iacute;nez-S\u0026aacute;nchez, F., Meil\u0026aacute;n, J. J. G., Carro, J., and Ivanova, O. A prototype for the voice analysis diagnosis of Alzheimer\u0026apos;s disease. J. Alzheimers. Dis. 64, 473-481 (2018).\u003c/li\u003e\n\u003cli\u003eParola, A., Simonsen, A., Bliksted, V., \u0026amp; Fusaroli, R. Voice patterns in schizophrenia: A systematic review and Bayesian meta-analysis. Schizo. Res. 216, 24-40 (2020).\u003c/li\u003e\n\u003cli\u003eSpeer, S. R., \u0026amp; Ito, K. Prosody in first language acquisition\u0026ndash;Acquiring intonation as a tool to organize information in conversation. Lang \u0026amp; Ling Com. 3, 90-110 (2009).\u003c/li\u003e\n\u003cli\u003eAlexander, M. P., Benson, D. F., \u0026amp; Stuss, D. T. Frontal lobes and language. Brain \u0026amp; Lang. 37, 656-691 (1989).\u003c/li\u003e\n\u003cli\u003eRoss, E. D. The aprosodias: Functional-anatomical organization of the affective components of language in the right hemisphere. Arch Neurol. 140, 695-710 (1981).\u003c/li\u003e\n\u003cli\u003eKeulen, S. et al. Psychogenic foreign accent syndrome: a new case. Front. Neurosci. 10, 143 (2016).\u003c/li\u003e\n\u003cli\u003eRoy, A., Allain, P., Roulin, J. L., Fournet, N., \u0026amp; Le Gall, D. Ecological approach of executive functions using the behavioural assessment of the dysexecutive syndrome for children (BADS-C): Developmental and validity study. J. Neuropsych. 37, 956-971 (2015).\u003c/li\u003e\n\u003cli\u003eBreitenstein, C., Van Lancker, D., Daum, I., \u0026amp; Waters, C. H. Impaired perception of vocal emotions in Parkinson\u0026apos;s disease: influence of speech time processing and executive functioning. Brain \u0026amp; Cogn. 45, 277-314 (2001).\u003c/li\u003e\n\u003cli\u003eNevler, N. et al. Automatic measurement of prosody in behavioral variant FTD. Neurol. 89, 650-656 (2017). \u003c/li\u003e\n\u003cli\u003eFilipe, M. G., Frota, S., \u0026amp; Vicente, S. G. Executive functions and prosodic abilities in children with high-functioning autism. Front. Psych. 9, 359 (2018).\u003c/li\u003e\n\u003cli\u003eAlghowinem, S., Gedeon, T., Goecke, R., Cohn, J. F., \u0026amp; Parker, G. Interpretation of depression detection models via feature selection methods. IEEE Trans. Affect. Comput. 14, 133-152 (2020). \u003c/li\u003e\n\u003cli\u003eCummins, N., Epps, J., Sethu, V., Breakspear, M., \u0026amp; Goecke, R., Modeling Spectral Variability for the Classification of Depressed Speech. Proc. Interspeech. 857-861 (2013). \u003c/li\u003e\n\u003cli\u003eMoore, I. I. E., Clements, M. A., Peifer, J. W., \u0026amp; Weisser, L. Critical analysis of the impact of glottal features in the classification of clinical depression in speech. Transact. Biomedic. 55, 96-107 (2007). \u003c/li\u003e\n\u003cli\u003eWilliamson, J. R. et al. Vocal biomarkers of depression based on motor incoordination. Proc. Aud. 3, 41-48 (2013).\u003c/li\u003e\n\u003cli\u003eEngelhardt, P. E., Nigg, J. T., \u0026amp; Ferreira, F. Is the fluency of language outputs related to individual differences in intelligence and executive function? Acta Psychol. 144, 424-432 (2013).\u003c/li\u003e\n\u003cli\u003eCamilleri, J. A. et al. SpEx: a German-language dataset of speech and executive function performance. Sci. Rep. 14, 9431; https://doi.org/10.1038/s41598-024-58617-3 (2024). \u003c/li\u003e\n\u003cli\u003eWiener Testsystem. (SCHUHFRIED GmbH, 2016). \u003c/li\u003e\n\u003cli\u003eStoet, G. PsyToolkit: A software package for programming psychological experiments using Linux. Behav. Res. Methods 42, 1096-1104 (2010).\u003c/li\u003e\n\u003cli\u003eReitan, R. M. Validity of the trail making test as an indicator of organic brain damage. Percept. Mot. Skills 8, 271\u0026ndash;276 (1958). \u003c/li\u003e\n\u003cli\u003eRaven, J. C., Raven, J. \u0026amp; Court, J. H. SPM Manual (Deutsche Bearbeitung und Normierung von St. Bulheller und H. Häcker). (Swets \u0026amp; Zeitlinger B.V, 1998).\u003c/li\u003e\n\u003cli\u003eGrant, D. A. \u0026amp; Berg, E. A. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a Weigl-type card-sorting problem. J. Exp. Psychol. 38, 404-411 (1948).\u003c/li\u003e\n\u003cli\u003eKaller, C. P., Unterrainer, J. M. \u0026amp; Stahl, C. Assessing planning ability with the Tower of London task: Psychometric properties of a structurally balanced problem set. Psychol. Assess. 24, 46-53 (2012).\u003c/li\u003e\n\u003cli\u003eMeiran, N. Reconfiguration of processing mode to task performance. J. Exp. Psychol. Learn. Mem. Cogn. 22, 1423-1442 (1996). \u003c/li\u003e\n\u003cli\u003eSchellig, D., Schuri, U. \u0026amp; Arendasy, M. NBN-NBACK-nonverbal. (SCHUHFRIED GmbH, 2009).\u003c/li\u003e\n\u003cli\u003eSturm, W. \u0026amp; Willmes, K. NVLT Non-Verbal Learning Test. (SCHUHFRIED GmbH, 2016).\u003c/li\u003e\n\u003cli\u003eSchellig, D. \u0026amp; Hättig, H. A. Die Bestimmung der visuellen Merkspanne mit dem Block-Board. Z. Neuropsychol. 4, 104-112 (1993).\u003c/li\u003e\n\u003cli\u003eKaiser, S., Aschenbrenner, S., Pfüller, U., Roesch-Ely, D., \u0026amp; Weisbrod, M. Response Inhibition. (SCHUHFRIED GmbH, 2016). \u003c/li\u003e\n\u003cli\u003eSimon, J. R. \u0026amp; Wolf, J. D. Choice reaction time as a function of angular stimulus-response correspondence and age. Ergonomics 6, 99-105 (1963).\u003c/li\u003e\n\u003cli\u003eSchuhfried, G. Interferenz nach Stroop. (SCHUHFRIED GmbH, 2016).\u003c/li\u003e\n\u003cli\u003eSturm, W. Wahrnehmungs- und Aufmerksamkeitsfunktionen: Geteilte Aufmerksamkeiten. (SCHUHFRIED GmbH, 2016).\u003c/li\u003e\n\u003cli\u003eMackworth, N. H. The breakdown of vigilance during prolonged visual search. J. Exper. Psych. 1, 6-21 (1948).\u003c/li\u003e\n\u003cli\u003eGoodglass, H., \u0026amp; Kaplan, E. The assessment of aphasia and related disorders. (Lea \u0026amp; Febiger, 1972).\u003c/li\u003e\n\u003cli\u003eAmunts, J., Camilleri, J. A., Eickhoff, S. B., Heim, S., \u0026amp; Weis, S. Executive functions predict verbal fluency scores in healthy participants. Sci. Rep. 10, 1-11 (2020).\u003c/li\u003e\n\u003cli\u003eAmunts, J. et al. Comprehensive verbal fluency features predict executive function performance. Sci. Rep. 11, 1-14 (2021).\u003c/li\u003e\n\u003cli\u003eEyben, F., W\u0026ouml;llmer, M., \u0026amp; Schuller, B. Opensmile: the munich versatile and fast open-source audio feature extractor. Proc. Multimed. 18, 1459-1462 (2010).\u003c/li\u003e\n\u003cli\u003eVan Rossum, G., \u0026amp; Drake, F. L. Python 3 Reference Manual. (CreateSpace, 2009).\u003c/li\u003e\n\u003cli\u003eHamdan, S. et al. Julearn: An Easy-to-Use Library for Leakage-Free Evaluation and Inspection of ML Models. Gigabyte, gigabyte 113; https://doi.org/10.46471%2Fgigabyte.113 (2024). \u003c/li\u003e\n\u003cli\u003eMolinaro, A. M, Simon, R., \u0026amp; Pfeiffer, R. M. Prediction error estimation: a comparison of resampling methods. Bioinform. 21, 3301-3307 (2005).\u003c/li\u003e\n\u003cli\u003eDromey, C., Silveira, J., \u0026amp; Sandor, P. Recognition of affective prosody by speakers of English as a first or foreign language. Speech comm. 47, 351-359 (2005).\u003c/li\u003e\n\u003cli\u003eVolin, J., Tykalov\u0026aacute;, T., \u0026amp; Boril, T. Stability of Prosodic Characteristics Across Age and Gender Groups. Inter Speech 3902-3906 (2017).\u003c/li\u003e\n\u003cli\u003eKaufman, S., Rosset, S., Perlich, C. \u0026amp; Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1\u0026ndash;21 (2012).\u003c/li\u003e\n\u003cli\u003eWolpert, D. H. The lack of a priori distinctions between learning algorithms. Neural. Com. 8, 1341-1390 (1996).\u003c/li\u003e\n\u003cli\u003eByeon H. Is the Random Forest Algorithm Suitable for Predicting Parkinson\u0026apos;s Disease with Mild Cognitive Impairment out of Parkinson\u0026apos;s Disease with Normal Cognition?. Int. J. Enviro. 17, 2594 (2020).\u003c/li\u003e\n\u003cli\u003eCordova, M. et al. Heterogeneity of executive function revealed by a functional random forest approach across ADHD and ASD. Neuro Im. Clin. 26, 102245; https://doi.org/10.1016/j.nicl.2020.102245 (2020).\u003c/li\u003e\n\u003cli\u003eAdnan, M. N., Ip, R. H., Bewong, M., \u0026amp; Islam, M. Z. BDF: A new decision forest algorithm. Inform. Sci. 569, 687-705 (2021).\u003c/li\u003e\n\u003cli\u003eGrinsztajn, L., Oyallon, E. \u0026amp; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst. 35, 507\u0026ndash;520 (2022).\u003c/li\u003e\n\u003cli\u003eBreiman, L. Random forests. Mach. Learn. 45, 5-32 (2001).\u003c/li\u003e\n\u003cli\u003ePoldrack, R. A., Huckins, G. \u0026amp; Varoquaux, G. Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry 77, 534\u0026ndash;540 (2020).\u003c/li\u003e\n\u003cli\u003eWright, S. Correlation and Causation. J. Agric. 20, 557-585 (1921).\u003c/li\u003e\n\u003cli\u003eNembrini, S., K\u0026ouml;nig, I. R. \u0026amp; Wright, M. N. The revival of the Gini importance? Bioinformatics 34, 3711\u0026ndash;3718 (2018).\u003c/li\u003e\n\u003cli\u003ePedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2, 2825\u0026ndash;2830 (2011).\u003c/li\u003e\n\u003cli\u003eNorth, B. V., Curtis, D. \u0026amp; Sham, P. C. A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet. 71, 439\u0026ndash;441 (2002).\u003c/li\u003e\n\u003cli\u003eBaddeley, A. D., \u0026amp; Hitch, G. Working memory. Psych. of learn \u0026amp; motiv. 8, 47-89 (1974).\u003c/li\u003e\n\u003cli\u003eYap, P. et al. Development trends of white matter connectivity in the first years of life. Plos one. 6, e24678; https://doi.org/10.1371/journal.pone.0024678 (2011). \u003c/li\u003e\n\u003cli\u003eTamarit, L., Goudbeek, M., \u0026amp; Scherer, K. R. Spectral slope measurements in emotionally expressive speech in Proc. of Speech. 7, 169-183 (2008). \u003c/li\u003e\n\u003cli\u003eLe, P., Ambikairajah, E., Epps, J., Sethu, V., \u0026amp; Choi, E. H. C. Investigation of spectral centroid features for cognitive load classification. Speech Comm. 54, 540-551 (2011). \u003c/li\u003e\n\u003cli\u003eHasan, M. R., Jamil, M., \u0026amp; Rahman, M. G. R. M. S. Speaker identification using mel frequency cepstral coefficients. Variat. 1, 565-568 (2004).\u003c/li\u003e\n\u003cli\u003eRosenblatt, M., Tejavibulya, L., Jiang, R., Noble, S. \u0026amp; Scheinost, D. Data leakage inflates prediction performance in connectome-based machine learning models. Nat. Commun. 15, 1829 (2024).\u003c/li\u003e\n\u003cli\u003eDiamantidis, N. A., Karlis, D. \u0026amp; Giakoumakis, E. A. Unsupervised stratification of cross-validation for accuracy estimation. Artif. Intell. 1\u0026ndash;16 (2000).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4745684/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4745684/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMachine learning analyses are widely used for predicting cognitive abilities, yet there are pitfalls that need to be considered during their implementation and interpretation of the results. Hence, the present study aimed at drawing attention to the risks of erroneous conclusions incurred by confounding variables illustrated by a case example predicting executive function performance by prosodic features. Healthy participants (n\u0026thinsp;=\u0026thinsp;231) performed speech tasks and EF tests. From 264 prosodic features, we predicted EF performance using 66 variables, controlling for confounding effects of age, sex, and education. A reasonable model fit was apparently achieved for EF variables of the Trail Making Test. However, in-depth analyses revealed indications of confound leakage, leading to inflated prediction accuracies, due to a strong relationship between confounds and targets. These findings highlight the need to control confounding variables in ML pipelines and caution against potential pitfalls in ML predictions.\u003c/p\u003e","manuscriptTitle":"Pitfalls in using ML to predict cognitive function performance","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-17 02:15:06","doi":"10.21203/rs.3.rs-4745684/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-04-03T10:28:36+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-12T19:19:10+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-09T21:25:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"92279549728191745996868529929664491520","date":"2024-12-29T18:18:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"179502868720921213952317192528540830155","date":"2024-12-15T15:41:49+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-08-05T14:29:58+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-29T19:57:42+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2024-07-22T12:31:36+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-18T12:33:32+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2024-07-15T21:56:50+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3f360d21-9d56-440c-99fe-4dd20be8f07b","owner":[],"postedDate":"August 17th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":35864829,"name":"Health sciences/Biomarkers"},{"id":35864830,"name":"Biological sciences/Neuroscience"},{"id":35864831,"name":"Biological sciences/Neuroscience/Computational neuroscience"},{"id":35864833,"name":"Biological sciences/Psychology"},{"id":35864835,"name":"Biological sciences/Psychology/Human behaviour"}],"tags":[],"updatedAt":"2025-11-03T15:59:42+00:00","versionOfRecord":{"articleIdentity":"rs-4745684","link":"https://doi.org/10.1038/s41598-025-24325-9","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-10-29 15:57:02","publishedOnDateReadable":"October 29th, 2025"},"versionCreatedAt":"2024-08-17 02:15:06","video":"","vorDoi":"10.1038/s41598-025-24325-9","vorDoiUrl":"https://doi.org/10.1038/s41598-025-24325-9","workflowStages":[]},"version":"v1","identity":"rs-4745684","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4745684","identity":"rs-4745684","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0