Psychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review

doi:10.21203/rs.3.rs-8543356/v1

Psychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review

2026 · doi:10.21203/rs.3.rs-8543356/v1

preprint OA: closed

Full text JSON View at publisher

Full text 240,633 characters · extracted from preprint-html · click to expand

Psychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Systematic Review Psychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review Telesmagno Neves-Teles, Jonatha Berguer de Souza, Cristian Zanon, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8543356/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 12 Mar, 2026 Read the published version in International Journal of Testing → Version 1 posted You are reading this latest preprint version Abstract Objective This review maps psychometric evidence from the past decade on digital and online tools assessing executive functions (EF) in healthy adults. It focuses on six modern domains of validity: content, structure, external, response processes, consequences, and reliability. Methods Searches were conducted across PsycNet, PubMed, Embase, Web of Science, and the Virtual Health Library, guided by PRISMA. Core terms included executive functions, digital tools, healthy adults, and psychometric properties. Risk of bias was systematically evaluated. Results Thirty-one studies met inclusion criteria, encompassing 11,246 participants. Most tools assessed core EF domains—working memory, inhibition, and cognitive flexibility—through performance-based tasks or online questionnaires. Reliability was reported in 23 studies, though often via single indices. Content validity appeared in 26 studies but lacked methodological rigor. Structural and external validity were reported in 8 and 17 studies, respectively. Response process evidence (n = 22) and consequential validity (n = 31) were frequently cited but rarely examined in depth. No study addressed all six domains comprehensively. The risk of bias was low for administration but high for sampling. Applicability concerns included unrepresentative samples and weak construct alignment. Conclusion While the field is expanding, it lacks methodological depth. Despite growing interest in digital EF tools, essential domains—particularly structural modeling and consequence analysis—remain underdeveloped. This review underscores the need for comprehensive validation frameworks that integrate theoretical coherence, empirical rigor, and equity-based implementation. Psychology Cognitive Neuroscience executive function digital tools psychometric evidence online testing healthy adults Figures Figure 1 Figure 2 Introduction Executive functions (EF) refer to a set of higher-order cognitive processes responsible for goal-directed behavior, including working memory, inhibitory control, cognitive flexibility, decision-making, and emotional self-regulation (Diamond, 2013 ; Dias & Malloy-Diniz, 2023 ; Friedman & Miyake, 2017 ; Zelazo & Carlson, 2023 ). These skills are essential for adaptive functioning across the lifespan and play a critical role in education, employment, social relationships, and mental health (Diamond, 2013 ; Ferguson et al., 2021 ; Zelazo, 2015 ; Zelazo & Carlson, 2023 ). As such, accurate assessment of EF is fundamental to both clinical decision-making and scientific research (Burgess & Stuss, 2017 ; Kessels & Hendriks, 2023 ; Zucchella et al., 2018 ). Traditionally, EF assessments have been classified as either objective performance-based tasks or subjective self- and informant-report questionnaires (Dias & Malloy-Diniz, 2024 ; Soto et al., 2020 ). While both formats yield valuable insights, they also present practical limitations—such as the need for trained administrators, limited ecological validity, and reduced accessibility for geographically or socially vulnerable populations. Digital and online tools have thus emerged as scalable alternatives to traditional paper-and-pencil or lab-based formats. Importantly, they can support both objective and subjective assessment modes, offering advantages such as automated scoring, improved precision (e.g., in capturing response latencies), and remote administration (Aalbers et al., 2013 ; Arioli et al., 2022 ; Feenstra et al., 2018 ; White et al., 2018 ). The adoption of digital EF assessments has grown considerably in recent years, driven by technological advances and increasing demand for flexible, accessible testing environments (Feenstra et al., 2018 ; Park & Schott, 2022 ; Wang et al., 2023 ; White et al., 2018 ). These tools are particularly useful in settings where face-to-face assessment is impractical or resource-intensive. Digital platforms can enhance performance-based tests through precise stimulus control and latency tracking and benefit self-report formats by enabling standardized administration, minimizing social desirability effects, and supporting large-scale deployment. These features expand the utility of EF assessments in both research and applied contexts (Naglieri et al., 2004 ; Park & Schott, 2022 ; Parsey & Schmitter-Edgecombe, 2013 ). Nevertheless, the transition from traditional to digital formats introduces specific challenges that warrant careful scrutiny, as highlighted in modern psychometric standards (American Educational Research Association et al., 2014; Clark & Watson, 2019 ; Revelle & Condon, 2019 ). Variability in devices, internet connectivity, environmental distractions, and user familiarity with technology may threaten reliability and standardization. Moreover, adaptations of traditional instruments to digital formats may alter task demands or response modalities, potentially affecting construct validity and comparability (Thomas, 2019 ). Several studies included in this review echo these concerns, reporting inconsistencies across platforms, reduced measurement precision in uncontrolled settings, and limited evidence of construct invariance (Aalbers et al., 2013 ; Feenstra et al., 2018 ; Iverson et al., 2009 ; Wang et al., 2023 ). These findings underscore the need for rigorous validation tailored to the specific characteristics of digital and online assessments. Despite their growing availability, many digital EF tools lack comprehensive validation aligned with contemporary psychometric frameworks (American Educational Research Association et al., 2014; Sellbom & Tellegen, 2019 ; Thomas, 2019 ). While internal consistency and test–retest reliability are commonly reported (Revelle & Condon, 2019 ), other key domains, such as content representativeness, factorial structure, external validity, response processes, and the consequences of test use, remain underexplored (Clark & Watson, 2019 ). This limited scope, evident in several studies reviewed here, restricts the ability of clinicians, educators, and researchers to make informed decisions about the scientific adequacy of these instruments. To address this gap, the present systematic review aimed to map and synthesize the psychometric evidence produced over the past decade for digital and online EF assessments targeting healthy adults. Anchored in five core dimensions of contemporary psychometrics—content validity, structural validity, external validity (relations with other variables), response processes, and consequential validity (American Educational Research Association et al., 2014; Field, 2020 ; Tabachnick & Fidell, 2019 ) - this review evaluates the extent to which each domain has been examined in validation studies. By doing so, it seeks to guide evidence-based decisions and promote higher psychometric standards in the design and application of digital EF tools. Ultimately, by consolidating findings across a wide range of instruments and validation approaches, this review intends to provide a comprehensive and up-to-date resource for researchers, clinicians, and test developers. It also aims to highlight persistent gaps and outline future research directions to strengthen the scientific foundation of EF assessment in the digital era. Methods This review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021 ). Transparency and openness To enhance methodological transparency and reduce bias, the protocol was prospectively registered in PROSPERO (registration number CRD420251027891). Eligibility criteria Studies were included if they: (a) involved healthy adults (≥ 18 years); (b) evaluated digital or online tools specifically designed to assess EF, conceptualized within a componential framework (Diamond, 2013 ; Friedman & Miyake, 2017 ), including working memory, inhibitory control, and cognitive flexibility; and (c) reported original psychometric data addressing at least one domain of contemporary validity theory - reliability, content validity, structural validity, external validity (i.e., convergent, discriminant, or criterion validity), response processes validity, or consequential validity. Peer-reviewed articles published between December 2013 and February 2025, in English, Portuguese, or Spanish, with full-text access were eligible. Studies were excluded if they: (a) targeted clinical or neurological populations; (b) used only paper-based tools; (c) lacked psychometric results; (d) were review articles, theoretical papers, case studies, conference abstracts, dissertations, or editorials; or (e) focused solely on usability or user experience without evaluating psychometric properties. Studies with a general cognitive focus but no explicit EF construct were also excluded. Information source and search strategy A comprehensive search was conducted on February 17, 2025, across Embase, PubMed, PsycNet, Web of Science, and the Virtual Health Library (VHL). Search strategies used controlled descriptors from each database’s thesaurus (EMTREE, MeSH, APA Thesaurus, Web of Science Core Collection, and DeCS) and Boolean operators to combine five main concepts: executive functions, digital/online tools, assessment instruments, psychometric properties, and healthy adults. No publication status filters were applied. The full search strategy, including all adapted syntaxes and term combinations, is provided in the supplementary materials. Study selection process All references were imported into Rayyan (Ouzzani et al., 2016 ) for initial screening. Duplicates were removed, and titles and abstracts were independently reviewed in blinded mode by two authors (T.N.T. and J.B.S.). Discrepancies were resolved through discussion, with a third author (R.M.M.A.) consulted when necessary. Full texts of potentially eligible studies were then assessed by T.N.T. and independently verified by J.B.S.; any remaining disagreements were resolved in consultation with R.M.M.A. Reasons for exclusion were documented at all stages. Finally, a PRISMA flow diagram was constructed to visually summarize the study selection process. Data collection process and synthesis methods Data extraction was performed by T.N.T., supported by Rayyan (Ouzzani et al., 2016 ) and NVivo (v1.7.1), following a structured protocol. Any inconsistencies were discussed with J.B.S. and R.M.M.A. Data were extracted along five psychometric domains: Content validity , which examines whether all relevant dimensions of the construct are represented, the coherence of items in measuring the same construct, the balance of items across dimensions, and their relevance to the target population. Structural validity , which assesses the internal structure of the instrument, typically through exploratory or confirmatory factor analysis, to verify whether the data support the theoretical model of EF. External validity , which includes convergent and discriminant validity (i.e., correlations with related or unrelated constructs) and criterion validity (i.e., concurrent or predictive relationships with relevant external variables). Response processes validity , which investigates whether respondents engage with the instrument as theoretically expected and how external factors (e.g., device use, familiarity, or cognitive strategy) may influence performance. Consequential validity , which considers the broader implications of instrument use, including its practical utility, theoretical contributions, and potential ethical or social impacts. A two-axis synthesis was conducted. First, a descriptive profile summarized study reference, country, tool name, sample size, mean age, and education. Second, psychometric evidence was mapped by EF domain, validity dimensions, and reliability metrics (e.g., Cronbach’s alpha, ICC). Results were organized into summary tables to facilitate comparison. Due to heterogeneity in instruments, constructs, and statistical methods, meta-analysis was not viable. Instead, a narrative synthesis was applied to highlight methodological trends, gaps, and strengths across studies. Study risk-of-bias assessment Study quality was assessed by T.N.T. using an adapted version of the QUADAS-II tool (Whiting et al., 2011 ). Signaling questions were modified to align with the review’s scope (see Appendix A). Although originally developed for diagnostic accuracy studies, QUADAS-II was selected for its structured, domain-based approach and adaptability to different research contexts. This flexibility made it particularly suitable for evaluating methodological quality in studies involving digital and online executive function assessments. Studies were rated as: low risk of bias, when all criteria were clearly met; high risk of bias, when the study’s methods or procedures (e.g., participant selection or test administration) could reasonably introduce bias; unclear risk, when insufficient information was available to make a definitive judgment; and not applicable, when the domain or item did not pertain to the study. All ratings were reviewed by J.B.S., with R.M.M.A. mediating unresolved cases and to further ensure rigor, the review team examined potential bias from missing data, such as unreported sample characteristics or psychometric results, and their implications for interpretation. Relevant limitations are noted in the results and discussion. Results Study selection The study selection process is illustrated in Fig. 1 (PRISMA Flow Diagram). A total of 4,019 records were identified across five databases: PubMed (n = 1,488), NHL (n = 845), PsycNet (n = 810), Embase (n = 526), and Web of Science (n = 350). After the removal of 656 duplicate records, 3,363 titles and abstracts were screened independently. This initial screening led to the exclusion of 3,286 records that did not meet the eligibility criteria. Of the remaining 77 studies, 75 were successfully retrieved, most from open-access sources, and some through direct author contact via email. Two reports could not be obtained due to lack of response from the corresponding authors. After full-text review of the 75 retrieved studies, 44 were excluded for the following reasons: lack of psychometric data (n = 22), wrong population, such as children, adolescents, or clinical samples (n = 11), use of non-EF tests (n = 8), or ineligible publication type (n = 3). As a result, 31 studies met all inclusion criteria and were included in the final synthesis. Figure 1 Study characteristics The characteristics of each study are presented in Table 1 . A total of 31 studies published over the last decade were included in this review. Collectively, these studies investigated digital or online tools aimed at assessing EF and comprised a combined sample of 11,246 healthy adult participants (6,283 females). Sample sizes ranged from 27 to 4,600 individuals. Table 1 Study characteristics Table 1 The average age across studies was approximately 45.1 years, with some studies focusing on specific age groups such as young adults (18–34 years) and older adults (60 + years). The average educational level, when reported, was about 14.2 years, indicating that most samples were composed of adults with at least secondary education. These samples underscore the relevance of EF assessment across the adult lifespan, while also highlighting the general focus on cognitively healthy, non-clinical populations. From a geographical perspective, most studies were concentrated in high-income regions, with Europe (n = 14) and North America (n = 8) leading in publication volume. Smaller yet relevant contributions originated from Asia (n = 5), Latin America (n = 2), and Oceania (n = 2), totaling studies from 20 countries. This broad international distribution reflects the increasing momentum toward digital EF assessment worldwide. The demand for remote and scalable testing alternatives, fueled in part by the constraints imposed by the COVID-19 pandemic, has accelerated the shift toward digital methodologies, particularly in contexts requiring reduced in-person contact and greater logistical adaptability. Regarding the types of instruments employed, the studies revealed a wide diversity of digital and online tools for executive function (EF) assessment. These included both established instruments, such as the National Institute of Health Toolbox Cognition Battery (NIHTB-CB) and the Mindmore Digital Cognitive Assessment (MINDMORE), as well as newly developed tools, designed specifically for the respective studies. In terms of structure, the instruments could be broadly categorized into: (a) Comprehensive digital batteries (e.g., NIHTB-CB, Cambridge Neuropsychological Test Automated Battery [CANTAB], NeuroUX Cognitive Platform [NeuroUX], Neuropsychological Online Platform [NeurOn]); (b) Digitized versions of classical paper-and-pencil (P&P) EF tasks (e.g., Trail Making Test [TMT], Stroop, Tower of London [TOL], N-back); (c) Serious games or gamified platforms; and (d) Questionnaires or rating scales adapted for online use (e.g., Executive Function Scale for Adults [EFSA]). Among these, only two tools were used in more than one study: the NIHTB-CB (n = 3) and digital versions of the TMT (n = 2). This pattern reflects a field marked by methodological heterogeneity, with researchers drawing on both standardized instruments and novel, context-specific tools to assess EF domains in digital and online environments. Concerning the nature of the assessment, most of the studies (n = 27) focused on performance-based (objective) EF measures, typically involving computerized cognitive tasks that assess core executive components such as working memory, inhibitory control, and cognitive flexibility. A smaller subset of studies (n = 4) relied on subjective assessments, including self-report or informant-report questionnaires specifically designed to capture everyday executive functioning (e.g., Executive Function Scale for Adults and adaptations of traditional paper-based inventories for online use). Altogether, this diversity in instruments, populations, and assessment strategies reflects both the versatility and the current lack of standardization in digital EF measurement practices. While digital tools offer clear advantages - such as flexible administration, cost-effectiveness, and potential for remote deployment - their widespread adoption remains limited, partly due to concerns regarding the sufficiency of their psychometric validation (Bergman et al., 2025). As emphasized by contemporary psychometric guidelines, including the American Educational Research Association (AERA) Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014) and recent theoretical contributions (Clark & Watson, 2019; Revelle & Condon, 2019), ensuring both methodological quality and construct validity is essential before test results can be interpreted or generalized with confidence. These considerations underscore the importance of critically evaluating the instruments reviewed, as addressed in the following subsections. Risk of bias in the reviewed studies Risk of bias was assessed using an adapted version of the QUADAS-2 tool (Appendix A), which evaluates four key domains: D1 – Participant Selection, D2 – Index Test, D3 – Reference Standard, and D4 – Flow and Timing. Table 2 presents the detailed risk-of-bias ratings for each individual study, while Fig. 2(a) summarizes the overall distribution of ratings across these domains. Below, we provide a domain-by-domain interpretation of the observed patterns, and the full set of domain-level judgments and textual justifications for each study is provided in the supplementary materials. Domain D3 was considered applicable only when the study employed an external criterion measure to validate the index test. In its absence, D3 was marked as “not applicable”. Table 2 The domain related to participant selection (D1) revealed the most critical source of bias. A total of 23 studies were rated as high risk, primarily due to convenience sampling, lack of randomization procedures, or insufficient reporting of recruitment strategies. These issues raise concerns about selection bias and limit the representativeness of the samples. Only four studies were classified as low risk, typically those with transparent and rigorous inclusion procedures. An additional four studies were marked as unclear due to limited information. As for the assessment procedures themselves (D2), most studies received favorable evaluations, with 29 rated as low risk. This reflects the frequent use of standardized administration protocols and consistent procedural reporting. However, in two studies, the absence of detail regarding blinding procedures or uniform testing conditions raised concerns, leading to their classification as high risk. It is important to note that these ratings reflect procedural adequacy rather than the depth of theoretical validation. In many cases, the instruments were designed to assess EF and included task elements broadly consistent with the target constructs, at least at a surface level. This distinction is important, as several tools lacked evidence of thorough construct validation, an issue addressed in the subsequent sections. With respect to the use of external criterion measures (D3), 19 studies were deemed applicable, all of which were rated as low risk. These studies employed validated EF instruments, clinical benchmarks, or robust scoring frameworks to support the interpretability of findings. The absence of high-risk ratings in this domain likely reflects the selective inclusion of studies with clearly defined comparators. In contrast, the domain evaluating flow and timing showed strong consistency, with all 31 studies rated as low risk. This uniformity suggests strong procedural control, characterized by consistent sequencing and timing of assessments and minimal data loss. The automated nature of most digital tools likely contributed to the reliability of testing procedures in this domain. Figure 3 In addition to identifying potential sources of bias, this review also assessed concerns related to applicability—that is, the extent to which each study’s findings can be generalized to the intended populations and contexts of EF assessment. Table 2 displays the detailed applicability ratings for each individual study, while Fig. 2(b) summarizes the distribution of ratings across the three applicable domains (D1, D2, and D3). The following is a domain-wise interpretation of these patterns. Concerns related to participant selection (D1) were particularly prominent. A total of 17 studies were rated as high concern, often due to participant profiles that did not reflect the broader populations targeted by the instruments, for example, samples composed predominantly of highly educated or digitally proficient individuals. Thirteen studies were rated as low concern, while one was marked as unclear due to insufficient demographic detail. Regarding the index tests (D2), most studies were judged to have low concern in terms of applicability (n = 29). This was generally based on the formal alignment between the stated purpose of the test and the constructs it intended to measure. However, as noted earlier, this evaluation reflects surface-level coherence and does not necessarily indicate strong theoretical grounding or robust construct representation. In two studies, poor alignment between the instrument and the intended EF domains led to high concern ratings due to potential construct misfit. The applicability of reference standards (D3) was assessed in 19 studies. Of these, 18 were rated as low concern. One study raised high concern due to the use of poorly justified comparators or evaluation criteria misaligned with EF theory, which could compromise the interpretability of findings. In 12 studies, the domain was not applicable, as no external criterion was employed for instrument validation. Taken together, these results reinforce the overall pattern observed in the risk-of-bias assessment. While most studies demonstrated procedural rigor, especially in test administration and timing, critical challenges remain regarding the representativeness of samples and the theoretical coherence between the instruments used and the constructs they aim to assess. Addressing these limitations is essential for improving the ecological validity, interpretability, and practical utility of digital and online EF assessment tools. Results of the reviewed studies In line with contemporary psychometric frameworks, the results are organized according to five domains of evidence: reliability, content validity, structural validity, external validity, response process validity, and consequential validity. Table 3 presents the psychometric evidence of the reviewed studies, serving as a visual synthesis of the findings discussed below. Table 3 Psychometric evidence in the reviewed studies Study Digital/Online Test Construct assessed Reliability Content validity Structural validity External validity Response Process Validity Consequential validity Bergman et al. (2025) Mindmore ATT, MEM, LANG, EF TRt-R: ICC = .60–.88 across 22 core scores (11 ≥ .70; 6 < .60). Digital adaptations of established neuropsychological instruments, preserving task structure and administration procedures. NR NR Practice effects in speeded tasks (e.g., TMT-A, Stroop) suggest expected cognitive sensitivity and task engagement (indirect evidence). Demonstrates clinical utility for early detection of cognitive deficits and remote monitoring; supports informed test selection and interpretation. Bruno et al. (2024) AFLT VIM NR Designed as a nonverbal analog to the RAVLT, with matched structure, memory demands, and administration procedures. NR CV: r = .20–.61 with ROCF (immediate, delayed, recognition). NR Clinically relevant for identifying material-specific memory deficits (e.g., in epilepsy); enables differentiation between verbal and nonverbal profiles via RAVLT comparison. Byrne et al. (2024) N-BR WM NR Tasks selected to reflect distinct WM paradigms (updating vs. serial recall), with theoretical support for a two-factor model. EFA: RMSEA = .026; CFI = .996; χ² = 23.58 (2-paradigm model); CFA: RMSEA = .027; CFI = .994; χ² = 10.66 (2-paradigm model). NR NR Supports theoretical differentiation of working memory paradigms; applicable to hierarchical and network models. Depauw et al. (2024) Axon-TMT PS, EF NR Digital version of the traditional TMT, targeting visuospatial and executive functions. NR CV: r = .51–.69 with TMT-A/B; CrV: performance with CAPTCHA (IT task). RTs and error profiles analyzed to support sensitivity to executive demands (indirect evidence). Enables rapid (~ 5 min) cognitive screening for diverse populations (neurotypical adults, older adults, children, stroke survivors); suitable for remote use. Giorgini et al. (2024) TMS MEM, EF NR Covers distinct EF and memory domains; content grounded in gold-standard neuropsychological measures. CFA: RMSEA = .000; CFI = 1.000; TLI = 1.007 (3-factor model); MI: metric invariance across countries. CV: r = .30–.45 with Digit Span, Stroop, COWAT, SVF; DV: r < .18 with MMSE. NR Useful for distinguishing EF and memory deficits; supports neuropsychological screening and cross-cultural applications. Hassmén et al. (2024) CHEFS WM, CF, IC NR Built from classic tasks (e.g., Matrix Reasoning, Number–Letter); aligned with Miyake et al.’s (2000) EF framework. NR CV: adj. R² = .11 with D-KEFS (MRA); CrV: 87% classification (Λ = .75, χ² = 14.76, p = .001). Variation in RT and accuracy aligned with EF subcomponents (e.g., inhibition, switching), offering indirect support for response process validity. Applicable for non-clinical EF assessment; supports use in driving fitness evaluations and referral decision-making. Hatahet & Seghier (2024) NIHTB-CB EF, WM, PS IC-R: α = .62–.86 (by age group). Uses a widely recognized tool for executive assessment, grounded in established cognitive domains. CFA: RMSEA = .052–.069; CFI = .932–.94; TLI = .914–.924 (3-factor model across age groups). NR Response pattern differences across age groups identified via measurement invariance; behavioral scores predicted age in older adults (indirect evidence). Recommended for age-based differentiation (younger vs. older adults) in clinical contexts. Hurtado-Pomares et al. (2024) FAB-E EF IC-R: α = .60; TRt-R: ICC = .72; ρ = .70. erived from the FAB, with six subscales covering distinct EF components. NR CrV: r = .43 with MMSE; DV: r = –.52 with TMT. NR Enables rapid EF screening with cutoffs tailored to age and education level. Morrissey et al. (2024) NeurOn RT, PS, EF, EM, WM, ATT, EP TRt-R: ICC = .50–.80 (repeated tasks). Adapted from traditional neuropsychological tests for online administration. NR CrV: ρ = .60 with MoCA; ρ = .61 with TMT. Administration mode and device-type effects examined; equivalent performance across formats supported task comparability (indirect evidence). Supports large-scale screening using nonverbal tasks; applicable to population studies and clinical trial eligibility. Paolillo et al. (2024) NeuroUX EF, WM, RT TRt-R: ICC > .76 (9 of 12 core measures). Includes tasks targeting executive cognition (processing speed, memory, EF). NR NR Performance influenced by smartphone type and testing environment (indirect evidence). Suitable for longitudinal cognitive monitoring; demonstrates usability in real-world environments. Perzl et al. (2024) DSST/SART/PVT PS, ATT, RT, SusATT TRt-R: r = .42–.84 (within-person stability in DSST, SART, PVT). Digital adaptations of validated tasks (DSST, SART, PVT) for occupational contexts. NR CV: r = .30–.46 with attention and arousal measures. Performance reflected sensitivity to prior cognitive and emotional states (indirect evidence). Applicable to occupational cognitive monitoring and performance tracking. Toh & Yang (2023) TSP CF, ER NR Task structure supported by second-order EF model (inhibition, WM, flexibility); aligned with unity/diversity framework (Friedman & Miyake, 2017). CFA: RMSEA = .012; CFI = .997; SRMR = .039 (second-order model). NR Intraindividual variability and EF-specific effects predicted regulatory behavior, aligning with task engagement (indirect evidence). Relevant for research on executive-emotional control and regulation mechanisms. Zhang et al. (2024) OCST CF SH-R: r = .72–.95 (OCST tasks). Digital WCST version with automated scoring aligned with expert consensus. NR NR Task engagement influenced by age and digital literacy, suggesting construct-relevant response variation (indirect evidence). Potentially suitable for remote EF screening and longitudinal cognitive monitoring. Kruger et al. (2023) EFSA WM, IC, CF IC-R: α/ω = .88/.89 (total), .90/.90 (WM), .79/.76 (IC), .62/.64 (CF). Based on Diamond’s model; expert ratings showed κ = .55–.73 and CVI > .30 for semantic clarity. EFA: RMSEA = .031; CFI = .990; TLI = .987 (3-factor model; 40.1% variance explained). CV: ρ = .17–.71 with Dysexecutive Questionnaire (WM, IC, CF). NR Promising clinical utility as a complementary EF assessment in applied settings; grounded in a contemporary theoretical model. Wang et al. (2023) FISHERMAN WM, IC, CF IC-R: α = .83–.89; SH-R: r = .77–.88 (subgames). Grounded in Miyake & Friedman’s (2012) model; subgames reflect core EF domains via classic analogues. CFA: RMSEA = .073; CFI = .982; TLI = .955; GFI = .972; χ²/df = 1.57 (3-factor model). CrV: r = .37–.75 with stop-signal, number switch, and Corsi block-tapping tasks (by subgame). NR Promising for older adult screening; age and sex effects support normative application relevance. Arioli et al. (2022) RCM MEM, ATT, VF, SS TRt-R: r = .45–.61 (memory, fluency, shifting); ns for WM. Replicates key components of CVLT-II and TMT-B via digital speech interface. NR CrV: d = .26–1.28 with P&P tasks (CVLT-II, Trails B, Fluency, Digit Span). NR Enhances access to cognitive screening in older adults; supports remote research and inclusive norming. Karlsen et al. (2022) CANTAB VIM, EF, VisATT TRt-R: r = .39–.79; practice effects in 6/14 (g = .15–.40). Established tool for memory and EF assessment. NR NR Practice effects and age influenced response patterns across sessions (indirect evidence). Enables probabilistic interpretation of clinically meaningful cognitive change. Ott et al. (2022) NIHTB-CB (iPad) GF NR Covers NIHTB-CB domains; limitations observed in attention and processing speed. NR CV: r = .35–.80 with gold-standard tests (attention/EF, WM, language, motor). NR Aligns with NIH’s iPad-only shift; supports standardization and integration into research workflows. Park & Schott (2022) dTMT EF, PS, FM IR-R: ICC = .90–.95 (TMT-M, A, B). Preserves TMT structure, adding interaction metrics (e.g., inter-touch, pause duration). NR CrV: r = .82–.90 with paper-based TMT. Touch and pause metrics showed consistent variation, reflecting fatigue or task familiarity (indirect evidence). Promising for early detection of cognitive decline. Wahyuningrum et al. (2022) INTB MEM, ATT, EF TRt-R: ICC = .60–.91 (subtests); low ICCs < .25 for 3 RAVLT indices. Adapts 10 international tests to assess cognitive domains in Indonesian context. PCA: 7 factors (62.8% var.; KMO = .83); CFA: RMSEA = .040; TLI = .947. NR Practice effects and response time variability indicated sensitivity to repetition and performance fluctuation (indirect evidence). Adapted for Indonesian cultural context; supports clinical use with preliminary normative data. Feenstra et al. (2018) ACS ATT, MEM, PS, EF TRt-R: ICC = .45–.80 (subtests); .83 (total score). Based on conventional neuropsychological tests (Rey, Corsi, TOL, WAIS). NR CV: r = .42–.70 with TMT, Digit Span, Corsi, TOL, Pegboard, RAVLT. Performance influenced by users’ computer proficiency (indirect evidence). Suitable for remote screening of general cognitive functions; normative data support preliminary ACS interpretation. White et al. (2018) CCTB EF, SelATT TRt-R: ICC = .34–.93 (Pro, Anti, Simon, Flanker, 2-back); low for Corsi. NR NR NR Fatigue and practice effects observed, particularly in early retest sessions (indirect evidence). Applicable to aging research; practice effects may obscure cognitive decline detection. Rijnen et al. (2018) CNS-VS VBM, VIM, PS, ATT, CF TRt-R: ICC = .40–.89 (higher in speed & flexibility; lower in memory/attention). NR NR NR Practice effects reported in cognitive flexibility and reaction time tasks (indirect evidence). Provides RCI formulae for reliable change assessment; highlights need to adjust for practice effects. Soveri et al. (2018) BEFT IC, CF, WM TRt-R: r = .35–.93 (higher for WM speed; lower for inhibition and N-back accuracy). NR NR NR Practice effects observed for response times across retest sessions (indirect evidence). Highlights limitations in interpreting longitudinal/intervention effects due to variability and practice effects. Parsons & Barnett (2017) VAST IC NR Modeled after traditional Stroop paradigms to support construct representation. NR CrV: η² = .09–.66 with D-KEFS and ANAM (ANOVA). Stimulus modality (VR vs. P&P) influenced cognitive response patterns (indirect evidence). Ecologically grounded; useful for dissociating inhibition from distractor resistance in older adults. Ishigami et al. (2016) ANT-I ATT SH-R: r = .29 (alerting), .70 (orienting), .68 (executive). Attention networks assessed separately, aligned with Petersen & Posner’s model. NR CrV: β = –.17 (conflict resolution), β = –.18 (verbal memory) with D-KEFS, SDMT, BSR. NR Supports evaluation of attentional changes in aging and clinical populations. Kaller et al. (2016) TOL-F PL, PRB IC-R: α = .71–.74; SH-R: r = .71–.76; GLB = .73–.76; TRt-R: r = .72. Built on gold-standard TOL; items balanced for complexity, rule structure, and search depth. NR NR Planning/execution times and rule violations used as behavioral indicators (indirect evidence). Suitable for clinical and lifespan planning assessment. Köstering et al. (2015) TOL-F PL TRt-R: ICC = .69 (accuracy), .27–.52 (latency). 24 tasks balanced for goal hierarchy and search depth to reflect planning demands. NR NR Initial latency deemed a noisy planning metric; accuracy considered a more valid response outcome (indirect evidence). Appropriate for group research and clinical use via accuracy-based planning metrics. Heaton et al. (2014) NIHTB-CB MEM, PS, EF IC-R: α = .77–.84 (composite, fluid, crystallized); TRt-R: ICC = .86–.92. NR NR CrV: r = .78–.90 with GS composites (PPVT, CWI, WCST, PASAT, RAVLT); d = .20–.50 with functional/health markers; DV: r = .17–.39 with distinct constructs. High test–retest consistency interpreted as reliable response engagement (indirect evidence). Supports clinical and research screening; findings may guide diagnostic and triage strategies. Troyer et al. (2014) SAOCA MEM, ATT IC-R: α = .96 (Stroop); SH-R: r = .62 (Face-Name); TRt-R: r = .49–.83; AF-R: r = .48–.82. Draws from validated tests (spatial WM, Stroop, Face–Name, Letter–Number Alternation). PCA: λ = 1.61; loadings = .58–.75 (unidimensional structure) NR High completion rate (87%) and expected score distribution (94%) suggest task engagement (indirect evidence). Useful for early cognitive decline detection and screening. Aalbers et al. (2013) BAM WM, PL, EP, VIM AF-R: ICC = .42 (WM), .43 (VSM), .17 (EM), .65 (Planning). NR NR CV: ρ = .40–.67 with WAIS, WMS, BADS, CVMT; DV: ρ = –.03 to –.13 with NART-IQ. RT and error variation across task complexity supported construct-relevant engagement (indirect evidence). Supports long-term self-monitoring in both clinical and digital environments. Note: ACS: Amsterdam cognition scan; AFLT: auditory figural learning test; AF-R: alternate-form reliability; ANAM: automated neuropsychological assessment metrics; ANOVA: analysis of variance; ANT-I: attention network test – interaction; ATT: attention; BADS: Behavioral Assessment of the Dysexecutive Syndrome; BEFT: Battery of Executive Function Tasks (Simon, visuoverbal n-back, visuospatial n-back, letter-memory, and number-letter); BSR: Buschke selective reminding; CANTAB: Cambridge neuropsychological test automated battery; CAPTCHA: completely automated public turing test to tell computers and humans apart; CCTB: computerized cognitive test battery (Pro, Anti, Simon, Flanker, 2-back, Corsi); CF: cognitive flexibility; CFA: confirmatory factor analysis; CFI: Comparative fit index; CHEFS: Coffs Harbour executive functioning screen; COWAT: controlled oral word association test; CrV: criterion validity; CV: convergent validity; CVI: content validity index; CVLT: California verbal learning test; CVMT: continuous visual memory test; CWI: color-word interference test; DSST: digit symbol substitution test; DV: discriminant validity; EF: executive functions; EFA: exploratory factor analysis; EFSA: executive function scale for adults; EM: episodic memory; EP: executive planning; ER: emotional regulation; FAB-E: frontal assessment battery - Spanish version; FISHERMAN: fisherman Task (serious game); FMC: fine motor control; GF: general cognitive functions; GFI: goodness of fit index; GLB: greatest lower bound; GS: global standard; IC: inhibitory control; IC-R: internal consistency – reliability; ICC: intraclass correlation coefficient; INTB: Indonesian neuropsychological test battery; IQ: intelligence quotient; IR-R: inter-rater reliability; IT: information theory; D-KEFS: Delis-Kaplan executive function system; KMO: Kaiser-Meyer-Olkin measure of sampling adequacy; LANG: Language; MEM: memory; MI: measurement invariance; MINDMORE: Mindmore digital cognitive assessment tool; MMSE: mini-mental state examination; MoCA: Montreal cognitive assessment; MRA: multivariate regression analysis; NART: national adult reading test; N-BR: n-back and backward recall tasks; NIH: national institutes of health; NIHTB-CB: NIH toolbox cognition battery; NR: not reported; OCST: online card sorting task; PASAT: paced auditory serial addition test; PCA: principal component analysis; P&P: paper and pencil; PLPS: planning and problem-solving; PPVT: Peabody picture vocabulary test; PS: processing speed; PV: predictive validity; PVT: psychomotor vigilance test; RAVLT: Rey auditory verbal learning test; RCI: reliable change indices; RCM: remote characterization module; RMSEA: root mean square error of approximation; ROCF: Rey-Osterrieth complex figure; RT: reaction time; SAOCA: self-administered online cognitive assessment; SART: sustained attention to response task; SDMT: symbol digit modalities test; SelATT: selective attention; SH-R: split-half reliability; SRMR: standardized root mean square residual; SS: set-shifting; SVF: semantic verbal fluency; TLI: Tucker-Lewis index; TMS: test of memory strategies; TMT: trail making test; TOL: tower of London; TRt-R: test-retest reliability; TSP: task switching paradigm; VAST: virtual apartment-based Stroop task; VBM: verbal memory; VF: verbal fluency; VisATT: visual attention; VIM: visual memory; VR: visual reality; VS: visuospatial skills; VSM: visuospatial memory; WAIS: Wechsler adult intelligence scale; WCST: Wisconsin card sorting test; WM: working memory; and WMS: Wechsler memory scale. Reliability evidence Reliability was assessed in 23 of the 31 studies, though with considerable variation in method and reporting practices. The most common form was test–retest reliability (TRt-R), investigated in 16 studies, typically using intraclass correlation coefficients (ICCs) or Pearson’s r. High coefficients were reported for instruments such as the dTMT-B (ICC = .95; Park & Schott, 2022), NIHTB-CB (ICC = .86 − .92; Hatahet & Seghier, 2024; Heaton et al., 2014), the TOL Freiburg version (TOL-F; r = .69 − .72 for accuracy; Kaller et al., 2016; Köstering et al., 2015), and the Indonesian Neuropsychological Test Battery (INTB; ICCs = .60–.91; Wahyuningrum et al., 2022). In contrast, lower stability was observed in latency-based outcomes from the TOL-F (ICCs = .27–.52) and in some Brain Aging Monitor-Cognitive Assessment Battery (BAM) subtests (ICC = .17 − .42; Aalbers et al., 2013). Accuracy-based scores generally yielded more consistent estimates than latency-based ones. Additionally, one study (Park & Schott, 2022) examined inter-run reliability (IR-R), with ICCs above .90 across conditions. Internal consistency was assessed in 12 studies, including 7 that reported Cronbach’s alpha or McDonald’s omega (IC-R) and 5 that reported split-half reliability (SH-R). The EFSA demonstrated strong internal consistency, with α and ω ranging from .76 to .90 across subscales (Kruger et al., 2023). The Stroop task in the Self-Administered Online Cognitive Assessment (SAOCA) showed α = .96 (Troyer et al., 2014), while the Online Card Sorting Task (OCST; Zhang et al., 2024) reported SH-R between r = .72 and .95. The Face–Name task in the SAOCA also showed moderate split-half reliability (r = .62). However, few instruments triangulated different reliability indices, limiting opportunities for convergence across methods and interpretations. In sum, while many instruments demonstrated at least moderate score stability, reporting practices were heterogeneous and often limited to a single reliability index. Triangulation across complementary methods - such as combining internal consistency and temporal stability - was rare. Notably, alternate-form reliability (AF-R) and inter-run reliability (IR-R) were reported in only two and one study, respectively, suggesting underexplored areas in the validation of digital EF tools. The adoption of broader, multi-indicator reliability frameworks, especially in instruments designed for repeated or longitudinal use, remains essential to enhance interpretability, generalizability, and psychometric robustness. Content validity evidence Content validity was addressed in 26 of the 31 reviewed studies, making it one of the most frequently mentioned domains. However, formal methodological procedures were rarely applied. While many instruments were conceptually grounded in established theoretical models or classical paradigms, only a minority described explicit strategies to ensure comprehensive construct coverage. No study employed formal indices, such as the Content Validity Index (CVI), or conducted structured expert judgment evaluations. Some instruments were clearly anchored in solid theoretical foundations. The TOL-F (Kaller et al., 2016, 2016) was designed to reflect goal hierarchy and search depth, key elements in planning. The EFSA (Kruger et al., 2023) and the Test of Memory Strategies (TMS; Giorgini et al., 2024) drew on multidimensional EF models (e.g., Diamond, 2013) and incorporated expert consultation during item development. The BAM battery (Aalbers et al., 2013), although conceptually aligned with cognitive aging domains, did not include item-level validation procedures. Similarly, instruments such as the Frontal Assessment Battery - Spanish Version (FAB-E; Hurtado-Pomares et al., 2024), CANTAB (Karlsen et al., 2022), and NeurOn (Morrissey et al., 2024) were based on traditional batteries but provided limited details on how theoretical constructs were preserved during digital adaptation. Several studies reported adaptations of classical tasks, such as the Digit Symbol Substitution Task (DSST; Perzl et al., 2024), OCST (Zhang et al., 2024), SAOCA (Troyer et al., 2014), and the Remote Characterization Module (RCM; Arioli et al., 2022), but did not clarify how construct representation was maintained. Others, including the Attention Network Test-Interaction (ANT-I; Ishigami et al., 2016), the Aggie Figures Learning Test (AFLT; Bruno et al., 2024), and the Virtual apartment-based Stroop test (VAST; Parsons & Barnett, 2017), limited their justification to structural similarity with legacy tasks. In summary, although many instruments referenced well-established paradigms or models, most lacked systematic evaluation of whether their items fully represented the intended constructs. This gap limits confidence in the content validity of digital EF assessments and highlights a critical need for more rigorous design and documentation practices in future instrument development. Structural validity evidence Structural validity was assessed in 8 of the 31 reviewed studies, with considerable variation in methodological rigor and analytic approach. Confirmatory factor analysis (CFA) was employed in six studies, four as standalone analyses and two in combination with either exploratory factor analysis (EFA) or principal component analysis (PCA). EFA was used in two studies (one exclusively and one alongside CFA), while PCA appeared in two studies (one standalone and one combined with CFA). Although PCA supports data reduction, it does not assess latent construct structure and therefore provides limited evidence of structural validity (American Educational Research Association et al., 2014). Among the CFA-based studies, several demonstrated robust model fit. The EFSA (Kruger et al., 2023) confirmed a three-factor structure encompassing working memory, inhibitory control, and cognitive flexibility (e.g., RMSEA = .031; CFI = .990). The TMS supported a similar three-factor solution involving executive and memory-related components and reported partial measurement invariance across countries (Giorgini et al., 2024). The NIHTB-CB showed acceptable fit for a three-factor solution across age groups, covering crystallized, fluid, and composite cognitive domains (Hatahet & Seghier, 2024). The INTB (Wahyuningrum et al., 2022) also reported CFA results that confirmed the seven-factor structure initially extracted via PCA (RMSEA = .040; TLI = .947), supporting its internal dimensional consistency. The PCA-exclusive study, SAOCA (Troyer et al., 2014), reported a unidimensional solution (λ = 1.61; loadings = .58–.75) but did not follow up with confirmatory modeling, limiting the interpretability of its structural assumptions. Several instruments provided theoretical justification for their multidimensional design but did not empirically test it. This includes the TOL-F (Kaller et al., 2016; Köstering et al., 2015), BAM (Aalbers et al., 2013), and FISHERMAN (Wang et al., 2023). Although grounded in theoretical frameworks, the lack of structural analysis restricts interpretability of their dimensional claims. Notably, most studies based on classical EF tasks, such as the Stroop, DSST, TMT, Go/No-Go, and N-back, did not report any internal structure assessment. This reflects a persistent gap in the digital EF literature, where dimensionality is frequently assumed rather than empirically verified. In summary, although a few instruments provided robust structural evidence through CFA, most relied on exploratory approaches or omitted structural validation entirely. Moreover, even among studies that conducted CFA, it was common for authors not to report whether factor loadings, variances, or covariances were statistically significant — a limitation that undermines the interpretability and replicability of the proposed models. A notable exception was the study by Byrne et al. (2024), which reported the significance of residual variances and contributed more comprehensive parameter estimates. Future research should prioritize confirmatory modeling, such as CFA or exploratory structural equation modeling (ESEM), to ensure consistency between theoretical frameworks and the empirical structure of digital EF assessments. External validity evidence External validity was assessed in 17 of the 31 studies, though the type and rigor of analyses varied considerably. Convergent (CV) and criterion-related validity (CrV) were each reported in 10 studies, with four also including discriminant validity (DV). The NIHTB-CB – iPad (Ott et al., 2022) and the Amsterdam Cognition Scan (ACS; Feenstra et al., 2018) showed moderate to strong convergent evidence with gold-standard tests (r = .35–.80 and r = .46–.70, respectively). The NIHTB-CB (Heaton et al., 2014) and EFSA (Kruger et al., 2023) demonstrated strong CrV, with correlations above r = .78 and ρ = .71, respectively. The FAB-E (Hurtado-Pomares et al., 2024) revealed both CV (r = .426 with MMSE) and DV (r = − .523 with TMT), while the BAM battery (Aalbers et al., 2013) showed modest CV (ρ = .40–.67) and nonsignificant correlations with unrelated constructs (ρ = –.03 to –.13), supporting DV. Other tools such as TMS (Giorgini et al., 2024), and FISHERMAN (Wang et al., 2023) reported moderate CV and CrV, respectively, but often lacked precision or breadth in their comparative frameworks. Predictive validity (PV) was addressed only in the Axon-TMT (Depauw et al., 2024), which predicted CAPTCHA performance with moderate to large effects (η² = .09–.66). A recurring limitation was the substitution of demographic correlations (e.g., age, education) for external validation, as seen in SAOCA (Troyer et al., 2014), and INTB (Wahyuningrum et al., 2022), which did not benchmark scores against validated instruments. In sum, while a subset of studies presented robust and multidimensional evidence of external validity -especially the NIHTB-CB (iPad), ACS, NIHTB-CB, and EFSA - many others relied on limited or indirect indicators. Future validation studies should prioritize conceptually anchored hypotheses, systematic comparisons with gold-standard instruments, and inclusion of predictive criteria to strengthen the interpretability and clinical utility of digital EF assessments. Response process validity Response process validity was inconsistently addressed across the reviewed studies. Although 22 instruments reported some evidence relevant to user interaction, task engagement, or behavioral performance, most relied on indirect or observational indicators rather than systematic analysis. Several studies assessed behavioral engagement using metrics such as completion rates, error patterns, and reaction time (RT). SAOCA (Troyer et al., 2014), for example, reported a 94% valid response rate in unsupervised settings, while BAM (Aalbers et al., 2013) showed increasing RT and error rates as task complexity rose, suggesting sensitivity to executive load. TOL-F studies (Kaller et al., 2016; Köstering et al., 2015) examined planning versus execution times and rule violations to infer cognitive strategy use. Other tools used RT dynamics more analytically. Axon-TMT (Depauw et al., 2024) tracked error trajectories across trials, and dTMT (Park & Schott, 2022) recorded latency features like touch intervals and pauses as markers of fatigue. The INTB (Wahyuningrum et al., 2022) demonstrated expected performance shifts by task difficulty and age. NeurOn and NeuroUX detected performance variation across testing environments and devices, while the study on the Task-Switching Paradigm (TST; Toh & Yang, 2024) explored intraindividual variability in relation to emotion regulation. In contrast, 9 studies offered no relevant process evidence, and several others (e.g., OCST, ANT-I) may have collected but did not report it. Across all studies, no use of advanced psychometric models such as Item Response Theory (IRT) or time-dependent modeling was observed. In sum, although some instruments incorporated meaningful behavioral data, process validity remains underexplored. Future studies should adopt more rigorous reporting and exploit digital trace data to better understand user-task interactions in EF assessments. Consequential validity evidence All 31 studies included in this review reported some form of consequential validity evidence, reflecting a growing interest in the applicability of digital EF assessments. However, the nature and methodological rigor of such evidence varied considerably. Based on the level of elaboration, we classified these findings into three qualitative categories: narrative claims, supported use cases, and substantiated consequences. Narrative claims were observed in a subset of studies that merely asserted potential applications, typically in aging or clinical contexts, without offering performance-based evidence or real-world implementation data. These studies generally highlighted theoretical relevance or ecological validity but lacked empirical follow-up. Examples include Ishigami et al. (2016), who referenced attentional monitoring in aging, and Parsons & Barnett, (2017), who emphasized ecological plausibility without validation in applied settings. A second group of studies presented supported use cases, in which the test's applicability was linked to population characteristics, normative datasets, or test design features, though without direct evidence of practical impact. For instance, Wang et al. (2023) discussed age- and sex-based performance patterns to support screening relevance; Wahyuningrum et al. (2022) emphasized adaptation for the Indonesian cultural context; and Ott et al. (2022) framed the NIHTB-CB as aligned with NIH’s iPad-based workflows. Although these claims were contextually grounded, they remained largely inferential, relying on assumed benefits rather than demonstrated outcomes in applied settings. Finally, several instruments offered substantiated consequences, with evidence of real-world use, screening utility, or performance-based applicability. The SAOCA (Troyer et al., 2014) was recommended for early detection of cognitive decline; the EFSA (Kruger et al., 2023) and TMS (Giorgini et al., 2024) were positioned as complementary to clinical evaluation; and the BAM battery (Aalbers et al., 2013) supported remote self-monitoring in aging. These studies linked test outcomes to practical decision-making scenarios or demonstrated diagnostic relevance through applied metrics. In sum, although all studies contributed some form of consequential insight, most lacked formal evaluation of outcomes, impact metrics, or potential adverse effects such as misclassification or digital exclusion. Future research should advance from post hoc assertions to prospective validation of implementation impact, ideally incorporating diagnostic utility studies, stakeholder feedback, and equity considerations to support the responsible use of digital EF assessments in real-world contexts. Discussion This review aimed to map and synthesize the psychometric evidence available over the past decade for digital and online tools designed to assess EF in healthy adults. Findings were organized according to five key dimensions of modern psychometrics: content validity, structural validity, external validity (relations to other variables), response processes, and consequential validity. The 31 included studies were conducted across diverse global regions, with the majority originating from high-income countries in North America and Europe (n = 22). Additional contributions came from Asia (n = 5), Latin America (n = 2), and Oceania (n = 2), representing a total of 20 countries. Although reliability was among the most frequently reported psychometric properties in the reviewed studies (n = 23), its operationalization often fell short of the standards set forth in contemporary psychometric literature. As emphasized by Revelle and Condon, ( 2019 ) and reflected in the AERA Standards (American Educational Research Association et al., 2014), reliability should not be viewed as a fixed property of the test itself, but rather as a feature of observed scores that is specific to context, population, and purpose - requiring the use of multiple indicators and replications over time. Nevertheless, most studies reported only a single index, typically test–retest correlations or Cronbach’s alpha, without justifying its conceptual alignment with the instrument’s design or intended application. Only a few studies (e.g., Krüger et al., 2023; Zhang et al., 2024 ) triangulated across multiple indices, such as internal consistency, test–retest stability, and split-half reliability, as recommended for tools intended for longitudinal or repeated use. Cronbach’s alpha was often used without verifying its underlying assumptions. Specifically, alpha requires tau-equivalence (that is, equal factor loadings across all items). When this assumption is violated, alpha can misrepresent reliability by either inflating or underestimating the true consistency of scores. Despite this limitation, alternative estimators such as McDonald’s omega (ω), which allows for unequal loadings and provides a more accurate estimate in most conditions, were rarely reported. Ideally, both α and ω should be presented, particularly in newly developed or adapted instruments, to ensure transparency and capture distinct aspects of internal structure (Clark & Watson, 2019 ; Revelle & Condon, 2019 ). Latency-based outcomes—common in digital EF tasks—were also underexamined in terms of score stability, even though they are particularly vulnerable to contextual and device-related variability. Moreover, less common but informative indices such as alternate-form reliability (AF-R) and inter-run reliability (IR-R) were observed in only a few studies, highlighting additional underexplored avenues in the validation of digital assessments. These limitations reflect a broader disconnect between the psychometric complexity of digital EF instruments and the simplicity of their reliability reporting. Addressing this gap requires a shift toward a reliability reasoning framework—one that integrates statistical indicators with theoretical justification and contextual awareness, as advocated by Revelle and Condon ( 2019 ) and endorsed by the AERA Standards. Although foundational in contemporary validity theory, content validity was the most neglected domain across the reviewed studies, revealing a sharp departure from the principles outlined in the AERA Standards (American Educational Research Association et al., 2014) and emphasized by Clark and Watson, ( 2019 ). According to these frameworks, content validation must extend beyond conceptual alignment; it requires systematic evidence that items or tasks comprehensively represent the intended construct and are suitable for the target population and testing purpose. Yet, none of the studies in this review employed formal content validation procedures, such as expert panel ratings, quantitative indices (e.g., Content Validity Index), or structured mapping techniques like test blueprints or construct specification equations (XXX). While several instruments were based on well-established paradigms (e.g., TOL, Stroop, WCST) or theoretical models (e.g., Diamond, 2013 ), referencing such sources is not sufficient to establish content representativeness. Particularly in digital contexts where presentation format, timing, and interaction modalities may substantially alter the construct being assessed. Clark and Watson ( 2019 ) warn against assuming content validity by association and calls for empirical evidence of representativeness and relevance, especially in instruments used for high-stakes decisions or cross-cultural comparisons. The absence of such procedures across the reviewed studies raises concerns about potential construct underrepresentation or contamination, which compromises score interpretability and the defensibility of decisions based on these measures. Structural validity was addressed in only 8 of the 31 studies included in this review. Even when reported, analyses were often limited to exploratory techniques such as principal component analysis (PCA), which are not designed to test latent structure or account for measurement error. According to Standard 1.13 of the AERA Standards (American Educational Research Association et al., 2014), when a construct is presumed to have a multidimensional structure—as is typically the case in executive function (EF) assessment—confirmatory factor analysis (CFA) or equivalent model-based approaches are expected to evaluate the correspondence between theoretical dimensions and empirical data. This position is echoed by Sellbom and Tellegen, ( 2019 ), who argue that a psychometric instrument cannot validly claim to measure a construct until its internal structure has been empirically verified using appropriate statistical techniques. Although a few instruments, such as the EFSA (Kruger et al., 2023 ), TMS (Giorgini et al., 2024 ), and INTB (Wahyuningrum et al., 2022 ), employed CFA and reported acceptable to excellent fit indices, the majority of studies either relied on PCA without model testing or omitted structural analysis altogether. This gap is particularly concerning given that many digital and online EF tools are designed around multiple theoretically distinct subcomponents (e.g., working memory, cognitive flexibility, inhibition), yet fail to demonstrate the empirical distinctiveness or interrelatedness of these domains. Furthermore, none of the studies tested alternative or competing models, such as bifactor or hierarchical CFA frameworks, which, as emphasized by Sellbom and Tellegen ( 2019 ) are critical for assessing the integrity of multidimensional constructs and reducing the risk of overfitting or misinterpretation. The absence of rigorous structural validation is especially problematic in instruments intended for use across diverse populations or in clinical decision-making contexts. Without empirical confirmation of internal structure, the interpretability of test scores remains limited, and the potential for misleading inferences increases. Future research should move beyond exploratory approaches and incorporate model comparison, fit evaluation, and measurement invariance testing, as recommended in contemporary psychometric literature, to ensure that digital EF assessments accurately capture the constructs they aim to measure. External validity was addressed in 17 of the reviewed studies, though often with limited scope and theoretical grounding. As outlined in Standard 1.16 of the AERA Standars (American Educational Research Association et al., 2014), this domain includes convergent (CV), discriminant (DV), and criterion-related validity (CrV), which support score interpretations through associations with external constructs, behaviors, or outcomes. While CV and CrV appeared with similar frequency, many studies lacked clear hypotheses or justification for the chosen benchmarks. Some instruments, such as the NIHTB-CB, EFSA, and ACS, reported moderate-to-strong correlations with established cognitive tests. However, discriminant validity was rarely assessed, and predictive evidence was limited to a single study (Axon-TMT). Moreover, several studies substituted demographic correlations (e.g., age, education) for true external validation, despite the risk of reflecting construct-irrelevant variance. Another common limitation was the omission of confidence intervals or error estimates for correlation coefficients, which weakens generalizability. As emphasized by Watson (2019), external validity evidence should be statistically sound, theory-driven, and appropriate to the intended use, especially in clinical and applied settings. Evidence regarding response process validity was inconsistently addressed, echoing a broader gap noted in the psychometric literature. According to Standards 1.10 and 1.12 of the AERA Standards, response process evidence should demonstrate alignment between the cognitive operations elicited by the task and the theoretical constructs being measured (American Educational Research Association et al., 2014). Yet, most studies offered only superficial analyses, limited to descriptive metrics such as reaction time (RT) or error rates. A few instruments (e.g., SAOCA, TOL-F, Axon-TMT) incorporated indicators like completion rates, latency distributions, or planning time to examine engagement and strategy use. However, even these analyses remained fragmented and lacked formal modeling of intraindividual variability or task–trait interactions. None employed process-tracing techniques (e.g., mouse tracking, eye-tracking, think-aloud protocols) or statistical approaches like Item Response Theory (IRT) and time-dependent modeling, despite their suitability for digital testing. As Thomas ( 2019 ) emphasizes, the omission of such modeling is particularly problematic given the rich behavioral data digital platforms can capture and the growing availability of analytic tools. Without structured process analysis, the interpretability of performance in digital EF tasks, especially in unsupervised or adaptive contexts, remains compromised. Response process validity is not peripheral to test interpretation; it is foundational when cognitive assessments rely on timing, interaction, or dynamic response patterns. Its underutilization across the reviewed studies highlights a critical methodological gap that future research must address. Consequential validity emerged as the least empirically developed domain across the reviewed studies, despite its central role in contemporary validity theory and its explicit emphasis in Standard 1.25 of the AERA Standards (American Educational Research Association et al., 2014). This standard asserts that test developers and users share responsibility not only for the intended uses of scores but also for anticipating and mitigating unintended consequences, such as misinterpretation, inequitable access, or harm resulting from invalid inferences. Although all 31 studies included some consequential claims, only a minority went beyond narrative assertions to examine practical implications. No study systematically evaluated the real-world impact of test use or its influence on decisions and service delivery. Some instruments, such as SAOCA (Troyer et al., 2014 ), CHEFS (Hassmén et al., 2024 ), and NeuroUX (Paolillo et al., 2024 ), proposed utility for cognitive screening or large-scale monitoring in aging populations, yet lacked empirical data on diagnostic accuracy, clinical effectiveness, or long-term outcomes. Other tools, including the EFSA (Kruger et al., 2023 ) and TMS (Giorgini et al., 2024 ), emphasized clinical relevance based on conceptual fit but offered no follow-up regarding consequences in applied settings. As Clark and Watson, ( 2019 ) caution, consequential claims must be substantiated by evidence that test use improves decision quality, access to care, or intervention outcomes, rather than remaining aspirational. Critically, none of the studies addressed potential adverse effects, such as overreliance on automated scores, algorithmic opacity, or inequities stemming from differences in literacy, socioeconomic status, or digital access. These risks are especially relevant for remote or self-administered platforms. The absence of implementation metrics, equity-focused analyses, and stakeholder consultation raises concerns about the ethical robustness and fairness of digital and online EF tools. As highlighted in the AERA Standards, consequences are not peripheral, they are foundational to the interpretive argument that underlies responsible test use (American Educational Research Association et al., 2014). In sum, while the field of digital EF assessment has made strides in proposing tools with practical utility, the lack of empirical investigation into their downstream impact represents a critical limitation. Future validation efforts must go beyond conceptual alignment and include diagnostic utility studies, real-world implementation data, and equity-oriented evaluations. Only then can digital and online EF instruments fulfill their promise of supporting not just accurate measurement, but also ethical, effective, and inclusive decision-making. Conclusion This review revealed a field in methodological transition when the digital and online assessments of executive functions (EF) have expanded in scope, theoretical grounding, and technological sophistication over the past decade. However, the psychometric landscape remains uneven, while reliability and external validity were addressed in a growing number of studies, domains such as content validity, structural modeling, response process analysis, and consequential evaluation continue to be underexplored or methodologically limited. Notably, no single study presented a validation framework encompassing all core domains of contemporary validity theory, as articulated by the Standards (American Educational Research Association et al., 2014). A major strength of this review lies in its comprehensive mapping of six psychometric domains - reliability, content validity, structural validity, external validity (relations with other variables), response processes, and consequential validity - across a diverse set of 31 studies. By adopting a multidimensional lens and applying current theoretical standards, this review offers a critical overview of the state of digital and online EF assessment and highlights areas of both progress and fragility. It also provides researchers and clinicians with a synthesized reference for identifying tools that are better supported by psychometric evidence. Nevertheless, despite its methodological rigor, this systematic review is not without limitations. First, the analysis focused exclusively on studies reporting psychometric evidence, thereby excluding those centered on usability, implementation science, or neural validation. In addition, heterogeneity in study design and reporting limited the ability to perform meta-analytic comparisons or extract effect sizes uniformly. Furthermore, due to the inclusion criteria, several relevant instruments that lacked psychometric investigations within the specified timeframe or target population (i.e., healthy adults) were excluded, potentially overlooking promising tools in early development or applied to clinical populations. Across the included studies, we also observed several recurring limitations that constrain the robustness of available evidence. These include frequent reliance on convenience samples with limited representativeness; underreporting of structural validation methods, such as confirmatory factor analyses; inadequate application of formal procedures for content validity assessment; superficial analyses of response processes; limited external validity evidence, often based on demographic variables rather than benchmark measures; and insufficient evaluation of consequential validity, particularly regarding real-world implementation and ethical implications. These limitations highlight critical areas for psychometric improvement in future research. Future validation efforts must move beyond single-index reporting and adopt more integrative strategies, combining confirmatory factor analysis, item response theory models, response behavior analysis, and real-world impact studies. Equity considerations, stakeholder engagement, and context-specific implementation research should also be embedded into validation protocols. Bridging the gap between psychometric rigor and digital innovation will be essential to ensure that EF assessments are not only theoretically sound, but also ethically responsible, contextually meaningful, and practically useful across diverse populations and settings. References Aalbers T, Baars MAE, Rikkert MGMO, Kessels RPC (2013) Puzzling with online games (BAM-COG): Reliability, validity, and feasibility of an online self-monitor for cognitive performance in aging adults. J Med Internet Res 15(12):183–193. https://doi.org/10.2196/jmir.2860 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014) The Standards for Educational and Psychological Testing . Https://Www.Apa.Org. https://www.apa.org/science/programs/testing/standards Arioli M, Rini J, Anguera-Singla R, Gazzaley A, Wais PE (2022) Validation of At-Home Application of a Digital Cognitive Screener for Older Adults. Frontiers in Aging Neuroscience , 14 . https://doi.org/10.3389/fnagi.2022.907496 Bergman I, Franke Föyen L, Gustavsson A, Van den Hurk W (2025) Test–retest reliability, practice effects and estimates of change: A study on the Mindmore digital cognitive assessment tool. Scand J Psychol 66(1):1–14. https://doi.org/10.1111/sjop.13054 Bruno D, Sánchez Rueda D, Lopez E, Pinasco C, Torralva T, Alfredo T, Sierra Sanjurjo N, Roca M (2024) Validity and norms for young adults for the Aggie Figures Learning Test. Appl Neuropsychology: Adult 0(0):1–7. https://doi.org/10.1080/23279095.2024.2354856 Burgess PW, Stuss DT (2017) Fifty Years of Prefrontal Cortex Research: Impact on Assessment. J Int Neuropsychol Soc 23(9–10):755–767. https://doi.org/10.1017/S1355617717000704 Byrne EM, Gilbert RA, Kievit RA, Holmes J (2024) Evidence for separate backward recall and n-back working memory factors: A large-scale latent variable analysis. Memory. https://www.tandfonline.com/doi/abs/ 10.1080/09658211.2024.2393388 Clark LA, Watson D (2019) Constructing validity: New developments in creating objective measuring instruments. Psychol Assess 31(12):1412–1427. https://doi.org/10.1037/pas0000626 Depauw T, Boasen J, Léger PM, Sénécal S (2024) Assessing the Relationship Between Digital Trail Making Test Performance and IT Task Performance: Empirical Study. JMIR Hum Factors 11(1):e49992. https://doi.org/10.2196/49992 Diamond A (2013) Executive Functions. Ann Rev Psychol 64:135–168. https://doi.org/10.1146/annurev-psych-113011-143750 Dias NM, Malloy-Diniz LF (eds) (2023) Tratado de funções executivas: Modelos teóricos, construtos associados e desenvolvimento (1 a ). Editora Ampla Dias NM, Malloy-Diniz LF (2024) Tratado de funções executivas: Avaliação e Intervenção (1 a ). Editora Ampla Feenstra HEM, Vermeulen IE, Murre JMJ, Schagen SB (2018) Online self-administered cognitive testing using the Amsterdam Cognition Scan: Establishing psychometric properties and normative data. J Med Internet Res 20(5). https://doi.org/10.2196/jmir.9298 Ferguson H, Brunsdon V, Bradford E (2021) The developmental trajectories of executive function from adolescence to old age. Scientific Reports , 11 . https://doi.org/10.1038/s41598-020-80866-1 Field A (2020) Descobrindo a Estatística Usando o SPSS, 5th edn. Penso Editora Friedman NP, Miyake A (2017) Unity and diversity of executive functions: Individual differences as a window on cognitive structure. Cortex 86:186–204. https://doi.org/10.1016/j.cortex.2016.04.023 Giorgini R, Maestu F, Sara FM, Pastore M, Abellan M, Quattrone A, Caparello S, Quattrone A, Vaccaro MG (2024) Measurement invariance across countries of the Test of Memory Strategies (TMS): A contribution to the cross-national validity study. Acta Psychol 246:104291. https://doi.org/10.1016/j.actpsy.2024.104291 Hassmén P, Hindman E, Keiller T, Blair D (2024) Piloting the Coffs Harbour Executive Functioning Screen (CHEFS): An off-road tool to predict fitness to drive. Appl Neuropsychology: Adult. https://www.tandfonline.com/doi/abs/ 10.1080/23279095.2024.2418031 Hatahet O, Seghier ML (2024) The validity of studying healthy aging with cognitive tests measuring different constructs. Sci Rep 14(1):23880. https://doi.org/10.1038/s41598-024-74488-0 Heaton RK, Akshoomoff N, Tulsky D, Mungas D, Weintraub S, Dikmen S, Beaumont J, Casaletto KB, Conway K, Slotkin J, Gershon R (2014) Reliability and validity of composite scores from the NIH Toolbox Cognition Battery in adults. J Int Neuropsychol Soc 20(6):588–598. https://doi.org/10.1017/S1355617714000241 Hurtado-Pomares M, Juárez-Leal I, Company-Devesa V, Sánchez-Pérez A, Peral-Gómez P, Espinosa-Sempere C, Valera-Gran D, Navarrete-Muñoz E-M (2024) Psychometric properties of the Spanish version of the Frontal Assessment Battery (FAB-E) and normative values in a representative adult population sample. Neurologia 39(8):694–700. https://doi.org/10.1016/j.nrleng.2022.09.004 Ishigami Y, Eskes GA, Tyndall AV, Longman RS, Drogos LL, Poulin MJ (2016) The Attention Network Test-Interaction (ANT-I): Reliability and validity in healthy older adults. Exp Brain Res 234(3):815–827. https://doi.org/10.1007/s00221-015-4493-4 Iverson GL, Brooks BL, Ashton VL, Johnson LG, Gualtieri CT (2009) Does familiarity with computers affect computerized neuropsychological test performance? J Clin Exp Neuropsychol. https://doi.org/10.1080/13803390802372125 Kaller CP, Debelak R, Köstering L, Egle J, Rahm B, Wild PS, Blettner M, Beutel ME, Unterrainer JM (2016) Assessing planning ability across the adult life Span: Population-representative and age-adjusted reliability estimates for the Tower of London (TOL-F). Arch Clin Neuropsychol 31(2):148–164 Karlsen RH, Karr JE, Saksvik SB, Lundervold AJ, Hjemdal O, Olsen A, Iverson GL, Skandsen T (2022) Examining 3-month test-retest reliability and reliable change using the Cambridge Neuropsychological Test Automated Battery. Appl Neuropsychology: Adult 29(2):146–154. https://doi.org/10.1080/23279095.2020.1722126 Kessels RPC, Hendriks MPH (2023) Neuropsychological assessment. In H. S. Friedman & C. H. Markey (Eds.), Encyclopedia of Mental Health (Third Edition) (pp. 622–628). Academic Press. https://doi.org/10.1016/B978-0-323-91497-0.00017-5 Köstering L, Nitschke K, Schumacher FK, Weiller C, Kaller CP (2015) Test-retest reliability of the Tower of London Planning Task (TOL-F). Psychol Assess 27(3):925–931. https://doi.org/10.1037/pas0000097 Kruger OE, Zibetti MR, Schlindwein R, Lopes FM (2023) Preliminary validity evidence for the Executive Function Scale for Adults (EFSA). Psychology & Neuroscience. No Pagination Specified-No Pagination Specified. https://doi.org/10.1037/pne0000321 Morrissey S, Gillings R, Hornberger M (2024) Feasibility and reliability of online vs in-person cognitive testing in healthy older people. PLoS ONE 19(8):e0309006. https://doi.org/10.1371/journal.pone.0309006 Naglieri JA, Drasgow F, Schmit M, Handler L, Prifitera A, Margolis A, Velasquez R (2004) Psychological Testing on the Internet: New Problems, Old Issues. Am Psychol 59(3):150–162. https://doi.org/10.1037/0003-066X.59.3.150 Ott LR, Schantell M, Willett MP, Johnson HJ, Eastman JA, Okelberry HJ, Wilson TW, Taylor BK, May PE (2022) Construct Validity of the NIH Toolbox Cognitive Domains: A Comparison With Conventional Neuropsychological Assessments. Neuropsychology 36(5):468–481. https://doi.org/10.1037/neu0000813 Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A (2016) Rayyan—A web and mobile app for systematic reviews. Syst Reviews 5(1). https://doi.org/10.1186/s13643-016-0384-4 Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, Moher D (2021) The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ (Clinical Res Ed) 372:n71. https://doi.org/10.1136/bmj.n71 Paolillo EW, Bomyea J, Depp CA, Henneghan AM, Raj A, Moore RC (2024) Characterizing Performance on a Suite of English-Language NeuroUX Mobile Cognitive Tests in a US Adult Sample: Ecological Momentary Cognitive Testing Study. J Med Internet Res 26(1):e51978. https://doi.org/10.2196/51978 Park SY, Schott N (2022) The trail-making-test: Comparison between paper-and-pencil and computerized versions in young and healthy older adults. Appl Neuropsychol Adult 29(5):1208–1220. https://doi.org/10.1080/23279095.2020.1864374 Parsey CM, Schmitter-Edgecombe M (2013) Applications of technology in neuropsychological assessment. Clin Neuropsychol 27(8):1328–1361. https://doi.org/10.1080/13854046.2013.834971 Parsons T, Barnett M (2017) Virtual apartment stroop task: Comparison with computerized and traditional stroop tasks. J Neurosci Methods 309:35–40. https://doi.org/10.1016/j.jneumeth.2018.08.022 Perzl J, Riedl EM, Thomas J (2024) Measuring Situational Cognitive Performance in the Wild: A Psychometric Evaluation of Three Brief Smartphone-Based Test Procedures. Assessment 31(6):1270–1291. https://doi.org/10.1177/10731911231213845 Revelle W, Condon DM (2019) Reliability from α to ω: A tutorial. Psychol Assess 31(12):1395–1411. https://doi.org/10.1037/pas0000754 Rijnen SJM, van der Linden SD, Emons WHM, Sitskoorn MM, Gehring K (2018) Test-retest reliability and practice effects of a computerized neuropsychological battery: A solution-oriented approach. Psychol Assess 30(12):1652–1662. https://doi.org/10.1037/pas0000618 Sellbom M, Tellegen A (2019) Factor analysis in psychological assessment research: Common pitfalls and recommendations. Psychol Assess 31(12):1428–1441. https://doi.org/10.1037/pas0000623 Soto EF, Kofler MJ, Singh LJ, Wells EL, Irwin LN, Groves NB, Miller CE (2020) Executive functioning rating scales: Ecologically valid or construct invalid? Neuropsychology 34(6):605–619. https://doi.org/10.1037/neu0000681 Soveri A, Lehtonen M, Karlsson LC, Lukasik K, Antfolk J, Laine M (2018) Test-retest reliability of five frequently used executive tasks in healthy adults. Appl Neuropsychol Adult 25(2):155–165. https://doi.org/10.1080/23279095.2016.1263795 Tabachnick BG, Fidell LS (2019) Using Multivariate Statistics . Pearson Thomas ML (2019) Advances in applications of item response theory to clinical assessment. Psychol Assess 31(12):1442–1455. https://doi.org/10.1037/pas0000597 Toh WX, Yang H (2024) To switch or not to switch? Individual differences in executive function and emotion regulation flexibility. Emotion 24(1):52–66. https://doi.org/10.1037/emo0001250 Troyer AK, Rowe G, Murphy KJ, Levine B, Leach L, Hasher L (2014) Development and evaluation of a self-administered on-line test of memory and attention for middle-aged and older adults. Front Aging Neurosci 6:335. https://doi.org/10.3389/fnagi.2014.00335 Wahyuningrum SE, Sulastri A, Hendriks MPH, van Luijtelaar G (2022) The Indonesian Neuropsychological Test Battery (INTB): Psychometric properties, preliminary normative scores, the underlying cognitive constructs, and the effects of age and education. Acta Neuropsychologica 20(4):445–470. https://doi.org/10.5604/01.3001.0016.1339 Wang P, Fang Y, Qi JY, Li HJ (2023) FISHERMAN: A Serious Game for Executive Function Assessment of Older Adults. Assessment 30(5):1499–1513. https://doi.org/10.1177/10731911221105648 White N, Flannery L, McClintock A, Machado L (2018) Repeated computerized cognitive testing: Performance shifts and test–retest reliability in healthy older adults. J Clin Exp Neuropsychol 41(2):179–191. https://doi.org/10.1080/13803395.2018.1526888 Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MMG, Sterne JAC, Bossuyt PMM, QUADAS-2 Group (2011) QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 155(8):529–536. https://doi.org/10.7326/0003-4819-155-8-201110180-00009 Zelazo PD (2015) Executive function: Reflection, iterative reprocessing, complexity, and the developing brain. Dev Rev 38:55–68. https://doi.org/10.1016/j.dr.2015.07.001 Zelazo PD, Carlson SM (2023) Reconciling the Context-Dependency and Domain-Generality of Executive Function Skills from a Developmental Systems Perspective. J Cognition Dev 24(2):205–222. https://doi.org/10.1080/15248372.2022.2156515 Zhang Z, Yang LZ, Vékony T, Wang C, Li H (2024) Split-half reliability estimates of an online card sorting task in a community sample of young and elderly adults. Behav Res Methods 56(2):1039–1051. https://doi.org/10.3758/s13428-023-02104-6 Zucchella C, Federico A, Martini A, Tinazzi M, Bartolo M, Tamburin S (2018) Neuropsychological testing. Pract Neurol 18(3). https://doi.org/10.1136/practneurol-2017-001743 Table 1 and 2 Table 1, 2 are available in the Supplementary Files section. Additional Declarations The authors declare no competing interests. Supplementary Files AppendixA.docx Suppl.MaterialFullsearchstrategy.pdf Supplementary material - Full search strategy Suppl.MaterialQUADASIIQuestionsforeachstudy.pdf Supplementary material - QUADAS II questions for each study Table12.docx Cite Share Download PDF Status: Published Journal Publication published 12 Mar, 2026 Read the published version in International Journal of Testing → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8543356","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Systematic Review","associatedPublications":[],"authors":[{"id":570962420,"identity":"78fa84c3-4dbb-49c6-86e3-defa3aeffa4e","order_by":0,"name":"Telesmagno Neves-Teles","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFklEQVRIie3RsUrDQBjA8e/4oFkSul6p4iscBHQpyavkOHCSKnQRHLwQSJY+QMSXcDpxS8jQJQ+QscUXSMEhgYCmilLBa3ATvP/0fcOP4+MATKY/Gcq95QpgDGS9GwMdsIF8kX5iABOJ7JeEZQPEt8Jw01543qMlrXXNZr5bidH2pYP5mfyZ2HYeuY4S4mmZkTBl51xVAu+PY1gcZRpCeTwlCgWrAhLZrAhOq8sCJxJ4qrvlZJO0rbr9IB179d1UINLuAKEkBkcV3jvpbycPVCCpRwdIyaOpo1YBK/PwbskET8tnRBLTBdUQKynybatufLaKsrq59vxx0r/SdLO5jnzG5d4HAdowBPr/+baRZhCYTCbTP+oNpZBbafRIZIkAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0002-4234-3218","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":true,"prefix":"","firstName":"Telesmagno","middleName":"","lastName":"Neves-Teles","suffix":""},{"id":570962743,"identity":"317834e1-4772-4245-a278-ab53b5578c5d","order_by":1,"name":"Jonatha Berguer de Souza","email":"","orcid":"https://orcid.org/0009-0007-6877-3738","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":false,"prefix":"","firstName":"Jonatha","middleName":"Berguer","lastName":"de Souza","suffix":""},{"id":570963322,"identity":"f3e4ae7e-0089-4b5b-bea5-c9213c8b9263","order_by":2,"name":"Cristian Zanon","email":"","orcid":"https://orcid.org/0000-0003-3822-5275","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":false,"prefix":"","firstName":"Cristian","middleName":"","lastName":"Zanon","suffix":""},{"id":570963791,"identity":"529cc13b-5905-4a3f-86d7-7ca352eaf4c0","order_by":3,"name":"Rosa Maria Martins de Almeida","email":"","orcid":"https://orcid.org/0000-0002-2450-2238","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":false,"prefix":"","firstName":"Rosa","middleName":"Maria Martins","lastName":"de Almeida","suffix":""}],"badges":[],"createdAt":"2026-01-07 15:39:43","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8543356/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8543356/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1080/15305058.2026.2639759","type":"published","date":"2026-03-13T00:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":100361869,"identity":"0e2371b4-1f4e-44e6-819e-69b1411373b5","added_by":"auto","created_at":"2026-01-16 07:45:52","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":291572,"visible":true,"origin":"","legend":"","description":"","filename":"ManuscriptIJTPreprintResearchSquare.docx","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/6142c85015b3f1a6a67b06c5.docx"},{"id":100013703,"identity":"b9feea77-38ab-477a-93df-f5cdc60e9ba8","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8543356.json","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/c6b22d8d2760692790b2c320.json"},{"id":100361923,"identity":"39a5ac47-9929-4ce4-ada5-8f798ad9f9a6","added_by":"auto","created_at":"2026-01-16 07:45:57","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":251753,"visible":true,"origin":"","legend":"","description":"","filename":"rs85433560enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/31b8903dfd759d60708ad6f2.xml"},{"id":100361527,"identity":"d0a17f86-a5b9-408a-9827-2c383ebe9f16","added_by":"auto","created_at":"2026-01-16 07:45:15","extension":"eps","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":109882,"visible":true,"origin":"","legend":"","description":"","filename":"drawingimage1.eps","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/2f2b50b860140c83fd10d657.eps"},{"id":100362336,"identity":"ac729342-dace-4f06-91d1-ddf4ef330185","added_by":"auto","created_at":"2026-01-16 07:46:34","extension":"eps","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":114372,"visible":true,"origin":"","legend":"","description":"","filename":"drawingimage2.eps","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/d7cfd645e5cab0a1f959b841.eps"},{"id":100362527,"identity":"3c34cbb9-55bf-4972-842a-b75b54a092c1","added_by":"auto","created_at":"2026-01-16 07:46:56","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1074,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/9b7772c823e9701d8dee71d7.jpeg"},{"id":100013706,"identity":"2f50be8c-4897-4deb-901a-571070cb1151","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":65891,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/e270be4802311a02250f6557.jpeg"},{"id":100362314,"identity":"0f9c5dc2-90ca-4459-9baa-bdd4bd71c4b6","added_by":"auto","created_at":"2026-01-16 07:46:33","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":935,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/16698218e0cde55d013c2b79.png"},{"id":100013713,"identity":"adb272f1-288a-4a74-914f-97def6d1838b","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":17360,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/7872cf36a9f4bafbc7fe0781.png"},{"id":100013714,"identity":"400f1ac6-890c-44b0-80b6-68f8267ae930","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":249156,"visible":true,"origin":"","legend":"","description":"","filename":"rs85433560structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/edf1ac4317ef77dd7db635f9.xml"},{"id":100013715,"identity":"300ce6db-2b1d-49dd-92a3-1e491eb4ea1f","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":263324,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/44f29765d4d6fefeb4bdc2e3.html"},{"id":100013700,"identity":"62f2349d-9a9c-448b-a506-9565a3db565b","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":285113,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003ePRISMA Flow Diagram\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/06c00602f84e16133c6fcb7b.png"},{"id":100361295,"identity":"f9b22ef4-c165-4c85-be60-c96edefaa491","added_by":"auto","created_at":"2026-01-16 07:44:51","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":119878,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003eRisk of bias and applicability concerns per domain in the reviewed studies\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eNote: Figure 2a shows the proportion of studies classified as low, high, or unclear risk of bias across four domains: participant selection, index test, reference standard, and flow and timing. Figure 2b presents concerns regarding applicability for the same domains. Colors represent level of risk or concern — green for low, red for high, blue for unclear, and gray for not applicable. Most studies presented low risk in the administration and timing of assessments, while participant selection was the most frequent source of bias and applicability concerns.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/a8920cedb2693f77fede33eb.png"},{"id":105316905,"identity":"90bf2bed-25b4-4b76-ad69-8ea0389e1f82","added_by":"auto","created_at":"2026-03-24 16:29:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1541945,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/deb7efa9-3582-456a-9024-9418d753f00c.pdf"},{"id":100013699,"identity":"88685b75-3c93-43a0-bea8-469a90015c00","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":16830,"visible":true,"origin":"","legend":"","description":"","filename":"AppendixA.docx","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/d116d7afde0c5592da46968a.docx"},{"id":100013701,"identity":"5e7e81cf-15cc-48a6-ba04-3544c7743244","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":68668,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary material - Full search strategy\u003c/p\u003e","description":"","filename":"Suppl.MaterialFullsearchstrategy.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/39da558af774925fd7a5deca.pdf"},{"id":100013707,"identity":"59034dee-8ef9-4bb8-8299-09cc765f3ccc","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":701057,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary material - QUADAS II questions for each study\u003c/p\u003e","description":"","filename":"Suppl.MaterialQUADASIIQuestionsforeachstudy.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/75f88bea46ef6aa2cd1a774f.pdf"},{"id":100013711,"identity":"d86807fb-81e0-4e8e-9228-0aa581a1a58f","added_by":"auto","created_at":"2026-01-12 06:21:20","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":84875,"visible":true,"origin":"","legend":"","description":"","filename":"Table12.docx","url":"https://assets-eu.researchsquare.com/files/rs-8543356/v1/ea836e638e6ab3ba0cc73bfd.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003ePsychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eExecutive functions (EF) refer to a set of higher-order cognitive processes responsible for goal-directed behavior, including working memory, inhibitory control, cognitive flexibility, decision-making, and emotional self-regulation (Diamond, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Dias \u0026amp; Malloy-Diniz, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Friedman \u0026amp; Miyake, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Zelazo \u0026amp; Carlson, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). These skills are essential for adaptive functioning across the lifespan and play a critical role in education, employment, social relationships, and mental health (Diamond, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Ferguson et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Zelazo, \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2015\u003c/span\u003e; Zelazo \u0026amp; Carlson, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). As such, accurate assessment of EF is fundamental to both clinical decision-making and scientific research (Burgess \u0026amp; Stuss, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Kessels \u0026amp; Hendriks, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Zucchella et al., \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eTraditionally, EF assessments have been classified as either objective performance-based tasks or subjective self- and informant-report questionnaires (Dias \u0026amp; Malloy-Diniz, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Soto et al., \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). While both formats yield valuable insights, they also present practical limitations\u0026mdash;such as the need for trained administrators, limited ecological validity, and reduced accessibility for geographically or socially vulnerable populations. Digital and online tools have thus emerged as scalable alternatives to traditional paper-and-pencil or lab-based formats. Importantly, they can support both objective and subjective assessment modes, offering advantages such as automated scoring, improved precision (e.g., in capturing response latencies), and remote administration (Aalbers et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Arioli et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Feenstra et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; White et al., \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe adoption of digital EF assessments has grown considerably in recent years, driven by technological advances and increasing demand for flexible, accessible testing environments (Feenstra et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Park \u0026amp; Schott, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Wang et al., \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; White et al., \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). These tools are particularly useful in settings where face-to-face assessment is impractical or resource-intensive. Digital platforms can enhance performance-based tests through precise stimulus control and latency tracking and benefit self-report formats by enabling standardized administration, minimizing social desirability effects, and supporting large-scale deployment. These features expand the utility of EF assessments in both research and applied contexts (Naglieri et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2004\u003c/span\u003e; Park \u0026amp; Schott, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Parsey \u0026amp; Schmitter-Edgecombe, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2013\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eNevertheless, the transition from traditional to digital formats introduces specific challenges that warrant careful scrutiny, as highlighted in modern psychometric standards (American Educational Research Association et al., 2014; Clark \u0026amp; Watson, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Revelle \u0026amp; Condon, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Variability in devices, internet connectivity, environmental distractions, and user familiarity with technology may threaten reliability and standardization. Moreover, adaptations of traditional instruments to digital formats may alter task demands or response modalities, potentially affecting construct validity and comparability (Thomas, \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Several studies included in this review echo these concerns, reporting inconsistencies across platforms, reduced measurement precision in uncontrolled settings, and limited evidence of construct invariance (Aalbers et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Feenstra et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Iverson et al., \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Wang et al., \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). These findings underscore the need for rigorous validation tailored to the specific characteristics of digital and online assessments.\u003c/p\u003e \u003cp\u003eDespite their growing availability, many digital EF tools lack comprehensive validation aligned with contemporary psychometric frameworks (American Educational Research Association et al., 2014; Sellbom \u0026amp; Tellegen, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Thomas, \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). While internal consistency and test\u0026ndash;retest reliability are commonly reported (Revelle \u0026amp; Condon, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), other key domains, such as content representativeness, factorial structure, external validity, response processes, and the consequences of test use, remain underexplored (Clark \u0026amp; Watson, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). This limited scope, evident in several studies reviewed here, restricts the ability of clinicians, educators, and researchers to make informed decisions about the scientific adequacy of these instruments.\u003c/p\u003e \u003cp\u003e To address this gap, the present systematic review aimed to map and synthesize the psychometric evidence produced over the past decade for digital and online EF assessments targeting healthy adults. Anchored in five core dimensions of contemporary psychometrics\u0026mdash;content validity, structural validity, external validity (relations with other variables), response processes, and consequential validity (American Educational Research Association et al., 2014; Field, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Tabachnick \u0026amp; Fidell, \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) - this review evaluates the extent to which each domain has been examined in validation studies. By doing so, it seeks to guide evidence-based decisions and promote higher psychometric standards in the design and application of digital EF tools.\u003c/p\u003e \u003cp\u003eUltimately, by consolidating findings across a wide range of instruments and validation approaches, this review intends to provide a comprehensive and up-to-date resource for researchers, clinicians, and test developers. It also aims to highlight persistent gaps and outline future research directions to strengthen the scientific foundation of EF assessment in the digital era.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eThis review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eTransparency and openness\u003c/h2\u003e \u003cp\u003eTo enhance methodological transparency and reduce bias, the protocol was prospectively registered in PROSPERO (registration number CRD420251027891).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eEligibility criteria\u003c/h3\u003e\n\u003cp\u003eStudies were included if they: (a) involved healthy adults (\u0026ge;\u0026thinsp;18 years); (b) evaluated digital or online tools specifically designed to assess EF, conceptualized within a componential framework (Diamond, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Friedman \u0026amp; Miyake, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), including working memory, inhibitory control, and cognitive flexibility; and (c) reported original psychometric data addressing at least one domain of contemporary validity theory - reliability, content validity, structural validity, external validity (i.e., convergent, discriminant, or criterion validity), response processes validity, or consequential validity. Peer-reviewed articles published between December 2013 and February 2025, in English, Portuguese, or Spanish, with full-text access were eligible.\u003c/p\u003e \u003cp\u003eStudies were excluded if they: (a) targeted clinical or neurological populations; (b) used only paper-based tools; (c) lacked psychometric results; (d) were review articles, theoretical papers, case studies, conference abstracts, dissertations, or editorials; or (e) focused solely on usability or user experience without evaluating psychometric properties. Studies with a general cognitive focus but no explicit EF construct were also excluded.\u003c/p\u003e\n\u003ch3\u003eInformation source and search strategy\u003c/h3\u003e\n\u003cp\u003eA comprehensive search was conducted on February 17, 2025, across Embase, PubMed, PsycNet, Web of Science, and the Virtual Health Library (VHL). Search strategies used controlled descriptors from each database\u0026rsquo;s thesaurus (EMTREE, MeSH, APA Thesaurus, Web of Science Core Collection, and DeCS) and Boolean operators to combine five main concepts: executive functions, digital/online tools, assessment instruments, psychometric properties, and healthy adults. No publication status filters were applied.\u003c/p\u003e \u003cp\u003eThe full search strategy, including all adapted syntaxes and term combinations, is provided in the supplementary materials.\u003c/p\u003e\n\u003ch3\u003eStudy selection process\u003c/h3\u003e\n\u003cp\u003eAll references were imported into Rayyan (Ouzzani et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) for initial screening. Duplicates were removed, and titles and abstracts were independently reviewed in blinded mode by two authors (T.N.T. and J.B.S.). Discrepancies were resolved through discussion, with a third author (R.M.M.A.) consulted when necessary. Full texts of potentially eligible studies were then assessed by T.N.T. and independently verified by J.B.S.; any remaining disagreements were resolved in consultation with R.M.M.A. Reasons for exclusion were documented at all stages. Finally, a PRISMA flow diagram was constructed to visually summarize the study selection process.\u003c/p\u003e\n\u003ch3\u003eData collection process and synthesis methods\u003c/h3\u003e\n\u003cp\u003eData extraction was performed by T.N.T., supported by Rayyan (Ouzzani et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) and NVivo (v1.7.1), following a structured protocol. Any inconsistencies were discussed with J.B.S. and R.M.M.A. Data were extracted along five psychometric domains:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eContent validity\u003c/b\u003e, which examines whether all relevant dimensions of the construct are represented, the coherence of items in measuring the same construct, the balance of items across dimensions, and their relevance to the target population.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eStructural validity\u003c/b\u003e, which assesses the internal structure of the instrument, typically through exploratory or confirmatory factor analysis, to verify whether the data support the theoretical model of EF.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eExternal validity\u003c/b\u003e, which includes convergent and discriminant validity (i.e., correlations with related or unrelated constructs) and criterion validity (i.e., concurrent or predictive relationships with relevant external variables).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eResponse processes validity\u003c/b\u003e, which investigates whether respondents engage with the instrument as theoretically expected and how external factors (e.g., device use, familiarity, or cognitive strategy) may influence performance.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eConsequential validity\u003c/b\u003e, which considers the broader implications of instrument use, including its practical utility, theoretical contributions, and potential ethical or social impacts.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eA two-axis synthesis was conducted. First, a descriptive profile summarized study reference, country, tool name, sample size, mean age, and education. Second, psychometric evidence was mapped by EF domain, validity dimensions, and reliability metrics (e.g., Cronbach\u0026rsquo;s alpha, ICC). Results were organized into summary tables to facilitate comparison.\u003c/p\u003e \u003cp\u003eDue to heterogeneity in instruments, constructs, and statistical methods, meta-analysis was not viable. Instead, a narrative synthesis was applied to highlight methodological trends, gaps, and strengths across studies.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eStudy risk-of-bias assessment\u003c/h2\u003e \u003cp\u003eStudy quality was assessed by T.N.T. using an adapted version of the QUADAS-II tool (Whiting et al., \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). Signaling questions were modified to align with the review\u0026rsquo;s scope (see Appendix A). Although originally developed for diagnostic accuracy studies, QUADAS-II was selected for its structured, domain-based approach and adaptability to different research contexts. This flexibility made it particularly suitable for evaluating methodological quality in studies involving digital and online executive function assessments. Studies were rated as: low risk of bias, when all criteria were clearly met; high risk of bias, when the study\u0026rsquo;s methods or procedures (e.g., participant selection or test administration) could reasonably introduce bias; unclear risk, when insufficient information was available to make a definitive judgment; and not applicable, when the domain or item did not pertain to the study.\u003c/p\u003e \u003cp\u003eAll ratings were reviewed by J.B.S., with R.M.M.A. mediating unresolved cases and to further ensure rigor, the review team examined potential bias from missing data, such as unreported sample characteristics or psychometric results, and their implications for interpretation. Relevant limitations are noted in the results and discussion.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eStudy selection\u003c/h2\u003e \u003cp\u003eThe study selection process is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e (PRISMA Flow Diagram). A total of 4,019 records were identified across five databases: PubMed (n\u0026thinsp;=\u0026thinsp;1,488), NHL (n\u0026thinsp;=\u0026thinsp;845), PsycNet (n\u0026thinsp;=\u0026thinsp;810), Embase (n\u0026thinsp;=\u0026thinsp;526), and Web of Science (n\u0026thinsp;=\u0026thinsp;350). After the removal of 656 duplicate records, 3,363 titles and abstracts were screened independently. This initial screening led to the exclusion of 3,286 records that did not meet the eligibility criteria.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eOf the remaining 77 studies, 75 were successfully retrieved, most from open-access sources, and some through direct author contact via email. Two reports could not be obtained due to lack of response from the corresponding authors. After full-text review of the 75 retrieved studies, 44 were excluded for the following reasons: lack of psychometric data (n\u0026thinsp;=\u0026thinsp;22), wrong population, such as children, adolescents, or clinical samples (n\u0026thinsp;=\u0026thinsp;11), use of non-EF tests (n\u0026thinsp;=\u0026thinsp;8), or ineligible publication type (n\u0026thinsp;=\u0026thinsp;3). As a result, 31 studies met all inclusion criteria and were included in the final synthesis.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eStudy characteristics\u003c/h2\u003e \u003cp\u003eThe characteristics of each study are presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. A total of 31 studies published over the last decade were included in this review. Collectively, these studies investigated digital or online tools aimed at assessing EF and comprised a combined sample of 11,246 healthy adult participants (6,283 females). Sample sizes ranged from 27 to 4,600 individuals.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003eStudy characteristics\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003c/p\u003e \u003cp\u003eThe average age across studies was approximately 45.1 years, with some studies focusing on specific age groups such as young adults (18\u0026ndash;34 years) and older adults (60\u0026thinsp;+\u0026thinsp;years). The average educational level, when reported, was about 14.2 years, indicating that most samples were composed of adults with at least secondary education. These samples underscore the relevance of EF assessment across the adult lifespan, while also highlighting the general focus on cognitively healthy, non-clinical populations.\u003c/p\u003e \u003cp\u003eFrom a geographical perspective, most studies were concentrated in high-income regions, with Europe (n\u0026thinsp;=\u0026thinsp;14) and North America (n\u0026thinsp;=\u0026thinsp;8) leading in publication volume. Smaller yet relevant contributions originated from Asia (n\u0026thinsp;=\u0026thinsp;5), Latin America (n\u0026thinsp;=\u0026thinsp;2), and Oceania (n\u0026thinsp;=\u0026thinsp;2), totaling studies from 20 countries. This broad international distribution reflects the increasing momentum toward digital EF assessment worldwide. The demand for remote and scalable testing alternatives, fueled in part by the constraints imposed by the COVID-19 pandemic, has accelerated the shift toward digital methodologies, particularly in contexts requiring reduced in-person contact and greater logistical adaptability.\u003c/p\u003e \u003cp\u003eRegarding the types of instruments employed, the studies revealed a wide diversity of digital and online tools for executive function (EF) assessment. These included both established instruments, such as the National Institute of Health Toolbox Cognition Battery (NIHTB-CB) and the Mindmore Digital Cognitive Assessment (MINDMORE), as well as newly developed tools, designed specifically for the respective studies.\u003c/p\u003e \u003cp\u003eIn terms of structure, the instruments could be broadly categorized into:\u003c/p\u003e\u003cp\u003e(a) Comprehensive digital batteries (e.g., NIHTB-CB, Cambridge Neuropsychological Test Automated Battery [CANTAB], NeuroUX Cognitive Platform [NeuroUX], Neuropsychological Online Platform [NeurOn]);\u003cbr\u003e(b) Digitized versions of classical paper-and-pencil (P\u0026amp;P) EF tasks (e.g., Trail Making Test [TMT], Stroop, Tower of London [TOL], N-back);\u003cbr\u003e(c) Serious games or gamified platforms; and\u003cbr\u003e(d) Questionnaires or rating scales adapted for online use (e.g., Executive Function Scale for Adults [EFSA]).\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003eAmong these, only two tools were used in more than one study: the NIHTB-CB (n = 3) and digital versions of the TMT (n = 2). This pattern reflects a field marked by methodological heterogeneity, with researchers drawing on both standardized instruments and novel, context-specific tools to assess EF domains in digital and online environments.\u003c/p\u003e\n\u003cp\u003eConcerning the nature of the assessment, most of the studies (n = 27) focused on performance-based (objective) EF measures, typically involving computerized cognitive tasks that assess core executive components such as working memory, inhibitory control, and cognitive flexibility. A smaller subset of studies (n = 4) relied on subjective assessments, including self-report or informant-report questionnaires specifically designed to capture everyday executive functioning (e.g., Executive Function Scale for Adults and adaptations of traditional paper-based inventories for online use).\u003c/p\u003e\n\u003cp\u003eAltogether, this diversity in instruments, populations, and assessment strategies reflects both the versatility and the current lack of standardization in digital EF measurement practices. While digital tools offer clear advantages - such as flexible administration, cost-effectiveness, and potential for remote deployment - their widespread adoption remains limited, partly due to concerns regarding the sufficiency of their psychometric validation (Bergman et al., 2025).\u003c/p\u003e\n\u003cp\u003eAs emphasized by contemporary psychometric guidelines, including the American Educational Research Association (AERA) Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014) and recent theoretical contributions (Clark \u0026amp; Watson, 2019; Revelle \u0026amp; Condon, 2019), ensuring both methodological quality and construct validity is essential before test results can be interpreted or generalized with confidence. These considerations underscore the importance of critically evaluating the instruments reviewed, as addressed in the following subsections.\u003c/p\u003e\n\u003cdiv id=\"Sec12\"\u003e\n \u003ch2\u003eRisk of bias in the reviewed studies\u003c/h2\u003e\n \u003cp\u003eRisk of bias was assessed using an adapted version of the QUADAS-2 tool (Appendix A), which evaluates four key domains: D1 – Participant Selection, D2 – Index Test, D3 – Reference Standard, and D4 – Flow and Timing. Table 2 presents the detailed risk-of-bias ratings for each individual study, while Fig. 2(a) summarizes the overall distribution of ratings across these domains. Below, we provide a domain-by-domain interpretation of the observed patterns, and the full set of domain-level judgments and textual justifications for each study is provided in the supplementary materials. Domain D3 was considered applicable only when the study employed an external criterion measure to validate the index test. In its absence, D3 was marked as “not applicable”.\u003c/p\u003e\n \u003cp\u003eTable 2\u003c/p\u003e\n \u003cp\u003eThe domain related to participant selection (D1) revealed the most critical source of bias. A total of 23 studies were rated as high risk, primarily due to convenience sampling, lack of randomization procedures, or insufficient reporting of recruitment strategies. These issues raise concerns about selection bias and limit the representativeness of the samples. Only four studies were classified as low risk, typically those with transparent and rigorous inclusion procedures. An additional four studies were marked as unclear due to limited information.\u003c/p\u003e\n \u003cp\u003eAs for the assessment procedures themselves (D2), most studies received favorable evaluations, with 29 rated as low risk. This reflects the frequent use of standardized administration protocols and consistent procedural reporting. However, in two studies, the absence of detail regarding blinding procedures or uniform testing conditions raised concerns, leading to their classification as high risk. It is important to note that these ratings reflect procedural adequacy rather than the depth of theoretical validation. In many cases, the instruments were designed to assess EF and included task elements broadly consistent with the target constructs, at least at a surface level. This distinction is important, as several tools lacked evidence of thorough construct validation, an issue addressed in the subsequent sections.\u003c/p\u003e\n \u003cp\u003eWith respect to the use of external criterion measures (D3), 19 studies were deemed applicable, all of which were rated as low risk. These studies employed validated EF instruments, clinical benchmarks, or robust scoring frameworks to support the interpretability of findings. The absence of high-risk ratings in this domain likely reflects the selective inclusion of studies with clearly defined comparators.\u003c/p\u003e\n \u003cp\u003eIn contrast, the domain evaluating flow and timing showed strong consistency, with all 31 studies rated as low risk. This uniformity suggests strong procedural control, characterized by consistent sequencing and timing of assessments and minimal data loss. The automated nature of most digital tools likely contributed to the reliability of testing procedures in this domain.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eFigure 3\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003eIn addition to identifying potential sources of bias, this review also assessed concerns related to applicability—that is, the extent to which each study’s findings can be generalized to the intended populations and contexts of EF assessment. Table\u0026nbsp;2 displays the detailed applicability ratings for each individual study, while Fig.\u0026nbsp;2(b) summarizes the distribution of ratings across the three applicable domains (D1, D2, and D3). The following is a domain-wise interpretation of these patterns.\u003c/p\u003e\n \u003cp\u003eConcerns related to participant selection (D1) were particularly prominent. A total of 17 studies were rated as high concern, often due to participant profiles that did not reflect the broader populations targeted by the instruments, for example, samples composed predominantly of highly educated or digitally proficient individuals. Thirteen studies were rated as low concern, while one was marked as unclear due to insufficient demographic detail.\u003c/p\u003e\n \u003cp\u003eRegarding the index tests (D2), most studies were judged to have low concern in terms of applicability (n = 29). This was generally based on the formal alignment between the stated purpose of the test and the constructs it intended to measure. However, as noted earlier, this evaluation reflects surface-level coherence and does not necessarily indicate strong theoretical grounding or robust construct representation. In two studies, poor alignment between the instrument and the intended EF domains led to high concern ratings due to potential construct misfit.\u003c/p\u003e\n \u003cp\u003eThe applicability of reference standards (D3) was assessed in 19 studies. Of these, 18 were rated as low concern. One study raised high concern due to the use of poorly justified comparators or evaluation criteria misaligned with EF theory, which could compromise the interpretability of findings. In 12 studies, the domain was not applicable, as no external criterion was employed for instrument validation.\u003c/p\u003e\n \u003cp\u003eTaken together, these results reinforce the overall pattern observed in the risk-of-bias assessment. While most studies demonstrated procedural rigor, especially in test administration and timing, critical challenges remain regarding the representativeness of samples and the theoretical coherence between the instruments used and the constructs they aim to assess. Addressing these limitations is essential for improving the ecological validity, interpretability, and practical utility of digital and online EF assessment tools.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\"\u003e\n \u003ch2\u003eResults of the reviewed studies\u003c/h2\u003e\n \u003cp\u003eIn line with contemporary psychometric frameworks, the results are organized according to five domains of evidence: reliability, content validity, structural validity, external validity, response process validity, and consequential validity. Table 3 presents the psychometric evidence of the reviewed studies, serving as a visual synthesis of the findings discussed below.\u0026nbsp;\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 3\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003e\u003cem\u003ePsychometric evidence in the reviewed studies\u003c/em\u003e\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eStudy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDigital/Online Test\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eConstruct assessed\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eReliability\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eContent validity\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eStructural validity\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eExternal validity\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eResponse Process Validity\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eConsequential validity\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBergman et al. (2025)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMindmore\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eATT, MEM, LANG, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .60–.88 across 22 core scores (11 ≥ .70; 6 \u0026lt; .60).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDigital adaptations of established neuropsychological instruments, preserving task structure and administration procedures.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePractice effects in speeded tasks (e.g., TMT-A, Stroop) suggest expected cognitive sensitivity and task engagement (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDemonstrates clinical utility for early detection of cognitive deficits and remote monitoring; supports informed test selection and interpretation.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBruno et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAFLT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVIM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDesigned as a nonverbal analog to the RAVLT, with matched structure, memory demands, and administration procedures.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .20–.61 with ROCF (immediate, delayed, recognition).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eClinically relevant for identifying material-specific memory deficits (e.g., in epilepsy); enables differentiation between verbal and nonverbal profiles via RAVLT comparison.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eByrne et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eN-BR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTasks selected to reflect distinct WM paradigms (updating vs. serial recall), with theoretical support for a two-factor model.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEFA: RMSEA = .026; CFI = .996; χ² = 23.58 (2-paradigm model); CFA: RMSEA = .027; CFI = .994; χ² = 10.66 (2-paradigm model).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupports theoretical differentiation of working memory paradigms; applicable to hierarchical and network models.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDepauw et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAxon-TMT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePS, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDigital version of the traditional TMT, targeting visuospatial and executive functions.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .51–.69 with TMT-A/B; CrV: performance with CAPTCHA (IT task).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRTs and error profiles analyzed to support sensitivity to executive demands (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEnables rapid (~ 5 min) cognitive screening for diverse populations (neurotypical adults, older adults, children, stroke survivors); suitable for remote use.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGiorgini et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTMS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEM, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCovers distinct EF and memory domains; content grounded in gold-standard neuropsychological measures.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCFA: RMSEA = .000; CFI = 1.000; TLI = 1.007 (3-factor model); MI: metric invariance across countries.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .30–.45 with Digit Span, Stroop, COWAT, SVF; DV: r \u0026lt; .18 with MMSE.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUseful for distinguishing EF and memory deficits; supports neuropsychological screening and cross-cultural applications.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHassmén et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCHEFS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWM, CF, IC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBuilt from classic tasks (e.g., Matrix Reasoning, Number–Letter); aligned with Miyake et al.’s (2000) EF framework.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: adj. R² = .11 with D-KEFS (MRA); CrV: 87% classification (Λ = .75, χ² = 14.76, p = .001).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVariation in RT and accuracy aligned with EF subcomponents (e.g., inhibition, switching), offering indirect support for response process validity.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eApplicable for non-clinical EF assessment; supports use in driving fitness evaluations and referral decision-making.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHatahet \u0026amp; Seghier (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNIHTB-CB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEF, WM, PS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .62–.86 (by age group).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUses a widely recognized tool for executive assessment, grounded in established cognitive domains.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCFA: RMSEA = .052–.069; CFI = .932–.94; TLI = .914–.924 (3-factor model across age groups).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eResponse pattern differences across age groups identified via measurement invariance; behavioral scores predicted age in older adults (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRecommended for age-based differentiation (younger vs. older adults) in clinical contexts.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHurtado-Pomares et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFAB-E\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .60; TRt-R: ICC = .72; ρ = .70.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eerived from the FAB, with six subscales covering distinct EF components.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: r = .43 with MMSE; DV: r = –.52 with TMT.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEnables rapid EF screening with cutoffs tailored to age and education level.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMorrissey et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeurOn\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRT, PS, EF, EM, WM, ATT, EP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .50–.80 (repeated tasks).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdapted from traditional neuropsychological tests for online administration.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: ρ = .60 with MoCA; ρ = .61 with TMT.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdministration mode and device-type effects examined; equivalent performance across formats supported task comparability (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupports large-scale screening using nonverbal tasks; applicable to population studies and clinical trial eligibility.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePaolillo et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeuroUX\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEF, WM, RT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC \u0026gt; .76 (9 of 12 core measures).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIncludes tasks targeting executive cognition (processing speed, memory, EF).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerformance influenced by smartphone type and testing environment (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSuitable for longitudinal cognitive monitoring; demonstrates usability in real-world environments.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerzl et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDSST/SART/PVT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePS, ATT, RT, SusATT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: r = .42–.84 (within-person stability in DSST, SART, PVT).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDigital adaptations of validated tasks (DSST, SART, PVT) for occupational contexts.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .30–.46 with attention and arousal measures.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerformance reflected sensitivity to prior cognitive and emotional states (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eApplicable to occupational cognitive monitoring and performance tracking.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eToh \u0026amp; Yang (2023)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTSP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCF, ER\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTask structure supported by second-order EF model (inhibition, WM, flexibility); aligned with unity/diversity framework (Friedman \u0026amp; Miyake, 2017).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCFA: RMSEA = .012; CFI = .997; SRMR = .039 (second-order model).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIntraindividual variability and EF-specific effects predicted regulatory behavior, aligning with task engagement (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRelevant for research on executive-emotional control and regulation mechanisms.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eZhang et al. (2024)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eOCST\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSH-R: r = .72–.95 (OCST tasks).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDigital WCST version with automated scoring aligned with expert consensus.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTask engagement influenced by age and digital literacy, suggesting construct-relevant response variation (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePotentially suitable for remote EF screening and longitudinal cognitive monitoring.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eKruger et al. (2023)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEFSA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWM, IC, CF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α/ω = .88/.89 (total), .90/.90 (WM), .79/.76 (IC), .62/.64 (CF).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBased on Diamond’s model; expert ratings showed κ = .55–.73 and CVI \u0026gt; .30 for semantic clarity.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEFA: RMSEA = .031; CFI = .990; TLI = .987 (3-factor model; 40.1% variance explained).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: ρ = .17–.71 with Dysexecutive Questionnaire (WM, IC, CF).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePromising clinical utility as a complementary EF assessment in applied settings; grounded in a contemporary theoretical model.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWang et al. (2023)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFISHERMAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWM, IC, CF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .83–.89; SH-R: r = .77–.88 (subgames).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGrounded in Miyake \u0026amp; Friedman’s (2012) model; subgames reflect core EF domains via classic analogues.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCFA: RMSEA = .073; CFI = .982; TLI = .955; GFI = .972; χ²/df = 1.57 (3-factor model).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: r = .37–.75 with stop-signal, number switch, and Corsi block-tapping tasks (by subgame).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePromising for older adult screening; age and sex effects support normative application relevance.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArioli et al. (2022)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRCM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEM, ATT, VF, SS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: r = .45–.61 (memory, fluency, shifting); ns for WM.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReplicates key components of CVLT-II and TMT-B via digital speech interface.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: d = .26–1.28 with P\u0026amp;P tasks (CVLT-II, Trails B, Fluency, Digit Span).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEnhances access to cognitive screening in older adults; supports remote research and inclusive norming.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eKarlsen et al. (2022)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCANTAB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVIM, EF, VisATT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: r = .39–.79; practice effects in 6/14 (g = .15–.40).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEstablished tool for memory and EF assessment.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePractice effects and age influenced response patterns across sessions (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEnables probabilistic interpretation of clinically meaningful cognitive change.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eOtt et al. (2022)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNIHTB-CB (iPad)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCovers NIHTB-CB domains; limitations observed in attention and processing speed.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .35–.80 with gold-standard tests (attention/EF, WM, language, motor).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAligns with NIH’s iPad-only shift; supports standardization and integration into research workflows.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePark \u0026amp; Schott (2022)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003edTMT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEF, PS, FM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIR-R: ICC = .90–.95 (TMT-M, A, B).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePreserves TMT structure, adding interaction metrics (e.g., inter-touch, pause duration).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: r = .82–.90 with paper-based TMT.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTouch and pause metrics showed consistent variation, reflecting fatigue or task familiarity (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePromising for early detection of cognitive decline.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWahyuningrum et al. (2022)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eINTB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEM, ATT, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .60–.91 (subtests); low ICCs \u0026lt; .25 for 3 RAVLT indices.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdapts 10 international tests to assess cognitive domains in Indonesian context.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePCA: 7 factors (62.8% var.; KMO = .83); CFA: RMSEA = .040; TLI = .947.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePractice effects and response time variability indicated sensitivity to repetition and performance fluctuation (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdapted for Indonesian cultural context; supports clinical use with preliminary normative data.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFeenstra et al. (2018)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eACS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eATT, MEM, PS, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .45–.80 (subtests); .83 (total score).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBased on conventional neuropsychological tests (Rey, Corsi, TOL, WAIS).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: r = .42–.70 with TMT, Digit Span, Corsi, TOL, Pegboard, RAVLT.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerformance influenced by users’ computer proficiency (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSuitable for remote screening of general cognitive functions; normative data support preliminary ACS interpretation.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWhite et al. (2018)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCCTB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEF, SelATT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .34–.93 (Pro, Anti, Simon, Flanker, 2-back); low for Corsi.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFatigue and practice effects observed, particularly in early retest sessions (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eApplicable to aging research; practice effects may obscure cognitive decline detection.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRijnen et al. (2018)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCNS-VS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVBM, VIM, PS, ATT, CF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .40–.89 (higher in speed \u0026amp; flexibility; lower in memory/attention).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePractice effects reported in cognitive flexibility and reaction time tasks (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eProvides RCI formulae for reliable change assessment; highlights need to adjust for practice effects.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSoveri et al. (2018)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBEFT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC, CF, WM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: r = .35–.93 (higher for WM speed; lower for inhibition and N-back accuracy).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePractice effects observed for response times across retest sessions (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHighlights limitations in interpreting longitudinal/intervention effects due to variability and practice effects.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eParsons \u0026amp; Barnett (2017)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVAST\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eModeled after traditional Stroop paradigms to support construct representation.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: η² = .09–.66 with D-KEFS and ANAM (ANOVA).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eStimulus modality (VR vs. P\u0026amp;P) influenced cognitive response patterns (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEcologically grounded; useful for dissociating inhibition from distractor resistance in older adults.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIshigami et al. (2016)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eANT-I\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eATT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSH-R: r = .29 (alerting), .70 (orienting), .68 (executive).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAttention networks assessed separately, aligned with Petersen \u0026amp; Posner’s model.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: β = –.17 (conflict resolution), β = –.18 (verbal memory) with D-KEFS, SDMT, BSR.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupports evaluation of attentional changes in aging and clinical populations.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eKaller et al. (2016)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTOL-F\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePL, PRB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .71–.74; SH-R: r = .71–.76; GLB = .73–.76; TRt-R: r = .72.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBuilt on gold-standard TOL; items balanced for complexity, rule structure, and search depth.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePlanning/execution times and rule violations used as behavioral indicators (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSuitable for clinical and lifespan planning assessment.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eKöstering et al. (2015)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTOL-F\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePL\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTRt-R: ICC = .69 (accuracy), .27–.52 (latency).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e24 tasks balanced for goal hierarchy and search depth to reflect planning demands.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eInitial latency deemed a noisy planning metric; accuracy considered a more valid response outcome (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAppropriate for group research and clinical use via accuracy-based planning metrics.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHeaton et al. (2014)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNIHTB-CB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEM, PS, EF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .77–.84 (composite, fluid, crystallized); TRt-R: ICC = .86–.92.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCrV: r = .78–.90 with GS composites (PPVT, CWI, WCST, PASAT, RAVLT); d = .20–.50 with functional/health markers; DV: r = .17–.39 with distinct constructs.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh test–retest consistency interpreted as reliable response engagement (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupports clinical and research screening; findings may guide diagnostic and triage strategies.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTroyer et al. (2014)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSAOCA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEM, ATT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eIC-R: α = .96 (Stroop); SH-R: r = .62 (Face-Name); TRt-R: r = .49–.83; AF-R: r = .48–.82.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDraws from validated tests (spatial WM, Stroop, Face–Name, Letter–Number Alternation).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePCA: λ = 1.61; loadings = .58–.75 (unidimensional structure)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHigh completion rate (87%) and expected score distribution (94%) suggest task engagement (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eUseful for early cognitive decline detection and screening.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAalbers et al. (2013)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBAM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWM, PL, EP, VIM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAF-R: ICC = .42 (WM), .43 (VSM), .17 (EM), .65 (Planning).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCV: ρ = .40–.67 with WAIS, WMS, BADS, CVMT; DV: ρ = –.03 to –.13 with NART-IQ.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRT and error variation across task complexity supported construct-relevant engagement (indirect evidence).\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupports long-term self-monitoring in both clinical and digital environments.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003ctfoot\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"9\"\u003eNote: ACS: Amsterdam cognition scan; AFLT: auditory figural learning test; AF-R: alternate-form reliability; ANAM: automated neuropsychological assessment metrics; ANOVA: analysis of variance; ANT-I: attention network test – interaction; ATT: attention; BADS: Behavioral Assessment of the Dysexecutive Syndrome; BEFT: Battery of Executive Function Tasks (Simon, visuoverbal n-back, visuospatial n-back, letter-memory, and number-letter); BSR: Buschke selective reminding; CANTAB: Cambridge neuropsychological test automated battery; CAPTCHA: completely automated public turing test to tell computers and humans apart; CCTB: computerized cognitive test battery (Pro, Anti, Simon, Flanker, 2-back, Corsi); CF: cognitive flexibility; CFA: confirmatory factor analysis; CFI: Comparative fit index; CHEFS: Coffs Harbour executive functioning screen; COWAT: controlled oral word association test; CrV: criterion validity; CV: convergent validity; CVI: content validity index; CVLT: California verbal learning test; CVMT: continuous visual memory test; CWI: color-word interference test; DSST: digit symbol substitution test; DV: discriminant validity; EF: executive functions; EFA: exploratory factor analysis; EFSA: executive function scale for adults; EM: episodic memory; EP: executive planning; ER: emotional regulation; FAB-E: frontal assessment battery - Spanish version; FISHERMAN: fisherman Task (serious game); FMC: fine motor control; GF: general cognitive functions; GFI: goodness of fit index; GLB: greatest lower bound; GS: global standard; IC: inhibitory control; IC-R: internal consistency – reliability; ICC: intraclass correlation coefficient; INTB: Indonesian neuropsychological test battery; IQ: intelligence quotient; IR-R: inter-rater reliability; IT: information theory; D-KEFS: Delis-Kaplan executive function system; KMO: Kaiser-Meyer-Olkin measure of sampling adequacy; LANG: Language; MEM: memory; MI: measurement invariance; MINDMORE: Mindmore digital cognitive assessment tool; MMSE: mini-mental state examination; MoCA: Montreal cognitive assessment; MRA: multivariate regression analysis; NART: national adult reading test; N-BR: n-back and backward recall tasks; NIH: national institutes of health; NIHTB-CB: NIH toolbox cognition battery; NR: not reported; OCST: online card sorting task; PASAT: paced auditory serial addition test; PCA: principal component analysis; P\u0026amp;P: paper and pencil; PLPS: planning and problem-solving; PPVT: Peabody picture vocabulary test; PS: processing speed; PV: predictive validity; PVT: psychomotor vigilance test; RAVLT: Rey auditory verbal learning test; RCI: reliable change indices; RCM: remote characterization module; RMSEA: root mean square error of approximation; ROCF: Rey-Osterrieth complex figure; RT: reaction time; SAOCA: self-administered online cognitive assessment; SART: sustained attention to response task; SDMT: symbol digit modalities test; SelATT: selective attention; SH-R: split-half reliability; SRMR: standardized root mean square residual; SS: set-shifting; SVF: semantic verbal fluency; TLI: Tucker-Lewis index; TMS: test of memory strategies; TMT: trail making test; TOL: tower of London; TRt-R: test-retest reliability; TSP: task switching paradigm; VAST: virtual apartment-based Stroop task; VBM: verbal memory; VF: verbal fluency; VisATT: visual attention; VIM: visual memory; VR: visual reality; VS: visuospatial skills; VSM: visuospatial memory; WAIS: Wechsler adult intelligence scale; WCST: Wisconsin card sorting test; WM: working memory; and WMS: Wechsler memory scale.\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tfoot\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\"\u003e\n \u003ch2\u003eReliability evidence\u003c/h2\u003e\n \u003cp\u003eReliability was assessed in 23 of the 31 studies, though with considerable variation in method and reporting practices. The most common form was test–retest reliability (TRt-R), investigated in 16 studies, typically using intraclass correlation coefficients (ICCs) or Pearson’s r. High coefficients were reported for instruments such as the dTMT-B (ICC = .95; Park \u0026amp; Schott, 2022), NIHTB-CB (ICC = .86 − .92; Hatahet \u0026amp; Seghier, 2024; Heaton et al., 2014), the TOL Freiburg version (TOL-F; r = .69 − .72 for accuracy; Kaller et al., 2016; Köstering et al., 2015), and the Indonesian Neuropsychological Test Battery (INTB; ICCs = .60–.91; Wahyuningrum et al., 2022).\u003c/p\u003e\n \u003cp\u003eIn contrast, lower stability was observed in latency-based outcomes from the TOL-F (ICCs = .27–.52) and in some Brain Aging Monitor-Cognitive Assessment Battery (BAM) subtests (ICC = .17 − .42; Aalbers et al., 2013). Accuracy-based scores generally yielded more consistent estimates than latency-based ones. Additionally, one study (Park \u0026amp; Schott, 2022) examined inter-run reliability (IR-R), with ICCs above .90 across conditions.\u003c/p\u003e\n \u003cp\u003eInternal consistency was assessed in 12 studies, including 7 that reported Cronbach’s alpha or McDonald’s omega (IC-R) and 5 that reported split-half reliability (SH-R). The EFSA demonstrated strong internal consistency, with α and ω ranging from .76 to .90 across subscales (Kruger et al., 2023). The Stroop task in the Self-Administered Online Cognitive Assessment (SAOCA) showed α = .96 (Troyer et al., 2014), while the Online Card Sorting Task (OCST; Zhang et al., 2024) reported SH-R between r = .72 and .95. The Face–Name task in the SAOCA also showed moderate split-half reliability (r = .62). However, few instruments triangulated different reliability indices, limiting opportunities for convergence across methods and interpretations.\u003c/p\u003e\n \u003cp\u003eIn sum, while many instruments demonstrated at least moderate score stability, reporting practices were heterogeneous and often limited to a single reliability index. Triangulation across complementary methods - such as combining internal consistency and temporal stability - was rare. Notably, alternate-form reliability (AF-R) and inter-run reliability (IR-R) were reported in only two and one study, respectively, suggesting underexplored areas in the validation of digital EF tools. The adoption of broader, multi-indicator reliability frameworks, especially in instruments designed for repeated or longitudinal use, remains essential to enhance interpretability, generalizability, and psychometric robustness.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec15\"\u003e\n \u003ch2\u003eContent validity evidence\u003c/h2\u003e\n \u003cp\u003eContent validity was addressed in 26 of the 31 reviewed studies, making it one of the most frequently mentioned domains. However, formal methodological procedures were rarely applied. While many instruments were conceptually grounded in established theoretical models or classical paradigms, only a minority described explicit strategies to ensure comprehensive construct coverage. No study employed formal indices, such as the Content Validity Index (CVI), or conducted structured expert judgment evaluations.\u003c/p\u003e\n \u003cp\u003eSome instruments were clearly anchored in solid theoretical foundations. The TOL-F (Kaller et al., 2016, 2016) was designed to reflect goal hierarchy and search depth, key elements in planning. The EFSA (Kruger et al., 2023) and the Test of Memory Strategies (TMS; Giorgini et al., 2024) drew on multidimensional EF models (e.g., Diamond, 2013) and incorporated expert consultation during item development. The BAM battery (Aalbers et al., 2013), although conceptually aligned with cognitive aging domains, did not include item-level validation procedures.\u003c/p\u003e\n \u003cp\u003eSimilarly, instruments such as the Frontal Assessment Battery - Spanish Version (FAB-E; Hurtado-Pomares et al., 2024), CANTAB (Karlsen et al., 2022), and NeurOn (Morrissey et al., 2024) were based on traditional batteries but provided limited details on how theoretical constructs were preserved during digital adaptation. Several studies reported adaptations of classical tasks, such as the Digit Symbol Substitution Task (DSST; Perzl et al., 2024), OCST (Zhang et al., 2024), SAOCA (Troyer et al., 2014), and the Remote Characterization Module (RCM; Arioli et al., 2022), but did not clarify how construct representation was maintained. Others, including the Attention Network Test-Interaction (ANT-I; Ishigami et al., 2016), the Aggie Figures Learning Test (AFLT; Bruno et al., 2024), and the Virtual apartment-based Stroop test (VAST; Parsons \u0026amp; Barnett, 2017), limited their justification to structural similarity with legacy tasks.\u003c/p\u003e\n \u003cp\u003eIn summary, although many instruments referenced well-established paradigms or models, most lacked systematic evaluation of whether their items fully represented the intended constructs. This gap limits confidence in the content validity of digital EF assessments and highlights a critical need for more rigorous design and documentation practices in future instrument development.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec16\"\u003e\n \u003ch2\u003eStructural validity evidence\u003c/h2\u003e\n \u003cp\u003eStructural validity was assessed in 8 of the 31 reviewed studies, with considerable variation in methodological rigor and analytic approach. Confirmatory factor analysis (CFA) was employed in six studies, four as standalone analyses and two in combination with either exploratory factor analysis (EFA) or principal component analysis (PCA). EFA was used in two studies (one exclusively and one alongside CFA), while PCA appeared in two studies (one standalone and one combined with CFA). Although PCA supports data reduction, it does not assess latent construct structure and therefore provides limited evidence of structural validity (American Educational Research Association et al., 2014).\u003c/p\u003e\n \u003cp\u003eAmong the CFA-based studies, several demonstrated robust model fit. The EFSA (Kruger et al., 2023) confirmed a three-factor structure encompassing working memory, inhibitory control, and cognitive flexibility (e.g., RMSEA = .031; CFI = .990). The TMS supported a similar three-factor solution involving executive and memory-related components and reported partial measurement invariance across countries (Giorgini et al., 2024). The NIHTB-CB showed acceptable fit for a three-factor solution across age groups, covering crystallized, fluid, and composite cognitive domains (Hatahet \u0026amp; Seghier, 2024). The INTB (Wahyuningrum et al., 2022) also reported CFA results that confirmed the seven-factor structure initially extracted via PCA (RMSEA = .040; TLI = .947), supporting its internal dimensional consistency.\u003c/p\u003e\n \u003cp\u003eThe PCA-exclusive study, SAOCA (Troyer et al., 2014), reported a unidimensional solution (λ = 1.61; loadings = .58–.75) but did not follow up with confirmatory modeling, limiting the interpretability of its structural assumptions.\u003c/p\u003e\n \u003cp\u003eSeveral instruments provided theoretical justification for their multidimensional design but did not empirically test it. This includes the TOL-F (Kaller et al., 2016; Köstering et al., 2015), BAM (Aalbers et al., 2013), and FISHERMAN (Wang et al., 2023). Although grounded in theoretical frameworks, the lack of structural analysis restricts interpretability of their dimensional claims.\u003c/p\u003e\n \u003cp\u003eNotably, most studies based on classical EF tasks, such as the Stroop, DSST, TMT, Go/No-Go, and N-back, did not report any internal structure assessment. This reflects a persistent gap in the digital EF literature, where dimensionality is frequently assumed rather than empirically verified.\u003c/p\u003e\n \u003cp\u003eIn summary, although a few instruments provided robust structural evidence through CFA, most relied on exploratory approaches or omitted structural validation entirely. Moreover, even among studies that conducted CFA, it was common for authors not to report whether factor loadings, variances, or covariances were statistically significant — a limitation that undermines the interpretability and replicability of the proposed models. A notable exception was the study by Byrne et al. (2024), which reported the significance of residual variances and contributed more comprehensive parameter estimates. Future research should prioritize confirmatory modeling, such as CFA or exploratory structural equation modeling (ESEM), to ensure consistency between theoretical frameworks and the empirical structure of digital EF assessments.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec17\"\u003e\n \u003ch2\u003eExternal validity evidence\u003c/h2\u003e\n \u003cp\u003eExternal validity was assessed in 17 of the 31 studies, though the type and rigor of analyses varied considerably. Convergent (CV) and criterion-related validity (CrV) were each reported in 10 studies, with four also including discriminant validity (DV). The NIHTB-CB – iPad (Ott et al., 2022) and the Amsterdam Cognition Scan (ACS; Feenstra et al., 2018) showed moderate to strong convergent evidence with gold-standard tests (r = .35–.80 and r = .46–.70, respectively). The NIHTB-CB (Heaton et al., 2014) and EFSA (Kruger et al., 2023) demonstrated strong CrV, with correlations above r = .78 and ρ = .71, respectively.\u003c/p\u003e\n \u003cp\u003eThe FAB-E (Hurtado-Pomares et al., 2024) revealed both CV (r = .426 with MMSE) and DV (r = − .523 with TMT), while the BAM battery (Aalbers et al., 2013) showed modest CV (ρ = .40–.67) and nonsignificant correlations with unrelated constructs (ρ = –.03 to –.13), supporting DV. Other tools such as TMS (Giorgini et al., 2024), and FISHERMAN (Wang et al., 2023) reported moderate CV and CrV, respectively, but often lacked precision or breadth in their comparative frameworks.\u003c/p\u003e\n \u003cp\u003ePredictive validity (PV) was addressed only in the Axon-TMT (Depauw et al., 2024), which predicted CAPTCHA performance with moderate to large effects (η² = .09–.66). A recurring limitation was the substitution of demographic correlations (e.g., age, education) for external validation, as seen in SAOCA (Troyer et al., 2014), and INTB (Wahyuningrum et al., 2022), which did not benchmark scores against validated instruments.\u003c/p\u003e\n \u003cp\u003eIn sum, while a subset of studies presented robust and multidimensional evidence of external validity -especially the NIHTB-CB (iPad), ACS, NIHTB-CB, and EFSA - many others relied on limited or indirect indicators. Future validation studies should prioritize conceptually anchored hypotheses, systematic comparisons with gold-standard instruments, and inclusion of predictive criteria to strengthen the interpretability and clinical utility of digital EF assessments.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec18\"\u003e\n \u003ch2\u003eResponse process validity\u003c/h2\u003e\n \u003cp\u003eResponse process validity was inconsistently addressed across the reviewed studies. Although 22 instruments reported some evidence relevant to user interaction, task engagement, or behavioral performance, most relied on indirect or observational indicators rather than systematic analysis.\u003c/p\u003e\n \u003cp\u003eSeveral studies assessed behavioral engagement using metrics such as completion rates, error patterns, and reaction time (RT). SAOCA (Troyer et al., 2014), for example, reported a 94% valid response rate in unsupervised settings, while BAM (Aalbers et al., 2013) showed increasing RT and error rates as task complexity rose, suggesting sensitivity to executive load. TOL-F studies (Kaller et al., 2016; Köstering et al., 2015) examined planning versus execution times and rule violations to infer cognitive strategy use.\u003c/p\u003e\n \u003cp\u003eOther tools used RT dynamics more analytically. Axon-TMT (Depauw et al., 2024) tracked error trajectories across trials, and dTMT (Park \u0026amp; Schott, 2022) recorded latency features like touch intervals and pauses as markers of fatigue. The INTB (Wahyuningrum et al., 2022) demonstrated expected performance shifts by task difficulty and age. NeurOn and NeuroUX detected performance variation across testing environments and devices, while the study on the Task-Switching Paradigm (TST; Toh \u0026amp; Yang, 2024) explored intraindividual variability in relation to emotion regulation.\u003c/p\u003e\n \u003cp\u003eIn contrast, 9 studies offered no relevant process evidence, and several others (e.g., OCST, ANT-I) may have collected but did not report it. Across all studies, no use of advanced psychometric models such as Item Response Theory (IRT) or time-dependent modeling was observed.\u003c/p\u003e\n \u003cp\u003eIn sum, although some instruments incorporated meaningful behavioral data, process validity remains underexplored. Future studies should adopt more rigorous reporting and exploit digital trace data to better understand user-task interactions in EF assessments.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec19\"\u003e\n \u003ch2\u003eConsequential validity evidence\u003c/h2\u003e\n \u003cp\u003eAll 31 studies included in this review reported some form of consequential validity evidence, reflecting a growing interest in the applicability of digital EF assessments. However, the nature and methodological rigor of such evidence varied considerably. Based on the level of elaboration, we classified these findings into three qualitative categories: narrative claims, supported use cases, and substantiated consequences.\u003c/p\u003e\n \u003cp\u003eNarrative claims were observed in a subset of studies that merely asserted potential applications, typically in aging or clinical contexts, without offering performance-based evidence or real-world implementation data. These studies generally highlighted theoretical relevance or ecological validity but lacked empirical follow-up. Examples include Ishigami et al. (2016), who referenced attentional monitoring in aging, and Parsons \u0026amp; Barnett, (2017), who emphasized ecological plausibility without validation in applied settings.\u003c/p\u003e\n \u003cp\u003eA second group of studies presented supported use cases, in which the test's applicability was linked to population characteristics, normative datasets, or test design features, though without direct evidence of practical impact. For instance, Wang et al. (2023) discussed age- and sex-based performance patterns to support screening relevance; Wahyuningrum et al. (2022) emphasized adaptation for the Indonesian cultural context; and Ott et al. (2022) framed the NIHTB-CB as aligned with NIH’s iPad-based workflows. Although these claims were contextually grounded, they remained largely inferential, relying on assumed benefits rather than demonstrated outcomes in applied settings.\u003c/p\u003e\n \u003cp\u003eFinally, several instruments offered substantiated consequences, with evidence of real-world use, screening utility, or performance-based applicability. The SAOCA (Troyer et al., 2014) was recommended for early detection of cognitive decline; the EFSA (Kruger et al., 2023) and TMS (Giorgini et al., 2024) were positioned as complementary to clinical evaluation; and the BAM battery (Aalbers et al., 2013) supported remote self-monitoring in aging. These studies linked test outcomes to practical decision-making scenarios or demonstrated diagnostic relevance through applied metrics.\u003c/p\u003e\n \u003cp\u003eIn sum, although all studies contributed some form of consequential insight, most lacked formal evaluation of outcomes, impact metrics, or potential adverse effects such as misclassification or digital exclusion. Future research should advance from \u003cem\u003epost hoc\u003c/em\u003e assertions to prospective validation of implementation impact, ideally incorporating diagnostic utility studies, stakeholder feedback, and equity considerations to support the responsible use of digital EF assessments in real-world contexts.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis review aimed to map and synthesize the psychometric evidence available over the past decade for digital and online tools designed to assess EF in healthy adults. Findings were organized according to five key dimensions of modern psychometrics: content validity, structural validity, external validity (relations to other variables), response processes, and consequential validity. The 31 included studies were conducted across diverse global regions, with the majority originating from high-income countries in North America and Europe (n\u0026thinsp;=\u0026thinsp;22). Additional contributions came from Asia (n\u0026thinsp;=\u0026thinsp;5), Latin America (n\u0026thinsp;=\u0026thinsp;2), and Oceania (n\u0026thinsp;=\u0026thinsp;2), representing a total of 20 countries.\u003c/p\u003e \u003cp\u003eAlthough reliability was among the most frequently reported psychometric properties in the reviewed studies (n\u0026thinsp;=\u0026thinsp;23), its operationalization often fell short of the standards set forth in contemporary psychometric literature. As emphasized by Revelle and Condon, (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) and reflected in the AERA Standards (American Educational Research Association et al., 2014), reliability should not be viewed as a fixed property of the test itself, but rather as a feature of observed scores that is specific to context, population, and purpose - requiring the use of multiple indicators and replications over time. Nevertheless, most studies reported only a single index, typically test\u0026ndash;retest correlations or Cronbach\u0026rsquo;s alpha, without justifying its conceptual alignment with the instrument\u0026rsquo;s design or intended application. Only a few studies (e.g., Kr\u0026uuml;ger et al., 2023; Zhang et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) triangulated across multiple indices, such as internal consistency, test\u0026ndash;retest stability, and split-half reliability, as recommended for tools intended for longitudinal or repeated use.\u003c/p\u003e \u003cp\u003eCronbach\u0026rsquo;s alpha was often used without verifying its underlying assumptions. Specifically, alpha requires tau-equivalence (that is, equal factor loadings across all items). When this assumption is violated, alpha can misrepresent reliability by either inflating or underestimating the true consistency of scores. Despite this limitation, alternative estimators such as McDonald\u0026rsquo;s omega (ω), which allows for unequal loadings and provides a more accurate estimate in most conditions, were rarely reported. Ideally, both α and ω should be presented, particularly in newly developed or adapted instruments, to ensure transparency and capture distinct aspects of internal structure (Clark \u0026amp; Watson, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Revelle \u0026amp; Condon, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eLatency-based outcomes\u0026mdash;common in digital EF tasks\u0026mdash;were also underexamined in terms of score stability, even though they are particularly vulnerable to contextual and device-related variability. Moreover, less common but informative indices such as alternate-form reliability (AF-R) and inter-run reliability (IR-R) were observed in only a few studies, highlighting additional underexplored avenues in the validation of digital assessments.\u003c/p\u003e \u003cp\u003eThese limitations reflect a broader disconnect between the psychometric complexity of digital EF instruments and the simplicity of their reliability reporting. Addressing this gap requires a shift toward a reliability reasoning framework\u0026mdash;one that integrates statistical indicators with theoretical justification and contextual awareness, as advocated by Revelle and Condon (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) and endorsed by the AERA Standards.\u003c/p\u003e \u003cp\u003eAlthough foundational in contemporary validity theory, content validity was the most neglected domain across the reviewed studies, revealing a sharp departure from the principles outlined in the AERA Standards (American Educational Research Association et al., 2014) and emphasized by Clark and Watson, (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). According to these frameworks, content validation must extend beyond conceptual alignment; it requires systematic evidence that items or tasks comprehensively represent the intended construct and are suitable for the target population and testing purpose. Yet, none of the studies in this review employed formal content validation procedures, such as expert panel ratings, quantitative indices (e.g., Content Validity Index), or structured mapping techniques like test blueprints or construct specification equations (XXX).\u003c/p\u003e \u003cp\u003eWhile several instruments were based on well-established paradigms (e.g., TOL, Stroop, WCST) or theoretical models (e.g., Diamond, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2013\u003c/span\u003e), referencing such sources is not sufficient to establish content representativeness. Particularly in digital contexts where presentation format, timing, and interaction modalities may substantially alter the construct being assessed. Clark and Watson (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) warn against assuming content validity by association and calls for empirical evidence of representativeness and relevance, especially in instruments used for high-stakes decisions or cross-cultural comparisons. The absence of such procedures across the reviewed studies raises concerns about potential construct underrepresentation or contamination, which compromises score interpretability and the defensibility of decisions based on these measures.\u003c/p\u003e \u003cp\u003eStructural validity was addressed in only 8 of the 31 studies included in this review. Even when reported, analyses were often limited to exploratory techniques such as principal component analysis (PCA), which are not designed to test latent structure or account for measurement error. According to Standard 1.13 of the AERA Standards (American Educational Research Association et al., 2014), when a construct is presumed to have a multidimensional structure\u0026mdash;as is typically the case in executive function (EF) assessment\u0026mdash;confirmatory factor analysis (CFA) or equivalent model-based approaches are expected to evaluate the correspondence between theoretical dimensions and empirical data. This position is echoed by Sellbom and Tellegen, (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), who argue that a psychometric instrument cannot validly claim to measure a construct until its internal structure has been empirically verified using appropriate statistical techniques.\u003c/p\u003e \u003cp\u003eAlthough a few instruments, such as the EFSA (Kruger et al., \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), TMS (Giorgini et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), and INTB (Wahyuningrum et al., \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), employed CFA and reported acceptable to excellent fit indices, the majority of studies either relied on PCA without model testing or omitted structural analysis altogether. This gap is particularly concerning given that many digital and online EF tools are designed around multiple theoretically distinct subcomponents (e.g., working memory, cognitive flexibility, inhibition), yet fail to demonstrate the empirical distinctiveness or interrelatedness of these domains. Furthermore, none of the studies tested alternative or competing models, such as bifactor or hierarchical CFA frameworks, which, as emphasized by Sellbom and Tellegen (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) are critical for assessing the integrity of multidimensional constructs and reducing the risk of overfitting or misinterpretation.\u003c/p\u003e \u003cp\u003eThe absence of rigorous structural validation is especially problematic in instruments intended for use across diverse populations or in clinical decision-making contexts. Without empirical confirmation of internal structure, the interpretability of test scores remains limited, and the potential for misleading inferences increases. Future research should move beyond exploratory approaches and incorporate model comparison, fit evaluation, and measurement invariance testing, as recommended in contemporary psychometric literature, to ensure that digital EF assessments accurately capture the constructs they aim to measure.\u003c/p\u003e \u003cp\u003eExternal validity was addressed in 17 of the reviewed studies, though often with limited scope and theoretical grounding. As outlined in Standard 1.16 of the AERA Standars (American Educational Research Association et al., 2014), this domain includes convergent (CV), discriminant (DV), and criterion-related validity (CrV), which support score interpretations through associations with external constructs, behaviors, or outcomes. While CV and CrV appeared with similar frequency, many studies lacked clear hypotheses or justification for the chosen benchmarks.\u003c/p\u003e \u003cp\u003eSome instruments, such as the NIHTB-CB, EFSA, and ACS, reported moderate-to-strong correlations with established cognitive tests. However, discriminant validity was rarely assessed, and predictive evidence was limited to a single study (Axon-TMT). Moreover, several studies substituted demographic correlations (e.g., age, education) for true external validation, despite the risk of reflecting construct-irrelevant variance.\u003c/p\u003e \u003cp\u003eAnother common limitation was the omission of confidence intervals or error estimates for correlation coefficients, which weakens generalizability. As emphasized by Watson (2019), external validity evidence should be statistically sound, theory-driven, and appropriate to the intended use, especially in clinical and applied settings.\u003c/p\u003e \u003cp\u003eEvidence regarding response process validity was inconsistently addressed, echoing a broader gap noted in the psychometric literature. According to Standards 1.10 and 1.12 of the AERA Standards, response process evidence should demonstrate alignment between the cognitive operations elicited by the task and the theoretical constructs being measured (American Educational Research Association et al., 2014). Yet, most studies offered only superficial analyses, limited to descriptive metrics such as reaction time (RT) or error rates.\u003c/p\u003e \u003cp\u003eA few instruments (e.g., SAOCA, TOL-F, Axon-TMT) incorporated indicators like completion rates, latency distributions, or planning time to examine engagement and strategy use. However, even these analyses remained fragmented and lacked formal modeling of intraindividual variability or task\u0026ndash;trait interactions. None employed process-tracing techniques (e.g., mouse tracking, eye-tracking, think-aloud protocols) or statistical approaches like Item Response Theory (IRT) and time-dependent modeling, despite their suitability for digital testing.\u003c/p\u003e \u003cp\u003eAs Thomas (\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) emphasizes, the omission of such modeling is particularly problematic given the rich behavioral data digital platforms can capture and the growing availability of analytic tools. Without structured process analysis, the interpretability of performance in digital EF tasks, especially in unsupervised or adaptive contexts, remains compromised.\u003c/p\u003e \u003cp\u003eResponse process validity is not peripheral to test interpretation; it is foundational when cognitive assessments rely on timing, interaction, or dynamic response patterns. Its underutilization across the reviewed studies highlights a critical methodological gap that future research must address.\u003c/p\u003e \u003cp\u003eConsequential validity emerged as the least empirically developed domain across the reviewed studies, despite its central role in contemporary validity theory and its explicit emphasis in Standard 1.25 of the AERA Standards (American Educational Research Association et al., 2014). This standard asserts that test developers and users share responsibility not only for the intended uses of scores but also for anticipating and mitigating unintended consequences, such as misinterpretation, inequitable access, or harm resulting from invalid inferences. Although all 31 studies included some consequential claims, only a minority went beyond narrative assertions to examine practical implications. No study systematically evaluated the real-world impact of test use or its influence on decisions and service delivery.\u003c/p\u003e \u003cp\u003eSome instruments, such as SAOCA (Troyer et al., \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2014\u003c/span\u003e), CHEFS (Hassm\u0026eacute;n et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), and NeuroUX (Paolillo et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), proposed utility for cognitive screening or large-scale monitoring in aging populations, yet lacked empirical data on diagnostic accuracy, clinical effectiveness, or long-term outcomes. Other tools, including the EFSA (Kruger et al., \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) and TMS (Giorgini et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), emphasized clinical relevance based on conceptual fit but offered no follow-up regarding consequences in applied settings. As Clark and Watson, (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) caution, consequential claims must be substantiated by evidence that test use improves decision quality, access to care, or intervention outcomes, rather than remaining aspirational.\u003c/p\u003e \u003cp\u003eCritically, none of the studies addressed potential adverse effects, such as overreliance on automated scores, algorithmic opacity, or inequities stemming from differences in literacy, socioeconomic status, or digital access. These risks are especially relevant for remote or self-administered platforms. The absence of implementation metrics, equity-focused analyses, and stakeholder consultation raises concerns about the ethical robustness and fairness of digital and online EF tools. As highlighted in the AERA Standards, consequences are not peripheral, they are foundational to the interpretive argument that underlies responsible test use (American Educational Research Association et al., 2014).\u003c/p\u003e \u003cp\u003eIn sum, while the field of digital EF assessment has made strides in proposing tools with practical utility, the lack of empirical investigation into their downstream impact represents a critical limitation. Future validation efforts must go beyond conceptual alignment and include diagnostic utility studies, real-world implementation data, and equity-oriented evaluations. Only then can digital and online EF instruments fulfill their promise of supporting not just accurate measurement, but also ethical, effective, and inclusive decision-making.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003e This review revealed a field in methodological transition when the digital and online assessments of executive functions (EF) have expanded in scope, theoretical grounding, and technological sophistication over the past decade. However, the psychometric landscape remains uneven, while reliability and external validity were addressed in a growing number of studies, domains such as content validity, structural modeling, response process analysis, and consequential evaluation continue to be underexplored or methodologically limited. Notably, no single study presented a validation framework encompassing all core domains of contemporary validity theory, as articulated by the Standards (American Educational Research Association et al., 2014).\u003c/p\u003e \u003cp\u003e A major strength of this review lies in its comprehensive mapping of six psychometric domains - reliability, content validity, structural validity, external validity (relations with other variables), response processes, and consequential validity - across a diverse set of 31 studies. By adopting a multidimensional lens and applying current theoretical standards, this review offers a critical overview of the state of digital and online EF assessment and highlights areas of both progress and fragility. It also provides researchers and clinicians with a synthesized reference for identifying tools that are better supported by psychometric evidence.\u003c/p\u003e \u003cp\u003eNevertheless, despite its methodological rigor, this systematic review is not without limitations. First, the analysis focused exclusively on studies reporting psychometric evidence, thereby excluding those centered on usability, implementation science, or neural validation. In addition, heterogeneity in study design and reporting limited the ability to perform meta-analytic comparisons or extract effect sizes uniformly. Furthermore, due to the inclusion criteria, several relevant instruments that lacked psychometric investigations within the specified timeframe or target population (i.e., healthy adults) were excluded, potentially overlooking promising tools in early development or applied to clinical populations.\u003c/p\u003e \u003cp\u003eAcross the included studies, we also observed several recurring limitations that constrain the robustness of available evidence. These include frequent reliance on convenience samples with limited representativeness; underreporting of structural validation methods, such as confirmatory factor analyses; inadequate application of formal procedures for content validity assessment; superficial analyses of response processes; limited external validity evidence, often based on demographic variables rather than benchmark measures; and insufficient evaluation of consequential validity, particularly regarding real-world implementation and ethical implications. These limitations highlight critical areas for psychometric improvement in future research.\u003c/p\u003e \u003cp\u003eFuture validation efforts must move beyond single-index reporting and adopt more integrative strategies, combining confirmatory factor analysis, item response theory models, response behavior analysis, and real-world impact studies. Equity considerations, stakeholder engagement, and context-specific implementation research should also be embedded into validation protocols. Bridging the gap between psychometric rigor and digital innovation will be essential to ensure that EF assessments are not only theoretically sound, but also ethically responsible, contextually meaningful, and practically useful across diverse populations and settings.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAalbers T, Baars MAE, Rikkert MGMO, Kessels RPC (2013) Puzzling with online games (BAM-COG): Reliability, validity, and feasibility of an online self-monitor for cognitive performance in aging adults. J Med Internet Res 15(12):183\u0026ndash;193. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/jmir.2860\u003c/span\u003e\u003cspan address=\"10.2196/jmir.2860\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmerican Educational Research Association, American Psychological Association, \u0026amp; National Council on Measurement in Education (2014) \u003cem\u003eThe Standards for Educational and Psychological Testing\u003c/em\u003e. Https://Www.Apa.Org. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.apa.org/science/programs/testing/standards\u003c/span\u003e\u003cspan address=\"https://www.apa.org/science/programs/testing/standards\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArioli M, Rini J, Anguera-Singla R, Gazzaley A, Wais PE (2022) Validation of At-Home Application of a Digital Cognitive Screener for Older Adults. \u003cem\u003eFrontiers in Aging Neuroscience\u003c/em\u003e, \u003cem\u003e14\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3389/fnagi.2022.907496\u003c/span\u003e\u003cspan address=\"10.3389/fnagi.2022.907496\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBergman I, Franke F\u0026ouml;yen L, Gustavsson A, Van den Hurk W (2025) Test\u0026ndash;retest reliability, practice effects and estimates of change: A study on the Mindmore digital cognitive assessment tool. Scand J Psychol 66(1):1\u0026ndash;14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1111/sjop.13054\u003c/span\u003e\u003cspan address=\"10.1111/sjop.13054\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBruno D, S\u0026aacute;nchez Rueda D, Lopez E, Pinasco C, Torralva T, Alfredo T, Sierra Sanjurjo N, Roca M (2024) Validity and norms for young adults for the Aggie Figures Learning Test. Appl Neuropsychology: Adult 0(0):1\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/23279095.2024.2354856\u003c/span\u003e\u003cspan address=\"10.1080/23279095.2024.2354856\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBurgess PW, Stuss DT (2017) Fifty Years of Prefrontal Cortex Research: Impact on Assessment. J Int Neuropsychol Soc 23(9\u0026ndash;10):755\u0026ndash;767. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1017/S1355617717000704\u003c/span\u003e\u003cspan address=\"10.1017/S1355617717000704\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eByrne EM, Gilbert RA, Kievit RA, Holmes J (2024) Evidence for separate backward recall and n-back working memory factors: A large-scale latent variable analysis. Memory. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.tandfonline.com/doi/abs/\u003c/span\u003e\u003cspan address=\"https://www.tandfonline.com/doi/abs/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/09658211.2024.2393388\u003c/span\u003e\u003cspan address=\"10.1080/09658211.2024.2393388\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eClark LA, Watson D (2019) Constructing validity: New developments in creating objective measuring instruments. Psychol Assess 31(12):1412\u0026ndash;1427. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000626\u003c/span\u003e\u003cspan address=\"10.1037/pas0000626\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDepauw T, Boasen J, L\u0026eacute;ger PM, S\u0026eacute;n\u0026eacute;cal S (2024) Assessing the Relationship Between Digital Trail Making Test Performance and IT Task Performance: Empirical Study. JMIR Hum Factors 11(1):e49992. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/49992\u003c/span\u003e\u003cspan address=\"10.2196/49992\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiamond A (2013) Executive Functions. Ann Rev Psychol 64:135\u0026ndash;168. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1146/annurev-psych-113011-143750\u003c/span\u003e\u003cspan address=\"10.1146/annurev-psych-113011-143750\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDias NM, Malloy-Diniz LF (eds) (2023) \u003cem\u003eTratado de fun\u0026ccedil;\u0026otilde;es executivas: Modelos te\u0026oacute;ricos, construtos associados e desenvolvimento\u003c/em\u003e (1\u003csup\u003ea\u003c/sup\u003e). Editora Ampla\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDias NM, Malloy-Diniz LF (2024) \u003cem\u003eTratado de fun\u0026ccedil;\u0026otilde;es executivas: Avalia\u0026ccedil;\u0026atilde;o e Interven\u0026ccedil;\u0026atilde;o\u003c/em\u003e (1\u003csup\u003ea\u003c/sup\u003e). Editora Ampla\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeenstra HEM, Vermeulen IE, Murre JMJ, Schagen SB (2018) Online self-administered cognitive testing using the Amsterdam Cognition Scan: Establishing psychometric properties and normative data. J Med Internet Res 20(5). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/jmir.9298\u003c/span\u003e\u003cspan address=\"10.2196/jmir.9298\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerguson H, Brunsdon V, Bradford E (2021) The developmental trajectories of executive function from adolescence to old age. \u003cem\u003eScientific Reports\u003c/em\u003e, \u003cem\u003e11\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41598-020-80866-1\u003c/span\u003e\u003cspan address=\"10.1038/s41598-020-80866-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eField A (2020) Descobrindo a Estat\u0026iacute;stica Usando o SPSS, 5th edn. Penso Editora\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFriedman NP, Miyake A (2017) Unity and diversity of executive functions: Individual differences as a window on cognitive structure. Cortex 86:186\u0026ndash;204. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cortex.2016.04.023\u003c/span\u003e\u003cspan address=\"10.1016/j.cortex.2016.04.023\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGiorgini R, Maestu F, Sara FM, Pastore M, Abellan M, Quattrone A, Caparello S, Quattrone A, Vaccaro MG (2024) Measurement invariance across countries of the Test of Memory Strategies (TMS): A contribution to the cross-national validity study. Acta Psychol 246:104291. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.actpsy.2024.104291\u003c/span\u003e\u003cspan address=\"10.1016/j.actpsy.2024.104291\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHassm\u0026eacute;n P, Hindman E, Keiller T, Blair D (2024) Piloting the Coffs Harbour Executive Functioning Screen (CHEFS): An off-road tool to predict fitness to drive. Appl Neuropsychology: Adult. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.tandfonline.com/doi/abs/\u003c/span\u003e\u003cspan address=\"https://www.tandfonline.com/doi/abs/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/23279095.2024.2418031\u003c/span\u003e\u003cspan address=\"10.1080/23279095.2024.2418031\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHatahet O, Seghier ML (2024) The validity of studying healthy aging with cognitive tests measuring different constructs. Sci Rep 14(1):23880. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41598-024-74488-0\u003c/span\u003e\u003cspan address=\"10.1038/s41598-024-74488-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHeaton RK, Akshoomoff N, Tulsky D, Mungas D, Weintraub S, Dikmen S, Beaumont J, Casaletto KB, Conway K, Slotkin J, Gershon R (2014) Reliability and validity of composite scores from the NIH Toolbox Cognition Battery in adults. J Int Neuropsychol Soc 20(6):588\u0026ndash;598. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1017/S1355617714000241\u003c/span\u003e\u003cspan address=\"10.1017/S1355617714000241\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHurtado-Pomares M, Ju\u0026aacute;rez-Leal I, Company-Devesa V, S\u0026aacute;nchez-P\u0026eacute;rez A, Peral-G\u0026oacute;mez P, Espinosa-Sempere C, Valera-Gran D, Navarrete-Mu\u0026ntilde;oz E-M (2024) Psychometric properties of the Spanish version of the Frontal Assessment Battery (FAB-E) and normative values in a representative adult population sample. Neurologia 39(8):694\u0026ndash;700. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.nrleng.2022.09.004\u003c/span\u003e\u003cspan address=\"10.1016/j.nrleng.2022.09.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIshigami Y, Eskes GA, Tyndall AV, Longman RS, Drogos LL, Poulin MJ (2016) The Attention Network Test-Interaction (ANT-I): Reliability and validity in healthy older adults. Exp Brain Res 234(3):815\u0026ndash;827. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00221-015-4493-4\u003c/span\u003e\u003cspan address=\"10.1007/s00221-015-4493-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIverson GL, Brooks BL, Ashton VL, Johnson LG, Gualtieri CT (2009) Does familiarity with computers affect computerized neuropsychological test performance? J Clin Exp Neuropsychol. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/13803390802372125\u003c/span\u003e\u003cspan address=\"10.1080/13803390802372125\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaller CP, Debelak R, K\u0026ouml;stering L, Egle J, Rahm B, Wild PS, Blettner M, Beutel ME, Unterrainer JM (2016) Assessing planning ability across the adult life Span: Population-representative and age-adjusted reliability estimates for the Tower of London (TOL-F). Arch Clin Neuropsychol 31(2):148\u0026ndash;164\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarlsen RH, Karr JE, Saksvik SB, Lundervold AJ, Hjemdal O, Olsen A, Iverson GL, Skandsen T (2022) Examining 3-month test-retest reliability and reliable change using the Cambridge Neuropsychological Test Automated Battery. Appl Neuropsychology: Adult 29(2):146\u0026ndash;154. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/23279095.2020.1722126\u003c/span\u003e\u003cspan address=\"10.1080/23279095.2020.1722126\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKessels RPC, Hendriks MPH (2023) Neuropsychological assessment. In H. S. Friedman \u0026amp; C. H. Markey (Eds.), \u003cem\u003eEncyclopedia of Mental Health (Third Edition)\u003c/em\u003e (pp. 622\u0026ndash;628). Academic Press. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/B978-0-323-91497-0.00017-5\u003c/span\u003e\u003cspan address=\"10.1016/B978-0-323-91497-0.00017-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eK\u0026ouml;stering L, Nitschke K, Schumacher FK, Weiller C, Kaller CP (2015) Test-retest reliability of the Tower of London Planning Task (TOL-F). Psychol Assess 27(3):925\u0026ndash;931. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000097\u003c/span\u003e\u003cspan address=\"10.1037/pas0000097\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKruger OE, Zibetti MR, Schlindwein R, Lopes FM (2023) Preliminary validity evidence for the Executive Function Scale for Adults (EFSA). Psychology \u0026amp; Neuroscience. No Pagination Specified-No Pagination Specified. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pne0000321\u003c/span\u003e\u003cspan address=\"10.1037/pne0000321\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMorrissey S, Gillings R, Hornberger M (2024) Feasibility and reliability of online vs in-person cognitive testing in healthy older people. PLoS ONE 19(8):e0309006. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0309006\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0309006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNaglieri JA, Drasgow F, Schmit M, Handler L, Prifitera A, Margolis A, Velasquez R (2004) Psychological Testing on the Internet: New Problems, Old Issues. Am Psychol 59(3):150\u0026ndash;162. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/0003-066X.59.3.150\u003c/span\u003e\u003cspan address=\"10.1037/0003-066X.59.3.150\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOtt LR, Schantell M, Willett MP, Johnson HJ, Eastman JA, Okelberry HJ, Wilson TW, Taylor BK, May PE (2022) Construct Validity of the NIH Toolbox Cognitive Domains: A Comparison With Conventional Neuropsychological Assessments. Neuropsychology 36(5):468\u0026ndash;481. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/neu0000813\u003c/span\u003e\u003cspan address=\"10.1037/neu0000813\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOuzzani M, Hammady H, Fedorowicz Z, Elmagarmid A (2016) Rayyan\u0026mdash;A web and mobile app for systematic reviews. Syst Reviews 5(1). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s13643-016-0384-4\u003c/span\u003e\u003cspan address=\"10.1186/s13643-016-0384-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePage MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hr\u0026oacute;bjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, Moher D (2021) The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ (Clinical Res Ed) 372:n71. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmj.n71\u003c/span\u003e\u003cspan address=\"10.1136/bmj.n71\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaolillo EW, Bomyea J, Depp CA, Henneghan AM, Raj A, Moore RC (2024) Characterizing Performance on a Suite of English-Language NeuroUX Mobile Cognitive Tests in a US Adult Sample: Ecological Momentary Cognitive Testing Study. J Med Internet Res 26(1):e51978. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/51978\u003c/span\u003e\u003cspan address=\"10.2196/51978\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark SY, Schott N (2022) The trail-making-test: Comparison between paper-and-pencil and computerized versions in young and healthy older adults. Appl Neuropsychol Adult 29(5):1208\u0026ndash;1220. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/23279095.2020.1864374\u003c/span\u003e\u003cspan address=\"10.1080/23279095.2020.1864374\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eParsey CM, Schmitter-Edgecombe M (2013) Applications of technology in neuropsychological assessment. Clin Neuropsychol 27(8):1328\u0026ndash;1361. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/13854046.2013.834971\u003c/span\u003e\u003cspan address=\"10.1080/13854046.2013.834971\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eParsons T, Barnett M (2017) Virtual apartment stroop task: Comparison with computerized and traditional stroop tasks. J Neurosci Methods 309:35\u0026ndash;40. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jneumeth.2018.08.022\u003c/span\u003e\u003cspan address=\"10.1016/j.jneumeth.2018.08.022\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePerzl J, Riedl EM, Thomas J (2024) Measuring Situational Cognitive Performance in the Wild: A Psychometric Evaluation of Three Brief Smartphone-Based Test Procedures. Assessment 31(6):1270\u0026ndash;1291. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/10731911231213845\u003c/span\u003e\u003cspan address=\"10.1177/10731911231213845\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRevelle W, Condon DM (2019) Reliability from α to ω: A tutorial. Psychol Assess 31(12):1395\u0026ndash;1411. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000754\u003c/span\u003e\u003cspan address=\"10.1037/pas0000754\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRijnen SJM, van der Linden SD, Emons WHM, Sitskoorn MM, Gehring K (2018) Test-retest reliability and practice effects of a computerized neuropsychological battery: A solution-oriented approach. Psychol Assess 30(12):1652\u0026ndash;1662. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000618\u003c/span\u003e\u003cspan address=\"10.1037/pas0000618\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSellbom M, Tellegen A (2019) Factor analysis in psychological assessment research: Common pitfalls and recommendations. Psychol Assess 31(12):1428\u0026ndash;1441. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000623\u003c/span\u003e\u003cspan address=\"10.1037/pas0000623\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSoto EF, Kofler MJ, Singh LJ, Wells EL, Irwin LN, Groves NB, Miller CE (2020) Executive functioning rating scales: Ecologically valid or construct invalid? Neuropsychology 34(6):605\u0026ndash;619. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/neu0000681\u003c/span\u003e\u003cspan address=\"10.1037/neu0000681\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSoveri A, Lehtonen M, Karlsson LC, Lukasik K, Antfolk J, Laine M (2018) Test-retest reliability of five frequently used executive tasks in healthy adults. Appl Neuropsychol Adult 25(2):155\u0026ndash;165. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/23279095.2016.1263795\u003c/span\u003e\u003cspan address=\"10.1080/23279095.2016.1263795\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTabachnick BG, Fidell LS (2019) \u003cem\u003eUsing Multivariate Statistics\u003c/em\u003e. Pearson\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThomas ML (2019) Advances in applications of item response theory to clinical assessment. Psychol Assess 31(12):1442\u0026ndash;1455. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/pas0000597\u003c/span\u003e\u003cspan address=\"10.1037/pas0000597\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eToh WX, Yang H (2024) To switch or not to switch? Individual differences in executive function and emotion regulation flexibility. Emotion 24(1):52\u0026ndash;66. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1037/emo0001250\u003c/span\u003e\u003cspan address=\"10.1037/emo0001250\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTroyer AK, Rowe G, Murphy KJ, Levine B, Leach L, Hasher L (2014) Development and evaluation of a self-administered on-line test of memory and attention for middle-aged and older adults. Front Aging Neurosci 6:335. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3389/fnagi.2014.00335\u003c/span\u003e\u003cspan address=\"10.3389/fnagi.2014.00335\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWahyuningrum SE, Sulastri A, Hendriks MPH, van Luijtelaar G (2022) The Indonesian Neuropsychological Test Battery (INTB): Psychometric properties, preliminary normative scores, the underlying cognitive constructs, and the effects of age and education. Acta Neuropsychologica 20(4):445\u0026ndash;470. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5604/01.3001.0016.1339\u003c/span\u003e\u003cspan address=\"10.5604/01.3001.0016.1339\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang P, Fang Y, Qi JY, Li HJ (2023) FISHERMAN: A Serious Game for Executive Function Assessment of Older Adults. Assessment 30(5):1499\u0026ndash;1513. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1177/10731911221105648\u003c/span\u003e\u003cspan address=\"10.1177/10731911221105648\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWhite N, Flannery L, McClintock A, Machado L (2018) Repeated computerized cognitive testing: Performance shifts and test\u0026ndash;retest reliability in healthy older adults. J Clin Exp Neuropsychol 41(2):179\u0026ndash;191. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/13803395.2018.1526888\u003c/span\u003e\u003cspan address=\"10.1080/13803395.2018.1526888\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWhiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MMG, Sterne JAC, Bossuyt PMM, QUADAS-2 Group (2011) QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 155(8):529\u0026ndash;536. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.7326/0003-4819-155-8-201110180-00009\u003c/span\u003e\u003cspan address=\"10.7326/0003-4819-155-8-201110180-00009\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZelazo PD (2015) Executive function: Reflection, iterative reprocessing, complexity, and the developing brain. Dev Rev 38:55\u0026ndash;68. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.dr.2015.07.001\u003c/span\u003e\u003cspan address=\"10.1016/j.dr.2015.07.001\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZelazo PD, Carlson SM (2023) Reconciling the Context-Dependency and Domain-Generality of Executive Function Skills from a Developmental Systems Perspective. J Cognition Dev 24(2):205\u0026ndash;222. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1080/15248372.2022.2156515\u003c/span\u003e\u003cspan address=\"10.1080/15248372.2022.2156515\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Yang LZ, V\u0026eacute;kony T, Wang C, Li H (2024) Split-half reliability estimates of an online card sorting task in a community sample of young and elderly adults. Behav Res Methods 56(2):1039\u0026ndash;1051. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3758/s13428-023-02104-6\u003c/span\u003e\u003cspan address=\"10.3758/s13428-023-02104-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZucchella C, Federico A, Martini A, Tinazzi M, Bartolo M, Tamburin S (2018) Neuropsychological testing. Pract Neurol 18(3). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/practneurol-2017-001743\u003c/span\u003e\u003cspan address=\"10.1136/practneurol-2017-001743\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Table 1 and 2","content":"\u003cp\u003eTable 1, 2 are available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"Federal University of Rio Grande do Sul","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"executive function, digital tools, psychometric evidence, online testing, healthy adults","lastPublishedDoi":"10.21203/rs.3.rs-8543356/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8543356/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eObjective\u003c/h2\u003e \u003cp\u003eThis review maps psychometric evidence from the past decade on digital and online tools assessing executive functions (EF) in healthy adults. It focuses on six modern domains of validity: content, structure, external, response processes, consequences, and reliability.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eSearches were conducted across PsycNet, PubMed, Embase, Web of Science, and the Virtual Health Library, guided by PRISMA. Core terms included executive functions, digital tools, healthy adults, and psychometric properties. Risk of bias was systematically evaluated.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThirty-one studies met inclusion criteria, encompassing 11,246 participants. Most tools assessed core EF domains\u0026mdash;working memory, inhibition, and cognitive flexibility\u0026mdash;through performance-based tasks or online questionnaires. Reliability was reported in 23 studies, though often via single indices. Content validity appeared in 26 studies but lacked methodological rigor. Structural and external validity were reported in 8 and 17 studies, respectively. Response process evidence (n\u0026thinsp;=\u0026thinsp;22) and consequential validity (n\u0026thinsp;=\u0026thinsp;31) were frequently cited but rarely examined in depth. No study addressed all six domains comprehensively. The risk of bias was low for administration but high for sampling. Applicability concerns included unrepresentative samples and weak construct alignment.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eWhile the field is expanding, it lacks methodological depth. Despite growing interest in digital EF tools, essential domains\u0026mdash;particularly structural modeling and consequence analysis\u0026mdash;remain underdeveloped. This review underscores the need for comprehensive validation frameworks that integrate theoretical coherence, empirical rigor, and equity-based implementation.\u003c/p\u003e","manuscriptTitle":"Psychometric Evidence of Digital and Online Executive Function Tests: A Systematic Review","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-12 06:21:15","doi":"10.21203/rs.3.rs-8543356/v1","editorialEvents":[{"type":"communityComments","content":2}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"75d02658-207e-47ac-994d-8d37e0842925","owner":[],"postedDate":"January 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":60759859,"name":"Psychology"},{"id":60759860,"name":"Cognitive Neuroscience"}],"tags":[],"updatedAt":"2026-03-24T16:29:00+00:00","versionOfRecord":{"articleIdentity":"rs-8543356","link":"https://doi.org/10.1080/15305058.2026.2639759","journal":{"identity":"international-journal-of-testing","isVorOnly":true,"title":"International Journal of Testing"},"publishedOn":"2026-03-13 00:00:00","publishedOnDateReadable":"March 13th, 2026"},"versionCreatedAt":"2026-01-12 06:21:15","video":"","vorDoi":"10.1080/15305058.2026.2639759","vorDoiUrl":"https://doi.org/10.1080/15305058.2026.2639759","workflowStages":[]},"version":"v1","identity":"rs-8543356","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8543356","identity":"rs-8543356","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00