Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task

doi:10.21203/rs.3.rs-7563159/v1

Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task

2025 · doi:10.21203/rs.3.rs-7563159/v1

preprint OA: closed

Full text JSON View at publisher

Full text 178,237 characters · extracted from preprint-html · click to expand

Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task Yan Zhou, Ke Bin, Lawrence Jun Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7563159/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Integrated speaking tasks have been widely used in many large-scale high-stakes tests. However, little is known about their application among low- or intermediate-level English second language learners, such as in the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test. This misalignment is particularly problematic given the substantial impact of integrated speaking tasks on teaching, learning, and assessing of English language education in China and internationally. To address this gap, the present study employed a bi-factor Exploratory Structural Equation Modeling (ESEM) and hierarchical multiple regression analyses, with data from a sample of 360 participants in a real test, and probed whether and to what extent the test actually assesses the ability it is supposed to test as manifested in its official test specifications issued by the Education Examinations Authority of Guangdong Province. Findings indicated that the test did measure students’ ability of accomplishing certain tasks in specific contexts by acquiring and applying various knowledge sources (e.g. tasks prompts, encyclopedic knowledge of English and the world, source material, communicative strategies, etc.) in English. Specifically, parameters from the five domain-specific textual factors and two communicative strategies, extracted from the participants’ oral output, co-worked on the participants’ performance in the test, with varying weights across factors. This variability highlights the comprehensiveness and contextual specificity of the test. These findings could provide empirical evidence supporting the validity of score interpretations and offer important implications for the teaching, learning, and assessment of integrated speaking tasks in senior high schools in China. Social science/Education Humanities/Language and linguistics Social science/Language and linguistics Biological sciences/Psychology Social science/Psychology integrated speaking tasks national matriculation test CELST construct validity Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction It is a challenge to conceptualize, define and assess speaking in a reliable and valid way (de Jong, 2023 ; Fan and Yan, 2020 ), particularly in relation to integrated speaking tasks, which are intended to be performed with reliance on given reading and/or listening resources. Integrated speaking tasks are increasingly endorsed by language testing scholars and educators for their authenticity in approximating real-world communicative practices (Barkaoui et al., 2013 ; Brown et al., 2005 ), their contribution to the cultivation of academic multiliteracy and their role in aligning English for Academic Purposes instruction with disciplinary content demands or designated test construct (Brown and Ducasse, 2019 ; Frost et al., 2021 ; Hirai and Koizumi, 2013 ). Meanwhile, the advancement of communicative technologies and the reform of assessment formats have led to the emergence of computer-based assessment, particularly computer-based integrated speaking tests, which in turn pose new challenges to its construct validation researches (Iwashita, 2022 , p.130–133), such as whether these new test formats faithfully capture its intended constructs and to what extent they do so. In integrated speaking tasks, test-takers are required to process and transform a cognitively complex stimulus (e.g. a written test or a lecture) and integrate information from this source into the speaking performance (Brown et al., 2005 , p.1; Zhang et al., 2022 ). Most existing studies on integrated speaking tests have focused on college or university-level English language learners in L2 contexts (Brown and Ducasse, 2019 ; Frost et al., 2021 ; Kormos et al., 2022 ; Suzuki and Kormos, 2023 ). In contrast, relatively few studies have examined intermediate or low-level ESL learners (Tsang and Lee, 2023 ; Xu et al., 2019 , 2020 , 2021, 2023a , 2023b , 2024 , 2025 ; Zhou and Zeng, 2016 ), despite the fact that a substantial proportion of ESL learners fall into this category. Even fewer studies have explored how these learners perform on computer-based integrated speaking tasks. Therefore, continued validation researches are expected by drawing on diverse theoretical perspectives and using a wide range of research methods (Iwashita, 2022 , p.139). To address this research gap, the present study conducted a construct validation study of a large-scale, high-stakes computer-based integrated speaking task, the Guangdong Version of the Computer-based English Listening and Speaking Test (hereafter, CELST) of the National Matriculation English Test. Employing bi-factor exploratory structure equation modeling (hereafter, ESEM) and hierarchical multiple regression analyses, the study aimed to investigate the underlying constructs measured by CELST. 2. Literature review 2.1 Integrative speaking tasks Currently, integrated speaking tasks are widely used in many large-scale high-stakes tests, such as, the TOEFL iBT speaking tests, the Versant TM Speak Test, the TEM-4 Oral Test, the CET-SET, and the CELST. Various theoretical motivations have driven the emergence of integrated speaking tasks. For instance, communicative language ability should target at the capacity to implement knowledge in communicative language use (Bachman, 1990); integrated tasks entail the amalgamation of multiple L2 skills (Huang and Hung, 2018). Among the various studies conducted on integrated speaking tasks, those on TOEFL iBT integrated speaking tasks accounted for a great portion. They focused mainly on the comparison between different types of TOEFL iBT integrated speaking tasks and independent speaking tasks (Huang et al., 2018; Zhang et al., 2022) or between TOEFL iBT integrated speaking tasks and other academic speaking tasks (Brown and Ducasse, 2019; Farnsworth, 2013); rating issues (Wei and Liosa, 2015); textual properties and strategy uses (Crossley and Kim, 2019; Inoue and Lam, 2021; Zhang et al., 2021); and source material use (Frost et al., 2019; Frost et al., 2021). Apart from the studies conducted on TOEFL iBT integrated speaking tasks, there are also some studies performed on other integrated speaking tasks and their research perspectives include source material use (Kormos et al., 2022; Lin, 2023; Pusey, 2020), rating (Hirai and Koizumi, 2013; Kim, 2015), affective factors (Ishikawa, 2020), strategy use (Rukthong, 2021; Rukthong and Brunfaut, 2020), fluency of oral products (Suzuki and Kormos, 2023), test fairness (Yan et al., 2019), and task acceptance (Zhang and Zhang, 2022), etc. Compared with the large number of studies on integrated speaking tasks conducted internationally, researches on such tasks in Chinese context remains relatively limited (Jin, 2012; Rui and Ji, 2017; Zeng, 2011), which is not well aligned with its substantial population of ESL learners and the great effort devoted to English teaching and learning. Jin (2012), Zeng (2011), and Zhou (2005) probed into the influences of source material input in integrated speaking tasks on college students’ oral performance and the results confirmed its positive impacts in relieving anxiety and classroom reticence, facilitating autonomous learning, and promoting oral language output. Meanwhile, Zhang and Elder (2009) examined the reliability and validity of CET-SET from multi-perspectives, namely, authenticity, interactiveness, fairness, and washback effects. Recently, Tsang and Lee (2023) proved that foreign language-related emotions (anxiety, boredom, and enjoyment), speaking motivation, and spoken input beyond the classroom connected directly to Year 3 and Year 4 EFL leaners’ speaking proficiency in Hongkong primary schools, with enjoyment and spoken input beyond classroom serving as direct predictive power. Among them, those particularly related with the present study were those on the National Matriculation English Test (Shanghai Version) (hereafter, NMET(SH)) (Hou, 2018; Liu and Chen, 2018; Xu, 2016, 2021; Zhang, 2019), especially when the characteristics of test takers (e.g. overall language ability, number of participants) were taken into consideration. Their studies confirmed that the NMET(SH) assessed students’ ability of applying various sources of information into completing the integrated listening-speaking tasks and exerted positive washback effects. 2.2 CELST It is early since World War II that political needs have been exerting various kinds of influences on the form and scoring of speaking test (Fulcher, 2003, p.1). China is a case in point. General Senior High School Curriculum Standards (hereafter, GSHSCS) (Ministry of Education of the PRC, 2020) provided a comprehensive guidance on the general teaching goals, content standards, and teaching and assessment approaches for senior high school English courses. According to the GSHSCS (2020), the general aim of senior high school English curriculum is to help students to cultivate and develop students’ subject core competencies which include language abilities, cultural awareness, thinking capacity, and learning ability (Ministry of Education of the PRC, 2020, p.5). Meanwhile, students should be able to acquire English learning resources through multiple channels, choose appropriate strategies and methods, monitor, evaluate, reflect on, and adjust learning content and progress (Ministry of Education of the PRC, 2020, p.6). CELST, a response of the aim of cultivating senior high school students’ comprehensive English ability, has been implemented since 2011. There are three sub-tasks in CELST 1 , namely, reading aloud, role play, and story retelling. Test design of CELST works for the selective purposes. According to the Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test (EEA-GD, 2016), CELST aims at assessing students’ ability of accomplishing communicative tasks in specific contexts by using English, acquiring and applying their knowledge of phonology, vocabulary, grammar, etc. to comprehend and express effectively. Over the past decade, the number of candidates taking the CELST has consistently exceeded 630,000 annually. In both 2022 and 2023, this figure reached 710,000, reflecting the sustained demand for English proficiency assessment in Guangdong province’s higher education admissions process. These figures also indicate that a substantive study of CELST is of great significance considering the large candidate population and the competitive nature of the college entrance examination. Large amount of test takers it owns, few studies were conducted on it. Cheng’s (2011) doctoral dissertation focused solely on the validity issues of the story-retelling task in CELST, leaving the first two sub-tasks unexamined. So do the original researches conducted by Wang et al. (2018) and Xu and his co-workers (2019, 2020, 2021, 2023a, 2023b, 2024, 2025). Zhan and Wan (2016) investigated into Senior III students’ attitudes, test preparation practices and test taking processes when completing CELST. Zhou and Zeng (2016) compared the rating results between human raters and computer automated scoring of CELST by using many-facet RASCH models and found that despite of the differences in rater severity between these two scoring approaches, computer automated scoring was of better inner-consistency due to lower bias rates. To sum up, the above reviewed validation studies on integrated speaking tasks are helpful in conceptualizing construct and the validation procedure. However, language test taking is a process that is task-bound and context specific (Cohen, 2014). Test response is a function not only of the items, tasks, or stimulus conditions, but also of the participants’ responding and the context of measurement (Messick, 1987). In this light, different types of integrated speaking tasks would set different requirements on the amount, degree, and form of information in the listening/video clip that can be integrated into the test-takers’ oral response. Besides, though diverse results have been reported for different types of integrated speaking tests, researches on CELST remain scare, especially the first two sub-tasks. What’s more, most existing studies have largely relied on correlational approaches, such as examining the relationship between textual features and test performance or between strategy use and test performance. Few, if any, have probed into its construct validity directly through in-depth analysis of test-takers’ actual performance. On top of that, as a large-scale high-stake test, studies addressing Chinese senior high school EFL learners’ listening-speaking performance in NMET context are in dire need concerning the great influence it exerts on education both nationally and internationally. Therefore, collection of validity evidence of CELST using a more mathematically-grounded validation testing theory and methodology is of great necessity for the sound and comprehensive interpretation of its test scores as well as test constructs. 2.3 Models of integrated speaking tasks and their conceptualizations Theoretical models serve various functions in test validity development and validation, such as score interpretation, test development, curriculum design, etc. (Luoma, 2004, p.96). The necessity of grounding language test development and application in a theory of language proficiency calls for the incorporation of a theoretical framework that defines what language proficiency is (Bachman, 1990, p.81). Besides, a task-specific theoretical model does not only represent and encompass task construct but also function as the foundation of evaluation and assessment (Luoma, 2004, p.107). Enlightened by the various frameworks related to listening and/or speaking, it could be concluded that a model of listening-speaking ability should firstly take an integrative, interactive, or communicative stand, for language use both in natural and academic contexts are inherently integrative, interactive, or communicative. Meanwhile, it should also serve as a guideline for the assessment of both language product parameters and speaking process parameters, for the inclusion of the former could guide leaners’ self-learning, teachers’ teaching, and raters’ rating practices, and the encompass of the latter could help examine participants’ cognitive and strategic processes, which would help finding factors influencing task completion (Bachman, 1990). Therefore, an operationalized working model of the construct of CELST will be developed based on the analysis of existing theoretical models, the Test Specifications , and the specific test purposes of CELST (EEA-GD, 2016), which will be presented in Section 2.5.1. 2.4 Theory of validity & validation Being aware of the shift from multiple types of validity to a unitary understanding and from a focus on prediction to one on explanation, Messick (1987) conceptualized validity as a unitary concept and proposed a comprehensive framework consisting of six interrelated aspects, namely, content, substantive, structural, generalizability, external, and consequential validity. Among them, the external aspect of construct validity, which includes convergent and discriminant evidence obtained through multitrait-multimethod comparisons, as well as evidence of criterion relevance applied utility (Messick, 1987), delineates the extent to which the construct represented in the assessment accounts for the external pattern of correlations. Besides, Chapelle put that a validity argument should integrate both evidence and rationales to support conclusions about the appropriateness and soundness of score-based inferences and uses of a test (1999, p.263) and held that validation should begin with formulating a hypothesis concerning the appropriateness of testing outcomes (1999, p.259). For the external aspect of validity, Chapelle (1999) stated that data analyzed through a multitrait-multimethod research design could be used to examine the relationships between the target test and other tests or quantifiable performance indicators. Extending back, influenced by psychological structuralism and the conception of explaining performance through the systems and subsystems of underlying process, Embretson (1983) introduced construct modeling to construct validation study. According to construct modeling, the internal structure and substance of a test can be addressed more directly by means of causal modeling of item or task performance. Embretson (1983) suggested that the construct validation research of identifying the theoretical mechanisms underlying task performances (a task decomposition process), that is, construct representation, and detecting the relationships of a test to others (e.g. strength, frequency, and pattern of significant relations with other measures), that is, nomothetic span, should be separated. It could be achieved by mathematical modeling, psychological modeling, or multicomponent latent trait modeling which combines the features of the former two ones (1983, p.181). Moreover, according to the Standards for Educational and Psychological Testing (hereafter the Standards ) (AERA et al., 2014), which provides criteria for the development and evaluation of tests and testing practices and guidelines for assessing the validity of test scores interpretations for their intended uses, evidence based on test content may involve logical or empirical analyses to determine how adequately the test content of a test represents the targeted domain in relation to the proposed interpretations of test scores (AERA et al., 2014, p.14). However, Xu et al. (2024) pointed out that, to date, no public report has examined the content validity of the CELST. Following the guidance of the Standards , decomposing into coded construct-relevant textual measures and strategic measures, and employing a multicomponent model based on the testing purpose of CELST, the present study would explore to what extent these data-driven measures represent and are related to the specific inferences to be made from test scores. 2.5 The present study 2.5.1 Working taxonomy of CELST All the theoretical models of speaking ability assessment are not really appropriate to serve as a theoretical framework for the CELST considering the integrative nature of the task itself and the targeted participants and context of it. Hence, a task-specific theoretical framework of CELST that reflects task construct and guides evaluation and assessment by offering wordings and providing criteria (Luoma, 2004, p.107) is needed. An operationalized working model of the construct of CELST (Figure 1) was worked out based on a series of considerations. It includes the analysis of the GSHSCS (Ministry of Education of the PRC, 2020), test aims issued in the Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test (EEA-GD, 2016), personal communication with the test designer, a review of the theoretical models related to listening and/or speaking, the notion that a combination of the assessment of the product and the process undertaking it (Bygate, 1987) could make construct representation much sounder, and the adoption of the internal operation process from COE model (Chapelle et al., 1997). Thus, it is supposed that the construct model of CELST should entail the following principles and conceptions: 1) context part, the left part of the figure, refers to various task characters, candidates, raters, etc. related factors that affect candidates’ performance; 2) internal operation part, the core of the working model symbolized by the largest block, is the underlying process running through task completion; and 3) performance is the results of the oral production (Gist & Britol, 2020). 2.5.2 SEM & Bi-factor E-SEM Structural equation modeling (SEM) is a priori hypothesis testing that is based on theoretical knowledge or what previous research has found. It examines the nature of the relationships between the observed variables and the latent variables by using a confirmatory and hypothesis-testing approach (Bollen, 1989; Byrne, 2013) and has been widely used in education and psychology, including language testing field (Dörnyei, 2007; Phakiti, 2008). Powerful as it is, traditional SEM did meet its bottle neck. For example, the fixed zero loadings of the factors led to poor application issues like believability and replicability (Asparouhov and Muthén, 2009). At this very moment, Exploratory Structural Equation Modeling (ESEM, for short), a relatively novel approach, has recently emerged as a promising alternative to overcome some of the challenges (van Zyl and ten Klooster, 2022). ESEM is an exploratory factor analysis measurement model with rotations used in a structural equation model and functions as a combination of confirmatory factor analysis and exploratory factor analysis (Marsh, et al., 2009, 2010). All the SEM parameters are accessible in ESEM, such as residual correlations, regressions of factors on covariates, regressions among factors, standard errors for all rotated parameters, and overall model fit indices (Asparouhov and Muthén, 2009, p.398-399). Meanwhile, bifactor models test the presence of a global unitary construct underlying the answers to all items (G-factor) and whether this global construct co-exists with meaningful specificities (S-factors) defined by the part of the items not explained by the G-factor (Swami et al., 2023). Since one of the contexts that bi-factor ESEM could be used is no clear a priori structure exists (Swami et al., 2023), the present paper would employ a bi-factor ESEM, a more flexible, data-driven approach, to examine the extent to which the data, as reflected by the test scores and coded values, aligns with the proposed theoretical construct of CELST. 2.5.3 Research questions As mentioned above, participants’ performances as well their strategic behaviors involved illustrated the construct of integrated speaking tasks. However, these studies did not involve a large number of ESL learners of low to intermediate English language ability, nor test their performance on the computer-based integrated speaking task such as the CELST. To bridge this gap, the present study drew on bi-factor ESEM and hierarchical multiple regression techniques to identify to what extent the CELST can tease out the supposed construct. Specifically, we addressed the following two research questions. To what extent do participants’ performances activate and reflect the textual measures in the construct of CELST? To what extent do participants’ performances activate and reflect the communicative strategy use in the construct of CELST? 3. Method 3.1 Participants 3.1.1 Student participants With the assistance of the Education Examinations Authority of Guangdong Province (EEA-GD), a total of 360 students’ performance data in the CELST were included. To protect participants’ personal privacy, EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Influential factors such as their overall oral English proficiency level (Brown et al., 2005; Iwashita et al., 2008; Swain et al . , 2009) and geographical regions (Jin and Wu, 2010) were taken into consideration when sampling participants. Firstly, only Test A was sampled despite six paralleled tests were used in that year due to convenience. Secondly, all students were stratified randomly into three groups (high, medium and low) based on their total score reported by EEA-GD. Students whose CELST scores ranked in the top 20 out of 100 were assigned to the high-level group, those in the bottom 50 were assigned to the low-level group, and the remaining students were placed in the middle-level group. Because such stratification criterion corresponds to the enrollment differences among key universities, higher vocational colleges, and general colleges or universities. Participants of each proficiency group were parceled and named as a separate file. Thirdly, all the participants in each proficiency level file would be classified into four geographical region groups, namely north Guangdong, east Guangdong, west Guangdong, and Pearl River Delta. To be specific, 30 participants would be sampled randomly from each geographical region group in each proficiency level. In other words, 360 participants (30*4*3 = 360) were recruited finally. Table 1 is a description of the constituent of the student participants. 3.1.2 Transcribers and coders Apart from the first author of the present study, another seven M.A. students majoring in English language testing participated the transcription work. All of them had experiences of rating CELST for at least two years before doing the present transcription work. Training and trial transcription were conducted before formal transcription. The pilot transcription recordings were sampled randomly from students’ performance in CELST conducted one year earlier than the present study and totally ten pieces were selected. Besides, their final transcription results were checked by the second author and another Ph.D. candidate in English language testing randomly and totally forty pieces of their transcriptions were double-checked. The coding work of the present study was classified into two parts. One was the coding work that needs to be done manually with the help of WinMax 2 to collect data on textual and communicative strategy use measures, and the other part were those could be done automatically under the guidance of WorldSmith Tools, Praat, and the online resources of Lu Xiaofei (http://aihaiyang.com/software/). Apart from the seven M.A. transcribers mentioned above, another two Ph.D. candidates in English language testing, with one in her first-year trip and the other in her second-year trip, participated the first part of the coding work. All these nine coders were trained together on how to do the coding work with the software WinMax. Thirty-six samples were double-coded so as to keep reliability of the coding results following the approach of Brown et al. (2005). The second part of the coding work was done by the first author and then checked by the second author concerning issues like missing data, input accuracy, etc. Table 1 A Description of the Student Participants Proficiency Region high medium low total north Guangdong 30 30 30 90 east Guangdong 30 30 30 90 west Guangdong 30 30 30 90 Pearl River Delta 30 30 30 90 Total 120 120 120 360 3.2 Instruments and materials 3.2.1 Integrated speaking tasks The oral data were collected from the three subtasks of Test A of CELST in a certain year. Figure 2 is a presentation of the screen interface of CELST. Appendix A presented the details of Test A of CELST used in the present study. 3.2.2 Rating rubrics Rating rubrics for CELST (Appendix B) were used in the present study not for the purpose of scoring performance as it usually employed, but contributing to the establishment of the working framework of the construct of the CELST and the taxonomy of coding. 3.2.3 Coding scheme Test takers’ performances were coded from the perspectives of oral product textual measures and communicative strategies used as presented in Figure 1. The final framework of test takers’ oral product textual measures comprises 25 coding items under five aspects, namely, accuracy, complexity, coherence, fluency, and phonology, while their strategies used would be assessed from reduction and achievement perspectives. The detailed variables that measure test takers’ oral products and strategies used could be referred in Appendix C. 4. Results and Discussions 4.1 Preliminary analysis Firstly, the first author of this paper rechecked whether total scores, sub-scores for each subtask, and the corresponding video tape recordings were included and the results were confirmed. Secondly, during the process of data confirming, it was found that the frequency for several coding parameters 3 were zero for all participants. Thus, the parameter number of internals was deleted due to zero occurrence across all three subtasks. Thirdly, observed variable SpM was no longer used due to its high correlation with the observed variable MSpM which has more linguistic implication. The inter-coder reliability Cranach’s α value for all parameters were over 0.8 (see Appendix D), indicating that the two coders were consistent with each other in the coding process. This further guaranteed the reliability of the coding results in the main analysis. 4.2 The competence model of CELST Based on the results of previous literature on integrated speaking assessment and second language assessment, a set of indexes would be used in several alternative hypothesized models of CELST to explore into the relationship between the theoretical model proposed for CELST and the working construct (see Figure 3) from the perspective of language competence parameters, addressing Research Question 1. As stated in Section 3.2.3, five factors, namely, accuracy, complexity, coherence, fluency, and phonology, were employed as latent factors in the competence model of CELST. Initially, the researchers assessed the model fitness index of hypothesized competence Model 1 using traditional second-order SEM. However, the analysis failed to converge due to the exceeding of the maximum number of iterations. A detailed inspection of the data found that most of the value of CpT is smaller than 2 (CpT appeared less than twice in each participant’s performance), indicating inappropriateness of using CpT as a measure of differentiating senior high school students’ oral proficiency. Iwashita et al. (2008: 45) also raised doubts on using of ratio measures to represent complexity. Since with participants’ overall oral language proficiency going up, NoT and NoC in their oral production would increase, but not the value of the ratio. Besides, the negative correlation between CpT and the total score (-.018) also violates the common sense, because more CpT usually indicates high overall language proficiency and score. Therefore, the observed variable NoC and NoT are used instead of CpT and MLT to measure language complexity. Besides, variable RMS would be discarded in the new model due to its non-significant relationship with total score (-.063). However, the new model still couldn’t converge. Probing into the observed variables in the new model, those whose correlation with the total score is smaller than 0.3, to be specific, DysM and RRSU, were deleted. Thus, a hypothesized exploratory bi-factor SEM (Model 2) (see Figure 4), instead of the traditional SEM, was proposed based on the data feedback of model 1. Results of bi-factor ESEM indicated that the model (Figure 4: Model 2) fits 4 the data well. The Chi-square in the present model was 116.489, with 85 degrees of freedom. χ 2 /df was 1.37, indicating a good fit between the present model and the data. The value of RMSEA and SRMR were much smaller than 0.5, also indicating well model fit. Hitherto, all these three absolute model fit indexes demonstrated that Model 2 fits the present data well. CFI and TLI are two indexes of comparative fit. The higher these two values are, the more fit the model would be. They both range from 0 to 1 and a value larger than 0.9 indicates well model fit. These two model fit indexes also indicated that Model 2 fits the present data well. Besides, almost all observed variables loaded on the general factor, listening-speaking ability, and several of them clustered and loaded on one domain specific factor such as accuracy, complexity, coherence, fluency and phonology. Such kind of loading pattern confirmed the use of exploratory bi-factor ESEM analysis of the present data. However, probing into the details, some minor issues still deserve further in-depth speculations. For one thing, RMP (ratio of mispronounced phonemes) loaded on overall listening-speaking ability with a positive value, which is against the theory. As is known to all, the more mistakes one makes in terms of phonemes mispronounced, the poorer one’s oral English would be. For another thing, RUiF (0.674) did work effectively on its a domain specific factor, phonology, but not on the overall general listening-speaking ability. Finally, the factor loading of PRO on its domain specific factor was relatively lower (-0.204) than the accepted value (0.3) symbolizing meaningful causal relationship between observed variable(s) and latent variable, while its contribution to the general listening-speaking ability factor was rather well (-0.747). Therefore, modifications were to be made. Firstly, the index RMP was discarded. Even though the factor loading values in the new modified model (Model 2.1) were somewhat better than those in Model 2, RUiF still could not load on general listening-speaking ability, which is against the hypothesis of bi-factor ESEM analysis. Besides, PRO still loaded poorly on phonology (-0.207), though significant. However, several modification indexes did not imply any modification on PRO. Thus, RUiF was deleted in the modified Model 2.2 to see whether the deletion of such an observed variable would result in better model fit and interpretations. As could be seen in Figure 5 (Model 2.2), all observed indicators in the new model loaded on the overall listening-speaking ability and their own domain specific latent factors. As a whole, the finalized model presented a well level of model-to-data fit (χ 2 =79.97, df=60; χ 2 /df=1.33; RMSEA=0.04; SRMR=0.02; CFI =1.00; TLI=0.976). All factor loadings of the observed variables on their corresponding latent variables were significant at the 5% level. 4.3 Strategy use in CELST As illustrated in Section 2.4.1 and Section 3.2.3, the communicative strategies adopted in the present study are consisted of two macro-parameters, to be specific, reductive strategies and achievement strategies, representing two different model approaches. Hierarchical multiple regression is often used when the independent variables are entered cumulatively according to a specified hierarchy by the researchers based on theoretical grounds (Gaciu, 2021; Pallant, 2020). Therefore, hierarchical multiple regression was employed to examine whether and to what extent the two different macro-parameters of communicative strategies used by participants influenced and predicted their overall performance, as well as their performance on individual aspects separately. Assumptions of conducting a multiple linear regression analysis like the examination of sample size, multicollinearity and singularity, outliers, normality, linearity, etc. were all confirmed. Firstly, a total of 360 participants met the sample size requirement on conducting multiple linear regression analysis (N≥50+m8; m refers to the number of independent variables) (Tabachnick & Fidell, 2013). Secondly, with all Pearson correlation values allocated between -0.2 and 0.3, Tolerance values of more than 0.10, and VIF values well below the cut-off of 10, non-collinearity could be achieved. Thirdly, visual inspection of normal probability plot of the regression standardized residual confirmed the normality of the data. (see Appendix E) As shown in Table 2, R square of the reductive model is 0.065. The value was upgraded to 0.168 after the four achievement strategy parameters entered. To be specific, the value of R square changed was 0.103. That is to say, the present communicative strategy model only explained 16.8 per cent of the variance in participants’ score. Results from ANOVA indicated that this model did achieve statistical significance (F=33.35, Sig.=.00). Detailed inspection of the coefficients analysis demonstrated that apart from App, all the other five parameters made significant unique contributions to the prediction of participants’ overall CELST score (Sig.<.05). Table 2 multiple regression analysis on communicative strategy variables and score model summary c ANOVA coefficients R R Square Adjusted R Square R square change F Sig Standardizedβ t Sig. .25 a .07 .06 .07 37.27 .00 MA -.17 -5.89 .00 SR -.18 -5.98 .00 .41 b .17 .16 .10 33.35 .00 MA -.21 -7.16 .00 SR -.22 -7.87 .00 App .00 .16 .87 GUS .07 2.45 .02 PAR .09 3.17 .00 RES .27 9.05 .00 a. Predictors: (Constant), MA, SR b. Predictors: (Constant), MA, SR, App, GUS, PAR RES c. Dependent Variable: score 5. Discussions 5.1 Textual measures perspective Such a loading pattern (see Fig. 5 : Model 2.2) confirms that CELST does examine participants’ ability of fulfilling three different types of listening-speaking tasks by using and applying their knowledge of accuracy, complexity, coherence, fluency, and phonology, which is in accordance with its test purposes, testing students’ ability of accomplishing tasks in specific contexts by acquiring and applying their knowledge of English phonetics, vocabulary, grammar, etc. to comprehend and express in particular (EEA-GD, 2016). The relatively high loadings on general listening-speaking ability also reflect that CELST implements the requirements of the GSHCSC that senior high school English courses should stress on cultivating students’ ability of acquiring and processing information by means of English, analyzing and solving problems in English, especially the ability of thinking and expressing in English; thus, developing students’ overall English language using ability. Besides, the different numbers of observed indicators of these five domain specific factors and their different weights not only manifest the comprehensiveness of CELST but also indicate its emphasis on the dimensions of language ability might diverse in different sub-tasks, that is, contextual specificity of its three subtasks. As manifested by Model 2.2, accuracy is loaded by REFC, RCVF, and RRSC, complexity is loaded by TTR, NoC, and NoT, fluency is loaded by MSpM, AR, MLR, and DysM, phonology is loaded by PRA, PRO, and PR, and coherence is loaded by COM. Each of the loadings of accuracy and fluency has a value of over 0.4, with some larger than 0.7, indicating strong predictive power of these textual measures on accuracy and fluency. However, when it comes to complexity and coherence, the situation turns out to be not so sound, for the loading value of TTR and COM on their correspondent domain specific factor is just over 0.3, a border value. What’s worse, the factor loading of RPO on phonology is only − 0.208. These different loading values reflect, to some extent, CELST gives different weights to these five dimensions due to aiming at testing participants’ ability of obtaining and applying the information from diverse source materials and their previous background knowledge to accomplish given tasks. This might also be attributed to the integrative nature of listening-speaking task (Brown et al., 2005 ; Farnsworth, 2013 ). Meanwhile, the three sub-tasks (reading aloud, role play, and story retelling) and their corresponding rating criteria might also play a role. The different requirements for senior high school students’ language ability development in terms of grammar, vocabulary, pronunciation, etc. of listening ability, speaking ability and affective attitude (see Section 2.2 ) again is one of the factors that could address the different weights. What needs to be noticed is that, phoneme omitting or swallowing, particularly the omission of the medial or final consonant in words, is a common phenomenon among students from Guangdong when speaking English (Xu et al., 2025 ). Such kind of consonant simplification could be possibly related with their dialect, Cantonese, which does not have complex consonant clusters at the end of syllables and tends to rely on tone change with pitch but not consonants and their endings. Therefore, to some sense, the negative significant impact of RPO on phonology and listening-speaking ability indicated that raters can recognize and evaluate participants’ phoneme omitting effectively, which is in consistency with the findings of Xu et al. ( 2025 ). Moreover, the unidimensional factor loading of RRSU, ADD, and TEM on the general listening-speaking ability and the components of the domain specific complexity factor deserve exploration. Whether RRSU should be used as a parameter of accuracy or a factor representing content remains a question (Zhou, 2005 ; Brown et al., 2005 ). The high and unidimensional factor loading of RRSU (0.965) on general listening-speaking ability manifests that RRSU is an indispensable as well as decisive variable accounting for the overall listening-speaking proficiency; however, it is a variable that should not be regarded as a part of accuracy. Moving to coding parameters of coherence, such as, ADD and TEM, it could be due to the fact that reading aloud task only requires test takers to read after the video clip with appropriate pronunciation and intonation and no cohesive devices are needed in this task. As a result, when CELST is taken into consideration as a whole, only COM variable clustered to the domain specific coherence factor. Results of Model 2.2 also prove that it is inappropriate to use ratio measures, like CpT, to assess L2 language learners’ oral production, which gives empirical support to the doubts proposed by Iwashita et al. ( 2008 , p.45). At the meantime, the high factor loadings of NoC and NoT on both the general listening-speaking ability factor (NoC: 0.694; NoT: 0.729) and the domain specific complexity factor (NoC: 0.692; NoT: 0.447) indicates that NoC and NoT could be used as measures representing textual complexity. 5.2 Communicative strategy use perspective Results of hierarchical multiple regression analysis demonstrated that the combination of all the six indices explained 16.4% of the score variances. Though not very large in terms of mathematic value, such a value is considerable since language competence part was not considered in the regression model. On the one hand, the four predictors of achievement strategies contributed 10.3% out of a total of 16.4%, indicating that achievement strategies like guessing, paraphrasing, restructuring, etc. are preferred by senior high school participants compared with those of reductive strategies when they have difficulties in expressing. This is quite different from Yang’s ( 2009 ) finding that their participants tended to rely more on strategies like verbatim source use which was considered to be unfavorable for their language development. Meanwhile, it might also be due to the fact that participants in the present study were well trained before they took part in such a large-scale high-stake test so that they were quite well-familiar with how their performances will be assessed. Hence, they would not use reductive strategies, message abandon and semantic reduction, to avoid being assessed poorly. Detailed inspection of the coding data proved that no participant used message abandon in the first sub-task (reading aloud), which requires the test takers to repeat immediately what the computer just broadcasted, with the text of the content presented on the computer screen. On the other hand, approximation was the only variable that did not make sense to score variance, compared to the other five communicative strategy variables (see Table 2 , t = .16, Sig. >.05). Swain et al. ( 2009 , p.23) also reported that approximation was the type of communicative strategy that was used least by participants. According to Fulcher ( 2003 ) and Swain et al. ( 2009 ), approximation refers to the behavior of using strategies like lexical substitution, over-generalization, and exemplification to replace an unknown word or a word that is out of memory’s reach. Probably, the vocabulary and expressions given in the source material in these three subtasks in CELST facilitated participants’ oral production in acquiring words and expressions (Brown et al., 2005 ; Frost et al., 2021 ). 6. Conclusions Employing a convergent mixed-methods and under the guidance of Messick’s unitary concept of construct validity (Messick, 1987 ), the present research explored into whether and to what extent CELST test the language competences it declares to. The results indicated that CELST does test students’ ability of accomplishing certain tasks in specific contexts by acquiring and applying various sources of knowledge, such as their encyclopedic knowledge of English (accuracy, complexity, coherence, fluency, and phonology) and the world, source material content, communicative strategies (message abandon, semantic reduction, approximation, guessing, paraphrasing, restructuring), etc. in English. Such test construct is similar to that of the integrated speaking tasks in TOEFL as manifested in Brown et al. ( 2005 ). That is to say, integrated speaking tasks, or CELST in the present study, examine participants’ ability of acquiring information from various source materials and their encyclopedic background knowledge and then use them to fulfill oral output tasks (reading aloud, role play, story retelling) under the guidance of task prompt and with the help of their communicative strategic knowledge (see Fig. 1 ), which is in consistency with the teaching goal guidance of the GSHCSC (Ministry of Education of the PRC, 2020). However, this study has several limitations that should be carefully considered when interpreting or generalizing the findings. Firstly, only participants’ performances were involved in the present study, but not the factors of the task and the candidate, that is, the left part of the theoretical model of CELST (see Fig. 1 ), which limits the generalizability of the research results. Secondly, other approaches of data collection, such as interviews or questionnaires concerning participants’ self-report of task completion process or senior high school teachers’ feedback, worth exploration, despite of the irreplaceability of authentic NMET data in the present study. Keeping these limitations in mind, the researchers interpret the findings and propose implications for L2 theory and practices. Theoretically, such context-bound task-specific model adapted from Chapelle et al. ( 1997 ) brings the integrative view of listening-speaking tasks into CELST. Task characteristics and candidates’ individual differences together play a role in their internal operation of their language competence, communicative strategy, and world knowledge in the performances they produce. Data of the influences of task characteristics and other factors of participants themselves, like geographical regions, gender differences, motivations (Tsang and Lee, 2023 ), etc., could be collected to gather evidences based on relations to other variables to examine the generalizability of CELST (AERA et al., 2014). Methodologically, the bottom-up research paradigm, a discourse-based approach, together with bi-factor ESEM will function as a complementation in the research methodology of construct validation in applied linguistics. Future comparative studies across the three subtasks of CELST concerning their task-specific constructs as well as the psychological processes during task fulfillment would contribute to the enrichment of the researches of CELST based on internal structure and response process (AERA et al., 2014). Pedagogically, the contributive coding parameters of specific factors and their loading values differed dramatically across each other, indicating that different types of integrated speaking should be adopted for different educative purposes. For example, reading aloud task and role play task might not be suitable to test students’ ability of achieving textual coherence, while it would be wiser to assess senior high school students’ oral language proficiency on how many clauses or T-unit they have made but not how many clauses or T-unit per sentence. Declarations Funding Declaration The authors declare that no financial support was received for the conduct of this study, the preparation of this manuscript, or its publication. This research was carried out independently without any external funding from public, commercial, or non-profit organizations. Ethics Approval The study was reviewed and given approval by the Ethics Review Committee of School of Foreign Languages, Southern Medical University (No. 20230910) on September 10, 2023. This study was conducted in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Informed consent The study “Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task” collected 360 senior high school students’ performance data in the Test A of the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test (hereafter, CELST). All the data were provided by the Education Examination Authority of Guangdong Province (hereafter, EEA-GD) since the first author participated in a Guangdong provincial educational and scientific research program (TJW2013001) which aimed at validating the reliability and validity of the automatic scoring of CELST. In line with provincial regulations and institutional policies, written (signed) consent was not required. EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Anonymity was rigorously guaranteed to ensure that all collected data would be used solely for academic research purposes. All participants of this provincial program, except for scientific research purposes, should keep the obtained data confidential. Additionally, we also applied for informed exemption consent and was approved by the Ethics Committee at the School of Foreign Languages, Southern Medical University. Data availability Due to an agreement with the Education Examination Authority of Guangdong Province, the data used and/or analyzed during the current study are restricted to research purposes only and cannot be made publicly available. Competing interest The authors declare that they have no competing interests. References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014) Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Asparouhov T, Muthén B (2009) Exploratory structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 16 (3):397-438. https://doi.org/10.1080/10705510903008204 Bachman LF (1990) Fundamental considerations in language testing. Cambridge, UK: Cambridge University Press. Barkaoui K, Brooks L, Swain M, Lapkin S (2013) Test-takers’ strategic behaviors in independent and integrated speaking tasks. Applied Linguistics, 34 (3):304-324. https://doi.org/10.1093/applin/ams046 Bollen KA (1989) Structural equations with latent variables . New York, NY: Wiley. Brown A, Ducasse AM (2019) An equal challenge? Comparing TOEFL iBT™ Speaking Tasks with Academic Speaking Tasks. Language Assessment Quarterly, 16 (2):253-270. https://doi.org/10.1080/15434303.2019.1628240 Brown A, Iwashita N, McNamara T (2005) An examination of rater orientations and test-taker performance on English-for-academic purposes speaking tasks. TOEFL R Monograph Series. MS-29. Princeton: Educational Testing Service. Bygate M (1987) Speaking. Oxford, UK: Oxford University Press. Byrne BM (2013) Structural equation modeling with Mplus: Basic concepts, applications, and programming. Routledge. https://doi.org/10.4324/9780203807644 Chapelle CA (1999) Validity in language assessment. Annual Review of Applied Linguistics,19:254-272. https://doi.org/10.1017/S0267190599190135 Chapelle CA, Grabe W, Berns M (1997) Communicative language proficiency: Definition and implications for TOEFL 2000. TOEFL R Monograph Series. MS-10. Educational Testing Service. Cheng F X (2011) Justifying the interpretations about a Listening-to-retell task in CELST in NMET(GD). Guangzhou, China: Guangdong University of Foreign Studies. Cohen AD (2014) Strategies in learning and using a second language. Longman. Crossley SA, Kim YJ (2019) Text integration and speaking proficiency: Linguistic, individual differences, and strategy use considerations. Language Assessment Quarterly, 16 (2):217-235. https://doi.org/10.1080/15434303.2019.1628239 de Jong NH (2023) Assessing second language speaking proficiency. Annual Reviews of Linguistics, 9: 541-60. https://doi.org/10.1146/annurev-linguistics-030521052114 Dörnyei A (2007) Research methods in applied linguistics. Oxford, UK: Oxford University Press. Education Examinations Authority of Guangdong Province (2016) Test syllabus and sample paper disk for Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (Guangdong Version). Guangzhou, China: Guangdong Pacific Electronic Press. Embretson S (1983) Construct validity: Construct representation versus nomothetic span. Psychological Bulletin. 93 (1):179-197. https://www.researchgate.net/publication/289963742 Fan J, Yan X (2020) Assessing speaking proﬁciency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 330. https://doi.org/10.3389/fpsyg.2020.00330 Farnsworth TL (2013) An investigation into the validity of the TOEFL iBT speaking test for international teaching assistant certification. Language Assessment Quarterly, 10 (3):274-291. https://doi.org/10.1080/15434303.2013.769548 Frost K, Clothier J, Huisman A, Wigglesworth G (2019) Responding to a TOEFL iBT integrated speaking task: Mapping task demands and test takers’ use of stimulus content. Language Testing, 37 (1):133-155. https://doi.org/10.1177/0265532219860750 Frost K, Wigglesworth G, Clothier J (2021) Relationships between comprehension, strategic behaviours and content-related aspects of test performances in integrated speaking tasks. Language Assessment Quarterly, 18 (2):133-153. https://doi.org/10.1080/15434303.2020.1835918 Fulcher G (2003) Testing second language speaking. Routledge. Gaciu N (2021) Understanding quantitative data in educational research. SAGE Publications. Gist, C. D., & Bristol, T. J. (Eds.). (2020). Fairness in Educational and Psychological Testing. Washington DC: American Educational Research Association. Hirai A, Koizumi R (2013) Validation of empirically derived rating scales for a story retelling speaking test. Language Assessment Quarterly, 10(4):398-422. https://doi.org/10.1080/15434303.2013.824973 Huang HD, Hung SA (2018) Investigating the strategic behaviors in integrated speaking assessment. System, 78(1):201-212. https://doi.org/10.1016/j.system.2018.09.007 Huang HD, Hung SA, Plakans L (2018) Topical knowledge in L2 speaking assessment: Comparing independent and integrated speaking test tasks. Language Testing, 35 (1):27-49. https://doi.org/10.1177/0265532216677106 Hou YP (2018) A study on the washback effect of the reform of SHNMET listening and speaking test. TEFLE, 183 (05):25-31. Inoue C, Lam DMK (2021) The effects of extended planning time on candidates’ performance, processes, and strategy use in the lecture listening-into-speaking tasks of the TOEFL iBT® test (TOEFL Research Report No. RR-93). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/ets2.12322 Ishikawa S (2020) Influence of learner attributes on complexity, accuracy, and fluency in English oral outputs of Japanese learners. In: Mentz O, Papaja K (eds) Focus on language: Challenging language learning and language teaching in peace and global education (pp. 43-68). LIT Verlag. Iwashita N (2022) Speaking assessment. In: Derwing TM, Munro MJ, Thomson RI (eds) The Routledge handbook of second language acquisition and speaking (pp.130-140). New York, NY: Routledge. Iwashita N, Brown A, McNamara T, O’Hagan S (2008) Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics, 29 (1):24-49. https://doi.org/10.1093/applin/amm017 Jin X (2012) Working memory constraints on L2 learners’ speech production. Foreign Language Teaching and Research, 44(4):523-535. Jin Y, Wu J (2010) A preliminary study of the validity of the Internet-Based CET-4 —— Factors Affecting Test-takers’ Perception of the Performance on the Test. Technology Enhanced Foreign Language Education, 132 (2): 3-10. Kim HJ (2015) A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12 (3):239-261. https://doi.org/10.1080/15434303.2015.1049353 Lin R (2023) Examining the scoring of content integration in a listening-speaking test: A G-theory analysis. Language Assessment Quarterly, 20(3):319-338. https://doi.org/10.1080/15434303.2023.2242334 Liu S, Chen YJ (2018) A practical exploration on NMET (Shanghai)-based English listening and speaking teaching. TEFLE, 183(05): 32-36. Luoma S (2004) Assessing speaking. Cambridge, UK: Cambridge University Press. Kormos J, Suzuki S, Eguchi M (2022) The role of input modality and vocabulary knowledge in alignment in reading-to-speaking tasks. System, 108, 102854. https://doi.org/10.1016/j.system.2022.102854 Marsh HW, Muthén B, Asparouhov T, Lüdtke O, Robitzsch A, Morin AJS, Trautwein U (2009) Exploratory structural equation modeling, integrating CFA and EFA: Application to students’ evaluations of university teaching. Structural Equation Modeling: A Multidisciplinary Journal, 16 (3):439-476. https://doi.org/10.1080/10705510903008220 Marsh HW, Lüdtke O, Bengt M, Asparouhov T, Morin AJS, Trautwein U, Nagengast B (2010) A new look at the big five factor structure through exploratory structural equation modeling. Psychological Assessment, 22 (3):471-491. https://doi.org/10.1037/a0019227 Messick S (1987) Validity (TOEFL Report). Princeton, NJ: Educational Testing Service. Ministry of Education of the People’s Republic of China. (2020) General senior high school curriculum standards. People’s Education Press. Pallant J (2020) SPSS survival manual: A step by step guide to data analysis using IBM SPSS (7 th ed.). Routledge. Phakiti A (2008) Construct validation of Bachman and Palmer’s (1996) strategic competence model over time in EFL reading tests. Language Testing, 25 (2): 237-272. https://doi.org/10.1177/0265532207086783 Pusey K (2020) Assessing L2 listening at a Japanese university: Effects of input type and response format. Language Education and Assessment, 3 (1): 13-35. https://doi.org/10.29140/lea.v3n1.193 Rui YP, Ji HJ (2017) The impact of multimodal listening & speaking teaching on English speaking anxiety and classroom reticence. TEFLE,178 (6): 50-55. Rukthong A (2021) MC listening questions vs. integrated listening-to-summarize tasks: What listening abilities do they assess? System, 97 (1), 102439. https://doi.org/10.1016/j.system.2020.102439 Rukthong A, Brunfaut T (2020) Is anybody listening? The nature of second language listening in integrated listening-to-summarize tasks. Language Testing, 37 (1): 31-53. https://doi.org/10.1177/0265532219871470 Swain M, Huang L, Barkaoui K, Brooks L, Lapkin S (2009) The speaking section of the TOEFL iBT TM (SSTiBT): Test-takers’ reported strategic behaviors . TOEFL iBT-10. Educational Testing Service. Swami V, Maïano C, Morin AJS (2023) A guide to exploratory structural equation modeling (ESEM) and bifactor-ESEM in body image research . Body Image, 47, 101641. https://doi.org/10.1016/j.bodyim.2023.101641 Suzuki S, Kormos J (2023) The multidimensionality of second language oral fluency: Interfacing cognitive fluency and utterance fluency. Studies in Second Language Acquisition, 45 (1):38-64. https://doi.org/10.1017/S0272263121000899 Tabachnick BG, Fidell LS (2013) Using Multivariate Statistics (6th ed ). Pearson Education. Tsang A, Lee JS (2023) The making of proficient young FL speakers: The role of emotions, speaking motivation, and spoken input beyond the classroom. System, 115, 103047. https://doi.org/10.1016/j.system.2023.103047 Van Zyl LE, ten Klooster PM (2022) Exploratory structural equation modeling: Practical guidelines and tutorial with a convenient online tool for Mplus. Frontiers in Psychiatry, 12 (1), 1-28. https://doi.org/10.3389/fpsyt.2021.795672 Wang H, Fan TT, Zeng YQ (2018) Investigating the construct of speaking proficiency under the listening-to-speak integrated task. Modern Foreign Languages, 41 (3), 413-424. Wei J, Liosa L (2015) Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. Language Assessment Quarterly, 12 (3):283-304. https://doi.org/10.1080/15434303.2015.1037446 Xu W (2016) Analysis of National Matriculation English Test (Shanghai) under the new reform of examination and enrollment system: Innovation, elucidation and prospection. Foreign Language Testing and Teaching, 4: 24-31. Xu W (2021) Practice of a speaking assessment task in a high-stake test: Taking NMET(Shanghai) as an example. Foreign Language Testing and Teaching, (1): 21-27. Xu Y, Huang M, Chen J, Zhang Y (2023a) Investigating a shared-dialect effect between raters and candidates in English speaking tests. Frontiers in Psychology, 14 , 1143031. https://doi.org/10.3389/fpsyg.2023.1143031 Xu Y, Li XD, Chen J (2024) The review: Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (NMET) Guangdong version in China. Language Testing, 42(2):238-249. https://doi.org/10.1177/02655322241255712 Xu Y, Li XD, Wang PC (2023b) Validating an empirically developed rating scale of story retelling task. Journal of PLA University of Foreign Languages, 46 (5): 11-19. Xu Y, Liao TH, Han S, Wang YQ (2019) Development and validation of the content rubric of a story retelling task. Foreign Language Testing and Teaching, 4: 21-30. Xu Y, Liao TH, Han S, Wang YQ (2020) Investigating language features for the listening-to-speak integrated task: A corpus-based approach. Foreign Language Research, 1: 56-63. Xu Y, Yang MN, Li XD (2025) Investigating the relationships between listening strategies and speaking performance in integrated listening-to-speak tasks. System, 129, 103586. https://doi.org/10.1016/j.system.2024.103586 Xu Y, Zhang YQ (2021) Investigating pronunciation features of the integrated listening-to-speak task construct. Foreign Language Testing and Teaching, 3: 39-48. Yan X, Cheng LX, Ginther A (2019) Factor analysis for fairness: Examining the impact of task type and examinee L1 background on scores of an ITA speaking test. Language Testing, 36 (2):207-234. https://doi.org/10.1177/0265532218775764 Yang HC (2009) Exploring the complexity of second language writers’ strategy use and performance on an integrated writing test through structural equation modeling and qualitative approaches. Unpublished doctoral dissertation. The University of Texas. Zhan Y, Wan ZH (2016) Test takers’ beliefs and experiences of a high-stakes Computer-based English Listening and Speaking Test. RELC Journal, 47 (3):363-376. https://doi.org/10.1177/0033688216631174 Zeng QM (2011) The efficacy of multi-modal teaching on the development of L2 listening and speaking abilities. Journal of PLA University of Foreign languages, 6: 72-76. Zhang R (2019) Washback effect analysis of NMET(Shanghai) listening and speaking test: Taking J school as an example. Foreign Language Testing and Teaching, 4:47-53. Zhou WJ (2005) Effects of input modes on oral English production. Journal of PLA University of Foreign Languages, 28 (6):53-58. Zhang Y, Elder C (2009) Measuring the speaking proficiency of advanced EFL learners in China: The CET-SET solution. Language Assessment Quarterly, 6 (4):298-314. https://doi.org/10.1080/15434300902990967 Zhou Y, Zeng YQ (2016) Many-facet Rasch model analysis on computer automatic scoring of a computer-based English listening-speaking test. Foreign Language Testing and Teaching, 1: 22-31. Zhang WW, Zhang LJ (2022) Understanding assessment tasks: Learners’ and teachers’ perceptions of cognitive load of integrated speaking tasks for TBLT implementation. System, 111, 102951. https://doi.org/10.1016/j.system.2022.102951 Zhang WW, Zhang DL, Zhang LJ (2021) Metacognitive instruction for sustainable learning: Learners’ perceptions of task difficulty and use of metacognitive strategies in completing integrated speaking tasks. Sustainability, 13, 6275. https://doi.org/10.3390/su13116275 Zhang WW, Zhao MJ, Zhu Y (2022) Understanding individual differences in metacognitive strategy use, task demand, and performance in integrated L2 speaking assessment tasks. Frontiers in Psychology, 13, 876208. https://doi.org/10.3389/fpsyg.2022.876208 Footnotes For the detailed description and scoring criteria of the three subtasks of CELST please refer to the “General description” section in Xu et al. ( 2024 ). These measures include: the number of error-free clauses, the number of verb forms, the number of correct-verb forms, the number of reported semantic units, the number of additive conjunctions, the number of comparative conjunctions, the number of temporal conjunctions, the number of consequential conjunctions, the number of internal conjunctions, the number of unmeaningful syllables (including repetition, reformulation, and replacement), the number of filled pauses, the number of unfilled pauses, the number of self-corrections, the number of phoneme additions, the number of phoneme omitting, the number of mispronounced phonemes, the number of unintelligible fragments, the number of misstressed phonemes, the number of message abandoned, the number of semantic units reduced, the number of guessing, the number of approximation, the number of approximation, the number of paraphrasing, the number of restructuring, the number of words coinaged. To be specific, they are all the sub-measures of coherence, number of internals, all the sub-measures of achievement strategy use, number of message abandon in reading aloud task; all the sub-measures of achievement strategy use and number of comparatives and temporals in role play task; and number of internal conjunctions and number of coinage across all the three sub-tasks in CELST. Generally speaking, χ 2 /df value of smaller than 2 indicates well fit, with a value of 2 < χ 2 /df < 5 indicating acceptable fitness (Hou, Wen, & Cheng, 2003, p.177–179). Another two commonly used absolute indexes of model fit are RMSEA and SRMR, representing Root Mean Square Error of Approximation and Standardized Root Mean Square Residual separately. It is usually reckoned that a RMSEA value of smaller than 0.8 indicates acceptable model fit, and a RMSEA value of smaller than 0.5 indicates well model fit. For SRMR, a value smaller than 0.5 indicates well model fit. Additional Declarations No competing interests reported. Supplementary Files 8Sept2025Appendices.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7563159","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":534598983,"identity":"561d83dd-9e75-4e4c-8ce0-75560d17c22b","order_by":0,"name":"Yan Zhou","email":"","orcid":"","institution":"Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Zhou","suffix":""},{"id":534598984,"identity":"5548d340-9d55-456c-a696-f73cb71d41dd","order_by":1,"name":"Ke Bin","email":"","orcid":"","institution":"Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Ke","middleName":"","lastName":"Bin","suffix":""},{"id":534598985,"identity":"87b0bdcc-d2c7-4fe1-9ab6-78647018a2f9","order_by":2,"name":"Lawrence Jun Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABA0lEQVRIiWNgGAWjYBACPoYDDMwgBj8QMzYUAMkDBLSwwbRINoC0GBClhQGixeAA0VoYDzB/Lqi5Y7f5+BnDjzMM7PL4DjA//MC4ow6vw4xnHHuWvO1MjrHkBoPkYskDbMYSjGfY8GpJ5mE7nGx2IMeM8YEBc+KGAwxmDIxtPHi1HOb5dzjZuP8NSEs9UAv7N6AWCXxaGJt52w7bGUgAbdlgcBiohQdkiwEeLQebmXn7DidI3HhWLDnD4HjizMM8xRKJbQk4tfBLHD78mefbYXv+/uSNH3sqqhP7jrdv/PCxDXeIMUgcbABRiQ1wEVA04bYDZA1ErT0+NaNgFIyCUTDCAQDGw1TmrNcDwQAAAABJRU5ErkJggg==","orcid":"","institution":"University of Auckland","correspondingAuthor":true,"prefix":"","firstName":"Lawrence","middleName":"Jun","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2025-09-08 10:38:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7563159/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7563159/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":94480412,"identity":"1061c069-be3a-42a5-a5a4-22bd7f9d26a5","added_by":"auto","created_at":"2025-10-27 16:11:01","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":593130,"visible":true,"origin":"","legend":"","description":"","filename":"9Sept2025MaindocumentRevAnonymous.docx","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/b98f1f913dd659e3a055aad4.docx"},{"id":94480413,"identity":"585a0fa3-5d7d-4ea0-afe3-cc0e2049a796","added_by":"auto","created_at":"2025-10-27 16:11:01","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5493,"visible":true,"origin":"","legend":"","description":"","filename":"82ea3c0f744a47f9b1e237fd6e14dbd4.json","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/f7ca564e05595174a2b2a886.json"},{"id":94480365,"identity":"328db4a6-190e-4bd6-b57f-bcfb419c6a1c","added_by":"auto","created_at":"2025-10-27 16:10:48","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":79433,"visible":true,"origin":"","legend":"","description":"","filename":"8Sept2025Appendices.docx","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/b89fb042dd401e6ae1ffe878.docx"},{"id":94480110,"identity":"44fcd2b6-d74b-4949-bd1e-a6c194f34900","added_by":"auto","created_at":"2025-10-27 16:09:42","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":165943,"visible":true,"origin":"","legend":"","description":"","filename":"82ea3c0f744a47f9b1e237fd6e14dbd41enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/c2b0ced861f416dcc9da9a23.xml"},{"id":94480517,"identity":"9cb4d2c8-f90b-4ff8-8eae-98545076f714","added_by":"auto","created_at":"2025-10-27 16:11:21","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":377210,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/2f97adea948147b260cf4536.jpeg"},{"id":94480265,"identity":"f188f470-a8b5-4366-9841-f3aaf87e24b2","added_by":"auto","created_at":"2025-10-27 16:10:28","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":350632,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/a4ae4f3ce5fd3040a3bbd571.jpeg"},{"id":94480376,"identity":"bf516f6f-c305-482e-861f-6839bf70393a","added_by":"auto","created_at":"2025-10-27 16:10:52","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1074,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/4ff0a912eefdc6aabd312ac5.jpeg"},{"id":94480269,"identity":"02f14e01-0113-4272-a01a-d0f1639310b3","added_by":"auto","created_at":"2025-10-27 16:10:28","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":739457,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/9fec5aebbe957da1b3c51020.jpeg"},{"id":94480049,"identity":"73b7a5ab-9aba-4a47-8f88-22f0780b2eda","added_by":"auto","created_at":"2025-10-27 16:09:07","extension":"jpeg","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":612633,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/1c3825f471f6afc11423f6ab.jpeg"},{"id":94490002,"identity":"e20be07d-6c77-46b4-a0ad-47d3cb85b933","added_by":"auto","created_at":"2025-10-27 17:06:56","extension":"jpeg","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":47571,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/cd304241788e6385e9877d45.jpeg"},{"id":94480519,"identity":"c43de96c-f4fd-457a-9483-a2172921a010","added_by":"auto","created_at":"2025-10-27 16:11:22","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":211909,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/fee1013a251b5e24655f1b22.png"},{"id":94480371,"identity":"55d90ce0-3634-4a4a-80c8-3da992368010","added_by":"auto","created_at":"2025-10-27 16:10:51","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":60535,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/007ea2e389d7610a67d7769c.png"},{"id":94480216,"identity":"ba04971c-6960-43eb-8eec-b8937cc787a2","added_by":"auto","created_at":"2025-10-27 16:10:07","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":935,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/a45a07376f6e0db18f0b0283.png"},{"id":94480043,"identity":"4466b10f-ed35-4f9f-88bf-c5dd23da856f","added_by":"auto","created_at":"2025-10-27 16:09:03","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":225920,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/5df198d1d04e2dc5ee1c06e3.png"},{"id":94480058,"identity":"1c42ecad-4d8d-4d50-a077-3c5d3f4480e0","added_by":"auto","created_at":"2025-10-27 16:09:13","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":375537,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/60be5b241361307d070195e0.png"},{"id":94480114,"identity":"5f39698f-2fc5-4a79-8f67-749f90db4b16","added_by":"auto","created_at":"2025-10-27 16:09:42","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24625,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/548193ef887a6b6c1be22ff1.png"},{"id":94479970,"identity":"e6ca90e0-613d-421f-80ed-b9fb4ab2507a","added_by":"auto","created_at":"2025-10-27 16:08:55","extension":"xml","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":161868,"visible":true,"origin":"","legend":"","description":"","filename":"82ea3c0f744a47f9b1e237fd6e14dbd41structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/2f852447c738be001da23ed2.xml"},{"id":94480247,"identity":"b4bed753-15e3-4536-aa75-24c826524d9b","added_by":"auto","created_at":"2025-10-27 16:10:19","extension":"html","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":173404,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/d65f97189e375cfd3205a8dc.html"},{"id":94480417,"identity":"55595212-156f-4375-becb-eb3348dbc038","added_by":"auto","created_at":"2025-10-27 16:11:03","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":241137,"visible":true,"origin":"","legend":"\u003cp\u003eHypothesized Working Model of CELST (adapted from Chapelle et al., 1997)\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/4a794e13b9c408e98e61b6de.png"},{"id":94480389,"identity":"477d0d67-0f3c-43f2-a451-2282550dc315","added_by":"auto","created_at":"2025-10-27 16:10:56","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":227994,"visible":true,"origin":"","legend":"\u003cp\u003eA presentation of the interface of CELST\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/88be9eb7e1d75e7362f6eec2.png"},{"id":94480054,"identity":"d5003619-73ef-491a-987b-8fbc4cdd9b41","added_by":"auto","created_at":"2025-10-27 16:09:09","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":73953,"visible":true,"origin":"","legend":"\u003cp\u003eA hypothesized competence model of CELST (Second-order SEM) (Model 1)\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/3d1c7666cd255cd2dae55ec9.png"},{"id":94480067,"identity":"47af6edd-d903-450b-a95d-756a793a480d","added_by":"auto","created_at":"2025-10-27 16:09:18","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":300675,"visible":true,"origin":"","legend":"\u003cp\u003eA hypothesized bi-factor ESEM model of CELST (Model 2)\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/881bf8e54e7356862ada2a7f.png"},{"id":94480395,"identity":"24ac903f-aed9-404f-92a1-0d7598f3fdd9","added_by":"auto","created_at":"2025-10-27 16:10:56","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":232235,"visible":true,"origin":"","legend":"\u003cp\u003eA hypothesized bi-factor ESEM model of CELST (Model 2.2)\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/3ffa7619e5f86f3b4db1982f.png"},{"id":102296898,"identity":"d93d43fa-04b8-4048-b851-13e55b6d7de4","added_by":"auto","created_at":"2026-02-10 10:22:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1720819,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/d8b98fe1-324c-4c97-a29a-8276df8fd92c.pdf"},{"id":94480367,"identity":"3fd13355-b2af-4fba-9873-352865ec5984","added_by":"auto","created_at":"2025-10-27 16:10:50","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":79433,"visible":true,"origin":"","legend":"","description":"","filename":"8Sept2025Appendices.docx","url":"https://assets-eu.researchsquare.com/files/rs-7563159/v1/107163f3f03fa8f47bd43f5d.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIt is a challenge to conceptualize, define and assess speaking in a reliable and valid way (de Jong, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Fan and Yan, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), particularly in relation to integrated speaking tasks, which are intended to be performed with reliance on given reading and/or listening resources. Integrated speaking tasks are increasingly endorsed by language testing scholars and educators for their authenticity in approximating real-world communicative practices (Barkaoui et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Brown et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e), their contribution to the cultivation of academic multiliteracy and their role in aligning English for Academic Purposes instruction with disciplinary content demands or designated test construct (Brown and Ducasse, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Frost et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Hirai and Koizumi, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). Meanwhile, the advancement of communicative technologies and the reform of assessment formats have led to the emergence of computer-based assessment, particularly computer-based integrated speaking tests, which in turn pose new challenges to its construct validation researches (Iwashita, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2022\u003c/span\u003e, p.130\u0026ndash;133), such as whether these new test formats faithfully capture its intended constructs and to what extent they do so.\u003c/p\u003e\u003cp\u003eIn integrated speaking tasks, test-takers are required to process and transform a cognitively complex stimulus (e.g. a written test or a lecture) and integrate information from this source into the speaking performance (Brown et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e, p.1; Zhang et al., \u003cspan citationid=\"CR78\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Most existing studies on integrated speaking tests have focused on college or university-level English language learners in L2 contexts (Brown and Ducasse, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Frost et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Kormos et al., \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Suzuki and Kormos, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). In contrast, relatively few studies have examined intermediate or low-level ESL learners (Tsang and Lee, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Xu et al., \u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e2019\u003c/span\u003e, \u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e2020\u003c/span\u003e, 2021, \u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e2023a\u003c/span\u003e, \u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e2023b\u003c/span\u003e, \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e2024\u003c/span\u003e, \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Zhou and Zeng, \u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e2016\u003c/span\u003e), despite the fact that a substantial proportion of ESL learners fall into this category. Even fewer studies have explored how these learners perform on computer-based integrated speaking tasks. Therefore, continued validation researches are expected by drawing on diverse theoretical perspectives and using a wide range of research methods (Iwashita, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2022\u003c/span\u003e, p.139).\u003c/p\u003e\u003cp\u003eTo address this research gap, the present study conducted a construct validation study of a large-scale, high-stakes computer-based integrated speaking task, the Guangdong Version of the Computer-based English Listening and Speaking Test (hereafter, CELST) of the National Matriculation English Test. Employing bi-factor exploratory structure equation modeling (hereafter, ESEM) and hierarchical multiple regression analyses, the study aimed to investigate the underlying constructs measured by CELST.\u003c/p\u003e"},{"header":"2. Literature review","content":"\u003ch2\u003e2.1 Integrative speaking tasks\u003c/h2\u003e\n\u003cp\u003eCurrently, integrated speaking tasks are widely used in many large-scale high-stakes tests, such as, the TOEFL iBT speaking tests, the Versant\u003csup\u003eTM\u003c/sup\u003e Speak Test, the TEM-4 Oral Test, the CET-SET, and the CELST. Various theoretical motivations have driven the emergence of integrated speaking tasks. For instance, communicative language ability should target at the capacity to implement knowledge in communicative language use (Bachman, 1990); integrated tasks entail the amalgamation of multiple L2 skills (Huang and Hung, 2018). Among the various studies conducted on integrated speaking tasks, those on TOEFL iBT integrated speaking tasks accounted for a great portion. They focused mainly on the comparison between different types of TOEFL iBT integrated speaking tasks and independent speaking tasks (Huang et al., 2018; Zhang et al., 2022) or between TOEFL iBT integrated speaking tasks and other academic speaking tasks (Brown and Ducasse, 2019; Farnsworth, 2013); rating issues (Wei and Liosa, 2015); textual properties and strategy uses (Crossley and Kim, 2019; Inoue and Lam, 2021; Zhang et al., 2021); and source material use (Frost et al., 2019; Frost et al., 2021). Apart from the studies conducted on TOEFL iBT integrated speaking tasks, there are also some studies performed on other integrated speaking tasks and their research perspectives include source material use (Kormos et al., 2022; Lin, 2023; Pusey, 2020), rating (Hirai and Koizumi, 2013; Kim, 2015), affective factors (Ishikawa, 2020), strategy use (Rukthong, 2021; Rukthong and Brunfaut, 2020), fluency of oral products (Suzuki and Kormos, 2023), test fairness (Yan et al., 2019), and task acceptance (Zhang and Zhang, 2022), etc.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCompared with the large number of studies on integrated speaking tasks conducted internationally, researches on such tasks in Chinese context remains relatively limited (Jin, 2012; Rui and Ji, 2017; Zeng, 2011), which is not well aligned with its substantial population of ESL learners and the great effort devoted to English teaching and learning. Jin (2012), Zeng (2011), and Zhou (2005) probed into the influences of source material input in integrated speaking tasks on college students\u0026rsquo; oral performance and the results confirmed its positive impacts in relieving anxiety and classroom reticence, facilitating autonomous learning, and promoting oral language output. Meanwhile, Zhang and Elder (2009) examined the reliability and validity of CET-SET from multi-perspectives, namely, authenticity, interactiveness, fairness, and washback effects. Recently, Tsang and Lee (2023) proved that foreign language-related emotions (anxiety, boredom, and enjoyment), speaking motivation, and spoken input beyond the classroom connected directly to Year 3 and Year 4 EFL leaners\u0026rsquo; speaking proficiency in Hongkong primary schools, with enjoyment and spoken input beyond classroom serving as direct predictive power. Among them, those particularly related with the present study were those on the National Matriculation English Test (Shanghai Version) (hereafter, NMET(SH)) (Hou, 2018; Liu and Chen, 2018; Xu, 2016, 2021; Zhang, 2019), especially when the characteristics of test takers (e.g. overall language ability, number of participants) were taken into consideration. Their studies confirmed that the NMET(SH) assessed students\u0026rsquo; ability of applying various sources of information into completing the integrated listening-speaking tasks and exerted positive washback effects.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e2.2 CELST\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eIt is early since World War II that political needs have been exerting various kinds of influences on the form and scoring of speaking test (Fulcher, 2003, p.1). China is a case in point. \u003cem\u003eGeneral Senior High School Curriculum Standards\u003c/em\u003e (hereafter, GSHSCS) (Ministry of Education of the PRC, 2020) provided a comprehensive guidance on the general teaching goals, content standards, and teaching and assessment approaches for senior high school English courses. According to\u003cem\u003e\u0026nbsp;the GSHSCS\u0026nbsp;\u003c/em\u003e(2020), the general aim of senior high school English curriculum is to help students to cultivate and develop students\u0026rsquo; subject core competencies which include language abilities, cultural awareness, thinking capacity, and learning ability (Ministry of Education of the PRC, 2020, p.5). Meanwhile, students should be able to acquire English learning resources through multiple channels, choose appropriate strategies and methods, monitor, evaluate, reflect on, and adjust learning content and progress (Ministry of Education of the PRC, 2020, p.6).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCELST, a response of the aim of cultivating senior high school students\u0026rsquo; comprehensive English ability, has been implemented since 2011. There are three sub-tasks in CELST\u003ca href=\"#_ftn1\" name=\"_ftnref1\" title=\"\"\u003e\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e, namely, reading aloud, role play, and story retelling. Test design of CELST works for the selective purposes. According to \u003cem\u003ethe Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test\u003c/em\u003e (EEA-GD, 2016), CELST aims at assessing students\u0026rsquo; ability of accomplishing communicative tasks in specific contexts by using English, acquiring and applying their knowledge of phonology, vocabulary, grammar, etc. to comprehend and express effectively. Over the past decade, the number of candidates taking the CELST has consistently exceeded 630,000 annually. In both 2022 and 2023, this figure reached 710,000, reflecting the sustained demand for English proficiency assessment in Guangdong province\u0026rsquo;s higher education admissions process. These figures also indicate that a substantive study of CELST is of great significance considering the large candidate population and the competitive nature of the college entrance examination.\u003c/p\u003e\n\u003cp\u003eLarge amount of test takers it owns, few studies were conducted on it. Cheng\u0026rsquo;s (2011) doctoral dissertation focused solely on the validity issues of the story-retelling task in CELST, leaving the first two sub-tasks unexamined. So do the original researches conducted by Wang et al. (2018) and Xu and his co-workers (2019, 2020, 2021, 2023a, 2023b, 2024, 2025). Zhan and Wan (2016) investigated into Senior III students\u0026rsquo; attitudes, test preparation practices and test taking processes when completing CELST. Zhou and Zeng (2016) compared the rating results between human raters and computer automated scoring of CELST by using many-facet RASCH models and found that despite of the differences in rater severity between these two scoring approaches, computer automated scoring was of better inner-consistency due to lower bias rates.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo sum up, the above reviewed validation studies on integrated speaking tasks are helpful in conceptualizing construct and the validation procedure. However, language test taking is a process that is task-bound and context specific (Cohen, 2014). Test response is a function not only of the items, tasks, or stimulus conditions, but also of the participants\u0026rsquo; responding and the context of measurement (Messick, 1987). In this light, different types of integrated speaking tasks would set different requirements on the amount, degree, and form of information in the listening/video clip that can be integrated into the test-takers\u0026rsquo; oral response. Besides, though diverse results have been reported for different types of integrated speaking tests, researches on CELST remain scare, especially the first two sub-tasks. What\u0026rsquo;s more, most existing studies have largely relied on correlational approaches, such as examining the relationship between textual features and test performance or between strategy use and test performance. Few, if any, have probed into its construct validity directly through in-depth analysis of test-takers\u0026rsquo; actual performance. On top of that, as a large-scale high-stake test, studies addressing Chinese senior high school EFL learners\u0026rsquo; listening-speaking performance in NMET context are in dire need concerning the great influence it exerts on education both nationally and internationally. Therefore, collection of validity evidence of CELST using a more mathematically-grounded validation testing theory and methodology is of great necessity for the sound and comprehensive interpretation of its test scores as well as test constructs.\u003c/p\u003e\n\u003ch2\u003e2.3\u0026nbsp;Models of integrated speaking tasks\u0026nbsp;and\u0026nbsp;their conceptualizations\u003c/h2\u003e\n\u003cp\u003eTheoretical models serve various functions in test validity development and validation, such as score interpretation, test development, curriculum design, etc. (Luoma, 2004, p.96). The necessity of grounding language test development and application in a theory of language proficiency calls for the incorporation of a theoretical framework that defines what language proficiency is (Bachman, 1990, p.81). Besides, a task-specific theoretical model does not only represent and encompass task construct but also function as the foundation of evaluation and assessment (Luoma, 2004, p.107).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eEnlightened by the various frameworks related to listening and/or speaking, it could be concluded that a model of listening-speaking ability should firstly take an integrative, interactive, or communicative stand, for language use both in natural and academic contexts are inherently integrative, interactive, or communicative. Meanwhile, it should also serve as a guideline for the assessment of both language product parameters and speaking process parameters, for the inclusion of the former could guide leaners\u0026rsquo; self-learning, teachers\u0026rsquo; teaching, and raters\u0026rsquo; rating practices, and the encompass of the latter could help examine participants\u0026rsquo; cognitive and strategic processes, which would help finding factors influencing task completion (Bachman, 1990). Therefore, an operationalized working model of the construct of CELST will be developed based on the analysis of existing theoretical models, the \u003cem\u003eTest Specifications\u003c/em\u003e, and the specific test purposes of CELST (EEA-GD, 2016), which will be presented in Section 2.5.1.\u003c/p\u003e\n\u003ch2\u003e2.4\u0026nbsp;Theory of validity \u0026amp; validation\u003c/h2\u003e\n\u003cp\u003eBeing aware of the shift from multiple types of validity to a unitary understanding and from a focus on prediction to one on explanation, Messick (1987) conceptualized validity as a unitary concept and proposed a comprehensive framework consisting of six interrelated aspects, namely, content, substantive, structural, generalizability, external, and consequential validity. Among them, the external aspect of construct validity, which includes convergent and discriminant evidence obtained through multitrait-multimethod comparisons, as well as evidence of criterion relevance applied utility (Messick, 1987), delineates the extent to which the construct represented in the assessment accounts for the external pattern of correlations.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBesides, Chapelle put that a validity argument should integrate both evidence and rationales to support conclusions about the appropriateness and soundness of score-based inferences and uses of a test (1999, p.263) and held that validation should begin with formulating a hypothesis concerning the appropriateness of testing outcomes (1999, p.259). For the external aspect of validity, Chapelle (1999) stated that data analyzed through a multitrait-multimethod research design could be used to examine the relationships between the target test and other tests or quantifiable performance indicators. Extending back, influenced by psychological structuralism and the conception of explaining performance through the systems and subsystems of underlying process, Embretson (1983) introduced construct modeling to construct validation study. According to construct modeling, the internal structure and substance of a test can be addressed more directly by means of causal modeling of item or task performance. Embretson (1983) suggested that the construct validation research of identifying the theoretical mechanisms underlying task performances (a task decomposition process), that is, construct representation, and detecting the relationships of a test to others (e.g. strength, frequency, and pattern of significant relations with other measures), that is, nomothetic span, should be separated. It could be achieved by mathematical modeling, psychological modeling, or multicomponent latent trait modeling which combines the features of the former two ones (1983, p.181).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMoreover, according to the \u003cem\u003eStandards for Educational and Psychological Testing\u0026nbsp;\u003c/em\u003e(hereafter the \u003cem\u003eStandards\u003c/em\u003e) (AERA et al., 2014), which provides criteria for the development and evaluation of tests and testing practices and guidelines for assessing the validity of test scores interpretations for their intended uses, evidence based on test content may involve logical or empirical analyses to determine how adequately the test content of a test represents the targeted domain in relation to the proposed interpretations of test scores (AERA et al., 2014, p.14). However, Xu et al. (2024) pointed out that, to date, no public report has examined the content validity of the CELST.\u003c/p\u003e\n\u003cp\u003eFollowing the guidance of the \u003cem\u003eStandards\u003c/em\u003e, decomposing into coded construct-relevant textual measures and strategic measures, and employing a multicomponent model based on the testing purpose of CELST, the present study would explore to what extent these data-driven measures represent and are related to the specific inferences to be made from test scores.\u003c/p\u003e\n\u003ch3\u003e2.5 The present study\u003c/h3\u003e\n\u003ch3\u003e2.5.1 Working taxonomy of CELST\u003c/h3\u003e\n\u003cp\u003eAll the theoretical models of speaking ability assessment are not really appropriate to serve as a theoretical framework for the CELST considering the integrative nature of the task itself and the targeted participants and context of it. Hence, a task-specific theoretical framework of CELST that reflects task construct and guides evaluation and assessment by offering wordings and providing criteria (Luoma, 2004, p.107) is needed.\u003c/p\u003e\n\u003cp\u003eAn operationalized working model of the construct of CELST (Figure 1) was worked out based on a series of considerations. It includes the analysis of the GSHSCS (Ministry of Education of the PRC, 2020), test aims issued in \u003cem\u003ethe Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test\u003c/em\u003e (EEA-GD, 2016), personal communication with the test designer, a review of the theoretical models related to listening and/or speaking, the notion that a combination of the assessment of the product and the process undertaking it (Bygate, 1987) could make construct representation much sounder, and the adoption of the internal operation process from COE model (Chapelle et al., 1997). Thus, it is supposed that the construct model of CELST should entail the following principles and conceptions: 1) context part, the left part of the figure, refers to various task characters, candidates, raters, etc. related factors that affect candidates\u0026rsquo; performance; 2) internal operation part, the core of the working model symbolized by the largest block, is the underlying process running through task completion; and 3) performance is the results of the oral production (Gist \u0026amp; Britol, 2020).\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003e2.5.2 SEM \u0026amp; Bi-factor E-SEM\u003c/h3\u003e\n\u003cp\u003eStructural equation modeling (SEM) is \u003cem\u003ea priori\u0026nbsp;\u003c/em\u003ehypothesis testing that is based on theoretical knowledge or what previous research has found. It examines the nature of the relationships between the observed variables and the latent variables by using a confirmatory and hypothesis-testing approach (Bollen, 1989; Byrne, 2013) and has been widely used in education and psychology, including language testing field (D\u0026ouml;rnyei, 2007; Phakiti, 2008). Powerful as it is, traditional SEM did meet its bottle neck. For example, the fixed zero loadings of the factors led to poor application issues like believability and replicability (Asparouhov and Muth\u0026eacute;n, 2009).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAt this very moment, Exploratory Structural Equation Modeling (ESEM, for short), a relatively novel approach, has recently emerged as a promising alternative to overcome some of the challenges (van Zyl and ten Klooster, 2022). ESEM is an exploratory factor analysis measurement model with rotations used in a structural equation model and functions as a combination of confirmatory factor analysis and exploratory factor analysis (Marsh, et al., 2009, 2010). All the SEM parameters are accessible in ESEM, such as residual correlations, regressions of factors on covariates, regressions among factors, standard errors for all rotated parameters, and overall model fit indices (Asparouhov and Muth\u0026eacute;n, 2009, p.398-399). Meanwhile, bifactor models test the presence of a global unitary construct underlying the answers to all items (G-factor) and whether this global construct co-exists with meaningful specificities (S-factors) defined by the part of the items not explained by the G-factor (Swami et al., 2023). Since one of the contexts that bi-factor ESEM could be used is no clear \u003cem\u003ea priori\u003c/em\u003e structure exists (Swami et al., 2023), the present paper would employ a bi-factor ESEM, a more flexible, data-driven approach, to examine the extent to which the data, as reflected by the test scores and coded values, aligns with the proposed theoretical construct of CELST.\u003c/p\u003e\n\u003ch3\u003e2.5.3 Research questions\u003c/h3\u003e\n\u003cp\u003eAs mentioned above, participants\u0026rsquo; performances as well their strategic behaviors involved illustrated the construct of integrated speaking tasks. However, these studies did not involve a large number of ESL learners of low to intermediate English language ability, nor test their performance on the computer-based integrated speaking task such as the CELST. To bridge this gap, the present study drew on bi-factor ESEM and hierarchical multiple regression techniques to identify to what extent the CELST can tease out the supposed construct. Specifically, we addressed the following two research questions.\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003eTo what extent do participants\u0026rsquo; performances activate and reflect the textual measures in the construct of CELST?\u003c/li\u003e\n \u003cli\u003eTo what extent do participants\u0026rsquo; performances activate and reflect the communicative strategy use in the construct of CELST?\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"3. Method","content":"\u003ch2\u003e3.1 Participants\u003c/h2\u003e\n\u003ch3\u003e3.1.1\u0026nbsp;Student participants\u003c/h3\u003e\n\u003cp\u003eWith the assistance of the Education Examinations Authority of Guangdong Province (EEA-GD), a total of 360 students\u0026rsquo; performance data in the CELST were included. To protect participants\u0026rsquo; personal privacy, EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Influential factors such as their overall oral English proficiency level (Brown et al., 2005; Iwashita et al., 2008; Swain et al\u003cem\u003e.\u003c/em\u003e, 2009) and geographical regions (Jin and Wu, 2010) were taken into consideration when sampling participants. Firstly, only Test A was sampled despite six paralleled tests were used in that year due to convenience. Secondly, all students were stratified randomly into three groups (high, medium and low) based on their total score reported by EEA-GD. Students whose CELST scores ranked in the top 20 out of 100 were assigned to the high-level group, those in the bottom 50 were assigned to the low-level group, and the remaining students were placed in the middle-level group. Because such stratification criterion corresponds to the enrollment differences among key universities, higher vocational colleges, and general colleges or universities. Participants of each proficiency group were parceled and named as a separate file. Thirdly, all the participants in each proficiency level file would be classified into four geographical region groups, namely north Guangdong, east Guangdong, west Guangdong, and Pearl River Delta. To be specific, 30 participants would be sampled randomly from each geographical region group in each proficiency level. In other words, 360 participants (30*4*3 = 360) were recruited finally. Table 1 is a description of the constituent of the student participants.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003e3.1.2 Transcribers\u0026nbsp;and coders\u003c/h3\u003e\n\u003cp\u003eApart from the first author of the present study, another seven M.A. students majoring in English language testing participated the transcription work. All of them had experiences of rating CELST for at least two years before doing the present transcription work. Training and trial transcription were conducted before formal transcription. The pilot transcription recordings were sampled randomly from students\u0026rsquo; performance in CELST conducted one year earlier than the present study and totally ten pieces were selected. Besides, their final transcription results were checked by the second author and another Ph.D. candidate in English language testing randomly and totally forty pieces of their transcriptions were double-checked.\u003c/p\u003e\n\u003cp\u003eThe coding work of the present study was classified into two parts. One was the coding work that needs to be done manually with the help of WinMax\u003ca href=\"#_ftn1\" name=\"_ftnref1\" title=\"\"\u003e\u003c/a\u003e\u003csup\u003e2\u003c/sup\u003e to collect data on textual and communicative strategy use measures, and the other part were those could be done automatically under the guidance of WorldSmith Tools, Praat, and the online resources of Lu Xiaofei (http://aihaiyang.com/software/). Apart from the seven M.A. transcribers mentioned above, another two Ph.D. candidates in English language testing, with one in her first-year trip and the other in her second-year trip, participated the first part of the coding work. All these nine coders were trained together on how to do the coding work with the software WinMax. Thirty-six samples were double-coded so as to keep reliability of the coding results following the approach of Brown et al. (2005). The second part of the coding work was done by the first author and then checked by the second author concerning issues like missing data, input accuracy, etc.\u003c/p\u003e\n\u003cp\u003eTable 1 A Description of the Student Participants\u003c/p\u003e\n\u003cdiv align=\"\"\u003e\n \u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; Proficiency\u003c/p\u003e\n \u003cp\u003eRegion\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 43px;\"\u003e\n \u003cp\u003ehigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 67px;\"\u003e\n \u003cp\u003emedium\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 38px;\"\u003e\n \u003cp\u003elow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 43px;\"\u003e\n \u003cp\u003etotal\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003enorth Guangdong\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 38px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003eeast Guangdong\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 38px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003ewest Guangdong\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 38px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003ePearl River Delta\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 38px;\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 152px;\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 38px;\"\u003e\n \u003cp\u003e120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 43px;\"\u003e\n \u003cp\u003e360\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003ch2\u003e3.2 Instruments and materials\u003c/h2\u003e\n\u003ch3\u003e3.2.1 Integrated\u0026nbsp;speaking tasks\u003c/h3\u003e\n\u003cp\u003eThe oral data were collected from the three subtasks of Test A of CELST in a certain year. Figure 2 is a presentation of the screen interface of CELST. Appendix A presented the details of Test A of CELST used in the present study.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003e3.2.2 Rating rubrics\u003c/h3\u003e\n\u003cp\u003eRating rubrics for CELST (Appendix B) were used in the present study not for the purpose of scoring performance as it usually employed, but contributing to the establishment of the working framework of the construct of the CELST and the taxonomy of coding.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003e3.2.3 Coding scheme\u003c/h3\u003e\n\u003cp\u003eTest takers\u0026rsquo; performances were coded from the perspectives of oral product textual measures and communicative strategies used as presented in Figure 1. The final framework of test takers\u0026rsquo; oral product textual measures comprises 25 coding items under five aspects, namely, accuracy, complexity, coherence, fluency, and phonology, while their strategies used would be assessed from reduction and achievement perspectives. The detailed variables that measure test takers\u0026rsquo; oral products and strategies used could be referred in Appendix C.\u003c/p\u003e"},{"header":"4. Results and Discussions","content":"\u003ch2\u003e4.1 Preliminary analysis\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eFirstly, the first author of this paper rechecked whether total scores, sub-scores for each subtask, and the corresponding video tape recordings were included and the results were confirmed. Secondly, during the process of data confirming, it was found that the frequency for several coding parameters\u003ca href=\"#_ftn1\" name=\"_ftnref1\" title=\"\"\u003e\u003c/a\u003e\u003csup\u003e3\u003c/sup\u003e were zero for all participants. Thus, the parameter number of internals was deleted due to zero occurrence across all three subtasks. Thirdly, observed variable SpM was no longer used due to its high correlation with the observed variable MSpM which has more linguistic implication.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe inter-coder reliability Cranach\u0026rsquo;s \u0026alpha; value for all parameters were over 0.8 (see Appendix D), indicating that the two coders were consistent with each other in the coding process. This further guaranteed the reliability of the coding results in the main analysis.\u003c/p\u003e\n\u003ch2\u003e4.2 The competence model of CELST\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eBased on the results of previous literature on integrated speaking assessment and second language assessment, a set of indexes would be used in several alternative hypothesized models of CELST to explore into the relationship between the theoretical model proposed for CELST and the working construct (see Figure 3) from the perspective of language competence parameters, addressing Research Question 1. As stated in Section 3.2.3, five factors, namely, accuracy, complexity, coherence, fluency, and phonology, were employed as latent factors in the competence model of CELST. Initially, the researchers assessed the model fitness index of hypothesized competence Model 1 using traditional second-order SEM. However, the analysis failed to converge due to the exceeding of the maximum number of iterations.\u003c/p\u003e\n\u003cp\u003eA detailed inspection of the data found that most of the value of CpT is smaller than 2 (CpT appeared less than twice in each participant\u0026rsquo;s performance), indicating inappropriateness of using CpT as a measure of differentiating senior high school students\u0026rsquo; oral proficiency. Iwashita et al. (2008: 45) also raised doubts on using of ratio measures to represent complexity. Since with participants\u0026rsquo; overall oral language proficiency going up, NoT and NoC in their oral production would increase, but not the value of the ratio. Besides, the negative correlation between CpT and the total score (-.018) also violates the common sense, because more CpT usually indicates high overall language proficiency and score. Therefore, the observed variable NoC and NoT are used instead of CpT and MLT to measure language complexity. Besides, variable RMS would be discarded in the new model due to its non-significant relationship with total score (-.063). However, the new model still couldn\u0026rsquo;t converge. Probing into the observed variables in the new model, those whose correlation with the total score is smaller than 0.3, to be specific, DysM and RRSU, were deleted. Thus, a hypothesized exploratory bi-factor SEM (Model 2) (see Figure 4), instead of the traditional SEM, was proposed based on the data feedback of model 1. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eResults of bi-factor ESEM indicated that the model (Figure 4: Model 2) fits\u003ca href=\"#_ftn2\" name=\"_ftnref2\" title=\"\"\u003e\u003c/a\u003e\u003csup\u003e4\u003c/sup\u003e the data well. The Chi-square in the present model was 116.489, with 85 degrees of freedom. \u0026chi;\u003csup\u003e2\u003c/sup\u003e/df was 1.37, indicating a good fit between the present model and the data. The value of RMSEA and SRMR were much smaller than 0.5, also indicating well model fit. Hitherto, all these three absolute model fit indexes demonstrated that Model 2 fits the present data well. CFI and TLI are two indexes of comparative fit. The higher these two values are, the more fit the model would be. They both range from 0 to 1 and a value larger than 0.9 indicates well model fit. These two model fit indexes also indicated that Model 2 fits the present data well. Besides, almost all observed variables loaded on the general factor, listening-speaking ability, and several of them clustered and loaded on one domain specific factor such as accuracy, complexity, coherence, fluency and phonology. Such kind of loading pattern confirmed the use of exploratory bi-factor ESEM analysis of the present data. However, probing into the details, some minor issues still deserve further in-depth speculations. For one thing, RMP (ratio of mispronounced phonemes) loaded on overall listening-speaking ability with a positive value, which is against the theory. As is known to all, the more mistakes one makes in terms of phonemes mispronounced, the poorer one\u0026rsquo;s oral English would be. For another thing, RUiF (0.674) did work effectively on its a domain specific factor, phonology, but not on the overall general listening-speaking ability. Finally, the factor loading of PRO on its domain specific factor was relatively lower (-0.204) than the accepted value (0.3) symbolizing meaningful causal relationship between observed variable(s) and latent variable, while its contribution to the general listening-speaking ability factor was rather well (-0.747). Therefore, modifications were to be made. Firstly, the index RMP was discarded. Even though the factor loading values in the new modified model (Model 2.1) were somewhat better than those in Model 2, RUiF still could not load on general listening-speaking ability, which is against the hypothesis of bi-factor ESEM analysis. Besides, PRO still loaded poorly on phonology (-0.207), though significant. However, several modification indexes did not imply any modification on PRO. Thus, RUiF was deleted in the modified Model 2.2 to see whether the deletion of such an observed variable would result in better model fit and interpretations.\u003c/p\u003e\n\u003cp\u003eAs could be seen in Figure 5 (Model 2.2), all observed indicators in the new model loaded on the overall listening-speaking ability and their own domain specific latent factors.\u0026nbsp;As a whole, the finalized model presented a well level of model-to-data fit (\u0026chi;\u003csup\u003e2\u003c/sup\u003e=79.97, df=60; \u0026chi;\u003csup\u003e2\u003c/sup\u003e/df=1.33; RMSEA=0.04; SRMR=0.02; CFI =1.00; TLI=0.976). All factor loadings of the observed variables on their corresponding latent variables were significant at the 5% level.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e4.3 Strategy use in CELST\u003c/h2\u003e\n\u003cp\u003eAs illustrated in Section 2.4.1 and Section 3.2.3, the communicative strategies adopted in the present study are consisted of two macro-parameters, to be specific, reductive strategies and achievement strategies, representing two different model approaches. Hierarchical multiple regression is often used when the independent variables are entered cumulatively according to a specified hierarchy by the researchers based on theoretical grounds (Gaciu, 2021; Pallant, 2020). Therefore, hierarchical multiple regression was employed to examine whether and to what extent the two different macro-parameters of communicative strategies used by participants influenced and predicted their overall performance, as well as their performance on individual aspects separately. Assumptions of conducting a multiple linear regression analysis like the examination of sample size, multicollinearity and singularity, outliers, normality, linearity, etc. were all confirmed. Firstly, a total of 360 participants met the sample size requirement on conducting multiple linear regression analysis (N\u0026ge;50+m8; m refers to the number of independent variables) (Tabachnick \u0026amp; Fidell, 2013). Secondly, with all Pearson correlation values allocated between -0.2 and 0.3, Tolerance values of more than 0.10, and VIF values well below the cut-off of 10, non-collinearity could be achieved. Thirdly, visual inspection of normal probability plot of the regression standardized residual confirmed the normality of the data. (see Appendix E)\u003c/p\u003e\n\u003cp\u003eAs shown in Table 2, R square of the reductive model is 0.065. The value was upgraded to 0.168 after the four achievement strategy parameters entered. To be specific, the value of R square changed was 0.103. That is to say, the present communicative strategy model only explained 16.8 per cent of the variance in participants\u0026rsquo; score. Results from ANOVA indicated that this model did achieve statistical significance (F=33.35, Sig.=.00). Detailed inspection of the coefficients analysis demonstrated that apart from App, all the other five parameters made significant unique contributions to the prediction of participants\u0026rsquo; overall CELST score (Sig.\u0026lt;.05).\u003c/p\u003e\n\u003cp\u003eTable 2 multiple regression analysis on communicative strategy variables and score\u003c/p\u003e\n\u003cdiv align=\"\"\u003e\n \u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"671\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd colspan=\"3\" valign=\"top\" style=\"width: 199px;\"\u003e\n \u003cp\u003emodel summary\u003csup\u003ec\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 95px;\"\u003e\n \u003cp\u003eANOVA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 56px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"3\" valign=\"top\" style=\"width: 199px;\"\u003e\n \u003cp\u003ecoefficients\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003eR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 66px;\"\u003e\n \u003cp\u003eR Square\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003eAdjusted R Square\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eR square change\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003eF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 38px;\"\u003e\n \u003cp\u003eSig\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003eStandardized\u0026beta;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 57px;\"\u003e\n \u003cp\u003et\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 47px;\"\u003e\n \u003cp\u003eSig.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.25\u003csup\u003ea\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 66px;\"\u003e\n \u003cp\u003e.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 85px;\"\u003e\n \u003cp\u003e.06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 76px;\"\u003e\n \u003cp\u003e.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 57px;\"\u003e\n \u003cp\u003e37.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 38px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eMA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e-.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e-5.89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eSR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e-.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e-5.98\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"6\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.41\u003csup\u003eb\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"6\" style=\"width: 66px;\"\u003e\n \u003cp\u003e.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"6\" style=\"width: 85px;\"\u003e\n \u003cp\u003e.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"6\" style=\"width: 76px;\"\u003e\n \u003cp\u003e.10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"6\" style=\"width: 57px;\"\u003e\n \u003cp\u003e33.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"6\" style=\"width: 38px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eMA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e-.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e-7.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eSR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e-.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e-7.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eApp\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.87\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eGUS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e2.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.02\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003ePAR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e.09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e3.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 28px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003eRES\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 104px;\"\u003e\n \u003cp\u003e.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 57px;\"\u003e\n \u003cp\u003e9.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 47px;\"\u003e\n \u003cp\u003e.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003ea. Predictors: (Constant), MA, SR\u003c/p\u003e\n\u003cp\u003eb. Predictors: (Constant), MA, SR, App, GUS, PAR RES\u003c/p\u003e\n\u003cp\u003ec. Dependent Variable: score\u003c/p\u003e"},{"header":"5. Discussions","content":"\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Textual measures perspective\u003c/h2\u003e\u003cp\u003eSuch a loading pattern (see Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e: Model 2.2) confirms that CELST does examine participants\u0026rsquo; ability of fulfilling three different types of listening-speaking tasks by using and applying their knowledge of accuracy, complexity, coherence, fluency, and phonology, which is in accordance with its test purposes, testing students\u0026rsquo; ability of accomplishing tasks in specific contexts by acquiring and applying their knowledge of English phonetics, vocabulary, grammar, etc. to comprehend and express in particular (EEA-GD, 2016). The relatively high loadings on general listening-speaking ability also reflect that CELST implements the requirements of the GSHCSC that senior high school English courses should stress on cultivating students\u0026rsquo; ability of acquiring and processing information by means of English, analyzing and solving problems in English, especially the ability of thinking and expressing in English; thus, developing students\u0026rsquo; overall English language using ability.\u003c/p\u003e\u003cp\u003eBesides, the different numbers of observed indicators of these five domain specific factors and their different weights not only manifest the comprehensiveness of CELST but also indicate its emphasis on the dimensions of language ability might diverse in different sub-tasks, that is, contextual specificity of its three subtasks. As manifested by Model 2.2, accuracy is loaded by REFC, RCVF, and RRSC, complexity is loaded by TTR, NoC, and NoT, fluency is loaded by MSpM, AR, MLR, and DysM, phonology is loaded by PRA, PRO, and PR, and coherence is loaded by COM. Each of the loadings of accuracy and fluency has a value of over 0.4, with some larger than 0.7, indicating strong predictive power of these textual measures on accuracy and fluency. However, when it comes to complexity and coherence, the situation turns out to be not so sound, for the loading value of TTR and COM on their correspondent domain specific factor is just over 0.3, a border value. What\u0026rsquo;s worse, the factor loading of RPO on phonology is only \u0026minus;\u0026thinsp;0.208. These different loading values reflect, to some extent, CELST gives different weights to these five dimensions due to aiming at testing participants\u0026rsquo; ability of obtaining and applying the information from diverse source materials and their previous background knowledge to accomplish given tasks. This might also be attributed to the integrative nature of listening-speaking task (Brown et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Farnsworth, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2013\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eMeanwhile, the three sub-tasks (reading aloud, role play, and story retelling) and their corresponding rating criteria might also play a role. The different requirements for senior high school students\u0026rsquo; language ability development in terms of grammar, vocabulary, pronunciation, \u003cem\u003eetc.\u003c/em\u003e of listening ability, speaking ability and affective attitude (see Section \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e2.2\u003c/span\u003e) again is one of the factors that could address the different weights. What needs to be noticed is that, phoneme omitting or swallowing, particularly the omission of the medial or final consonant in words, is a common phenomenon among students from Guangdong when speaking English (Xu et al., \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). Such kind of consonant simplification could be possibly related with their dialect, Cantonese, which does not have complex consonant clusters at the end of syllables and tends to rely on tone change with pitch but not consonants and their endings. Therefore, to some sense, the negative significant impact of RPO on phonology and listening-speaking ability indicated that raters can recognize and evaluate participants\u0026rsquo; phoneme omitting effectively, which is in consistency with the findings of Xu et al. (\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eMoreover, the unidimensional factor loading of RRSU, ADD, and TEM on the general listening-speaking ability and the components of the domain specific complexity factor deserve exploration. Whether RRSU should be used as a parameter of accuracy or a factor representing content remains a question (Zhou, \u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Brown et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e). The high and unidimensional factor loading of RRSU (0.965) on general listening-speaking ability manifests that RRSU is an indispensable as well as decisive variable accounting for the overall listening-speaking proficiency; however, it is a variable that should not be regarded as a part of accuracy. Moving to coding parameters of coherence, such as, ADD and TEM, it could be due to the fact that reading aloud task only requires test takers to read after the video clip with appropriate pronunciation and intonation and no cohesive devices are needed in this task. As a result, when CELST is taken into consideration as a whole, only COM variable clustered to the domain specific coherence factor. Results of Model 2.2 also prove that it is inappropriate to use ratio measures, like CpT, to assess L2 language learners\u0026rsquo; oral production, which gives empirical support to the doubts proposed by Iwashita et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2008\u003c/span\u003e, p.45). At the meantime, the high factor loadings of NoC and NoT on both the general listening-speaking ability factor (NoC: 0.694; NoT: 0.729) and the domain specific complexity factor (NoC: 0.692; NoT: 0.447) indicates that NoC and NoT could be used as measures representing textual complexity.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec25\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Communicative strategy use perspective\u003c/h2\u003e\u003cp\u003eResults of hierarchical multiple regression analysis demonstrated that the combination of all the six indices explained 16.4% of the score variances. Though not very large in terms of mathematic value, such a value is considerable since language competence part was not considered in the regression model. On the one hand, the four predictors of achievement strategies contributed 10.3% out of a total of 16.4%, indicating that achievement strategies like guessing, paraphrasing, restructuring, etc. are preferred by senior high school participants compared with those of reductive strategies when they have difficulties in expressing. This is quite different from Yang\u0026rsquo;s (\u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e2009\u003c/span\u003e) finding that their participants tended to rely more on strategies like verbatim source use which was considered to be unfavorable for their language development. Meanwhile, it might also be due to the fact that participants in the present study were well trained before they took part in such a large-scale high-stake test so that they were quite well-familiar with how their performances will be assessed. Hence, they would not use reductive strategies, message abandon and semantic reduction, to avoid being assessed poorly. Detailed inspection of the coding data proved that no participant used message abandon in the first sub-task (reading aloud), which requires the test takers to repeat immediately what the computer just broadcasted, with the text of the content presented on the computer screen.\u003c/p\u003e\u003cp\u003eOn the other hand, approximation was the only variable that did not make sense to score variance, compared to the other five communicative strategy variables (see Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, t\u0026thinsp;=\u0026thinsp;.16, Sig. \u0026gt;.05). Swain et al. (\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2009\u003c/span\u003e, p.23) also reported that approximation was the type of communicative strategy that was used least by participants. According to Fulcher (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2003\u003c/span\u003e) and Swain et al. (\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2009\u003c/span\u003e), approximation refers to the behavior of using strategies like lexical substitution, over-generalization, and exemplification to replace an unknown word or a word that is out of memory\u0026rsquo;s reach. Probably, the vocabulary and expressions given in the source material in these three subtasks in CELST facilitated participants\u0026rsquo; oral production in acquiring words and expressions (Brown et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e; Frost et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e"},{"header":"6. Conclusions","content":"\u003cp\u003eEmploying a convergent mixed-methods and under the guidance of Messick\u0026rsquo;s unitary concept of construct validity (Messick, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e1987\u003c/span\u003e), the present research explored into whether and to what extent CELST test the language competences it declares to. The results indicated that CELST does test students\u0026rsquo; ability of accomplishing certain tasks in specific contexts by acquiring and applying various sources of knowledge, such as their encyclopedic knowledge of English (accuracy, complexity, coherence, fluency, and phonology) and the world, source material content, communicative strategies (message abandon, semantic reduction, approximation, guessing, paraphrasing, restructuring), etc. in English. Such test construct is similar to that of the integrated speaking tasks in TOEFL as manifested in Brown et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2005\u003c/span\u003e). That is to say, integrated speaking tasks, or CELST in the present study, examine participants\u0026rsquo; ability of acquiring information from various source materials and their encyclopedic background knowledge and then use them to fulfill oral output tasks (reading aloud, role play, story retelling) under the guidance of task prompt and with the help of their communicative strategic knowledge (see Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e), which is in consistency with the teaching goal guidance of the GSHCSC (Ministry of Education of the PRC, 2020).\u003c/p\u003e\u003cp\u003eHowever, this study has several limitations that should be carefully considered when interpreting or generalizing the findings. Firstly, only participants\u0026rsquo; performances were involved in the present study, but not the factors of the task and the candidate, that is, the left part of the theoretical model of CELST (see Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e), which limits the generalizability of the research results. Secondly, other approaches of data collection, such as interviews or questionnaires concerning participants\u0026rsquo; self-report of task completion process or senior high school teachers\u0026rsquo; feedback, worth exploration, despite of the irreplaceability of authentic NMET data in the present study.\u003c/p\u003e\u003cp\u003eKeeping these limitations in mind, the researchers interpret the findings and propose implications for L2 theory and practices. Theoretically, such context-bound task-specific model adapted from Chapelle et al. (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e1997\u003c/span\u003e) brings the integrative view of listening-speaking tasks into CELST. Task characteristics and candidates\u0026rsquo; individual differences together play a role in their internal operation of their language competence, communicative strategy, and world knowledge in the performances they produce. Data of the influences of task characteristics and other factors of participants themselves, like geographical regions, gender differences, motivations (Tsang and Lee, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), etc., could be collected to gather evidences based on relations to other variables to examine the generalizability of CELST (AERA et al., 2014). Methodologically, the bottom-up research paradigm, a discourse-based approach, together with bi-factor ESEM will function as a complementation in the research methodology of construct validation in applied linguistics. Future comparative studies across the three subtasks of CELST concerning their task-specific constructs as well as the psychological processes during task fulfillment would contribute to the enrichment of the researches of CELST based on internal structure and response process (AERA et al., 2014). Pedagogically, the contributive coding parameters of specific factors and their loading values differed dramatically across each other, indicating that different types of integrated speaking should be adopted for different educative purposes. For example, reading aloud task and role play task might not be suitable to test students\u0026rsquo; ability of achieving textual coherence, while it would be wiser to assess senior high school students\u0026rsquo; oral language proficiency on how many clauses or T-unit they have made but not how many clauses or T-unit per sentence.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding Declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that no financial support was received for the conduct of this study, the preparation of this manuscript, or its publication. This research was carried out independently without any external funding from public, commercial, or non-profit organizations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics Approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study was reviewed and given approval by the Ethics Review Committee of School of Foreign Languages, Southern Medical University (No. 20230910) on September 10, 2023. This study was conducted in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInformed consent\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study \u0026ldquo;Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task\u0026rdquo; collected 360 senior high school students\u0026rsquo; performance data in the Test A of the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test (hereafter, CELST). All the data were provided by the Education Examination Authority of Guangdong Province (hereafter, EEA-GD) since the first author participated in a Guangdong provincial educational and scientific research program (TJW2013001) which aimed at validating the reliability and validity of the automatic scoring of CELST. In line with provincial regulations and institutional policies, written (signed) consent was not required. EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Anonymity was rigorously guaranteed to ensure that all collected data would be used solely for academic research purposes. All participants of this provincial program, except for scientific research purposes, should keep the obtained data confidential. Additionally, we also applied for informed exemption consent and was approved by the Ethics Committee at the School of Foreign Languages, Southern Medical University.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDue to an agreement with the Education Examination Authority of Guangdong Province, the data used and/or analyzed during the current study are restricted to research purposes only and cannot be made publicly available.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interest\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAmerican Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014) \u003cem\u003eStandards for Educational and Psychological Testing. \u003c/em\u003eWashington, DC: American Educational Research Association. \u003c/li\u003e\n\u003cli\u003eAsparouhov T, Muth\u0026eacute;n B (2009) Exploratory structural equation modeling. \u003cem\u003eStructural Equation Modeling: A Multidisciplinary Journal, 16\u003c/em\u003e(3):397-438. https://doi.org/10.1080/10705510903008204\u003c/li\u003e\n\u003cli\u003eBachman LF (1990) \u003cem\u003eFundamental considerations in language testing.\u003c/em\u003e Cambridge, UK: Cambridge University Press.\u003c/li\u003e\n\u003cli\u003eBarkaoui K, Brooks L, Swain M, Lapkin S (2013) Test-takers\u0026rsquo; strategic behaviors in independent and integrated speaking tasks. \u003cem\u003eApplied Linguistics, 34\u003c/em\u003e(3):304-324. https://doi.org/10.1093/applin/ams046 \u003c/li\u003e\n\u003cli\u003eBollen KA (1989) \u003cem\u003eStructural equations with latent variables\u003c/em\u003e. New York, NY: Wiley.\u003c/li\u003e\n\u003cli\u003eBrown A, Ducasse AM (2019) An equal challenge? Comparing TOEFL iBT\u0026trade; Speaking Tasks with Academic Speaking Tasks. \u003cem\u003eLanguage Assessment Quarterly, 16\u003c/em\u003e(2):253-270. https://doi.org/10.1080/15434303.2019.1628240 \u003c/li\u003e\n\u003cli\u003eBrown A, Iwashita N, McNamara T (2005) \u003cem\u003eAn examination of rater orientations and test-taker performance on English-for-academic purposes speaking tasks. TOEFL\u003c/em\u003e\u003cem\u003e\u003csup\u003eR\u003c/sup\u003e\u003c/em\u003e Monograph Series. MS-29. Princeton: Educational Testing Service. \u003c/li\u003e\n\u003cli\u003eBygate M (1987) \u003cem\u003eSpeaking.\u003c/em\u003e Oxford, UK: Oxford University Press. \u003c/li\u003e\n\u003cli\u003eByrne BM (2013) \u003cem\u003eStructural equation modeling with Mplus: Basic concepts, applications, and programming.\u003c/em\u003e Routledge. https://doi.org/10.4324/9780203807644\u003c/li\u003e\n\u003cli\u003eChapelle CA (1999) Validity in language assessment. Annual Review of Applied Linguistics,19:254-272. https://doi.org/10.1017/S0267190599190135 \u003c/li\u003e\n\u003cli\u003eChapelle CA, Grabe W, Berns M (1997) \u003cem\u003eCommunicative language proficiency: Definition and implications for TOEFL 2000. TOEFL\u003c/em\u003e\u003cem\u003e\u003csup\u003eR\u003c/sup\u003e\u003c/em\u003e Monograph Series. MS-10. Educational Testing Service. \u003c/li\u003e\n\u003cli\u003eCheng F X (2011) \u003cem\u003eJustifying the interpretations about a Listening-to-retell task in CELST in NMET(GD).\u003c/em\u003e Guangzhou, China: Guangdong University of Foreign Studies.\u003c/li\u003e\n\u003cli\u003eCohen AD (2014) \u003cem\u003eStrategies in learning and using a second language.\u003c/em\u003e Longman. \u003c/li\u003e\n\u003cli\u003eCrossley SA, Kim YJ (2019) Text integration and speaking proficiency: Linguistic, individual differences, and strategy use considerations. \u003cem\u003eLanguage Assessment Quarterly, 16\u003c/em\u003e(2):217-235. https://doi.org/10.1080/15434303.2019.1628239 \u003c/li\u003e\n\u003cli\u003ede Jong NH (2023) Assessing second language speaking proficiency. \u003cem\u003eAnnual Reviews of Linguistics, 9:\u003c/em\u003e541-60. https://doi.org/10.1146/annurev-linguistics-030521052114 \u003c/li\u003e\n\u003cli\u003eD\u0026ouml;rnyei A (2007) \u003cem\u003eResearch methods in applied linguistics.\u003c/em\u003e Oxford, UK: Oxford University Press.\u003c/li\u003e\n\u003cli\u003eEducation Examinations Authority of Guangdong Province (2016) \u003cem\u003eTest syllabus and sample paper disk for Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (Guangdong Version). \u003c/em\u003eGuangzhou, China: Guangdong Pacific Electronic Press.\u003c/li\u003e\n\u003cli\u003eEmbretson S (1983) Construct validity: Construct representation versus nomothetic span. \u003cem\u003ePsychological Bulletin. 93\u003c/em\u003e(1):179-197. https://www.researchgate.net/publication/289963742 \u003c/li\u003e\n\u003cli\u003eFan J, Yan X (2020) Assessing speaking proﬁciency: A narrative review of speaking assessment research within the argument-based validation framework.\u003cem\u003e Frontiers in Psychology, 11, \u003c/em\u003e330. https://doi.org/10.3389/fpsyg.2020.00330\u003c/li\u003e\n\u003cli\u003eFarnsworth TL (2013) An investigation into the validity of the TOEFL iBT speaking test for international teaching assistant certification. \u003cem\u003eLanguage Assessment Quarterly, 10\u003c/em\u003e(3):274-291. https://doi.org/10.1080/15434303.2013.769548 \u003c/li\u003e\n\u003cli\u003eFrost K, Clothier J, Huisman A, Wigglesworth G (2019) Responding to a TOEFL iBT integrated speaking task: Mapping task demands and test takers\u0026rsquo; use of stimulus content. \u003cem\u003eLanguage Testing, 37\u003c/em\u003e(1):133-155. https://doi.org/10.1177/0265532219860750 \u003c/li\u003e\n\u003cli\u003eFrost K, Wigglesworth G, Clothier J (2021) Relationships between comprehension, strategic behaviours and content-related aspects of test performances in integrated speaking tasks.\u003cem\u003e Language Assessment Quarterly, 18\u003c/em\u003e(2):133-153. https://doi.org/10.1080/15434303.2020.1835918 \u003c/li\u003e\n\u003cli\u003eFulcher G (2003) \u003cem\u003eTesting second language speaking.\u003c/em\u003e Routledge. \u003c/li\u003e\n\u003cli\u003eGaciu N (2021) Understanding quantitative data in educational research. SAGE Publications. \u003c/li\u003e\n\u003cli\u003eGist, C. D., \u0026amp; Bristol, T. J. (Eds.). (2020). \u003cem\u003eFairness in Educational and Psychological Testing.\u003c/em\u003e Washington DC: American Educational Research Association.\u003c/li\u003e\n\u003cli\u003eHirai A, Koizumi R (2013) Validation of empirically derived rating scales for a story retelling speaking test. Language Assessment Quarterly, 10(4):398-422. https://doi.org/10.1080/15434303.2013.824973 \u003c/li\u003e\n\u003cli\u003eHuang HD, Hung SA (2018) Investigating the strategic behaviors in integrated speaking assessment. System, 78(1):201-212. https://doi.org/10.1016/j.system.2018.09.007 \u003c/li\u003e\n\u003cli\u003eHuang HD, Hung SA, Plakans L (2018) Topical knowledge in L2 speaking assessment: Comparing independent and integrated speaking test tasks. \u003cem\u003eLanguage Testing, 35\u003c/em\u003e(1):27-49. https://doi.org/10.1177/0265532216677106\u003c/li\u003e\n\u003cli\u003eHou YP (2018) A study on the washback effect of the reform of SHNMET listening and speaking test. \u003cem\u003eTEFLE, 183\u003c/em\u003e(05):25-31.\u003c/li\u003e\n\u003cli\u003eInoue C, Lam DMK (2021) \u003cem\u003eThe effects of extended planning time on candidates\u0026rsquo; performance, processes, and strategy use in the lecture listening-into-speaking tasks of the TOEFL iBT\u0026reg; test (TOEFL Research Report No. RR-93).\u003c/em\u003e Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/ets2.12322 \u003c/li\u003e\n\u003cli\u003eIshikawa S (2020) Influence of learner attributes on complexity, accuracy, and fluency in English oral outputs of Japanese learners. In: Mentz O, Papaja K (eds) \u003cem\u003eFocus on language: Challenging language learning and language teaching in peace and global education\u003c/em\u003e (pp. 43-68). LIT Verlag.\u003c/li\u003e\n\u003cli\u003eIwashita N (2022) Speaking assessment. In: Derwing TM, Munro MJ, Thomson RI (eds) \u003cem\u003eThe Routledge handbook of second language acquisition and speaking\u003c/em\u003e (pp.130-140). New York, NY: Routledge.\u003c/li\u003e\n\u003cli\u003eIwashita N, Brown A, McNamara T, O\u0026rsquo;Hagan S (2008) Assessed levels of second language speaking proficiency: How distinct? \u003cem\u003eApplied Linguistics, 29\u003c/em\u003e(1):24-49. https://doi.org/10.1093/applin/amm017 \u003c/li\u003e\n\u003cli\u003eJin X (2012) Working memory constraints on L2 learners\u0026rsquo; speech production. \u003cem\u003eForeign Language Teaching and Research,\u003c/em\u003e 44(4):523-535.\u003c/li\u003e\n\u003cli\u003eJin Y, Wu J (2010) A preliminary study of the validity of the Internet-Based CET-4 \u0026mdash;\u0026mdash; Factors Affecting Test-takers\u0026rsquo; Perception of the Performance on the Test. \u003cem\u003eTechnology Enhanced Foreign Language Education, 132\u003c/em\u003e(2): 3-10. \u003c/li\u003e\n\u003cli\u003eKim HJ (2015) A qualitative analysis of rater behavior on an L2 speaking assessment. \u003cem\u003eLanguage Assessment Quarterly,\u003c/em\u003e \u003cem\u003e12\u003c/em\u003e(3):239-261. https://doi.org/10.1080/15434303.2015.1049353 \u003c/li\u003e\n\u003cli\u003eLin R (2023) Examining the scoring of content integration in a listening-speaking test: A G-theory analysis. Language Assessment Quarterly, 20(3):319-338. https://doi.org/10.1080/15434303.2023.2242334 \u003c/li\u003e\n\u003cli\u003eLiu S, Chen YJ (2018) A practical exploration on NMET (Shanghai)-based English listening and speaking teaching. TEFLE, 183(05): 32-36.\u003c/li\u003e\n\u003cli\u003eLuoma S (2004) \u003cem\u003eAssessing speaking.\u003c/em\u003e Cambridge, UK: Cambridge University Press.\u003c/li\u003e\n\u003cli\u003eKormos J, Suzuki S, Eguchi M (2022) The role of input modality and vocabulary knowledge in alignment in reading-to-speaking tasks. \u003cem\u003eSystem, 108,\u003c/em\u003e 102854. https://doi.org/10.1016/j.system.2022.102854 \u003c/li\u003e\n\u003cli\u003eMarsh HW, Muth\u0026eacute;n B, Asparouhov T, L\u0026uuml;dtke O, Robitzsch A, Morin AJS, Trautwein U (2009) Exploratory structural equation modeling, integrating CFA and EFA: Application to students\u0026rsquo; evaluations of university teaching. \u003cem\u003eStructural Equation Modeling: A Multidisciplinary Journal, 16\u003c/em\u003e(3):439-476. https://doi.org/10.1080/10705510903008220 \u003c/li\u003e\n\u003cli\u003eMarsh HW, L\u0026uuml;dtke O, Bengt M, Asparouhov T, Morin AJS, Trautwein U, Nagengast B (2010) A new look at the big five factor structure through exploratory structural equation modeling. \u003cem\u003ePsychological Assessment, 22\u003c/em\u003e(3):471-491. https://doi.org/10.1037/a0019227 \u003c/li\u003e\n\u003cli\u003eMessick S (1987) \u003cem\u003eValidity (TOEFL Report).\u003c/em\u003e Princeton, NJ: Educational Testing Service.\u003c/li\u003e\n\u003cli\u003eMinistry of Education of the People\u0026rsquo;s Republic of China. (2020) \u003cem\u003eGeneral senior high school curriculum standards.\u003c/em\u003e People\u0026rsquo;s Education Press.\u003c/li\u003e\n\u003cli\u003ePallant J (2020) SPSS survival manual: A step by step guide to data analysis using IBM SPSS (7\u003csup\u003eth\u003c/sup\u003e ed.). Routledge. \u003c/li\u003e\n\u003cli\u003ePhakiti A (2008) Construct validation of Bachman and Palmer\u0026rsquo;s (1996) strategic competence model over time in EFL reading tests. \u003cem\u003eLanguage Testing, 25\u003c/em\u003e(2): 237-272. https://doi.org/10.1177/0265532207086783 \u003c/li\u003e\n\u003cli\u003ePusey K (2020) Assessing L2 listening at a Japanese university: Effects of input type and response format. \u003cem\u003eLanguage Education and Assessment, 3\u003c/em\u003e(1): 13-35. https://doi.org/10.29140/lea.v3n1.193 \u003c/li\u003e\n\u003cli\u003eRui YP, Ji HJ (2017) The impact of multimodal listening \u0026amp; speaking teaching on English speaking anxiety and classroom reticence. \u003cem\u003eTEFLE,178\u003c/em\u003e(6): 50-55. \u003c/li\u003e\n\u003cli\u003eRukthong A (2021) MC listening questions vs. integrated listening-to-summarize tasks: What listening abilities do they assess? \u003cem\u003eSystem, 97\u003c/em\u003e(1), 102439. https://doi.org/10.1016/j.system.2020.102439 \u003c/li\u003e\n\u003cli\u003eRukthong A, Brunfaut T (2020) Is anybody listening? The nature of second language listening in integrated listening-to-summarize tasks. \u003cem\u003eLanguage Testing, 37\u003c/em\u003e(1): 31-53. https://doi.org/10.1177/0265532219871470 \u003c/li\u003e\n\u003cli\u003eSwain M, Huang L, Barkaoui K, Brooks L, Lapkin S (2009) The speaking section of the TOEFL iBT\u003csup\u003eTM\u003c/sup\u003e (SSTiBT): Test-takers\u0026rsquo; reported strategic behaviors\u003cem\u003e. \u003c/em\u003eTOEFL iBT-10. Educational Testing Service.\u003c/li\u003e\n\u003cli\u003eSwami V, Ma\u0026iuml;ano C, Morin AJS (2023) A guide to exploratory structural equation modeling (ESEM) and bifactor-ESEM in body image research\u003cem\u003e. Body Image, 47, \u003c/em\u003e101641. https://doi.org/10.1016/j.bodyim.2023.101641 \u003c/li\u003e\n\u003cli\u003eSuzuki S, Kormos J (2023) The multidimensionality of second language oral fluency: Interfacing cognitive fluency and utterance fluency. \u003cem\u003eStudies in Second Language Acquisition, 45\u003c/em\u003e(1):38-64. https://doi.org/10.1017/S0272263121000899 \u003c/li\u003e\n\u003cli\u003eTabachnick BG, Fidell LS (2013) Using Multivariate Statistics\u003cem\u003e \u003c/em\u003e(6th\u003csup\u003eed\u003c/sup\u003e). Pearson Education.\u003c/li\u003e\n\u003cli\u003eTsang A, Lee JS (2023) The making of proficient young FL speakers: The role of emotions, speaking motivation, and spoken input beyond the classroom. \u003cem\u003eSystem, 115,\u003c/em\u003e 103047. https://doi.org/10.1016/j.system.2023.103047 \u003c/li\u003e\n\u003cli\u003eVan Zyl LE, ten Klooster PM (2022) Exploratory structural equation modeling: Practical guidelines and tutorial with a convenient online tool for Mplus. \u003cem\u003eFrontiers in Psychiatry, 12\u003c/em\u003e(1), 1-28. https://doi.org/10.3389/fpsyt.2021.795672 \u003c/li\u003e\n\u003cli\u003eWang H, Fan TT, Zeng YQ (2018) Investigating the construct of speaking proficiency under the listening-to-speak integrated task. \u003cem\u003eModern Foreign Languages, 41\u003c/em\u003e(3), 413-424.\u003c/li\u003e\n\u003cli\u003eWei J, Liosa L (2015) Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. \u003cem\u003eLanguage Assessment Quarterly, 12\u003c/em\u003e(3):283-304. https://doi.org/10.1080/15434303.2015.1037446 \u003c/li\u003e\n\u003cli\u003eXu W (2016) Analysis of National Matriculation English Test (Shanghai) under the new reform of examination and enrollment system: Innovation, elucidation and prospection. \u003cem\u003eForeign Language Testing and Teaching, 4:\u003c/em\u003e24-31.\u003c/li\u003e\n\u003cli\u003eXu W (2021) Practice of a speaking assessment task in a high-stake test: Taking NMET(Shanghai) as an example. Foreign Language Testing and Teaching, (1): 21-27.\u003c/li\u003e\n\u003cli\u003eXu Y, Huang M, Chen J, Zhang Y (2023a) Investigating a shared-dialect effect between raters and candidates in English speaking tests. \u003cem\u003eFrontiers in Psychology, 14\u003c/em\u003e, 1143031. https://doi.org/10.3389/fpsyg.2023.1143031 \u003c/li\u003e\n\u003cli\u003eXu Y, Li XD, Chen J (2024) The review: Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (NMET) Guangdong version in China. Language Testing, 42(2):238-249. https://doi.org/10.1177/02655322241255712 \u003c/li\u003e\n\u003cli\u003eXu Y, Li XD, Wang PC (2023b) Validating an empirically developed rating scale of story retelling task. \u003cem\u003eJournal of PLA University of Foreign Languages, 46\u003c/em\u003e(5): 11-19.\u003c/li\u003e\n\u003cli\u003eXu Y, Liao TH, Han S, Wang YQ (2019) Development and validation of the content rubric of a story retelling task. \u003cem\u003eForeign Language Testing and Teaching, 4:\u003c/em\u003e21-30. \u003c/li\u003e\n\u003cli\u003eXu Y, Liao TH, Han S, Wang YQ (2020) Investigating language features for the listening-to-speak integrated task: A corpus-based approach. \u003cem\u003eForeign Language Research, 1:\u003c/em\u003e56-63.\u003c/li\u003e\n\u003cli\u003eXu Y, Yang MN, Li XD (2025) Investigating the relationships between listening strategies and speaking performance in integrated listening-to-speak tasks. \u003cem\u003eSystem, 129, \u003c/em\u003e103586. https://doi.org/10.1016/j.system.2024.103586 \u003c/li\u003e\n\u003cli\u003eXu Y, Zhang YQ (2021) Investigating pronunciation features of the integrated listening-to-speak task construct. \u003cem\u003eForeign Language Testing and Teaching, 3:\u003c/em\u003e39-48.\u003c/li\u003e\n\u003cli\u003eYan X, Cheng LX, Ginther A (2019) Factor analysis for fairness: Examining the impact of task type and examinee L1 background on scores of an ITA speaking test. \u003cem\u003eLanguage Testing, 36\u003c/em\u003e(2):207-234. https://doi.org/10.1177/0265532218775764\u003c/li\u003e\n\u003cli\u003eYang HC (2009) \u003cem\u003eExploring the complexity of second language writers\u0026rsquo; strategy use and performance on an integrated writing test through structural equation modeling and qualitative approaches.\u003c/em\u003e Unpublished doctoral dissertation. The University of Texas. \u003c/li\u003e\n\u003cli\u003eZhan Y, Wan ZH (2016) Test takers\u0026rsquo; beliefs and experiences of a high-stakes Computer-based English Listening and Speaking Test. \u003cem\u003eRELC Journal, 47\u003c/em\u003e(3):363-376. https://doi.org/10.1177/0033688216631174\u003c/li\u003e\n\u003cli\u003eZeng QM (2011) The efficacy of multi-modal teaching on the development of L2 listening and speaking abilities. \u003cem\u003eJournal of PLA University of Foreign languages, 6:\u003c/em\u003e72-76.\u003c/li\u003e\n\u003cli\u003eZhang R (2019) Washback effect analysis of NMET(Shanghai) listening and speaking test: Taking J school as an example. Foreign Language Testing and Teaching, 4:47-53.\u003c/li\u003e\n\u003cli\u003eZhou WJ (2005) Effects of input modes on oral English production. \u003cem\u003eJournal of PLA University of Foreign Languages, 28\u003c/em\u003e(6):53-58.\u003c/li\u003e\n\u003cli\u003eZhang Y, Elder C (2009) Measuring the speaking proficiency of advanced EFL learners in China: The CET-SET solution. \u003cem\u003eLanguage Assessment Quarterly, 6\u003c/em\u003e(4):298-314. https://doi.org/10.1080/15434300902990967\u003c/li\u003e\n\u003cli\u003eZhou Y, Zeng YQ (2016) Many-facet Rasch model analysis on computer automatic scoring of a computer-based English listening-speaking test. \u003cem\u003eForeign Language Testing and Teaching, 1:\u003c/em\u003e22-31.\u003c/li\u003e\n\u003cli\u003eZhang WW, Zhang LJ (2022) Understanding assessment tasks: Learners\u0026rsquo; and teachers\u0026rsquo; perceptions of cognitive load of integrated speaking tasks for TBLT implementation. \u003cem\u003eSystem, 111,\u003c/em\u003e 102951. https://doi.org/10.1016/j.system.2022.102951 \u003c/li\u003e\n\u003cli\u003eZhang WW, Zhang DL, Zhang LJ (2021) Metacognitive instruction for sustainable learning: Learners\u0026rsquo; perceptions of task difficulty and use of metacognitive strategies in completing integrated speaking tasks. \u003cem\u003eSustainability, 13,\u003c/em\u003e 6275. https://doi.org/10.3390/su13116275 \u003c/li\u003e\n\u003cli\u003eZhang WW, Zhao MJ, Zhu Y (2022) Understanding individual differences in metacognitive strategy use, task demand, and performance in integrated L2 speaking assessment tasks. Frontiers in Psychology, 13, 876208. https://doi.org/10.3389/fpsyg.2022.876208 \u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Footnotes","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e For the detailed description and scoring criteria of the three subtasks of CELST please refer to the \u0026ldquo;General description\u0026rdquo; section in Xu et al. (\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e These measures include: the number of error-free clauses, the number of verb forms, the number of correct-verb forms, the number of reported semantic units, the number of additive conjunctions, the number of comparative conjunctions, the number of temporal conjunctions, the number of consequential conjunctions, the number of internal conjunctions, the number of unmeaningful syllables (including repetition, reformulation, and replacement), the number of filled pauses, the number of unfilled pauses, the number of self-corrections, the number of phoneme additions, the number of phoneme omitting, the number of mispronounced phonemes, the number of unintelligible fragments, the number of misstressed phonemes, the number of message abandoned, the number of semantic units reduced, the number of guessing, the number of approximation, the number of approximation, the number of paraphrasing, the number of restructuring, the number of words coinaged.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e To be specific, they are all the sub-measures of coherence, number of internals, all the sub-measures of achievement strategy use, number of message abandon in reading aloud task; all the sub-measures of achievement strategy use and number of comparatives and temporals in role play task; and number of internal conjunctions and number of coinage across all the three sub-tasks in CELST.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e Generally speaking, χ\u003csup\u003e2\u003c/sup\u003e/df value of smaller than 2 indicates well fit, with a value of 2\u0026thinsp;\u0026lt;\u0026thinsp;χ\u003csup\u003e2\u003c/sup\u003e/df\u0026thinsp;\u0026lt;\u0026thinsp;5 indicating acceptable fitness (Hou, Wen, \u0026amp; Cheng, 2003, p.177\u0026ndash;179). Another two commonly used absolute indexes of model fit are RMSEA and SRMR, representing Root Mean Square Error of Approximation and Standardized Root Mean Square Residual separately. It is usually reckoned that a RMSEA value of smaller than 0.8 indicates acceptable model fit, and a RMSEA value of smaller than 0.5 indicates well model fit. For SRMR, a value smaller than 0.5 indicates well model fit.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"integrated speaking tasks, national matriculation test, CELST, construct validity","lastPublishedDoi":"10.21203/rs.3.rs-7563159/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7563159/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIntegrated speaking tasks have been widely used in many large-scale high-stakes tests. However, little is known about their application among low- or intermediate-level English second language learners, such as in the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test. This misalignment is particularly problematic given the substantial impact of integrated speaking tasks on teaching, learning, and assessing of English language education in China and internationally. To address this gap, the present study employed a bi-factor Exploratory Structural Equation Modeling (ESEM) and hierarchical multiple regression analyses, with data from a sample of 360 participants in a real test, and probed whether and to what extent the test actually assesses the ability it is supposed to test as manifested in its official test specifications issued by the Education Examinations Authority of Guangdong Province. Findings indicated that the test did measure students\u0026rsquo; ability of accomplishing certain tasks in specific contexts by acquiring and applying various knowledge sources (e.g. tasks prompts, encyclopedic knowledge of English and the world, source material, communicative strategies, etc.) in English. Specifically, parameters from the five domain-specific textual factors and two communicative strategies, extracted from the participants\u0026rsquo; oral output, co-worked on the participants\u0026rsquo; performance in the test, with varying weights across factors. This variability highlights the comprehensiveness and contextual specificity of the test. These findings could provide empirical evidence supporting the validity of score interpretations and offer important implications for the teaching, learning, and assessment of integrated speaking tasks in senior high schools in China.\u003c/p\u003e","manuscriptTitle":"Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-27 15:20:11","doi":"10.21203/rs.3.rs-7563159/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"29558863-1311-4a6e-b58a-1aec60c1725d","owner":[],"postedDate":"October 27th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":56843374,"name":"Social science/Education"},{"id":56843375,"name":"Humanities/Language and linguistics"},{"id":56843376,"name":"Social science/Language and linguistics"},{"id":56843377,"name":"Biological sciences/Psychology"},{"id":56843378,"name":"Social science/Psychology"}],"tags":[],"updatedAt":"2026-02-09T08:26:19+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-27 15:20:11","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7563159","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7563159","identity":"rs-7563159","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00