Potential Use of ChatGPT for Automated Essay Scoring Based | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Potential Use of ChatGPT for Automated Essay Scoring Based Roghaye Torki, Fariba Rahimi Esfahani, Farshad Kiyoumarsi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7533498/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The rapid advancements in Artificial Intelligence (AI) have significantly influenced educational practices, particularly in writing assessment. Automated Essay Scoring (AES) systems offer a promising alternative to traditional scoring methods by enhancing consistency, efficiency, and scalability. However, the integration of AI in high-stakes assessments like IELTS Writing Task 2 requires rigorous evaluation to ensure reliability and alignment with human judgment. This study explores the potential of ChatGPT, an advanced AI language model, as a tool for scoring essays based on IELTS Writing Task 2 criteria—Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. Employing a quantitative Associational Ex Post Facto Design, 30 essays were scored by both certified human raters and ChatGPT, using intra-class correlation coefficients (ICC) for reliabilityand MANOVA for comparative accuracy. The findings reveal that while ChatGPT demonstrates high internal consistency in scoring, significant discrepancies persist when compared to human raters, particularly in Coherence and Cohesion. These results highlight both the potential and limitations of ChatGPT in AES, suggesting that it can complement, but not yet replace, human evaluators in complex writing tasks. The study contributes to the ongoing discourse on the role of AI in education, emphasizing the need for further refinements to optimize AI-assisted assessments for fairness and precision. Beyond its theoretical contributions, this study provides practical insights for language educators, testing bodies, and policymakers on how AI can be responsibly integrated into large-scale writing assessments. Artificial Intelligence (AI) Automated Essay Scoring (AES) ChatGPT IELTS Writing Task 2 criterion Writing Assessment Introduction Writing is widely recognized as one of the most demanding skills in language education, and for good reason. It does not merely involve producing text; it requires learners to sustain intellectual effort, organize their ideas, and refine them into a coherent argument (Irman & Almusharraf, 2023 ). Unlike oral interaction, where meaning can be clarified instantly, writing results in a permanent record. This permanence turns writing into an extended piece of discourse that is cognitively taxing and difficult to master. As Emig ( 1997 ) persuasively argued, writing is not only a means of communication but also a distinct mode of learning—yet one that imposes high cognitive costs. Because of these demands, the assessment of writing has long been problematic. Evaluation practices are often criticized for inter-rater variability, lack of transparency, and the inconsistent application of band descriptors. These features make traditional assessment a weaker basis for applied linguistics research concerned with fairness and reliability (Ahmadi Shirazi, 2013 ; Beck et al., 2018). After all, even trained examiners interpret descriptors differently, which means that scores may vary depending on who evaluates a script. For this reason, scholars have repeatedly called for more systematic approaches to writing evaluation. Automated Essay Scoring (AES) has emerged as one response to these challenges. Drawing on Artificial Intelligence (AI) and Natural Language Processing (NLP), AES promises efficiency, consistency, and reduced rater fatigue (Doğan et al., 2014 ; Mizumoto & Eguchi, 2023 ). Indeed, it has been embraced in large-scale testing environments, where thousands of essays must be processed within limited time frames (Barrot, 2024 ; Zawacki-Richter et al., 2019 ). However, reliability is not equivalent to validity. Although AES can stabilize scoring, it does not automatically guarantee fairness, and concerns about criterion-related validity remain central (Mizumoto & Eguchi, 2023 ). The release of ChatGPT by OpenAI in late 2022 intensified these debates. As a transformer-based AI model, ChatGPT has demonstrated remarkable ability to generate coherent text, structure arguments, and provide context-sensitive support (Kohnke et al., 2023 ). In principle, such features could relieve examiners by producing consistent scores, thereby allowing teachers to devote more energy to qualitative feedback (Barrot, 2024 ; Essel, 2023 ). It has therefore been suggested that ChatGPT could serve as an AES system, complementing or even approximating human raters (Parker et al., 2023 ; Mizumoto & Eguchi, 2023 ; Yancey et al., 2023 ). Yet, enthusiasm must be moderated. While ChatGPT handles surface-level features such as grammar and lexical range with relative accuracy, it is less reliable in capturing discourse-level coherence, rhetorical effectiveness, and idea development (Bui & Barrot, 2024 ; Ramesh & Sanampudi, 2022 ; Yancey et al., 2023 ). This creates a tension. On the one hand, ChatGPT appears consistent and efficient; on the other, it risks overlooking the very dimensions that define writing proficiency in extended discourse. Although prompt design and input quality have been shown to improve its performance (Huang et al., 2022 ; Enright & Quinlan, 2010 ; Barrot, 2024 ), they cannot fully eliminate bias or inconsistency. Moreover, few studies have systematically examined its ability to evaluate crucial aspects such as global cohesion, appropriateness to audience, or stylistic variation—dimensions that applied linguistics research considers essential for validity in academic writing assessment. Against this backdrop, the present study aims to investigate ChatGPT’s role in Automated Essay Scoring by comparing its performance with that of certified human examiners on IELTS Writing Task 2 essays. This task was chosen because its scoring rubric is transparent and widely documented, consisting of four analytic criteria: Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), and Grammatical Range and Accuracy (GRA) (Daneshvar et al., 2021 ). TR assesses the degree to which the prompt is addressed with well-supported arguments; CC captures the logical flow and effective use of cohesive devices; LR measures lexical precision, variety, and appropriateness; while GRA reflects control of syntax, complexity, and grammatical accuracy. These descriptors are intended to operationalize the construct of writing proficiency in a structured way. However, interpretive flexibility cannot be eliminated entirely, which is why IELTS Writing Task 2 provides an ideal context to test whether ChatGPT can deliver not only stable scores but also ones that meaningfully align with expert human judgment. Literature review Automated Writing Evaluation and Automated Essay Scoring Automated Writing Evaluation (AWE) and AES systems have increasingly been positioned as central instruments in the assessment of L2 writing. These systems are designed not only to capture macro-level dimensions such as argument structure and organization but also to identify micro-level aspects like grammar and syntax (Zhai & Ma, 2022 ). Drawing on the Complexity, Accuracy, and Fluency (CAF) framework, they are frequently applied to distinguish proficiency levels in a more systematic and standardized way (Phuoc & Barrot, 2022 ). Although AES and AWE are often promoted as complementary tools to human raters in high-stakes tests such as TOEFL and IELTS, their actual alignment with human judgment has been repeatedly questioned, particularly with respect to construct validity and the consistency of rater interpretation. Different theoretical perspectives have informed the conceptual foundations of AWE. From the interaction hypothesis, the emphasis is placed on modified linguistic input occurring during exchanges between learners and automated systems, or between teachers and such systems (Long, 1996 ; Wilson & Czik, 2016 ). In contrast, sociocultural theory interprets AWE feedback as mediated assistance that scaffolds learning within the learner’s zone of proximal development (Zhai & Ma, 2022 ). Cognitive information-processing perspectives highlight yet another dimension: AWE and AES function as metalinguistic feedback providers, offering learners immediate opportunities to notice, reflect on, and restructure linguistic forms. In this way, explicit knowledge is gradually converted into implicit competence (Schmidt, 1995). Taken together, these theoretical perspectives underscore that automated systems do more than generate scores; they operate as pedagogical mediators with the potential to support learning in extended pieces of discourse. From a psychometric standpoint, AES attempts to replicate human scoring practices while simultaneously promising enhanced inter-rater reliability, scalability, and efficiency (Ramineni & Williamson, 2013 ; Shermis, 2014 ). In large-scale testing environments, this scalability is particularly attractive, since thousands of essays must be scored with consistency under strict timelines. It is often claimed that AES offers objectivity by minimizing rater fatigue and bias. Yet this assumption of neutrality is not absolute. Algorithmic outputs remain constrained by training corpora and the design of scoring rubrics, meaning that claims of objectivity should be treated cautiously. Despite these advantages, AES systems reveal serious limitations. They are frequently less capable of capturing rhetorical effectiveness, reasoning, creativity, or stylistic nuance—features that applied linguistics research regards as central to advanced writing proficiency (Barrot, 2024 ; Lee et al., 2023 ). Context can also be misinterpreted, leading to questionable judgments (Barrot, 2024 ). Moreover, because AES systems can sometimes be manipulated through superficial strategies such as keyword stuffing, learners may attain artificially inflated scores without authentic improvement in their writing quality (Higgins & Heilman, 2014 ). These shortcomings reinforce the point that statistical reliability cannot be equated with construct representation or validity. Scholars have therefore attempted to classify and critique existing AES models. Hussein et al. ( 2019 ) distinguished between handcrafted feature-based models and automatic feature-extraction approaches. While the former stress objectivity and transparency, the latter enable wider coverage but can also introduce new forms of bias. Cox (2022) similarly pointed out that crucial dimensions, such as lexical sophistication and cohesion, remain underdeveloped in many existing rubrics, weakening claims of validity. Ramesh and Sanampudi (2021), in a systematic review of 26 AES-related studies, identified persistent shortcomings in measuring coherence and completeness. Many AES systems were shown to rely on general-purpose datasets, such as the Kaggle ASAP corpus, which lack domain-specific features essential for fine-grained analysis. They concluded that without domain-tailored corpora and improved feature extraction, AES will continue to struggle with criterion-related validity and comparability to human raters. Research on AWE has also emphasized feedback functions rather than final scores. Crossley ( 2020 ), Fan and Ma ( 2022 ), and Zhai and Ma ( 2022 ) demonstrated that automated feedback can enhance writing quality, though the extent of improvement varies depending on learners’ educational level and the genre of writing. Yet, methodological issues persist: small sample sizes and a predominant focus on overall writing quality, rather than on fine-grained dimensions such as syntactic complexity, limit the generalizability of findings. Fan and Ma’s ( 2022 ) review of 22 studies suggested that AWE feedback tends to be most effective in within-group experimental designs, pointing to further methodological caveats. Direct comparisons between automated systems and human raters have provided additional insights. Almusharraf and Alotaibi ( 2023 ), for instance, analyzed 197 essays scored both by Grammarly and by human examiners, and observed a moderate correlation. However, Grammarly not only detected more errors but also consistently assigned lower scores. This outcome illustrates a trade-off: while AES enhances inter-rater consistency, it does not necessarily align with human interpretations of band descriptors, leaving a gap in construct validity. ChatGPT as an AI Model in Writing and Education In more recent years, scholarly attention has shifted specifically to ChatGPT and its potential applications in writing and education. Broader discussions of AI in education have already emphasized learner support, reduced teacher workload, and the potential to bridge achievement gaps (Afzaal et al., 2022 ; Holmes et al., 2022 ; Imran & Almusharraf, 2023 ). With its capacity to process vast datasets and mimic human-like responses, AI has pushed AES beyond the boundaries of traditional feature-based models. A pivotal innovation underpinning this shift is the rise of deep learning. Neural networks—especially transformer architectures—have transformed a wide array of tasks ranging from speech recognition to natural language processing (Dong et al., 2017 ; Hussein et al., 2019 ). By combining the strengths of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), transformers allow for faster and more effective training on sequential data (Giacaglia, 2019 ; Mizumoto & Eguchi, 2023 ). Building on this foundation, GPT models were trained on massive corpora to perform diverse NLP tasks, from text generation to translation (Radford et al., 2018 ). ChatGPT, as a conversational adaptation of GPT, has therefore redefined human–machine interaction and inevitably raised questions about its potential role in AES (Essel, 2023 ; Lee et al., 2017). One of ChatGPT’s unique contributions is its ability to deliver Automated Written Corrective Feedback (AWCF). Unlike traditional grammar checkers, it can evaluate not only grammar but also content, organization, clarity, and style, while simultaneously suggesting revisions (Barrot, 2024 ). Its scoring functions operate at both holistic and analytic levels, often providing explanations alongside band scores. In principle, this dual capability should allow for richer criterion-referenced feedback. Nevertheless, concerns about validity and fairness remain unresolved (Barrot, 2024 ; Han & Sari, 2024 ). Empirical studies highlight both benefits and limitations. Alberth ( 2023 ) observed that ChatGPT may accelerate research productivity and support writing assistance, but cautioned against overreliance and ethical misuse. Imran and Almusharraf ( 2023 ) echoed this caution, underscoring the need for careful human oversight to mitigate risks of plagiarism and overdependence. Mizumoto and Eguchi ( 2023 ) further demonstrated that ChatGPT’s scoring can be strengthened by incorporating linguistic features such as lexical diversity and syntactic complexity. Other comparative studies provide additional nuance. Guo and Wang ( 2024 ) examined ChatGPT’s feedback on argumentative writing and found that it generated more comments than human teachers, distributing attention more evenly across content, organization, and language. Teachers valued these strengths but pointed to weaknesses such as excessive verbosity, limited context sensitivity, and accessibility issues. Schmidt-Fajlik ( 2023 ) compared ChatGPT with Grammarly and ProWritingAid, reporting that learners appreciated its intuitive interface and detailed feedback, even though its evaluations remained largely linguistic rather than rhetorical. Yet significant concerns about reliability persist. Yancey et al. ( 2023 ) reported that while ChatGPT aligned well with human ratings on linguistic measures, it failed to capture rhetorical effectiveness and audience awareness. Parker et al. ( 2023 ) observed that ChatGPT tended to be stricter than human raters, consistently assigning lower scores. Extending this line of inquiry, Bui and Barrot ( 2024 ) identified instability in ChatGPT’s scoring across different time points, partly due to algorithmic drift and version updates (Ray, 2023 ; Gonzalez Torres & Sawhney, 2023 ; Suppadungsuk et al., 2023 ). Although these updates are designed to enhance model performance, they complicate longitudinal validity and comparability. Taken together, these findings converge on a familiar theme: ChatGPT is reliable at surface-level judgments but struggles with higher-order aspects of writing, including discourse-level cohesion, argumentative depth, and creativity (Ramesh & Sanampudi, 2022 ; Mizumoto & Eguchi, 2023 ). While earlier AES systems were criticized on similar grounds, ChatGPT’s probabilistic nature adds another layer of variability. As Schade ( 2023 ) has noted, even when identical prompts are used, slight differences can occur in the outputs due to randomness in large language models. In sum, current research portrays ChatGPT as a promising but still imperfect AES tool. It is effective in detecting grammatical accuracy and lexical range, in generating immediate feedback, and in reducing examiner workload. However, it remains less capable of capturing the full construct of academic writing proficiency. Whereas human raters interpret band descriptors with reference to rhetorical effectiveness and contextual appropriateness, ChatGPT often under-represents these higher-order dimensions. This persistent gap underscores the need for further validation studies, especially in high-stakes contexts such as IELTS Writing Task 2, where criterion-related validity and fairness are paramount. Filling these research gaps, the present study aims to offer a more detailed examination of ChatGPT’s performance as an AES tool compared with established human scoring practices. While earlier investigations have identified both potential and limitations, very few studies have directly compared ChatGPT’s scores with those of certified examiners under standardized, high-stakes conditions. Moreover, little systematic attention has been paid to whether discrepancies can be minimized through careful prompt design and consistent evaluation procedures. For this reason, the current study is guided by the following research questions: Is ChatGPT a reliable language model in scoring essays according to the IELTS Writing Task 2 benchmarks? Is there a statistically significant difference between ChatGPT’s essay scores and human raters’ scores when both follow IELTS Writing Task 2 band descriptors? By addressing these questions, the study aims not only to evaluate ChatGPT’s reliability and validity but also to clarify whether it can serve as a credible complement—rather than a substitute—for human raters in high-stakes writing assessment. Methods Research Design This study followed a quantitative, non-experimental design in the tradition of ex post facto research. Since each essay was evaluated by both a certified IELTS examiner and ChatGPT, the analysis adopted a within-subject framework, treating the type of rater (human vs. AI) as the independent factor. The dependent variables were the four analytic IELTS Writing Task 2 criteria—Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), and Grammatical Range and Accuracy (GRA)—together with the overall band score. To examine reliability, intra-class correlation coefficients (ICC 2,1) were calculated, as this statistic is considered more appropriate than Cronbach’s alpha for continuous ratings across raters and allows for absolute agreement estimates. To address differences in scoring patterns, a repeated-measures MANOVA was employed, followed by paired-samples t-tests and effect size calculations where necessary. All assumptions for multivariate analysis (e.g., normality, linearity, and homogeneity of covariance matrices) were checked before interpreting results. While this design does not permit causal inference, it provides a rigorous means of comparing the alignment and divergence between human and AI raters. Participants and Sample Selection The dataset consisted of thirty IELTS Writing Task 2 essays, deliberately selected to cover a wide range of proficiency levels. The essays were drawn from three sources: The Cambridge IELTS practice book series, the “How to do IELTS” website, and Exir Academy’s online platform. A stratified random sampling procedure was applied to ensure representation of different proficiency bands, thereby enhancing the fairness of the comparison. Human ratings were provided by certified IELTS examiners with prior experience in test preparation and assessment. Materials and Instruments The primary material for analysis was IELTS Writing Task 2, chosen because it requires candidates to produce extended argumentative essays that demand idea development, organization, and support with relevant examples. In contrast to Task 1, which mainly involves data description, Task 2 offers a more comprehensive measure of academic writing proficiency. Both human raters and ChatGPT evaluated the essays using the official IELTS scoring descriptors, covering TR, CC, LR, GRA, and overall band performance. Procedures All essays were anonymized prior to scoring to reduce potential bias. Each script was independently scored by a human examiner and by ChatGPT, with ratings recorded separately for each of the four analytic categories and the overall band score. No qualitative feedback or comments were provided; only numeric band scores were considered for analysis. Data Analysis Data analysis proceeded in several stages. First, descriptive statistics (means and standard deviations) were computed for all variables to illustrate central tendencies and dispersion. Inter-rater reliability between human and AI scores was then assessed using two-way random effects intra-class correlation coefficients (ICC 2,1), reported with 95% confidence intervals. To evaluate systematic differences between human and AI scores, a repeated-measures MANOVA was performed. Where multivariate results were significant, follow-up paired-samples t-tests were conducted for each criterion, with effect sizes (Cohen’s d) calculated to gauge the magnitude of differences. A significance threshold of α = .05 was applied, and results were interpreted with reference to both statistical and practical significance. Results To evaluate the stability of ChatGPT’s scoring across repeated evaluations of the same scripts, we estimated intra-rater reliability using a two-way mixed-effects, absolute-agreement ICC (2,1). Reliability was uniformly excellent across all outcomes—TR = 0.984, CC = 0.968, LR = 0.980, GRA = 0.967, Overall = 0.964—indicating near-perfect repeatability under identical prompting conditions (Table 1 ). Descriptively, means and standard deviations changed only negligibly from session to session, reinforcing this stability. Table 1 Intra-class correlation coefficients for ChatGPT scores across three sessions Criterion ICC(2,1) Interpretation* Task Response (TR) 0.984 Excellent Coherence and Cohesion (CC) 0.968 Excellent Grammatical Range and Accuracy (GRA) 0.967 Excellent Lexical Resource (LR) 0.980 Excellent Overall Band Score 0.964 Excellent Having established that ChatGPT yields highly stable scores across repeated measurements, we next examined whether its ratings systematically diverged from those assigned by human raters. This addressed the second research question and required an omnibus multivariate test followed by criterion-level paired comparisons. A repeated-measures MANOVA with rater type (human vs. ChatGPT) as the within-subjects factor and the five IELTS outcomes as dependent variables revealed a robust multivariate effect of rater type. Pillai’s Trace = 0.897, F(5, 29) = 50.30, p < .001; convergent statistics from the Hotelling–Lawley Trace, Roy’s Largest Root, and Hotelling’s T² corroborated this omnibus difference, indicating that the combined score profile reliably differed between human and AI ratings (Table 2 ). Table 2 repeated-measures multivariate test results (Hotelling’s T²) Test Statistic F df1 df2 p Pillai's Trace 0.897 50.30 5 29 < .001 Hotelling-Lawley Trace 8.672 44.86 5 25 < .001 Roy's Largest Root 8.672 251.49 1 29 < .001 Hotelling's T² 251.487 43.36 5 25 < .001 Follow-up paired comparisons localized the multivariate effect to specific criteria (Table 3 ). Human raters consistently awarded higher scores than ChatGPT across all five outcomes. The largest discrepancies were observed for Coherence and Cohesion (Mean Diff = 1.50, 95% CI [1.19, 1.81], t(29) = 10.02, p < .001, d z = 1.83) and for Grammatical Range and Accuracy (Mean Diff = 0.90, 95% CI [0.58, 1.22], t(29) = 5.83, p < .001, d z = 1.07). Substantial differences were also found for Lexical Resource (Mean Diff = 0.80, 95% CI [0.52, 1.08], t(29) = 5.76, p < .001, d z = 1.05) and for the Overall band (Mean Diff = 0.90, 95% CI [0.61, 1.19], t(29) = 6.31, p < .001, d z = 1.15). Even Task Response showed a smaller but significant gap (Mean Diff = 0.70, 95% CI [0.16, 1.24], t(29) = 2.66, p = .013, d z = 0.49). Holm-adjusted p-values remained significant across all criteria, confirming systematic human–AI differences in level rather than mere rank ordering. Table 3 paired comparisons (Human – ChatGPT) for IELTS criteria and overall Criterion Human Mean ± SD ChatGPT Mean ± SD Mean Diff (H–C) SE (Diff) 95% CI Low 95% CI High t (paired) df p Cohen's d_z Partial η² p (Holm) TR 5.50 ± 1.46 4.80 ± 0.89 0.70 0.26 0.16 1.24 2.66 29 0.013 0.49 0.20 0.013 CC 5.80 ± 1.00 4.30 ± 0.65 1.50 0.15 1.19 1.81 10.02 29 < .001 1.83 0.78 < .001 LR 5.10 ± 0.84 4.30 ± 0.79 0.80 0.14 0.52 1.08 5.76 29 < .001 1.05 0.53 < .001 GRA 4.90 ± 0.96 4.00 ± 0.64 0.90 0.15 0.58 1.22 5.83 29 < .001 1.07 0.54 < .001 Overall 5.40 ± 0.93 4.50 ± 0.72 0.90 0.14 0.61 1.19 6.31 29 < .001 1.15 0.58 < .001 Taken together, ChatGPT’s scoring is highly stable across repeated sessions (ICC: .964–.984), yet its mean scores do not fully converge with human judgments; the omnibus multivariate effect is large, and criterion-level gaps are most pronounced for coherence/organization and grammatical control. Discussion The present study investigated ChatGPT’s potential as an AES tool by comparing its performance with that of human raters using IELTS Writing Task 2 criteria. The results revealed two key findings: first, ChatGPT demonstrated high internal reliability across three sessions, as indicated by ICC values exceeding .96 for all criteria, establishing its scoring stability. Second, despite this reliability, significant systematic differences were observed between human and AI scores across all four analytic criteria and the overall band score, as evidenced by repeated-measures MANOVA and follow-up paired comparisons. These results highlight ChatGPT’s consistency but also its divergence from human evaluators, raising critical implications for its role in high-stakes assessment. These findings extend prior research on AES systems, which has consistently reported discrepancies between machine-generated and human ratings (Almusharraf & Alotaibi, 2023 ; Crossley, 2020 ; Fan & Ma, 2022 ; Ramesh & Sanampudi, 2021; Zhai & Ma, 2022 ). In line with Almusharraf and Alotaibi ( 2023 ) and Parker et al. ( 2023 ), our results show that ChatGPT tended to assign lower scores than human raters, particularly for Coherence and Cohesion and Lexical Resource. These findings also invite closer comparison with previous research. Unlike Crossley ( 2020 ), who underscored the importance of linguistic indicators such as syntactic complexity and lexical sophistication in shaping writing quality, the present study shows that ChatGPT’s judgments were more heavily weighted toward surface-level accuracy, often overlooking deeper discourse-level dimensions. Similarly, while Guo and Wang ( 2024 ) reported that ChatGPT provided extensive feedback distributed across content, organization, and language, our results indicate that its evaluative focus in scoring remained narrower, with less sensitivity to rhetorical effectiveness and creative idea development—areas where human raters demonstrated greater tolerance and recognition. This contrasts with Bui and Barrot ( 2024 ), who emphasized inconsistencies across different ChatGPT versions, whereas our study—by employing a single, stable version—found consistently replicated scores. Thus, while previous work questioned the reproducibility of ChatGPT’s evaluations (Bui & Barrot, 2024 ), the present study confirms that the system is internally stable when conditions are held constant. One possible explanation for the lower scores assigned by ChatGPT lies in its algorithmic focus on error detection rather than nuanced assessment of idea development and creativity (Guo & Wang, 2024 ; Schmidt-Fajlik, 2023 ). Similar to the observations of Yancey et al. ( 2023 ), ChatGPT struggles with interpreting unconventional structures or subtle argumentative moves, often penalizing essays where human raters recognize originality or rhetorical effectiveness. Human raters, familiar with second-language learner writing, may apply more lenient judgments, especially for coherence and lexical sophistication, leading to the observed score gaps. Another contributing factor is training data bias, as generative AI systems reflect the patterns of their training corpus (Bui & Barrot, 2024 ). This bias can result in systematic underestimation of L2 writing performance, particularly in dimensions such as cohesion and vocabulary richness. Our findings resonate with prior research showing that AES systems often fail to capture creativity and discourse-level fluency (Ramesh & Sanampudi, 2021; Yan et al., 2023 ). The substantial and statistically significant differences in Coherence and Cohesion (p < .001) and Lexical Resource (p < .001) underscore these limitations. However, a notable departure from Bui and Barrot’s ( 2024 ) conclusions lies in the consistency of ChatGPT scores across repeated evaluations. Whereas they reported variability attributed to algorithmic updates and model drift (Ray, 2023 ; Gonzalez Torres & Sawhney, 2023 ; Suppadungsuk et al., 2023 ), our study shows that ChatGPT produced nearly identical results across three sessions. This suggests that when model updates are controlled, ChatGPT is capable of providing reproducible scores, thus offering potential utility in large-scale assessment scenarios where reliability is paramount. Despite these strengths, the weak correlations with human ratings raise concerns about construct validity. AES models like ChatGPT appear reliable but not yet valid proxies for human judgment, particularly in high-stakes contexts like IELTS. As Mizumoto and Eguchi ( 2023 ) argue, enhancing scoring accuracy requires integrating linguistic features such as syntactic complexity and lexical diversity. Furthermore, improving prompt design and fine-tuning models specifically for essay scoring may help bridge the gap between AI and human evaluators. Conclusion and Implications This study evaluated ChatGPT’s performance as an AES tool for IELTS Writing Task 2. Two conclusions are clear. First, ChatGPT produced highly consistent scores across repeated evaluations, with excellent intra-rater reliability, indicating that- under stable conditions- it can generate reproducible outcomes. Second, despite this stability, systematic differences emerged between ChatGPT and human raters for all analytic criteria (TR, CC, LR, GRA) and the overall band score, with the largest gaps in Coherence and Cohesion, followed by the overall band, Grammatical Range and Accuracy, and Lexical Resource, and a smaller yet reliable gap in Task Response. In short, reliability is strong, but validity- defined here as alignment with expert human judgment- remains limited. Placed against prior work, these findings both converge with and depart from the literature. They converge with studies that document persistent human–AI discrepancies and the difficulty AES systems face with discourse-level organization and nuanced lexical control (Almusharraf & Alotaibi, 2023 ; Crossley, 2020 ; Fan & Ma, 2022 ; Ramesh & Sanampudi, 2021; Zhai & Ma, 2022 ; Yan et al., 2023 ). They also align with reports that ChatGPT tends to assign lower scores than human raters in multiple contexts (Bui & Barrot, 2024 ; Guo & Wang, 2024 ; Schmidt-Fajlik, 2023 ; Yancey et al., 2023 ). At the same time, our design diverges from Bui and Barrot’s ( 2024 ) claim of unstable scoring across ChatGPT versions: by holding the version constant and repeating evaluations under identical conditions, we observed stable internal reliability, suggesting that previously reported variability likely reflected version changes and model drift (Ray, 2023 ; Gonzalez Torres & Sawhney, 2023 ; Suppadungsuk et al., 2023 ) rather than inherent randomness. Taken together, these results recommend a complementary—rather than substitutive—role for ChatGPT in high-stakes assessment. In practical terms: Human-in-the-loop scoring. Use ChatGPT for preliminary scoring and rapid triage, while reserving final judgments for trained examiners. This arrangement leverages AI’s efficiency without sacrificing validity and fairness (Almusharraf & Alotaibi, 2023 ). Targeted deployment. Allow ChatGPT to highlight surface-level issues (grammatical slips, local cohesion) and provide criterion-linked prompts for revision, while teachers adjudicate discourse-level qualities (global coherence, argument development, rhetorical effectiveness) (Yancey et al., 2023 ; Schmidt-Fajlik, 2023 ). Calibration and standard setting. Periodically calibrate AI outputs to human band descriptors, adopt cut-scores for automatic acceptance vs. mandatory human review, and perform bias audits for L2 populations. Model improvement. Consistent with Mizumoto and Eguchi ( 2023 ), integrate linguistic features—syntactic complexity, lexical diversity, and discourse markers—into model tuning; task-specific fine-tuning for IELTS-style responses is especially warranted. Monitoring drift. Establish version control and drift monitoring so that any system update triggers re-validation before operational use (Ray, 2023 ; Gonzalez Torres & Sawhney, 2023 ; Suppadungsuk et al., 2023 ). Beyond summative scoring, there is strong potential for formative applications. ChatGPT can support personalized learning by delivering immediate, criterion-referenced feedback that helps learners iterate on organization, vocabulary choice, and grammatical control (Alberth, 2023 ; Su & Yang, 2023 ). That said, because the model can underestimate performance relative to human raters—particularly on discourse and lexical dimensions—teacher mediation remains essential to prevent construct under-representation and preserve stakeholder trust. In conclusion, ChatGPT currently offers reliability without full validity in mirroring human scoring for IELTS Writing Task 2. When integrated thoughtfully—human-in-the-loop, calibrated to band descriptors, audited for drift and bias—it can improve throughput and feedback cycles. However, until refinements address discourse-level interpretation and alignment with expert judgments, ChatGPT should be positioned as a pedagogical aid and auxiliary scorer, not a stand-alone substitute, in high-stakes writing assessment (Bui & Barrot, 2024 ; Guo & Wang, 2024 ; Yancey et al., 2023 ; Mizumoto & Eguchi, 2023 ). Abbreviations AES: Automated Essay Scoring AI: Artificial Intelligence AWCF: Automated Written Corrective Feedback AWE: Automated Writing Evaluation CAF: Accuracy and Fluency CC: Coherence and Cohesion CNNs: Convolutional Neural Networks GRA: Grammatical Range and Accuracy IELTS: International English Language Testing System ICC: Intra-class correlation coefficient LR: Lexical Resource MANOVA: Multivariate analysis of variance NLP: Natural Language Processing TR: Task Response Declarations Ethics approval and consent to participate: not applicable. Clinical trial number not applicable. Consent for publication: not applicable. Funding: not applicable. Author Contribution R.T. contributed to the conceptualization of the study, literature review, data collection, and initial drafting of the manuscript.F.R. supervised the research process, contributed to study design and methodology, performed data analysis, and provided critical revisions.F. K . contributed to the design and validation of the automated essay scoring procedures, assisted with statistical analyses, and participated in manuscript editing and technical review.All authors reviewed and approved the final version of the manuscript for submission References Afzaal, M., Imran, M., Du, X., & Almusharraf, N. (2022). Automated and human interaction in written discourse: A contrastive parallel corpus-based investigation of meta discourse features in machine-human translations. SAGE Open , 12(4), 1-18. https://doi.org/21582440221142210 Ahmadi Shirazi, M. (2013). Using an analytic dichotomous evaluation checklist to increase inter- and intra-rater reliability of EFL writing evaluation. Iranian Journal of Applied Linguistics, 16 (1), 25–57. Alberth (2023). The use of ChatGPT in academic writing: A blessing or a curse in disguise? TEFLIN Journal: A Publication on the Teaching and Learning of English · 34(2):337-352. DOI: 10.15639/teflinjournal.v34i2/337-352 Almusharraf, N., & Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. Technology Knowledge and Learning ,28(3), 1015–1031. DOI:10.1007/s10758-022-09592-z Barrot, J. S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, 100745. Barrot, J. S. (2024). Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis. Education and Information Technologies , 29(6), 7155–7179. British Council. (2023). IELTS writing test – Task 2 . https://www.ielts.org/about-the-test/sample-test-questions Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies . https://doi.org/10.1007/s10639-024-12891-w Cerero, J. F., Rueda, M. M., Batanero, J. M. F., & Meneses, E. L. (2023).Impact of the implementation of ChatGPT in education: A systematic review . Computers , 12 (8), 153; https://doi.org/10.3390/computers12080153 Cox, J. (2020). Writing rubrics: Samples of basic, expository, and narrative rubrics . ThoughtCo. https://www.thoughtco.com/writing-rubric-examples-2081369 Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research , 11(3), 415–443. https://doi.org/10.17239/jowr-2020.11.03.01 Daneshvar, A., Sadegh Bagheri M., Sadighi F., Yarmohammadi L., Yamini M. (2021 ). A Probe into Iranian Learners’ Performance on IELTS Academic Writing Task 2: Operationalizing Two Models of Dynamic Assessment versus Static Assessment. Journal of Modern Research in English Language Studies, 8(2),25-58. Doğan, A., Akbarova, A. A., Aydoğan, H., Gönen, K., & Tuncdemir, E. (2014). Automated essay scoring versus human scoring: a reliability check. International Journal of Linguistics, Literature and Translation , 3,1. Dong, F., Zhang, Y., & Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring. Proceedings of the 21st Conference on Computational Natural Language Learning, 153–162. https://doi.org/10.18653/v1/K17-1017 Drigas, A. S., Argyri, K., & Vrettaros, J. (2009). Artificial intelligence techniques in student modeling. In Best practices for the knowledge society. Knowledge, learning, development and technology for all: Second world summit on the knowledge society , 2 (pp. 552-564). Emig, J. (1997). Writing as a mode of learning. In Villanueva, V. (Ed.), Cross talk in composition theory. Urbana, IL: National Council of Teachers of English . Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater scoring. Language Testing , 27(3), 317-334. https://doi.org/10.1177/0265532210363144 Essel, H. B. (2023). 7 things you should know about GPT . Research Gate. https://www.researchgate.net/publication/367377300 Fan, N., & Ma, Y. (2022). The effects of automated writing evaluation (AWE) feedback on students’ English writing quality: a systematic literature review. Language Teaching Research Quarterly , 28, 53-73. https://doi:10.32038/ltrq.2022.28.03 Giacaglia, G. (2019). How transformers work. Medium . https://towardsdatascience.com/transformers-141e32e69591 Gonzalez Torres, A. P., & Sawhney, N. (2023). Role of regulatory sandboxes and MLOps for AI-enabled public sector services. The Review of Socionetwork Strategies ,17, 297–318. Guo, K., & Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies,29, 8435–8463. Han, T., & Sari, E. (2024). An investigation on the use of automated feedback in Turkish EFL students’ writing classes. Computer Assisted Language Learning , 37(4), 961–985. Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice ,33(3), 36–46. Holmes, W., Persson, J., Chounta, I. A., Wasson, B., & Dimitrova, V. (2022). Artificial intelligence and education: A critical view through the lens of human rights, democracy and the rule of law. Council of Europe . https://rm.coe.int/artificial-intelligence-and-education-a-critical-view-through-the-lens/1680a886bd Huang, W., Hew, K., & Fryer, L. (2022). Chatbots for language learning-Are they really useful? A systematic review of chatbot‐supported language learning. Computer Assisted Learning , 38 (1) (2022), pp. 237-257. https://doi.org / 10.1111/jcal.12610 Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science , 5, e208. https://doi.org/10.7717/peerj-cs.208 Imran, M., & Almusharraf, N. (2023). Review of teaching innovation in university education: Case studies and main practices. The Social Science Journal . https://doi.org/10.1080/03623319.2023.2201973 Kohnke, L., Moorhouse , B. L.,& Zou, D. (2023). ChatGPT for Language Teaching and Learning. RELC Journal , Volume 54, Issue 2. https://doi.org/10.1177/00336882231162868 Lee, A. V. Y., Luco, A. C., & Tan, S. C. (2023). A human-centric automated essay scoring and feedback system for the development of ethical reasoning. Educational Technology & Society ,26(1), 147–159. Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W. Ritchie, & T.K. Bhatia (Eds), Handbook of Second Language acquisition (pp. 413–468). Academic Press. Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics . 2(2). https://doi.org/10.1016/j.rmal.2023.100050 Parker, J. L., Becker, K., & Carroca, C. (2023). ChatGPT for automated writing evaluation in scholarly writing instruction. Journal of Nursing Education ,62(12), 721–727. Phuoc, V. D., & Barrot, Jessie S. (2022). Complexity, accuracy, and fluency in L2 writing across proficiency levels: A matter of L1 background? Assessing Writing . 54. https://doi.org/10.1016/j.asw.2022.100673 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language_understanding_paper.pdf Ramesh, D., &· Sanampudi, S. K. (2022). An automated essay scoring system: a systematic literature review. Artificial Intelligence Review , 55:2495–2527. https://doi.org/10.1007/s10462-021-10068-2 Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing ,18(1), 25–39. Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems ,3, 121–154. Schade, M. (2023). How ChatGPT and our language models are developed. Retrieved October 28, 2023, from https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed Schmidt-Fajlik, R. (2023). ChatGPT as a grammar checker for Japanese English language learners: A comparison with Grammarly and ProWritingAid. Asia CALL Online Journal ,14(1), 105–119. Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing ,20, 53–76. Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: a framework for applying generative AI in education. ECNU Review of Education, 6(3) 355–366. Suppadungsuk, S., Thongprayoon, C., Miao, J., Krisanapan, P., Qureshi, F., Kashani, K., & Cheungpasitporn, W. (2023). Exploring the potential of chatbots in critical care nephrology. Medicines ,10(10), 58. Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English Language Arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100(1), 94–109. Yan, D., Fauss, M., Hao J., & Cui W. (2023). Detection of AI-generated essays in writing assessments. Psychological Test and Assessment Modeling , 65,125-144. Yancey, K. P., Lafair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576–584). https://aclanthology.org/2023.bea-1.49 Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? International Journal of Educational Technology in Higher Education ,16(1), 1–27. Zhai, N., & Ma, X. (2022). The effectiveness of automated writing evaluation on writing quality: a meta-analysis. Educational Computing Research, 1–26. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7533498","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":517880072,"identity":"ba36efc3-1ebb-4d4a-95bf-9fc1a330dd30","order_by":0,"name":"Roghaye Torki","email":"","orcid":"","institution":"ShK.C, Islamic Azad University","correspondingAuthor":false,"prefix":"","firstName":"Roghaye","middleName":"","lastName":"Torki","suffix":""},{"id":517880073,"identity":"2eb84ba6-0bf5-4b78-9a8c-2da9204e4d86","order_by":1,"name":"Fariba Rahimi Esfahani","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8UlEQVRIiWNgGAWjYBAC9gYGhsNglgRjA0NChQRU3AC3Fp4DKFrOEKmFGaIFiBnbiHAYD/vZh4cLahjy+Wc3t254OM9C3uAA88MPDAX3cGvhSTc4POMYg+WMOwfbbiRukzDccIDNWILBoBinFnuGNIbDPGxAt99IBGth3HCAwQzolwTctvA/A2r5x2AgD9YyR8J+wwH2b/i1SABt4W1jMDAAa2mQSNxwgIeALRJAW2b2SRgYgrQkHJNInnmYp1giAa/D0pg/F3yzMZC7kf7s5o+aOtu+4+0bP3z4g1sLFEggsUHRRFDDKBgFo2AUjAK8AADPZlIWYYN/qwAAAABJRU5ErkJggg==","orcid":"","institution":"ShK.C, Islamic Azad University","correspondingAuthor":true,"prefix":"","firstName":"Fariba","middleName":"Rahimi","lastName":"Esfahani","suffix":""},{"id":517880074,"identity":"b00428ce-b828-4528-94c9-8a5c1a85f09a","order_by":2,"name":"Farshad Kiyoumarsi","email":"","orcid":"","institution":"ShK.C, Islamic Azad University","correspondingAuthor":false,"prefix":"","firstName":"Farshad","middleName":"","lastName":"Kiyoumarsi","suffix":""}],"badges":[],"createdAt":"2025-09-04 07:53:39","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7533498/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7533498/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":92201500,"identity":"9e9e7492-18f3-48b7-9a9b-e631871c188c","added_by":"auto","created_at":"2025-09-25 17:12:32","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54526,"visible":true,"origin":"","legend":"","description":"","filename":"paper.docx","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/b5541b800eea6bbfe25a6e11.docx"},{"id":92201501,"identity":"3470a168-330b-4434-8187-cb81b84b2cbc","added_by":"auto","created_at":"2025-09-25 17:12:32","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5492,"visible":true,"origin":"","legend":"","description":"","filename":"a32cf534857a4816a8a540457b8a247d.json","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/f75788fd68d291e3b8ec8549.json"},{"id":92201502,"identity":"4b83ebe2-fb43-4870-b257-bd19de701700","added_by":"auto","created_at":"2025-09-25 17:12:32","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":116688,"visible":true,"origin":"","legend":"","description":"","filename":"a32cf534857a4816a8a540457b8a247d1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/7264dde3809e9a6ce7192a1d.xml"},{"id":92201504,"identity":"806b430a-0711-4654-b649-832b05f22b69","added_by":"auto","created_at":"2025-09-25 17:12:32","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":116518,"visible":true,"origin":"","legend":"","description":"","filename":"a32cf534857a4816a8a540457b8a247d1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/30b7c8d4ee92d80ca7e2838c.xml"},{"id":92201503,"identity":"c45a1e36-0eea-4c61-ab6d-d03b6b716cfe","added_by":"auto","created_at":"2025-09-25 17:12:32","extension":"html","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":124307,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/2f655a1c67bd68d6a893ed6a.html"},{"id":92950637,"identity":"5d8d988e-c12b-4874-bbe9-c0cdd29cfd64","added_by":"auto","created_at":"2025-10-07 13:06:35","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":713834,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7533498/v1/a4b4c4a4-afa4-47f6-b5b9-ffe6cfe7d240.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Potential Use of ChatGPT for Automated Essay Scoring Based","fulltext":[{"header":"Introduction","content":"\u003cp\u003eWriting is widely recognized as one of the most demanding skills in language education, and for good reason. It does not merely involve producing text; it requires learners to sustain intellectual effort, organize their ideas, and refine them into a coherent argument (Irman \u0026amp; Almusharraf, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Unlike oral interaction, where meaning can be clarified instantly, writing results in a permanent record. This permanence turns writing into an extended piece of discourse that is cognitively taxing and difficult to master. As Emig (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e1997\u003c/span\u003e) persuasively argued, writing is not only a means of communication but also a distinct mode of learning\u0026mdash;yet one that imposes high cognitive costs.\u003c/p\u003e\u003cp\u003eBecause of these demands, the assessment of writing has long been problematic. Evaluation practices are often criticized for inter-rater variability, lack of transparency, and the inconsistent application of band descriptors. These features make traditional assessment a weaker basis for applied linguistics research concerned with fairness and reliability (Ahmadi Shirazi, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Beck et al., 2018). After all, even trained examiners interpret descriptors differently, which means that scores may vary depending on who evaluates a script. For this reason, scholars have repeatedly called for more systematic approaches to writing evaluation.\u003c/p\u003e\u003cp\u003eAutomated Essay Scoring (AES) has emerged as one response to these challenges. Drawing on Artificial Intelligence (AI) and Natural Language Processing (NLP), AES promises efficiency, consistency, and reduced rater fatigue (Doğan et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Indeed, it has been embraced in large-scale testing environments, where thousands of essays must be processed within limited time frames (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Zawacki-Richter et al., \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). However, reliability is not equivalent to validity. Although AES can stabilize scoring, it does not automatically guarantee fairness, and concerns about criterion-related validity remain central (Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThe release of ChatGPT by OpenAI in late 2022 intensified these debates. As a transformer-based AI model, ChatGPT has demonstrated remarkable ability to generate coherent text, structure arguments, and provide context-sensitive support (Kohnke et al., \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). In principle, such features could relieve examiners by producing consistent scores, thereby allowing teachers to devote more energy to qualitative feedback (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Essel, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). It has therefore been suggested that ChatGPT could serve as an AES system, complementing or even approximating human raters (Parker et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Yancey et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Yet, enthusiasm must be moderated. While ChatGPT handles surface-level features such as grammar and lexical range with relative accuracy, it is less reliable in capturing discourse-level coherence, rhetorical effectiveness, and idea development (Bui \u0026amp; Barrot, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Ramesh \u0026amp; Sanampudi, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Yancey et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThis creates a tension. On the one hand, ChatGPT appears consistent and efficient; on the other, it risks overlooking the very dimensions that define writing proficiency in extended discourse. Although prompt design and input quality have been shown to improve its performance (Huang et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Enright \u0026amp; Quinlan, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2010\u003c/span\u003e; Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), they cannot fully eliminate bias or inconsistency. Moreover, few studies have systematically examined its ability to evaluate crucial aspects such as global cohesion, appropriateness to audience, or stylistic variation\u0026mdash;dimensions that applied linguistics research considers essential for validity in academic writing assessment.\u003c/p\u003e\u003cp\u003eAgainst this backdrop, the present study aims to investigate ChatGPT\u0026rsquo;s role in Automated Essay Scoring by comparing its performance with that of certified human examiners on IELTS Writing Task 2 essays. This task was chosen because its scoring rubric is transparent and widely documented, consisting of four analytic criteria: Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), and Grammatical Range and Accuracy (GRA) (Daneshvar et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). TR assesses the degree to which the prompt is addressed with well-supported arguments; CC captures the logical flow and effective use of cohesive devices; LR measures lexical precision, variety, and appropriateness; while GRA reflects control of syntax, complexity, and grammatical accuracy. These descriptors are intended to operationalize the construct of writing proficiency in a structured way. However, interpretive flexibility cannot be eliminated entirely, which is why IELTS Writing Task 2 provides an ideal context to test whether ChatGPT can deliver not only stable scores but also ones that meaningfully align with expert human judgment.\u003c/p\u003e"},{"header":"Literature review","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eAutomated Writing Evaluation and Automated Essay Scoring\u003c/h2\u003e\u003cp\u003eAutomated Writing Evaluation (AWE) and AES systems have increasingly been positioned as central instruments in the assessment of L2 writing. These systems are designed not only to capture macro-level dimensions such as argument structure and organization but also to identify micro-level aspects like grammar and syntax (Zhai \u0026amp; Ma, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Drawing on the Complexity, Accuracy, and Fluency (CAF) framework, they are frequently applied to distinguish proficiency levels in a more systematic and standardized way (Phuoc \u0026amp; Barrot, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Although AES and AWE are often promoted as complementary tools to human raters in high-stakes tests such as TOEFL and IELTS, their actual alignment with human judgment has been repeatedly questioned, particularly with respect to construct validity and the consistency of rater interpretation.\u003c/p\u003e\u003cp\u003eDifferent theoretical perspectives have informed the conceptual foundations of AWE. From the interaction hypothesis, the emphasis is placed on modified linguistic input occurring during exchanges between learners and automated systems, or between teachers and such systems (Long, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e1996\u003c/span\u003e; Wilson \u0026amp; Czik, \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). In contrast, sociocultural theory interprets AWE feedback as mediated assistance that scaffolds learning within the learner\u0026rsquo;s zone of proximal development (Zhai \u0026amp; Ma, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Cognitive information-processing perspectives highlight yet another dimension: AWE and AES function as metalinguistic feedback providers, offering learners immediate opportunities to notice, reflect on, and restructure linguistic forms. In this way, explicit knowledge is gradually converted into implicit competence (Schmidt, 1995). Taken together, these theoretical perspectives underscore that automated systems do more than generate scores; they operate as pedagogical mediators with the potential to support learning in extended pieces of discourse.\u003c/p\u003e\u003cp\u003eFrom a psychometric standpoint, AES attempts to replicate human scoring practices while simultaneously promising enhanced inter-rater reliability, scalability, and efficiency (Ramineni \u0026amp; Williamson, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Shermis, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). In large-scale testing environments, this scalability is particularly attractive, since thousands of essays must be scored with consistency under strict timelines. It is often claimed that AES offers objectivity by minimizing rater fatigue and bias. Yet this assumption of neutrality is not absolute. Algorithmic outputs remain constrained by training corpora and the design of scoring rubrics, meaning that claims of objectivity should be treated cautiously.\u003c/p\u003e\u003cp\u003eDespite these advantages, AES systems reveal serious limitations. They are frequently less capable of capturing rhetorical effectiveness, reasoning, creativity, or stylistic nuance\u0026mdash;features that applied linguistics research regards as central to advanced writing proficiency (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Lee et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Context can also be misinterpreted, leading to questionable judgments (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Moreover, because AES systems can sometimes be manipulated through superficial strategies such as keyword stuffing, learners may attain artificially inflated scores without authentic improvement in their writing quality (Higgins \u0026amp; Heilman, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). These shortcomings reinforce the point that statistical reliability cannot be equated with construct representation or validity.\u003c/p\u003e\u003cp\u003eScholars have therefore attempted to classify and critique existing AES models. Hussein et al. (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) distinguished between handcrafted feature-based models and automatic feature-extraction approaches. While the former stress objectivity and transparency, the latter enable wider coverage but can also introduce new forms of bias. Cox (2022) similarly pointed out that crucial dimensions, such as lexical sophistication and cohesion, remain underdeveloped in many existing rubrics, weakening claims of validity.\u003c/p\u003e\u003cp\u003eRamesh and Sanampudi (2021), in a systematic review of 26 AES-related studies, identified persistent shortcomings in measuring coherence and completeness. Many AES systems were shown to rely on general-purpose datasets, such as the Kaggle ASAP corpus, which lack domain-specific features essential for fine-grained analysis. They concluded that without domain-tailored corpora and improved feature extraction, AES will continue to struggle with criterion-related validity and comparability to human raters.\u003c/p\u003e\u003cp\u003eResearch on AWE has also emphasized feedback functions rather than final scores. Crossley (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), Fan and Ma (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), and Zhai and Ma (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) demonstrated that automated feedback can enhance writing quality, though the extent of improvement varies depending on learners\u0026rsquo; educational level and the genre of writing. Yet, methodological issues persist: small sample sizes and a predominant focus on overall writing quality, rather than on fine-grained dimensions such as syntactic complexity, limit the generalizability of findings. Fan and Ma\u0026rsquo;s (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) review of 22 studies suggested that AWE feedback tends to be most effective in within-group experimental designs, pointing to further methodological caveats.\u003c/p\u003e\u003cp\u003eDirect comparisons between automated systems and human raters have provided additional insights. Almusharraf and Alotaibi (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), for instance, analyzed 197 essays scored both by Grammarly and by human examiners, and observed a moderate correlation. However, Grammarly not only detected more errors but also consistently assigned lower scores. This outcome illustrates a trade-off: while AES enhances inter-rater consistency, it does not necessarily align with human interpretations of band descriptors, leaving a gap in construct validity.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eChatGPT as an AI Model in Writing and Education\u003c/h3\u003e\n\u003cp\u003eIn more recent years, scholarly attention has shifted specifically to ChatGPT and its potential applications in writing and education. Broader discussions of AI in education have already emphasized learner support, reduced teacher workload, and the potential to bridge achievement gaps (Afzaal et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Holmes et al., \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Imran \u0026amp; Almusharraf, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). With its capacity to process vast datasets and mimic human-like responses, AI has pushed AES beyond the boundaries of traditional feature-based models.\u003c/p\u003e\u003cp\u003eA pivotal innovation underpinning this shift is the rise of deep learning. Neural networks\u0026mdash;especially transformer architectures\u0026mdash;have transformed a wide array of tasks ranging from speech recognition to natural language processing (Dong et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Hussein et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). By combining the strengths of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), transformers allow for faster and more effective training on sequential data (Giacaglia, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Building on this foundation, GPT models were trained on massive corpora to perform diverse NLP tasks, from text generation to translation (Radford et al., \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). ChatGPT, as a conversational adaptation of GPT, has therefore redefined human\u0026ndash;machine interaction and inevitably raised questions about its potential role in AES (Essel, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Lee et al., 2017).\u003c/p\u003e\u003cp\u003eOne of ChatGPT\u0026rsquo;s unique contributions is its ability to deliver Automated Written Corrective Feedback (AWCF). Unlike traditional grammar checkers, it can evaluate not only grammar but also content, organization, clarity, and style, while simultaneously suggesting revisions (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Its scoring functions operate at both holistic and analytic levels, often providing explanations alongside band scores. In principle, this dual capability should allow for richer criterion-referenced feedback. Nevertheless, concerns about validity and fairness remain unresolved (Barrot, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Han \u0026amp; Sari, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eEmpirical studies highlight both benefits and limitations. Alberth (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) observed that ChatGPT may accelerate research productivity and support writing assistance, but cautioned against overreliance and ethical misuse. Imran and Almusharraf (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) echoed this caution, underscoring the need for careful human oversight to mitigate risks of plagiarism and overdependence. Mizumoto and Eguchi (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) further demonstrated that ChatGPT\u0026rsquo;s scoring can be strengthened by incorporating linguistic features such as lexical diversity and syntactic complexity.\u003c/p\u003e\u003cp\u003eOther comparative studies provide additional nuance. Guo and Wang (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) examined ChatGPT\u0026rsquo;s feedback on argumentative writing and found that it generated more comments than human teachers, distributing attention more evenly across content, organization, and language. Teachers valued these strengths but pointed to weaknesses such as excessive verbosity, limited context sensitivity, and accessibility issues. Schmidt-Fajlik (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) compared ChatGPT with Grammarly and ProWritingAid, reporting that learners appreciated its intuitive interface and detailed feedback, even though its evaluations remained largely linguistic rather than rhetorical.\u003c/p\u003e\u003cp\u003eYet significant concerns about reliability persist. Yancey et al. (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) reported that while ChatGPT aligned well with human ratings on linguistic measures, it failed to capture rhetorical effectiveness and audience awareness. Parker et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) observed that ChatGPT tended to be stricter than human raters, consistently assigning lower scores. Extending this line of inquiry, Bui and Barrot (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) identified instability in ChatGPT\u0026rsquo;s scoring across different time points, partly due to algorithmic drift and version updates (Ray, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Gonzalez Torres \u0026amp; Sawhney, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Suppadungsuk et al., \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Although these updates are designed to enhance model performance, they complicate longitudinal validity and comparability.\u003c/p\u003e\u003cp\u003eTaken together, these findings converge on a familiar theme: ChatGPT is reliable at surface-level judgments but struggles with higher-order aspects of writing, including discourse-level cohesion, argumentative depth, and creativity (Ramesh \u0026amp; Sanampudi, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). While earlier AES systems were criticized on similar grounds, ChatGPT\u0026rsquo;s probabilistic nature adds another layer of variability. As Schade (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) has noted, even when identical prompts are used, slight differences can occur in the outputs due to randomness in large language models.\u003c/p\u003e\u003cp\u003eIn sum, current research portrays ChatGPT as a promising but still imperfect AES tool. It is effective in detecting grammatical accuracy and lexical range, in generating immediate feedback, and in reducing examiner workload. However, it remains less capable of capturing the full construct of academic writing proficiency. Whereas human raters interpret band descriptors with reference to rhetorical effectiveness and contextual appropriateness, ChatGPT often under-represents these higher-order dimensions. This persistent gap underscores the need for further validation studies, especially in high-stakes contexts such as IELTS Writing Task 2, where criterion-related validity and fairness are paramount.\u003c/p\u003e\u003cp\u003eFilling these research gaps, the present study aims to offer a more detailed examination of ChatGPT\u0026rsquo;s performance as an AES tool compared with established human scoring practices. While earlier investigations have identified both potential and limitations, very few studies have directly compared ChatGPT\u0026rsquo;s scores with those of certified examiners under standardized, high-stakes conditions. Moreover, little systematic attention has been paid to whether discrepancies can be minimized through careful prompt design and consistent evaluation procedures.\u003c/p\u003e\u003cp\u003eFor this reason, the current study is guided by the following research questions:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eIs ChatGPT a reliable language model in scoring essays according to the IELTS Writing Task 2 benchmarks?\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eIs there a statistically significant difference between ChatGPT\u0026rsquo;s essay scores and human raters\u0026rsquo; scores when both follow IELTS Writing Task 2 band descriptors?\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eBy addressing these questions, the study aims not only to evaluate ChatGPT\u0026rsquo;s reliability and validity but also to clarify whether it can serve as a credible complement\u0026mdash;rather than a substitute\u0026mdash;for human raters in high-stakes writing assessment.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003eResearch Design\u003c/h2\u003e\u003cp\u003eThis study followed a quantitative, non-experimental design in the tradition of ex post facto research. Since each essay was evaluated by both a certified IELTS examiner and ChatGPT, the analysis adopted a within-subject framework, treating the type of rater (human vs. AI) as the independent factor. The dependent variables were the four analytic IELTS Writing Task 2 criteria\u0026mdash;Task Response (TR), Coherence and Cohesion (CC), Lexical Resource (LR), and Grammatical Range and Accuracy (GRA)\u0026mdash;together with the overall band score. To examine reliability, intra-class correlation coefficients (ICC 2,1) were calculated, as this statistic is considered more appropriate than Cronbach\u0026rsquo;s alpha for continuous ratings across raters and allows for absolute agreement estimates. To address differences in scoring patterns, a repeated-measures MANOVA was employed, followed by paired-samples t-tests and effect size calculations where necessary. All assumptions for multivariate analysis (e.g., normality, linearity, and homogeneity of covariance matrices) were checked before interpreting results. While this design does not permit causal inference, it provides a rigorous means of comparing the alignment and divergence between human and AI raters.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eParticipants and Sample Selection\u003c/h3\u003e\n\u003cp\u003eThe dataset consisted of thirty IELTS Writing Task 2 essays, deliberately selected to cover a wide range of proficiency levels. The essays were drawn from three sources: The Cambridge IELTS practice book series, the \u0026ldquo;How to do IELTS\u0026rdquo; website, and Exir Academy\u0026rsquo;s online platform. A stratified random sampling procedure was applied to ensure representation of different proficiency bands, thereby enhancing the fairness of the comparison. Human ratings were provided by certified IELTS examiners with prior experience in test preparation and assessment.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eMaterials and Instruments\u003c/h2\u003e\u003cp\u003eThe primary material for analysis was IELTS Writing Task 2, chosen because it requires candidates to produce extended argumentative essays that demand idea development, organization, and support with relevant examples. In contrast to Task 1, which mainly involves data description, Task 2 offers a more comprehensive measure of academic writing proficiency. Both human raters and ChatGPT evaluated the essays using the official IELTS scoring descriptors, covering TR, CC, LR, GRA, and overall band performance.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eProcedures\u003c/h3\u003e\n\u003cp\u003eAll essays were anonymized prior to scoring to reduce potential bias. Each script was independently scored by a human examiner and by ChatGPT, with ratings recorded separately for each of the four analytic categories and the overall band score. No qualitative feedback or comments were provided; only numeric band scores were considered for analysis.\u003c/p\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003eData Analysis\u003c/h2\u003e\u003cp\u003eData analysis proceeded in several stages. First, descriptive statistics (means and standard deviations) were computed for all variables to illustrate central tendencies and dispersion. Inter-rater reliability between human and AI scores was then assessed using two-way random effects intra-class correlation coefficients (ICC 2,1), reported with 95% confidence intervals. To evaluate systematic differences between human and AI scores, a repeated-measures MANOVA was performed. Where multivariate results were significant, follow-up paired-samples t-tests were conducted for each criterion, with effect sizes (Cohen\u0026rsquo;s d) calculated to gauge the magnitude of differences. A significance threshold of α\u0026thinsp;=\u0026thinsp;.05 was applied, and results were interpreted with reference to both statistical and practical significance.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eTo evaluate the stability of ChatGPT\u0026rsquo;s scoring across repeated evaluations of the same scripts, we estimated intra-rater reliability using a two-way mixed-effects, absolute-agreement ICC (2,1). Reliability was uniformly excellent across all outcomes\u0026mdash;TR\u0026thinsp;=\u0026thinsp;0.984, CC\u0026thinsp;=\u0026thinsp;0.968, LR\u0026thinsp;=\u0026thinsp;0.980, GRA\u0026thinsp;=\u0026thinsp;0.967, Overall\u0026thinsp;=\u0026thinsp;0.964\u0026mdash;indicating near-perfect repeatability under identical prompting conditions (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Descriptively, means and standard deviations changed only negligibly from session to session, reinforcing this stability.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eIntra-class correlation coefficients for ChatGPT scores across three sessions\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCriterion\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eICC(2,1)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eInterpretation*\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTask Response (TR)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.984\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExcellent\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCoherence and Cohesion (CC)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.968\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExcellent\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGrammatical Range and Accuracy (GRA)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.967\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExcellent\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLexical Resource (LR)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.980\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExcellent\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOverall Band Score\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.964\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExcellent\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eHaving established that ChatGPT yields highly stable scores across repeated measurements, we next examined whether its ratings systematically diverged from those assigned by human raters. This addressed the second research question and required an omnibus multivariate test followed by criterion-level paired comparisons. A repeated-measures MANOVA with rater type (human vs. ChatGPT) as the within-subjects factor and the five IELTS outcomes as dependent variables revealed a robust multivariate effect of rater type. Pillai\u0026rsquo;s Trace\u0026thinsp;=\u0026thinsp;0.897, F(5, 29)\u0026thinsp;=\u0026thinsp;50.30, p\u0026thinsp;\u0026lt;\u0026thinsp;.001; convergent statistics from the Hotelling\u0026ndash;Lawley Trace, Roy\u0026rsquo;s Largest Root, and Hotelling\u0026rsquo;s T\u0026sup2; corroborated this omnibus difference, indicating that the combined score profile reliably differed between human and AI ratings (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003erepeated-measures multivariate test results (Hotelling\u0026rsquo;s T\u0026sup2;)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTest\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eStatistic\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eF\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003edf1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003edf2\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003ep\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePillai's Trace\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.897\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e50.30\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHotelling-Lawley Trace\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e8.672\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e44.86\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e25\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRoy's Largest Root\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e8.672\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e251.49\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHotelling's T\u0026sup2;\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e251.487\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e43.36\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e25\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eFollow-up paired comparisons localized the multivariate effect to specific criteria (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Human raters consistently awarded higher scores than ChatGPT across all five outcomes. The largest discrepancies were observed for Coherence and Cohesion (Mean Diff\u0026thinsp;=\u0026thinsp;1.50, 95% CI [1.19, 1.81], t(29)\u0026thinsp;=\u0026thinsp;10.02, p\u0026thinsp;\u0026lt;\u0026thinsp;.001, d\u0026thinsp;\u0026lt;\u0026thinsp;sub\u0026thinsp;\u0026gt;\u0026thinsp;z\u0026lt;/sub\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;1.83) and for Grammatical Range and Accuracy (Mean Diff\u0026thinsp;=\u0026thinsp;0.90, 95% CI [0.58, 1.22], t(29)\u0026thinsp;=\u0026thinsp;5.83, p\u0026thinsp;\u0026lt;\u0026thinsp;.001, d\u0026thinsp;\u0026lt;\u0026thinsp;sub\u0026thinsp;\u0026gt;\u0026thinsp;z\u0026lt;/sub\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;1.07). Substantial differences were also found for Lexical Resource (Mean Diff\u0026thinsp;=\u0026thinsp;0.80, 95% CI [0.52, 1.08], t(29)\u0026thinsp;=\u0026thinsp;5.76, p\u0026thinsp;\u0026lt;\u0026thinsp;.001, d\u0026thinsp;\u0026lt;\u0026thinsp;sub\u0026thinsp;\u0026gt;\u0026thinsp;z\u0026lt;/sub\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;1.05) and for the Overall band (Mean Diff\u0026thinsp;=\u0026thinsp;0.90, 95% CI [0.61, 1.19], t(29)\u0026thinsp;=\u0026thinsp;6.31, p\u0026thinsp;\u0026lt;\u0026thinsp;.001, d\u0026thinsp;\u0026lt;\u0026thinsp;sub\u0026thinsp;\u0026gt;\u0026thinsp;z\u0026lt;/sub\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;1.15). Even Task Response showed a smaller but significant gap (Mean Diff\u0026thinsp;=\u0026thinsp;0.70, 95% CI [0.16, 1.24], t(29)\u0026thinsp;=\u0026thinsp;2.66, p\u0026thinsp;=\u0026thinsp;.013, d\u0026thinsp;\u0026lt;\u0026thinsp;sub\u0026thinsp;\u0026gt;\u0026thinsp;z\u0026lt;/sub\u0026thinsp;\u0026gt;\u0026thinsp;=\u0026thinsp;0.49). Holm-adjusted p-values remained significant across all criteria, confirming systematic human\u0026ndash;AI differences in level rather than mere rank ordering.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003epaired comparisons (Human \u0026ndash; ChatGPT) for IELTS criteria and overall\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"13\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c12\" colnum=\"12\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c13\" colnum=\"13\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCriterion\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eHuman Mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eChatGPT\u003c/p\u003e\u003cp\u003eMean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMean Diff (H\u0026ndash;C)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSE (Diff)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e95% CI Low\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003e95% CI High\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003et (paired)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c9\"\u003e\u003cp\u003edf\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c10\"\u003e\u003cp\u003ep\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c11\"\u003e\u003cp\u003eCohen's d_z\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c12\"\u003e\u003cp\u003ePartial η\u0026sup2;\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c13\"\u003e\u003cp\u003ep (Holm)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTR\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e\u003cp\u003e5.50\u0026thinsp;\u0026plusmn;\u0026thinsp;1.46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e4.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.89\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.70\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.26\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.16\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.24\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e2.66\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e0.013\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c11\"\u003e\u003cp\u003e0.49\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c12\"\u003e\u003cp\u003e0.20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c13\"\u003e\u003cp\u003e0.013\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCC\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e\u003cp\u003e5.80\u0026thinsp;\u0026plusmn;\u0026thinsp;1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e4.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1.50\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.15\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e1.19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e10.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c11\"\u003e\u003cp\u003e1.83\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c12\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c13\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLR\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e\u003cp\u003e5.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.84\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e4.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.80\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.52\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.08\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e5.76\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c11\"\u003e\u003cp\u003e1.05\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c12\"\u003e\u003cp\u003e0.53\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c13\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGRA\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e\u003cp\u003e4.90\u0026thinsp;\u0026plusmn;\u0026thinsp;0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e4.00\u0026thinsp;\u0026plusmn;\u0026thinsp;0.64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.15\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.58\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.22\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e5.83\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c11\"\u003e\u003cp\u003e1.07\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c12\"\u003e\u003cp\u003e0.54\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c13\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOverall\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e\u003cp\u003e5.40\u0026thinsp;\u0026plusmn;\u0026thinsp;0.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e4.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.72\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.14\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.61\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e6.31\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c11\"\u003e\u003cp\u003e1.15\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c12\"\u003e\u003cp\u003e0.58\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c13\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTaken together, ChatGPT\u0026rsquo;s scoring is highly stable across repeated sessions (ICC: .964\u0026ndash;.984), yet its mean scores do not fully converge with human judgments; the omnibus multivariate effect is large, and criterion-level gaps are most pronounced for coherence/organization and grammatical control.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe present study investigated ChatGPT’s potential as an AES tool by comparing its performance with that of human raters using IELTS Writing Task 2 criteria. The results revealed two key findings: first, ChatGPT demonstrated high internal reliability across three sessions, as indicated by ICC values exceeding .96 for all criteria, establishing its scoring stability. Second, despite this reliability, significant systematic differences were observed between human and AI scores across all four analytic criteria and the overall band score, as evidenced by repeated-measures MANOVA and follow-up paired comparisons. These results highlight ChatGPT’s consistency but also its divergence from human evaluators, raising critical implications for its role in high-stakes assessment.\u003c/p\u003e\u003cp\u003eThese findings extend prior research on AES systems, which has consistently reported discrepancies between machine-generated and human ratings (Almusharraf \u0026amp; Alotaibi, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Crossley, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Fan \u0026amp; Ma, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Ramesh \u0026amp; Sanampudi, 2021; Zhai \u0026amp; Ma, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). In line with Almusharraf and Alotaibi (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) and Parker et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), our results show that ChatGPT tended to assign lower scores than human raters, particularly for Coherence and Cohesion and Lexical Resource. These findings also invite closer comparison with previous research. Unlike Crossley (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), who underscored the importance of linguistic indicators such as syntactic complexity and lexical sophistication in shaping writing quality, the present study shows that ChatGPT’s judgments were more heavily weighted toward surface-level accuracy, often overlooking deeper discourse-level dimensions.\u003c/p\u003e\u003cp\u003eSimilarly, while Guo and Wang (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) reported that ChatGPT provided extensive feedback distributed across content, organization, and language, our results indicate that its evaluative focus in scoring remained narrower, with less sensitivity to rhetorical effectiveness and creative idea development—areas where human raters demonstrated greater tolerance and recognition.\u003c/p\u003e\u003cp\u003eThis contrasts with Bui and Barrot (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), who emphasized inconsistencies across different ChatGPT versions, whereas our study—by employing a single, stable version—found consistently replicated scores. Thus, while previous work questioned the reproducibility of ChatGPT’s evaluations (Bui \u0026amp; Barrot, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), the present study confirms that the system is internally stable when conditions are held constant.\u003c/p\u003e\u003cp\u003eOne possible explanation for the lower scores assigned by ChatGPT lies in its algorithmic focus on error detection rather than nuanced assessment of idea development and creativity (Guo \u0026amp; Wang, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Schmidt-Fajlik, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Similar to the observations of Yancey et al. (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), ChatGPT struggles with interpreting unconventional structures or subtle argumentative moves, often penalizing essays where human raters recognize originality or rhetorical effectiveness. Human raters, familiar with second-language learner writing, may apply more lenient judgments, especially for coherence and lexical sophistication, leading to the observed score gaps.\u003c/p\u003e\u003cp\u003eAnother contributing factor is training data bias, as generative AI systems reflect the patterns of their training corpus (Bui \u0026amp; Barrot, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). This bias can result in systematic underestimation of L2 writing performance, particularly in dimensions such as cohesion and vocabulary richness. Our findings resonate with prior research showing that AES systems often fail to capture creativity and discourse-level fluency (Ramesh \u0026amp; Sanampudi, 2021; Yan et al., \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). The substantial and statistically significant differences in Coherence and Cohesion (p \u0026lt; .001) and Lexical Resource (p \u0026lt; .001) underscore these limitations.\u003c/p\u003e\u003cp\u003eHowever, a notable departure from Bui and Barrot’s (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) conclusions lies in the consistency of ChatGPT scores across repeated evaluations. Whereas they reported variability attributed to algorithmic updates and model drift (Ray, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Gonzalez Torres \u0026amp; Sawhney, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Suppadungsuk et al., \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), our study shows that ChatGPT produced nearly identical results across three sessions. This suggests that when model updates are controlled, ChatGPT is capable of providing reproducible scores, thus offering potential utility in large-scale assessment scenarios where reliability is paramount.\u003c/p\u003e\u003cp\u003eDespite these strengths, the weak correlations with human ratings raise concerns about construct validity. AES models like ChatGPT appear reliable but not yet valid proxies for human judgment, particularly in high-stakes contexts like IELTS. As Mizumoto and Eguchi (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) argue, enhancing scoring accuracy requires integrating linguistic features such as syntactic complexity and lexical diversity. Furthermore, improving prompt design and fine-tuning models specifically for essay scoring may help bridge the gap between AI and human evaluators.\u003c/p\u003e"},{"header":"Conclusion and Implications","content":"\u003cp\u003eThis study evaluated ChatGPT’s performance as an AES tool for IELTS Writing Task 2. Two conclusions are clear. First, ChatGPT produced highly consistent scores across repeated evaluations, with excellent intra-rater reliability, indicating that- under stable conditions- it can generate reproducible outcomes. Second, despite this stability, systematic differences emerged between ChatGPT and human raters for all analytic criteria (TR, CC, LR, GRA) and the overall band score, with the largest gaps in Coherence and Cohesion, followed by the overall band, Grammatical Range and Accuracy, and Lexical Resource, and a smaller yet reliable gap in Task Response. In short, reliability is strong, but validity- defined here as alignment with expert human judgment- remains limited.\u003c/p\u003e\u003cp\u003ePlaced against prior work, these findings both converge with and depart from the literature. They converge with studies that document persistent human–AI discrepancies and the difficulty AES systems face with discourse-level organization and nuanced lexical control (Almusharraf \u0026amp; Alotaibi, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Crossley, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Fan \u0026amp; Ma, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Ramesh \u0026amp; Sanampudi, 2021; Zhai \u0026amp; Ma, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Yan et al., \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). They also align with reports that ChatGPT tends to assign lower scores than human raters in multiple contexts (Bui \u0026amp; Barrot, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Guo \u0026amp; Wang, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Schmidt-Fajlik, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Yancey et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). At the same time, our design diverges from Bui and Barrot’s (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) claim of unstable scoring across ChatGPT versions: by holding the version constant and repeating evaluations under identical conditions, we observed stable internal reliability, suggesting that previously reported variability likely reflected version changes and model drift (Ray, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Gonzalez Torres \u0026amp; Sawhney, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Suppadungsuk et al., \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) rather than inherent randomness.\u003c/p\u003e\u003cp\u003eTaken together, these results recommend a complementary—rather than substitutive—role for ChatGPT in high-stakes assessment. In practical terms:\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eHuman-in-the-loop scoring. Use ChatGPT for preliminary scoring and rapid triage, while reserving final judgments for trained examiners. This arrangement leverages AI’s efficiency without sacrificing validity and fairness (Almusharraf \u0026amp; Alotaibi, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTargeted deployment. Allow ChatGPT to highlight surface-level issues (grammatical slips, local cohesion) and provide criterion-linked prompts for revision, while teachers adjudicate discourse-level qualities (global coherence, argument development, rhetorical effectiveness) (Yancey et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Schmidt-Fajlik, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCalibration and standard setting. Periodically calibrate AI outputs to human band descriptors, adopt cut-scores for automatic acceptance vs. mandatory human review, and perform bias audits for L2 populations.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eModel improvement. Consistent with Mizumoto and Eguchi (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), integrate linguistic features—syntactic complexity, lexical diversity, and discourse markers—into model tuning; task-specific fine-tuning for IELTS-style responses is especially warranted.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMonitoring drift. Establish version control and drift monitoring so that any system update triggers re-validation before operational use (Ray, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Gonzalez Torres \u0026amp; Sawhney, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Suppadungsuk et al., \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eBeyond summative scoring, there is strong potential for formative applications. ChatGPT can support personalized learning by delivering immediate, criterion-referenced feedback that helps learners iterate on organization, vocabulary choice, and grammatical control (Alberth, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Su \u0026amp; Yang, \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). That said, because the model can underestimate performance relative to human raters—particularly on discourse and lexical dimensions—teacher mediation remains essential to prevent construct under-representation and preserve stakeholder trust.\u003c/p\u003e\u003cp\u003eIn conclusion, ChatGPT currently offers reliability without full validity in mirroring human scoring for IELTS Writing Task 2. When integrated thoughtfully—human-in-the-loop, calibrated to band descriptors, audited for drift and bias—it can improve throughput and feedback cycles. However, until refinements address discourse-level interpretation and alignment with expert judgments, ChatGPT should be positioned as a pedagogical aid and auxiliary scorer, not a stand-alone substitute, in high-stakes writing assessment (Bui \u0026amp; Barrot, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Guo \u0026amp; Wang, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Yancey et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Mizumoto \u0026amp; Eguchi, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003e\u003cstrong\u003eAES: \u003c/strong\u003eAutomated Essay Scoring\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI: \u003c/strong\u003eArtificial Intelligence\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAWCF: \u003c/strong\u003eAutomated Written Corrective Feedback\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAWE:\u003c/strong\u003e\u0026nbsp;Automated Writing Evaluation\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCAF: \u003c/strong\u003eAccuracy and Fluency\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCC: \u003c/strong\u003eCoherence and Cohesion\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCNNs: \u003c/strong\u003eConvolutional Neural Networks\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGRA: \u003c/strong\u003eGrammatical Range and Accuracy\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eIELTS: \u003c/strong\u003eInternational English Language Testing System\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eICC: \u003c/strong\u003eIntra-class correlation coefficient\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLR: \u003c/strong\u003eLexical Resource\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMANOVA: \u003c/strong\u003eMultivariate analysis of variance\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNLP: \u003c/strong\u003eNatural Language Processing\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTR:\u0026nbsp;\u003c/strong\u003eTask Response\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate:\u003c/strong\u003e\u003cp\u003enot applicable.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eClinical trial number\u003c/strong\u003e\u003cp\u003enot applicable.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eConsent for publication:\u003c/strong\u003e\u003cp\u003enot applicable.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e\u003cp\u003enot applicable.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eR.T. contributed to the conceptualization of the study, literature review, data collection, and initial drafting of the manuscript.F.R. supervised the research process, contributed to study design and methodology, performed data analysis, and provided critical revisions.F. K . contributed to the design and validation of the automated essay scoring procedures, assisted with statistical analyses, and participated in manuscript editing and technical review.All authors reviewed and approved the final version of the manuscript for submission\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAfzaal, M., Imran, M., Du, X., \u0026amp; Almusharraf, N. (2022). Automated and human interaction in written discourse: A contrastive parallel corpus-based investigation of meta discourse features in machine-human translations. \u003cem\u003eSAGE Open\u003c/em\u003e, 12(4), 1-18. https://doi.org/21582440221142210\u003c/li\u003e\n\u003cli\u003eAhmadi Shirazi, M. (2013). Using an analytic dichotomous evaluation checklist to increase inter- and intra-rater reliability of EFL writing evaluation. \u003cem\u003eIranian Journal of Applied Linguistics, 16\u003c/em\u003e(1), 25\u0026ndash;57.\u003c/li\u003e\n\u003cli\u003eAlberth (2023). The use of ChatGPT in academic writing: A blessing or a curse in disguise? TEFLIN Journal: \u003cem\u003eA Publication on the Teaching and Learning of English\u003c/em\u003e \u0026middot; 34(2):337-352. \u003cu\u003eDOI:\u003c/u\u003e10.15639/teflinjournal.v34i2/337-352\u003c/li\u003e\n\u003cli\u003eAlmusharraf, N., \u0026amp; Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. \u003cem\u003eTechnology Knowledge and Learning\u003c/em\u003e,28(3), 1015\u0026ndash;1031. DOI:10.1007/s10758-022-09592-z\u003c/li\u003e\n\u003cli\u003eBarrot, J. S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, 100745. \u003c/li\u003e\n\u003cli\u003eBarrot, J. S. (2024). Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis. \u003cem\u003eEducation and Information Technologies\u003c/em\u003e, 29(6), 7155\u0026ndash;7179.\u003c/li\u003e\n\u003cli\u003eBritish Council. (2023). \u003cem\u003eIELTS writing test \u0026ndash; Task 2\u003c/em\u003e. https://www.ielts.org/about-the-test/sample-test-questions\u003c/li\u003e\n\u003cli\u003eBui, N. M., \u0026amp; Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. \u003cem\u003eEducation and Information Technologies\u003c/em\u003e. https://doi.org/10.1007/s10639-024-12891-w\u003c/li\u003e\n\u003cli\u003eCerero, J. F., Rueda, M. M., Batanero, J. M. F., \u0026amp; Meneses, E. L. (2023).Impact of the implementation of ChatGPT in education: A systematic review\u003cstrong\u003e.\u003c/strong\u003e\u003cem\u003e \u003c/em\u003e\u003cem\u003eComputers\u003c/em\u003e, \u003cem\u003e12\u003c/em\u003e(8), 153; https://doi.org/10.3390/computers12080153\u003c/li\u003e\n\u003cli\u003eCox, J. (2020). \u003cem\u003eWriting rubrics: Samples of basic, expository, and narrative rubrics\u003c/em\u003e. ThoughtCo. https://www.thoughtco.com/writing-rubric-examples-2081369\u003c/li\u003e\n\u003cli\u003eCrossley, S. (2020). Linguistic features in writing quality and development: An overview. \u003cem\u003eJournal of Writing Research\u003c/em\u003e, 11(3), 415\u0026ndash;443. https://doi.org/10.17239/jowr-2020.11.03.01\u003c/li\u003e\n\u003cli\u003eDaneshvar, A., Sadegh Bagheri M., Sadighi F., Yarmohammadi L., Yamini M. (2021\u003cem\u003e). \u003c/em\u003eA Probe into Iranian Learners\u0026rsquo; Performance on IELTS Academic Writing Task 2: Operationalizing Two Models of Dynamic Assessment versus Static Assessment. \u003cem\u003eJournal of Modern Research in English Language Studies,\u003c/em\u003e8(2),25-58.\u003c/li\u003e\n\u003cli\u003eDoğan, A., Akbarova, A. A., Aydoğan, H., G\u0026ouml;nen, K., \u0026amp; Tuncdemir, E. (2014). Automated essay scoring versus human scoring: a reliability check. \u003cem\u003eInternational Journal of Linguistics, Literature and Translation\u003c/em\u003e, 3,1.\u003c/li\u003e\n\u003cli\u003eDong, F., Zhang, Y., \u0026amp; Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring. \u003cem\u003eProceedings of the 21st Conference on Computational Natural Language Learning,\u003c/em\u003e153\u0026ndash;162. https://doi.org/10.18653/v1/K17-1017\u003c/li\u003e\n\u003cli\u003eDrigas, A. S., Argyri, K., \u0026amp; Vrettaros, J. (2009). Artificial intelligence techniques in student modeling. \u003cem\u003eIn Best practices for the knowledge society. Knowledge, learning, development and technology for all: Second world summit on the knowledge society\u003c/em\u003e, 2 (pp. 552-564). \u003c/li\u003e\n\u003cli\u003eEmig, J. (1997). Writing as a mode of learning. In Villanueva, V. (Ed.), Cross talk in composition theory. \u003cem\u003eUrbana, IL: National Council of Teachers of English\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eEnright, M. K., \u0026amp; Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater scoring. \u003cem\u003eLanguage Testing\u003c/em\u003e, 27(3), 317-334. https://doi.org/10.1177/0265532210363144 \u003c/li\u003e\n\u003cli\u003eEssel, H. B. (2023). \u003cem\u003e7 things you should know about GPT\u003c/em\u003e. Research\u003cspan dir=\"RTL\"\u003e \u003c/span\u003eGate.\u003cspan dir=\"RTL\"\u003e \u003c/span\u003e\u003cspan dir=\"RTL\"\u003e\u003cspan dir=\"LTR\"\u003ehttps://www.researchgate.net/publication/367377300\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\n\u003cli\u003eFan, N., \u0026amp; Ma, Y. (2022). The effects of automated writing evaluation (AWE) feedback on students\u0026rsquo; English writing quality: a systematic literature review. \u003cem\u003eLanguage Teaching Research Quarterly\u003c/em\u003e, 28, 53-73. https://doi:10.32038/ltrq.2022.28.03\u003c/li\u003e\n\u003cli\u003eGiacaglia, G. (2019). How transformers work. \u003cem\u003eMedium\u003c/em\u003e. https://towardsdatascience.com/transformers-141e32e69591\u003c/li\u003e\n\u003cli\u003eGonzalez Torres, A. P., \u0026amp; Sawhney, N. (2023). Role of regulatory sandboxes and MLOps for AI-enabled public sector services. \u003cem\u003eThe Review of Socionetwork Strategies\u003c/em\u003e,17, 297\u0026ndash;318. \u003c/li\u003e\n\u003cli\u003eGuo, K., \u0026amp; Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT\u0026rsquo;s potential to support teacher feedback in EFL writing. Education and Information Technologies,29, 8435\u0026ndash;8463.\u003c/li\u003e\n\u003cli\u003eHan, T., \u0026amp; Sari, E. (2024). An investigation on the use of automated feedback in Turkish EFL students\u0026rsquo; writing classes. \u003cem\u003eComputer Assisted Language Learning\u003c/em\u003e, 37(4), 961\u0026ndash;985.\u003c/li\u003e\n\u003cli\u003eHiggins, D., \u0026amp; Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. \u003cem\u003eEducational Measurement: Issues and Practice\u003c/em\u003e,33(3), 36\u0026ndash;46.\u003c/li\u003e\n\u003cli\u003eHolmes, W., Persson, J., Chounta, I. A., Wasson, B., \u0026amp; Dimitrova, V. (2022). Artificial intelligence and education: A critical view through the lens of human rights, democracy and the rule of law. \u003cem\u003eCouncil of Europe\u003c/em\u003e. \u003cu\u003ehttps://rm.coe.int/artificial-intelligence-and-education-a-critical-view-through-the-lens/1680a886bd\u003c/u\u003e\u003c/li\u003e\n\u003cli\u003eHuang, W., Hew, K., \u0026amp; Fryer, L. (2022). Chatbots for language learning-Are they really useful? A systematic review of chatbot‐supported language learning. \u003cem\u003eComputer Assisted Learning\u003c/em\u003e, 38 (1) (2022), pp. 237-257. https://doi.org\u003cstrong\u003e/\u003c/strong\u003e10.1111/jcal.12610\u003c/li\u003e\n\u003cli\u003eHussein, M. A., Hassan, H., \u0026amp; Nassef, M. (2019). Automated language essay scoring systems: A literature review. \u003cem\u003ePeerJ Computer Science\u003c/em\u003e, 5, e208. https://doi.org/10.7717/peerj-cs.208\u003c/li\u003e\n\u003cli\u003eImran, M., \u0026amp; Almusharraf, N. (2023). Review of teaching innovation in university education: Case studies and main practices. \u003cem\u003eThe Social Science Journal\u003c/em\u003e. https://doi.org/10.1080/03623319.2023.2201973\u003c/li\u003e\n\u003cli\u003eKohnke, L., Moorhouse\u003cstrong\u003e, \u003c/strong\u003eB. L.,\u0026amp; Zou, D. (2023). ChatGPT for Language Teaching and Learning.\u003cem\u003eRELC Journal\u003c/em\u003e, Volume 54, Issue 2. https://doi.org/10.1177/00336882231162868\u003c/li\u003e\n\u003cli\u003eLee, A. V. Y., Luco, A. C., \u0026amp; Tan, S. C. (2023). A human-centric automated essay scoring and feedback system for the development of ethical reasoning. \u003cem\u003eEducational Technology \u0026amp; Society\u003c/em\u003e,26(1), 147\u0026ndash;159.\u003c/li\u003e\n\u003cli\u003eLong, M. H. (1996). The role of the linguistic environment in second language acquisition. \u003cem\u003eIn W. Ritchie, \u0026amp; T.K. Bhatia (Eds), Handbook of Second Language acquisition \u003c/em\u003e(pp. 413\u0026ndash;468). Academic Press.\u003c/li\u003e\n\u003cli\u003eMizumoto, A., \u0026amp; Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. \u003cem\u003eResearch Methods in Applied Linguistics\u003c/em\u003e\u003cem\u003e.\u003c/em\u003e 2(2). https://doi.org/10.1016/j.rmal.2023.100050\u003c/li\u003e\n\u003cli\u003eParker, J. L., Becker, K., \u0026amp; Carroca, C. (2023). ChatGPT for automated writing evaluation in scholarly writing instruction. \u003cem\u003eJournal of Nursing Education\u003c/em\u003e,62(12), 721\u0026ndash;727.\u003c/li\u003e\n\u003cli\u003ePhuoc, V. D., \u0026amp; Barrot, Jessie S. (2022). Complexity, accuracy, and fluency in L2 writing across proficiency levels: A matter of L1 background? \u003cem\u003eAssessing Writing\u003c/em\u003e. 54. https://doi.org/10.1016/j.asw.2022.100673\u003c/li\u003e\n\u003cli\u003eRadford, A., Narasimhan, K., Salimans, T., \u0026amp; Sutskever, I. (2018). Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language_understanding_paper.pdf\u003c/li\u003e\n\u003cli\u003eRamesh, D., \u0026amp;\u0026middot; Sanampudi, S. K. (2022). An automated essay scoring system: a systematic literature review. \u003cem\u003eArtificial Intelligence Review\u003c/em\u003e, 55:2495\u0026ndash;2527. https://doi.org/10.1007/s10462-021-10068-2\u003c/li\u003e\n\u003cli\u003eRamineni, C., \u0026amp; Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. \u003cem\u003eAssessing Writing\u003c/em\u003e,18(1), 25\u0026ndash;39. \u003c/li\u003e\n\u003cli\u003eRay, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. \u003cem\u003eInternet of Things and Cyber-Physical Systems\u003c/em\u003e,3, 121\u0026ndash;154. \u003c/li\u003e\n\u003cli\u003eSchade, M. (2023). How ChatGPT and our language models are developed. Retrieved October 28, 2023, from https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed \u003c/li\u003e\n\u003cli\u003eSchmidt-Fajlik, R. (2023). ChatGPT as a grammar checker for Japanese English language learners: A comparison with Grammarly and ProWritingAid. \u003cem\u003eAsia CALL Online Journal\u003c/em\u003e,14(1), 105\u0026ndash;119.\u003c/li\u003e\n\u003cli\u003eShermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. \u003cem\u003eAssessing Writing\u003c/em\u003e,20, 53\u0026ndash;76.\u003c/li\u003e\n\u003cli\u003eSu, J., \u0026amp; Yang, W. (2023). Unlocking the power of ChatGPT: a framework for applying generative AI in education. \u003cem\u003eECNU Review of Education, 6(3) 355\u0026ndash;366. \u003c/em\u003e\u003c/li\u003e\n\u003cli\u003eSuppadungsuk, S., Thongprayoon, C., Miao, J., Krisanapan, P., Qureshi, F., Kashani, K., \u0026amp; Cheungpasitporn, W. (2023). Exploring the potential of chatbots in critical care nephrology. \u003cem\u003eMedicines\u003c/em\u003e,10(10), 58.\u003c/li\u003e\n\u003cli\u003eWilson, J., \u0026amp; Czik, A. (2016). Automated essay evaluation software in English Language Arts classrooms: Effects on teacher feedback, student motivation, and writing quality. \u003cem\u003eComputers \u0026amp; Education,\u003c/em\u003e 100(1), 94\u0026ndash;109. \u003c/li\u003e\n\u003cli\u003eYan, D., Fauss, M., Hao J., \u0026amp; Cui W. (2023). Detection of AI-generated essays in writing assessments. \u003cem\u003ePsychological Test and Assessment Modeling\u003c/em\u003e, 65,125-144.\u003c/li\u003e\n\u003cli\u003eYancey, K. P., Lafair, G., Verardi, A., \u0026amp; Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, \u0026amp; T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576\u0026ndash;584). https://aclanthology.org/2023.bea-1.49\u003c/li\u003e\n\u003cli\u003eZawacki-Richter, O., Mar\u0026iacute;n, V. I., Bond, M., \u0026amp; Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education\u0026ndash;where are the educators? \u003cem\u003eInternational Journal of Educational Technology in Higher Education\u003c/em\u003e,16(1), 1\u0026ndash;27.\u003c/li\u003e\n\u003cli\u003eZhai, N., \u0026amp; Ma, X. (2022). The effectiveness of automated writing evaluation on writing quality: a meta-analysis. Educational Computing Research, 1\u0026ndash;26. \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence (AI), Automated Essay Scoring (AES), ChatGPT, IELTS Writing Task 2 criterion, Writing Assessment","lastPublishedDoi":"10.21203/rs.3.rs-7533498/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7533498/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe rapid advancements in Artificial Intelligence (AI) have significantly influenced educational practices, particularly in writing assessment. Automated Essay Scoring (AES) systems offer a promising alternative to traditional scoring methods by enhancing consistency, efficiency, and scalability. However, the integration of AI in high-stakes assessments like IELTS Writing Task 2 requires rigorous evaluation to ensure reliability and alignment with human judgment. This study explores the potential of ChatGPT, an advanced AI language model, as a tool for scoring essays based on IELTS Writing Task 2 criteria\u0026mdash;Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy. Employing a quantitative Associational Ex Post Facto Design, 30 essays were scored by both certified human raters and ChatGPT, using intra-class correlation coefficients (ICC) for reliabilityand MANOVA for comparative accuracy. The findings reveal that while ChatGPT demonstrates high internal consistency in scoring, significant discrepancies persist when compared to human raters, particularly in Coherence and Cohesion. These results highlight both the potential and limitations of ChatGPT in AES, suggesting that it can complement, but not yet replace, human evaluators in complex writing tasks. The study contributes to the ongoing discourse on the role of AI in education, emphasizing the need for further refinements to optimize AI-assisted assessments for fairness and precision. Beyond its theoretical contributions, this study provides practical insights for language educators, testing bodies, and policymakers on how AI can be responsibly integrated into large-scale writing assessments.\u003c/p\u003e","manuscriptTitle":"Potential Use of ChatGPT for Automated Essay Scoring Based","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-25 17:12:27","doi":"10.21203/rs.3.rs-7533498/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"67cbdaaa-7c2b-46ec-a549-0c6c0670fdba","owner":[],"postedDate":"September 25th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-10-07T12:55:17+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-25 17:12:27","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7533498","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7533498","identity":"rs-7533498","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.