Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts Mohammad Hadra, Karleen Cambridge, Mostefa Mesbah This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7359956/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 01 Feb, 2026 Read the published version in International Journal for Educational Integrity → Version 1 posted You are reading this latest preprint version Abstract Generative Artificial Intelligence (GenAI) tools capable of producing human-like text have raised considerable concerns regarding academic integrity. In response, AI content detectors such as Turnitin and Originality are increasingly employed in higher education. However, empirical evidence regarding their accuracy, reliability, and fairness, particularly in the context of English as a Foreign Language (EFL) writing remains limited. This study evaluates the performance of both detectors across variations in text length, genre, and authorship type. A balanced dataset of 192 texts was constructed, comprising authentic EFL student writing, professionally authored human texts, AI-generated outputs, and hybrid compositions. Based on the percentage of AI content identified by each detector, texts were categorized as Human, Hybrid, or AI. Detector performance was assessed against ground truth labels using precision, recall, specificity, F1 score, and accuracy. Statistical significance was tested using Pearson’s chi-square and Fisher’s Exact Test. Originality outperformed Turnitin in overall accuracy (0.69 vs. 0.61) and macro-average recall (0.60 vs. 0.51). However, both detectors performed poorly on Hybrid texts, with recall scores of 0.31 for Turnitin and 0.02 for Originality. Performance declined significantly with longer texts (p < 0.015 for Turnitin; p < 0.002 for Originality) and varied across genres, with higher accuracy observed in humanities than in science (p < 0.0001 for both detectors). Originality also exhibited a borderline statistically significant bias favoring professionally authored texts over EFL texts (p = 0.058). These findings suggest that neither detector is sufficiently reliable to serve as the sole basis for academic misconduct decisions. Institutions are advised to supplement AI detection tools with human judgment, incorporate AI literacy into academic curricula, and encourage detector developers to pursue further research into bias mitigation. Generative Artificial Intelligence AI Content Dectection English as a Foreign Language (EFL) - Academic Integrity Detection Bias Detection Reliability. Figures Figure 1 Figure 2 Figure 3 Figure 4 1 Introduction The increasing use of Artificial Intelligence (AI) tools to generate human-like text and other content along with their ability to adapt to different human writing styles, has introduced new financial, societal and academic challenges. In academia, integrity, intellectual property rights, students' overreliance on technology, and the impact on critical thinking are some of these challenges. The potential misuse of generative AI (Gen-AI) by students in producing essay assignments and presenting them as original work is a major concern in higher education institutions (Currie, 2023 ) (Perkins, 2023 ). The use of Gen-AI to cheat in online exams is another persistent concern (Cotton, Cotton and Shipway, 2024 ). The possibility of Gen-AI misuse by students and researchers may have a profound impact on academic integrity (Zhong et al. , 2023)(Wu, Duan and Ni, 2023 ). Challenges are not restricted to the academic setting. Dissemination of fake news to redirect public opinions and generate deceptive product reviews are also expected to increase many folds with the invention of Gen-AI tools, particularly those capable of producing not only text, but also synthetic images and videos (Saheb, Sidaoui and Schmarzo, 2024 )(Chakraborty et al. , 2023). These challenges raise significant ethical concerns and issues (Bommasani et al. , 2021). Combating the negative consequences of the above-mentioned issues, automatic identification/detection of Gen-AI generated content has become essential. In their report, T. Waltzer et al. (Waltzer, Pilegard and Heyman, 2024 ) demonstrated that, on average, instructors correctly identified ChatGPT-written work only 70% of the time. A study by S. Gehrmann et al. (Gehrmann, Strobelt and Rush, 2019 ) concluded that humans are unable to differentiate between texts generated by Gen-AI and those generated by humans. Hence, an accurate and reliable automatic tool is required for this detection. Developers of Gen-AI detection tools mainly exploit the differences between human-generated text distribution and those generated by Gen-AI, along with other emerging techniques, such as watermarking signatures, to automatically distinguish between human and Gen-AI writing (Fariello et al., 2025 ). Human text distribution and Gen-AI text distribution refer to statistical patterns of texts written by humans versus those generated by Gen-AI models (Wu et al., 2025 ). These distributions describe how linguistic features, such as word choice, sentence structure, and contextual coherence, are distributed across a dataset or text corpus. Human text features diversity, imperfect patterns, context awareness, non-randomness, and perplexity or surprise in the text (Liu et al., 2024 ). In contrast, the Gen-AI text features statistical optimization, repetition, high fluency and limited creativity, tighter pattern (frequency of specific words and sentence length), and low perplexity in text and limited surprise (Akram, 2023 ). Detection tools extract these characteristics and feed them to classification algorithms. Some developers of Large Language Models (LLMs), like OpenAI, aimed to maintain accountability for the misuse of Gen-AI tools in text creation. Hence, they proposed and adopted the watermarking technique. This technique works by embedding invisible statistical patterns or signals in the text during generation. Watermarking typically involves altering the choice of words, punctuation, or phrasing in subtle ways to encode a recognizable pattern. These patterns can later be analyzed to determine if the content originated from an AI model with watermarking enabled (Kirchenbauer et al. , 2023)(Zhao et al. , 2024). Accuracy and reliability are highly significant for the proposed detection tools since the user of these tools is supposed to judge a student's or researcher's work as authentic or plagiarized based on their results. The current literature has a contradictory view of the accuracy and reliability of the Gen-AI detection tools, and some studies claim that it is not possible to create a tool to accurately detect Gen-AI content (Weber-Wulff et al., 2023 ). Numerous studies were conducted to test the performance of detection tools. In one such study, a group of researchers evaluated twelve publicly available tools and two commercial ones. They systematically examined the general functionality of the detection tools and evaluated their accuracy and error types. They concluded that the tested tools were neither accurate nor reliable. (Weber-Wulff et al., 2023 ). Another study (Elkhatat, Elsaid and Almeer, 2023 ) evaluated five publicly available tools, including a tool developed by OpenAI. They used fifteen paragraphs generated by Gen-AI and five human responses as control. They concluded that the detection tools were more accurate in detecting content generated by ChatGPT3.5 than ChatGPT4, and when applied to human-generated text, they exhibited inconsistency, producing false positive and uncertain classification. Another study, carried out by Sadasivan et al. (Sadasivan et al. , 2023), suggested that current Gen-AI detection tools relying on the watermarking technique (Kirchenbauer et al. , 2023) and zero-shot classifiers are not reliable. They showed that a simple paraphraser, such as PEGASUS (Zhang et al. , 2020), designed with a lightweight neural network, drastically reduced the detection accuracy. In contrast to the conclusions drawn by Sadasivan et al. (Sadasivan et al. , 2023), Chakraborty et al. (Chakraborty et al. , 2023) asserted that the claim that the detection of Gen-AI in writing is impossible is not supported by real-world evidence. The authors in (Chakraborty et al. , 2023) argued that effective and practical Gen-AI detection remains possible, as long as human and machine-generated texts do not exhibit similar distributional characteristics. Drawing from the information theory, they established sample complexity bounds, showing that as machine-generated text improves in quality, a large sample size is required, which reduces the practicality of the approach. They evaluated their suggestions across multiple datasets, including Xsum , Squad , IMDb , and Kaggle FakeNews . Their study tested various state-of-the-art Gen-AI against detectors like oBERTa-Large/Base-Detector and GPTZero. Gen-AI detectors face several key challenges. One such challenge is the wide range of attacks to evade the detection or confuse the detection model. These attacks are schemes used to mislead the GenAI detector into believing that tested texts were generated by a human. Some such schemes are: 1) Paraphrasing attacks; 2) Synonym Replacement in which users replace some words with their synonyms; and 3) Backtranslation, where the user translates the generated text into another language and then translates it back to the original language by machine translation. In some instances, a combination of these attacks is used, which is known as an Ensemble Attack (Creo and Pudasaini, 2024 ). V. S. Sadasvian et al. (Sadasivan et al. , 2023) showed that a recursive paraphrasing attack was able to bypass a range of detectors, including those based on watermarking and neural networks, Zero-shot classifiers and Retrieval-based detectors. Anderson et.al [17] were able to mislead a GPT-2 Output Detector by changing its output decision from 0.02% human-generated to 99% human-generated through paraphrasing. In their report, M. Perkins et al. (Perkins et al., 2024 ) demonstrated that detector accuracy was reduced by 17.4% when simple techniques were applied to manipulate AI-generated content. From the above-mentioned literature, it is clear that the accuracy and reliability of Gen-AI detection tools are yet to be conclusively proven. The studies addressing this issue are few, and most of the papers and reports are available in Gray Literature repositories, such as arXiv, as preprints. Furthermore, most of the existing studies either used very limited data prepared by humans or used publicly available datasets generated by fluent professional or native speakers. Studies specifically targeting undergraduate students learning English as a Foreign Language (EFL) are non-existent. This gap motivated the current study. Additionally, several of the previous studies used a limited number of metrics to evaluate the performance of a detector, while others did not clearly describe the methodology used to enable replication. In this study, we attempt to systematically estimate the accuracy and reliability of two very popular AI detection tools in higher education institutes, i.e. Turnitin and Originality . Our work includes the use of original Middle Eastern student work, within an EFL context, written before the introduction of Gen-AI tools, as one of the datasets used for evaluation, which guarantees that it is an authentic sample of student writing. This use of the dataset allows the examination of the alleged bias of the current AI detection tools towards non-professional writers (Liang et al., 2023 ). Furthermore, the types of datasets used in this study allow for deep insights into the detectors' performance in different writing styles. Our work aims to provide a recommendation based on rigorous and data-driven evaluation for higher education institutes’ educators and policymakers for using Gen-AI text detection tools. A complete description of the datasets used in this work is found in the methodology section of this paper. The rest of the paper is organized into five sections. In the methodology section, we present a comprehensive description of the data and data analysis employed in the study. The findings of the study are then reported in the results section, followed by an in-depth discussion of the findings in the discussion section. The next section outlines the limitations of the study and offers recommendations for future research. Finally, the conclusion summarizes the key findings of the study. 2 Methodology 2.1 Data Description For this research, we used a dataset of 192 texts, equally distributed across four categories: 48 authentic English as a Foreign Language (EFL) student-written texts submitted for grading before the emergence of AI-based text generation technology, 48 authentic human-written texts on diverse topics by professional writers, 48 AI-generated texts produced by ChatGPT and Claude AI, and 48 hybrid texts created by merging content from the EFL student texts and AI-generated texts while maintaining textual coherence. To cover different text length, we divided the texts into three categories: texts with a minimum of 300 words, texts around 500 words (with a ± 10% margin), and texts around 1000 words (with a ± 10% margin). Additionally, we considered texts from both the humanities and science genres. The EFL student texts were drawn from final drafts of reports submitted for grading as a requirement of specific Foundation Program English for Sciences (FPES) courses at the Centre for Preparatory Studies, Sultan Qaboos University in Oman. The selected student reports were submitted in or before August 2022 to eliminate the possibility of Gen-AI usage. All necessary approvals were obtained from the university to use the students’ submissions for the purpose of this research. The authentic texts by professional writers were sourced from the XSum (Extreme Summarization) dataset, a well-known open-source resource frequently used in Natural Language Processing (NLP) research. The XSum dataset is a large-scale corpus developed by the University of Edinburgh’s NLP group for abstractive text summarization ( marsyas/gtzan · Datasets at Hugging Face , no date)( EdinburghNLP/XSum: Topic-Aware Convolutional Neural Networks for Extreme Summarization , no date). It consists of over 226,000 news articles from the BBC, each accompanied by a single-sentence summary that provides a highly condensed version of the original content. The articles, collected from BBC content (2010–2017), include a variety of domains, including news, politics, sports, weather, business, technology, science, education, entertainment, and the arts. Due to its concise and information-dense nature, XSum is a strong representation of human-authored text, making it well-suited for evaluating AI-generated text detectors. Its use in benchmarking summarization models has established it as a widely recognized resource in NLP research. The collected data covers four categories of written text: AI-generated, hybrid text, professional writer text, and EFL student-written texts. The inclusion of the last two categories allows us to investigate whether there is a performance difference in detectors when distinguishing between EFL-written texts and professional-written texts. This comparison helps address claims in the literature regarding potential biases in detectors against non-professional writers. 2.2 Classification Approach Turnitin reports the percentage of AI authorship, while Originality provides a likelihood score indicating whether the text is AI or human-authored (referred to as "Original"). To enable meaningful evaluation of the AI content detectors, these probabilistic outputs, typically expressed as percentage likelihoods, have been converted into discrete classification categories. In this study, we adopt a thresholding strategy that maps the detector's output into three classes: Human-Generated , AI-Generated , and Hybrid . More specifically, texts with an AI-content probability between 0–20% are labelled as Human , those between 21–79% as Hybrid , and those equal to and above 80% as AI . This threshold schema is grounded in both the output behavior of existing AI content detectors and practices documented in related studies (Weber-Wulff et al., 2023 ). For instance, tools such as GPTZero and OpenAI's AI classifier (now deprecated) have used a high-confidence threshold of 80% or more to indicate AI authorship, while probabilities below 20% are considered strong indicators of human authorship (Buchert, 2024 ) (OpenAI, 2023 ). The intermediate range is often associated with mixed or ambiguous authorship, justifying its alignment with the Hybrid category. Furthermore, Turnitin’s published guidelines and common practice among AI content detection tools in their website (iThenticate, 2024 ). Its guidelines no longer report specific detection percentages below 20% due to a higher risk of false positives, while results above 80% are regarded as strong indicators of AI authorship. The intermediate range is typically associated with mixed or ambiguous authorship, making it suitable for the Hybrid category. Originality has the same guidelines ( How Does AI Content Detection Work? – Originality.AI , no date). These above-mentioned thresholds were not derived from the present dataset but were instead based on external standards, allowing for a fair assessment of the detector’s current performance against an accepted norm. The classification results will be compared to the ground-truth labels to compute standard performance metrics, such as accuracy, precision, recall, specificity, and F1-score. 2.3 Data Analysis 2.3.1 Performance Analysis The confusion matrix is a standard tool used to evaluate the performance of classification models. While it is commonly applied to binary classification problems, it can be extended to multi-class classification tasks. It is widely adopted in Machine Learning (ML) research and practice to assess ML models’ performance and to guide ML’s algorithms hyperparameter tuning. In our context, we treat AI detectors as text-to-label classifier models that categorize text into one of three classes: AI, Human, or Hybrid. The confusion matrix provides a detailed breakdown of how predicted labels align with the actual labels. A confusion matrix is typically represented as a table containing all possible combinations of actual and predicted class labels. Figure 1 illustrates the three-class confusion matrix, where each cell represents the number or percentage of instances assigned to a specific predicted class by the classifier. The confusion matrix serves as the foundation for calculating several critical performance metrics, such as Recall (Sensitivity), Specificity, Precision, Accuracy, and F1-score. These metrics are essential for understanding classification quality beyond overall accuracy. To illustrate how these metrics are derived, consider the class AI and how the classifier correctly or incorrectly labels instances of this class. For this explanation, we treat AI-generated texts to represent the positive class. Figure 2 depicts the classification outcomes related to the AI-generated class in terms of TP, TN, FP, and FN, defined below. True Positives (TP) represents the number of samples correctly identified by the classifiers as AI-authored. True Negatives (TN) represents the number of samples the detector correctly classifies a non-AI text (Human or Hybrid) as not AI-authored. False Positives (FP) represents the number of samples the detector incorrectly classifies a non-AI-authored text as AI-authored. False Negatives (FN) represents the number of samples the detector incorrectly classifies an AI-authored text as non-AI. It is important to note that in binary or one-vs-rest classifications, the positive or negative designation depends on the class under consideration; in this case, it is AI-generated. The above-mentioned performance measures are mathematically expressed by: : Sensitivity (Sen) /Recall This measure describes out of all actual AI-authored texts, how many were correctly identified by the detector. $$\:Sen\:=\:\:\frac{TP}{TP+\sum\:FN}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(1\right)$$ Specificity ( Spec ): Specificity indicates out of all non-AI texts; how many were correctly identified as not AI-authored. $$\:Spec\:=\:\:\frac{\sum\:TN}{\sum\:TN+\sum\:FP}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(2\right)$$ Precision Precision indicates out of all texts predicted as AI-authored, how many were actually AI-authored. $$\:Precision=\:\:\frac{TP}{TP+\sum\:FP}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(3\right)$$ F1-Score AI F1-score is the harmonic mean of Precision and Recall. It balances both false positives and false negatives. $$\:F1-score=\:2\times\:\:\frac{Precision*Sen}{Precision+Sen}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(4\right)$$ Accuracy ( Acc ): The accuracy indicates the overall rate at which the detector correctly classifies both AI and non-AI texts $$\:Acc\:=\:\:\frac{TP+\sum\:TN}{TP+\sum\:TN+\sum\:FP+\sum\:FN}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(5\right)$$ These metrics together provide a comprehensive evaluation of the model's performance. Relying on accuracy alone, for example, can be misleading, especially in the presence of class imbalance. We repeated the above calculations for all three classes (AI, Human, Hybrid), treating each one as a positive class. versus the rest. Given that the dataset exhibits moderate imbalance (with Human-authored texts being approximately twice as frequent as the AI and Hybrid classes), we report the macro-averaged performance metrics. Macro-averaging gives equal weight to each class, offering a balanced and conservative view of performance in imbalanced settings (De Angeli et al., 2022 ). 2.3.2 Statistical Analysis In our analysis, we systematically evaluate the performance of AI content detectors across several key variables relevant to academic contexts. These variables included (1) Text length , categorized into three ranges: (A) 300 − 330 words, (B) 450–550 words, and (C) 900 − 1100 words; (2) Text genre , comparing Science and Humanities; and (3) Authorship group , EFL student writing vs. professional writing. To determine whether detector performance varied significantly across the study variables, we applied Pearson’s chi-square test of independence. For each analysis, we (I) constructed an r × c contingency table, (II) verified that the assumption of adequate expected cell counts was satisfied (no expected frequency < 5), and (III) calculated the χ² statistic. We report the χ² value and its associated p-value for every comparison (Bewick, Cheek and Ball, 2003 )(McHugh, 2013 ). In cases where the assumption of the Chi-square test is violated, we employ its non-parametric counterpart, namely Fisher’s Exact Test (Fisher, Marshall and Mitchell, 2011 ). Unlike the Chi-square test, Fisher’s Exact Test does not rely on large-sample approximations and is particularly appropriate when expected cell frequencies are small or the assumption of minimum expected cell counts for Chi-square is violated (Fisher, Marshall and Mitchell, 2011 ). It is particularly suited for 2×2 contingency tables with small or unevenly distributed cell counts, as it calculates the exact hypergeometric probability of observing the obtained frequencies under the null hypothesis ofindependence. In our context, the test assessed whether the distribution of classification outcomes (Correct vs. Incorrect) was statistically independent of the author group, with a directional alternative hypothesis positing that Open Database texts would yield significantly higher correct classification rates compared to EFL-authored texts. All performance metrics (e.g., accuracy, precision, recall, F1-score) are reported alongside the corresponding statistical test results and p-values, providing a comprehensive and rigorous evaluation of how each variable affects the detector's classification performance. This multi-method approach ensures both statistical robustness and interpretability of the findings. 3 Results 3.1 Overall Detection Performance Figure 3 shows the confusion matrix results of both detectors for per-class and multi-class performance Table 1 shows the results of the overall per-class detection performance for each detector Table 1 of per-class parameters results for Turnitin and Originality Detector Class Sensitivity Specificity Precision F1-Score Accuracy Turnitin Human 0.93 0.53 0.66 0.77 0.73 AI 0.29 0.98 0.82 0.43 0.81 Hybrid 0.31 0.82 0.37 0.34 0.69 Originality Human 0.96 0.55 0.68 0.80 0.76 AI 0.83 0.90 0.74 0.78 0.89 Hybrid 0.02 0.99 0.33 0.04 0.74 The table summarizes how Turnitin and Originality perform across three text classes, namely Human-generated, AI-generated, and Hybrid, using standard classification metrics. Turnitin Class Human-generated : The results show high sensitivity (Sensitivity: 0.93) and good precision (0.66), but low Specificity (0.53). This indicates frequent confusion with non-human texts. F1-score (0.77) and Accuracy (0.73) reflect the overall performance. Class AI-generated : Here, the results show very high Specificity (0.98) and Precision (0.82). This means that the detector rarely mislabels other texts as AI. However, low Sensitivity (0.29) indicates that many AI texts are missed. F1-score (0.43) shows a limited balance between Precision and Recall. Class Hybrid : Performance is weak across all metrics, especially Sensitivity (0.31) and F1-score (0.34), suggesting poor detection of Hybrid texts. Originality Class Human : Excellent Sensitivity (0.96) and good Precision (0.68) yield a high F1-score (0.80). Low Specificity of (0.55) while Accuracy (0.76) remains moderate. Class AI : High performance with Sensitivity (0.83), Specificity (0.90), and F1-score (0.78), indicating a good binary classification performance. Class Hybrid : Sensitivity is extremely low (0.02), with minimal correct detection. While Specificity (0.99) is high, the F1-score (0.04) confirms poor classification for this class. Table 2 below shows the results of the macro-average for both detectors. The macro-averaged results reflect overall detector performance by computing the unweighted mean of the metrics (e.g. precision, recall, F1-score ) across all classes, treating each class equally regardless of its frequency. It is particularly suitable for imbalanced class distributions (Takahashi et al. , 2022). These metrics provide a clearer picture of each system's consistency across Human, AI, and Hybrid classifications. Table 2 Macro-average of all performance metrics for Turnitin and Originality Detector Overall Accuracy Macro Avg Precision Macro Avg Specificity Macro Avg Recall (Sensitivity) Macro Avg F1-Score Turnitin 0.61 0.62 0.78 0.51 0.51 Originality 0.69 0.59 0.81 0.60 0.54 Turnitin Overall Accuracy (0.61) and Macro F1-score (0.51) indicate limited overall performance. The gap between Precision (0.62) and Recall (0.51) suggests that while Turnitin may be cautious in its positive predictions, it fails to adequately capture true instances across all classes. This aligns with the low recall values observed in the per-class results for AI (0.29) and Hybrid (0.31), confirming inconsistent and unreliable detection across categories. A specificity of 0.78 shows a moderate ability to avoid false positives, with strong specificity for AI and Hybrid but weak for the Human class. Reflects inconsistency across classes. Originality Overall Accuracy (0.69) and Macro F1-score (0.54) are unacceptably low for reliable use, especially in academic applications requiring consistent detection. Despite relatively balanced Precision (0.59) and Recall (0.60), the low F1-score reflects weak overall classification effectiveness. The Hybrid class, with a near-zero recall (0.02), heavily impacts these averages, exposing a critical failure in handling this category. Specificity of (0.81) is slightly better than overall specificity, driven by high values for AI and Hybrid. Performance is more consistent but still not strong. Overall, both detectors show limited macro-level performance, with insufficiencies clearly aligned with poor per-class detection; particularly in the Hybrid category. These results highlight significant limitations in their current state. 3.2 The Effect of Text Length on Performance Table 3 shows the comparison results of the detector performance for different text lengths, and Fig. 4 visualizes the results. In the figure, we only present and compare the results of accuracy, precision, and recall for clarity. Table 3 Detectors' performance among the three different text lengths Detector Text length Accuracy Precision Recall Specificity F- Score Turnitin 300–330 0.87 0.71 0.94 0.95 0.77 Turnitin 450–550 0.56 0.60 0.48 0.75 0.46 Turnitin 900–1100 0.68 0.69 0.76 0.88 0.54 Originality 300–330 0.96 0.65 0.67 0.93 0.66 Originality 450–550 0.63 0.59 0.60 0.79 0.51 Originality 900–1100 0.84 0.60 0.58 0.90 0.58 These results offer a clear indication that text length clearly affects the performance of both Turnitin and Originality in detecting AI-generated content: Short texts (300 – 330 words) were better classified by both systems. Medium length texts (450–550 words) show a drop in all metrics, especially for Turnitin. Long texts (900 – 1100 words) are close to medium length and still perform less than short texts. This implies both tools struggle more as text length increases and show inconsistent performance when text length changes. We further check with Pearson’s chi-square test of independence whether the observed performance difference is statistically significant; the contingency table for Correct detection vs. Incorrect detection created for the test is presented in Table 4 . Table 4 Contingency table for text length effect Turnitin Originality A (300–329) B (450–549) C (900–1100) A (300–329) B (450–549) C (900–1100) Correct 20 81 17 22 90 21 Incorrect 3 63 8 1 54 4 Pearson’s chi-square test of independence for Turnitin results yielded χ²(2) = 8.41, p = 0.0149, and for Originality χ²(2) = 13.17, p = 0.0014, indicating a statistically significant association between text length and detection outcome for both systems. This result has many implications in academic settings and technically may be attributed to several linguistic features that will be discussed in the discussion section. 3.3 The Effect of Text Genre on Performance Table 5 shows the results of the detector performance related to text genres (Science vs. Humanities). Table 6 shows the contingency table used for statistical tests of significance. Table 5 Detectors performance across text genre Detector Genre Accuracy Precision Recall Specificity F1-Score Turnitin Sci 0.51 0.59 0.50 0.75 0.47 Turnitin Hum 0.86 0.60 0.46 0.95 0.51 Originality Sci 0.58 0.52 0.59 0.79 0.49 Originality Hum 0.96 0.65 0.59 0.93 0.62 The results presented in Table 5 show that both Turnitin and Originality perform markedly better on humanities texts than on science texts across all evaluation criteria: Turnitin : Shows a clear drop in accuracy (from 0.86 in humanities to 0.51 in science) and similarly weaker precision, specificity, and F1-score in science. Originality : Demonstrates high accuracy in humanities (0.96) and a clear drop in science (0.58) but maintains a more consistent recall across genres. Specificity and precision, however, still show notable declines in science. Statistical tests We further checked with Pearson’s chi-square test of independence to see whether the observed performance difference is statistically significant. The contingency table for correct detection vs. incorrect detection created for the test is presented in Table 6 . Table 6 Contingency table to test statistical differences among text genre Turnitin Originality Text Genre Incorrect Correct Incorrect Correct Humanities 8 49 2 55 Science 66 69 57 78 Pearson’s Chi-square test of independence for Turnitin results yielded χ²(1) = 19.109, p = 0.000, and for Originality χ²(1) = 26.429, p = 0.000, indicating a statistically significant association between text genre and detection outcome for both detectors. It is worth noting that the assumption of expected cell counts being greater than or equal to 5 for the Chi-square test was examined and found to be satisfied. Both Turnitin and Originality detectors exhibit genre-dependent performance, with significantly better detection accuracy on humanities texts compared to science. The consistency of this pattern across all metrics in both systems, supported by highly significant statistical tests, points to a systematic limitation in genre generalization, which has a number of implications in academic settings. These findings underscore the need for evaluating AI content detectors not just overall, but within specific academic domains. 3.4 The Effect of Text Type on Performance In the analysis of detector performance across different authorship groups (EFL vs. Open Database), the available ground-truth data consisted exclusively of texts that were written by humans. This constraint limits the classification framework to a single actual class (“Human”), thus impeding the construction of a full multi-class confusion matrix necessary for computing metrics, such as precision, specificity, and F1-score (Skaik, 2008 ). Consequently, the evaluation of performance in this context was confined to a binary distinction between correctly and incorrectly classified Human texts. As a result, accuracy defined as the proportion of correctly identified Human texts out of the total, was adopted as the sole performance metric. This choice is methodologically justified, as accuracy in this case is mathematically equivalent to recall (sensitivity) for the Human class, and serves as an appropriate indicator of the detector’s ability to detect true Human authorship without misclassification. The use of accuracy under this design reflects a constrained but valid and interpretable approach to assessing detector consistency concerning variations in author background. In Table 7 , we presented the results of detector accuracy among different text authorships, i.e. EFL students and professional writers from the Open Database. Table 8 is used for the significance test. Tables are followed by a summary of performance trends and statistical findings for both detectors. Table 7 Detectors’ performance with different authorships Detector Author Correct Incorrect Total Accuracy (%) Turnitin Open Database 45 3 48 93.8% EFL 44 4 48 91.6% Originality Open Database 48 0 48 100% EFL 44 4 48 91.6% To determine whether there is a statistically significant difference in detectors’ performance between texts authored by EFL students and those drawn from an Open Database, Fisher’s Exact Test was employed. Classification outcomes (correct vs. incorrect detection) were cross-tabulated by text source (EFL vs. Open Database). Then, the test assessed whether the distribution of detection outcomes, i.e. correctly vs. incorrectly classified texts, was independent of the author group (Open Database vs. EFL). Given the directional hypothesis that AI detectors may perform more favorably on texts authored by Open DB writers, a one-sided Fisher’s Exact Test was conducted for each detector. In Table 8 , we presented the calculated contingency table and followed by the results of Fisher’s Exact Test. Table 8 Contingency table used for calculating statistical test Turnitin Originality Author Incorrect Correct Incorrect Correct Open Database 3 45 0 48 EFL 4 44 4 44 For Turnitin, the test yielded a non-significant result, the one-sided p-value ( Open DB > EFL = 0.500, the odds ratio = 1.364), indicating no evidence of better performance for the Open Database group. For Originality, Fisher’s Exact Test one-sided p-value (Open DB > EFL = 0.058, odds ratio = undefined- zero count in one cell ), suggesting a trend toward higher accuracy in the Open Database group. The above analysis shows that for Turnitin, Fisher’s Exact Test yielded a non-significant result, indicating no evidence of a performance difference between authorship groups. In contrast, for Originality, the test approached near statistical significance, suggesting a potential trend toward higher accuracy for texts from the Open Database group. Specifically, Turnitin's accuracy decreased slightly from 93.8% for texts written by professional authors (Open Database) to 91.6% for those written by EFL students. For Originality, the accuracy dropped more noticeably, from 100% for professional texts to 91.6% for EFL texts. These outcomes, combined with Fisher’s test results, suggest that Turnitin maintains consistent performance across authorship types, showing no detectable bias. In contrast, Originality demonstrates a slight tendency toward favoring professional writing, with a near-significant advantage (one-sided p = 0.0586) for the Open Database texts. This comparative analysis underscores that authorship type does not influence Turnitin’s performance, while Originality displays mild author-dependent behavior, primarily in its handling of errors and classification decisions. The borderline significant result (p = 0.058) in Originality’s performance across authorship groups highlights a possible bias that warrants further investigation. 4 Discussion This study offers a critical empirical assessment of both Turnitin and Originality tools across varied text genres, lengths, and authorship types, revealing clear patterns of inconsistency, technical limitations, and signs of potential bias. Based on the accumulated evidence, it must be concluded that neither Turnitin nor Originality can currently be considered sufficiently reliable as AI detection solutions in academic settings. The overall classification results (using macro-average metrics) indicate that Originality slightly outperforms Turnitin (Table 2 ). While Turnitin achieved a macro average accuracy of 61%, Originality reached 69%, with higher values for macro-average recall (60% vs. 51%) and F1-score (54% vs. 51%). While Originality outperformed Turnitin across macro-average metrics, both systems failed to demonstrate robust detection capabilities across all considered scenarios, i.e. text length, text genre and authorship type. Most notably, performance related to the Hybrid (a mix of human and AI authorship) was noticeably low. For example, Turnitin and Originality achieved recalls of 31% and 2% respectively. Given that Hybrid compositions reflect the emerging reality of student engagement with AI tools, this performance failure has serious implications for the credibility of detection judgments. Furthermore, the performance related to the hybrid class indicates that these tools are sensitive to common tactics used by students and academics, such as paraphrasing, synonym substitution, and rephrasing. Although both tools achieved high accuracy on short texts (300–330 words), this accuracy degraded with longer texts (900–1100 words), a category more typical of authentic student assessments. The accuracy for longer texts fell from 87–68% for Turnitin and from 96–84% in the case of Originality, with statistically significant differences confirmed through Pearson’s chi-square test with p-value = 0.0149 for Turnitin and p-value = 0.0014 for Originality. This instability raises practical concerns for academic use, where essay-length submissions are the norm. These results are consistent with observations that many AI detectors rely on features like perplexity that become less stable in longer texts (Weber-Wulff et al., 2023 ). Detection reliability must be evaluated not only by technical metrics but also by fairness across demographic and disciplinary lines. The evaluation of AI content detectors across authorship types reveals important disparities with significant implications for academic integrity implementation. We observe meaningful insights into potential bias in automated detection systems. For Originality, the one-sided p-value of 0.0586 approached the conventional significance threshold (α = 0.05), suggesting a borderline or marginally non-significant difference in favor of the expert professional writers’ texts (Open Database group). While not statistically significant at the 0.05 level, this result indicates a directional trend: Originality achieved perfect classification accuracy (100%) in the Open Database group, while it misclassified four texts in the EFL group (accuracy ≈ 91.7%). Fisher’s Exact Test is specifically appropriate in this context because of the small sample size and the presence of a zero cell (no misclassifications in the Open Database group). Unlike asymptotic tests (e.g., chi-square), Fisher's test does not rely on large-sample approximations and provides an exact p-value, making the result more reliable for small or unbalanced groups. Given the observed difference (4 incorrect instances in EFL students vs. Zero incorrect instances for professional writers) and the marginal p-value (0.058), it is reasonable to infer that the likelihood of detecting a statistically significant difference would increase with a larger sample size. The exact nature of the test ensures that this conclusion is grounded in conservative statistical inference. Although the result did not cross the traditional threshold for significance, the borderline p-value (0.0586) and the perfect performance in one group versus errors in another suggest a potentially meaningful difference in detection accuracy between EFL and expert professional writers’ texts. On the other hand, Turnitin exhibited no statistically significant differences between the two groups in accuracy, p-value = 0.5. This consistency suggests that Turnitin applies its classification logic uniformly, without systematic favour or disadvantage to EFL writers. In the case of Originality, EFL-authored texts were more likely to be incorrectly flagged or mislabeled compared to those from professional writers. This may be attributed to the lack of language variability among non-professional writers, which in turn makes their writing follow a repetitive pattern. (Liang et al., 2023 ) This possible disparity is particularly problematic in academic contexts, where detection outcomes carry disciplinary and reputational consequences. A detector that disproportionately misclassifies or fails to validate the authenticity of EFL writing can contribute to structural inequities, potentially stigmatizing students, based not on the originality of their work, but on linguistic patterns or syntactic markers correlated with non-professional writing. The presence of authorship-based bias in an AI detector undermines its utility as an objective arbiter of academic misconduct. For institutions that serve linguistically diverse populations, especially those with high proportions of EFL learners, such bias can reduce trust in detection outcomes, lead to unfair accusations, and reinforce barriers to academic inclusion. The findings from the Originality detector point to a need for detector model retraining, diversification of training data, and the incorporation of bias mitigation strategies. This finding aligns with concerns raised in recent AI ethics literature about linguistic disadvantage in algorithmic systems (Helm et al., 2024 )(Andersen et al., 2025 ). Similarly, genre-based performance differences were noticeable (Table 5 ). Turnitin achieved 86% accuracy on humanities texts but only 51% on scientific writing. Originality, despite its stronger overall metrics, exhibited similar discrepancies: 96% accuracy on humanities versus 58% on science. Both detectors appear to struggle with technical, domain-specific language, which may resemble the lexical density and structure of AI-generated outputs. The significance of these differences was confirmed by Pearson’s Chi-square test of independence (p < 0.0001) for both detectors. These findings suggest that the detectors may be detecting stylistic deviation rather than actual machine authorship. The most consequential insight from this study is that both tools, even if deployed together, do not offer a defensible level of accuracy or reliability for autonomous AI authorship detection. Their classification is inconsistent, context-sensitive, and particularly weak in edge cases such as Hybrid texts or genre-variant structures. As such, they should not be used in isolation for any disciplinary action or academic integrity implementation. Instead of escalating the race between detection technologies and Gen-AI, academia must embrace a holistic rethinking of writing authorship and assessment in the Gen-AI era. Rather than treating Gen-AI as a threat, it should be approached as a tool that requires understanding and responsible use, critiqued, and ethically integrated. This calls for a curricular reform that embeds AI literacy and digital authorship ethics, assessment redesign that prioritizes process over product and enables transparency of AI usage, institutional policies that differentiate between acceptable augmentation and unethical authorship substitution, and faculty training to interpret AI detection scores critically and contextually, not deterministically (Cotton, Cotton and Shipway, 2024 ). Detection tools may still serve as supplementary aids, particularly when used as part of a broader review process. However, their limitations demand transparency in their deployment and restraint in their interpretation. Just as plagiarism software never replaced human judgment, AI detectors must not become unquestioned arbiters of academic misconduct. 5 Limitations and Future Research While the sample size of 192 texts is moderate and sufficient for meaningful statistical analysis, a larger sample size may yield more robust findings. Additionally, the analysis is limited to two detectors. Given the rapid evolution and diversification of detection technologies, future research should extend the scope to include a wider range of AI detectors. Expanding the dataset and incorporating more detection systems will enhance the generalizability of findings and support more comprehensive benchmarking of detection reliability in diverse academic settings. In our hybrid class, we combined AI-generated and human-written texts in equal proportions (roughly 50/50), ensuring cohesion and coherence throughout the content. For future work, we plan to investigate the impact of varying mixture ratios, such as 80/20 or 60/40, on detector performance. This investigation may offer deeper insights into how different levels of hybridization influence the accuracy and reliability of AI-content detection systems. 6 Conclusion In conclusion, this study offers a multi-perspective evaluation of two leading commercial AI content detectors. It concludes that a reassessment of current approaches to AI detection in education is required. Turnitin and Originality, while technically sophisticated, are neither sufficiently accurate nor equitably reliable for exclusive use. Their effectiveness is noticeably affected by text length , genre , and to some extent by authorship type , and their poor performance on hybrid content reveals fundamental conceptual limitations. Academic institutions must resist the temptation to treat AI as a threat to be neutralized and instead embrace it as an emerging pedagogical innovation to be navigated with care, creativity, and integrity. Declarations • The authors have no relevant financial or non-financial interests to disclose. • The authors have no conflicts of interest to declare that are relevant to the content of this article. • All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. • The authors have no financial or proprietary interests in any material discussed in this article. Author Contribution M.H. was responsible for the study design, data analysis, and preparation of the manuscript. K.C. contributed to data collection, dataset creation and management, execution of the experiments, and editing and reviewing the manuscript. M.M. provided subject matter expertise and supervision, and contributed to editing and reviewing the manuscript. All authors approved the final version of the manuscript and agree to be accountable for all aspects of the work. Data Availability Data and materials created for this research are available upon request. Please direct all inquiries to the corresponding author. References Akram, A. (2023) ‘An Empirical Study of AI-Generated Text Detection Tools’, Advances in Machine Learning & Artificial Intelligence , 4(2). doi: 10.33140/amlai.04.02.03. Andersen, N. et al. (2025) ‘Algorithmic Fairness in Automatic Short Answer Scoring’, International Journal of Artificial Intelligence in Education . doi: 10.1007/s40593-025-00495-5. De Angeli, K. et al. (2022) ‘Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types’, Journal of Biomedical Informatics , 125, p. 103957. doi: https://doi.org/10.1016/j.jbi.2021.103957. Bewick, V., Cheek, L. and Ball, J. (2003) ‘Statistics review 8: Qualitative data – tests of association’, Critical Care , 8(1), p. 46. doi: 10.1186/cc2428. Bommasani, R. et al. (2021) ‘On the Opportunities and Risks of Foundation Models’, ArXiv , abs/2108.0. Available at: https://api.semanticscholar.org/CorpusID:237091588. Buchert, J. (2024) How do AI Detectors Work? Do they Work? , Intellectualead . Available at: https://gptzero.me/news/how-ai-detectors-work/ (Accessed: 23 June 2025). Chakraborty, S. et al. (2023) ‘On the Possibilities of AI-Generated Text Detection’, ArXiv , abs/2304.0. Available at: https://api.semanticscholar.org/CorpusID:258048481. Cotton, D. R. E., Cotton, P. A. and Shipway, J. R. (2024) ‘Chatting and cheating: Ensuring academic integrity in the era of ChatGPT’, Innovations in Education and Teaching International , 61(2), pp. 228–239. doi: 10.1080/14703297.2023.2190148. Creo, A. and Pudasaini, S. (2024) ‘Evading AI-Generated Content Detectors using Homoglyphs.’, CoRR . doi: 10.48550/ARXIV.2406.11239. Currie, G. M. (2023) ‘Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy?’, Seminars in nuclear medicine , 53(5), pp. 719–730. doi: 10.1053/J.SEMNUCLMED.2023.04.008. EdinburghNLP/XSum: Topic-Aware Convolutional Neural Networks for Extreme Summarization (no date). Available at: https://github.com/EdinburghNLP/XSum (Accessed: 8 February 2025). Elkhatat, A. M., Elsaid, K. and Almeer, S. (2023) ‘Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text’, International Journal for Educational Integrity , 19(1), pp. 1–16. doi: 10.1007/s40979-023-00140-5. Fariello, S. et al. (2025) ‘Distinguishing Human From Machine: A Review of Advances and Challenges in AI-Generated Text Detection’, International Journal of Interactive Multimedia and Artificial Intelligence , 9(3), pp. 6–18. doi: 10.9781/ijimai.2024.12.002. Fisher, M. J., Marshall, A. P. and Mitchell, M. (2011) ‘Testing differences in proportions’, Australian Critical Care , 24(2), pp. 133–138. doi: https://doi.org/10.1016/j.aucc.2011.01.005. Gehrmann, S., Strobelt, H. and Rush, A. M. (2019) ‘GLTR: Statistical detection and visualization of generated text’, in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of System Demonstrations . Association for Computational Linguistics (ACL), pp. 111–116. doi: 10.18653/v1/p19-3019. Helm, P. et al. (2024) ‘Diversity and language technology: how language modeling bias causes epistemic injustice’, Ethics and Information Technology , 26(1), pp. 1–15. doi: 10.1007/s10676-023-09742-6. How Does AI Content Detection Work? – Originality.AI (no date). Available at: https://originality.ai/blog/how-does-ai-content-detection-work?utm_source=chatgpt.com (Accessed: 23 June 2025). iThenticate (2024) AI writing detection in the new, enhanced Similarity Report view , iThenticate . Available at: https://guides.turnitin.com/hc/en-us/articles/22774058814093-AI-writing-detection-in-the-new-enhanced-Similarity-Report (Accessed: 23 June 2025). Kirchenbauer, J. et al. (2023) ‘A Watermark for Large Language Models’, Proceedings of Machine Learning Research , 202, pp. 17061–17084. Liang, W. et al. (2023) ‘GPT detectors are biased against non-native English writers’, Patterns , 4(7), p. 100779. doi: 10.1016/j.patter.2023.100779. Liu, J. Q. J. et al. (2024) ‘The great detectives: humans versus AI detectors in catching large language model-generated medical writing’, International Journal for Educational Integrity , 20(1), pp. 1–14. doi: 10.1007/s40979-024-00155-6. marsyas/gtzan · Datasets at Hugging Face (no date). Available at: https://huggingface.co/datasets/EdinburghNLP/xsum/viewer (Accessed: 8 February 2025). McHugh, M. L. (2013) ‘The Chi-square test of independence’, Biochemia Medica , 23(2), pp. 143–149. doi: 10.11613/BM.2013.018. OpenAI (2023) New AI classifier for indicating AI-written text , OpenAI . Available at: https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/ (Accessed: 23 June 2025). Perkins, M. (2023) ‘Academic Integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond’, Journal of University Teaching and Learning Practice , 20(2). doi: 10.53761/1.20.02.07. Perkins, M. et al. (2024) ‘Simple techniques to bypass GenAI text detectors: implications for inclusive education’, International Journal of Educational Technology in Higher Education , 21(1), p. 53. doi: 10.1186/s41239-024-00487-w. Sadasivan, V. S. et al. (2023) ‘Can AI-Generated Text be Reliably Detected?’ Available at: http://arxiv.org/abs/2303.11156 (Accessed: 16 January 2025). Saheb, T., Sidaoui, M. and Schmarzo, B. (2024) ‘Convergence of artificial intelligence with social media: A bibliometric & qualitative analysis’, Telematics and Informatics Reports , 14, p. 100146. doi: https://doi.org/10.1016/j.teler.2024.100146. Skaik, Y. (2008) ‘Understanding and using sensitivity, specificity and predictive values’, Indian Journal of Ophthalmology , 56(4), p. 341. doi: 10.4103/0301-4738.41424. Takahashi, K. et al. (2022) ‘Confidence interval for micro-averaged F (1) and macro-averaged F (1) scores.’, Applied intelligence (Dordrecht, Netherlands) , 52(5), pp. 4961–4972. doi: 10.1007/s10489-021-02635-5. Waltzer, T., Pilegard, C. and Heyman, G. D. (2024) ‘Can you spot the bot? Identifying AI-generated writing in college essays’, International Journal for Educational Integrity , 20(1), p. 11. doi: 10.1007/s40979-024-00158-3. Weber-Wulff, D. et al. (2023) ‘Testing of detection tools for AI-generated text’, International Journal for Educational Integrity , 19(1), pp. 1–39. doi: 10.1007/s40979-023-00146-z. Wu, J. et al. (2025) ‘A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions’, Computational Linguistics , 51(1), pp. 275–338. doi: 10.1162/coli_a_00549. Wu, X., Duan, R. and Ni, J. (2023) ‘Unveiling Security, Privacy, and Ethical Concerns of ChatGPT’, ArXiv , abs/2307.1. Available at: https://api.semanticscholar.org/CorpusID:260164746. Zhang, J. et al. (2020) ‘{PEGASUS}: Pre-training with Extracted Gap-sentences for Abstractive Summarization’, in III, H. D. and Singh, A. (eds) Proceedings of the 37th International Conference on Machine Learning . PMLR (Proceedings of Machine Learning Research), pp. 11328–11339. Available at: https://proceedings.mlr.press/v119/zhang20ae.html. Zhao, Y. et al. (2024) ‘Leveraging Past Assignments to Determine If Students Are Using ChatGPT for Their Essays’, in L@S 2024 - Proceedings of the 11th ACM Conference on Learning @ Scale . Association for Computing Machinery, Inc, pp. 320–324. doi: 10.1145/3657604.3664707. Zhong, H. et al. (2023) ‘Copyright Protection and Accountability of Generative AI: Attack, Watermarking and Attribution’, in Companion Proceedings of the ACM Web Conference 2023 . New York, NY, USA: Association for Computing Machinery (WWW ’23 Companion), pp. 94–98. doi: 10.1145/3543873.3587321. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 01 Feb, 2026 Read the published version in International Journal for Educational Integrity → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7359956","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":500854858,"identity":"edfbe78b-1a1d-4109-aa0b-ab6ea0bc69c7","order_by":0,"name":"Mohammad Hadra","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAwklEQVRIiWNgGAWjYFACHjCSg3DYSNBiTLqWxAaiteg28B788HaPXfr89hwDhg9lhxn4xQ7g12J2gC9Zcs6z5NwNZ94YMM44d5hBcnYCIS08BtI8B5hzN0jkGDDzth1mMLhNWIvxb54D9enyM4Ba/hKpxQxoy+EEhhtALYxEaTnMY2Y558Bxww1nnhUc7DmXzkPYL8d7jG+8OVAtL9+evPHBjzJrOX5pAloYmOGsBIYDDOA4Ih4QMnwUjIJRMApGLAAAHxBBVkyTovUAAAAASUVORK5CYII=","orcid":"","institution":"Sultan Qaboos University","correspondingAuthor":true,"prefix":"","firstName":"Mohammad","middleName":"","lastName":"Hadra","suffix":""},{"id":500854859,"identity":"8d516ef0-d054-4846-8e1a-bb446d97dd62","order_by":1,"name":"Karleen Cambridge","email":"","orcid":"","institution":"Sultan Qaboos University","correspondingAuthor":false,"prefix":"","firstName":"Karleen","middleName":"","lastName":"Cambridge","suffix":""},{"id":500854860,"identity":"641f9cc7-aa1e-45b6-a06e-b5e6e83bdcf5","order_by":2,"name":"Mostefa Mesbah","email":"","orcid":"","institution":"Sultan Qaboos University","correspondingAuthor":false,"prefix":"","firstName":"Mostefa","middleName":"","lastName":"Mesbah","suffix":""}],"badges":[],"createdAt":"2025-08-13 02:23:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7359956/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7359956/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s40979-026-00213-1","type":"published","date":"2026-02-02T00:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":91400137,"identity":"d35ead45-e080-4e53-800c-6b64ef6bbf7b","added_by":"auto","created_at":"2025-09-16 06:51:45","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":16686,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConfusion matrix illustration for three three-class problem\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7359956/v1/57e75ea2cd19b4e84c744eea.png"},{"id":91400140,"identity":"97717175-7f08-41f2-ae76-ae75c92d3eec","added_by":"auto","created_at":"2025-09-16 06:51:45","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":17459,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix for class AI and all possible classification outcomes\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7359956/v1/942aa047c6f50153ac74f2c6.png"},{"id":91401126,"identity":"cc603702-e376-4e36-b05b-6ef3d3316f00","added_by":"auto","created_at":"2025-09-16 06:59:45","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":370124,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConfusion matrix for Turnitin and Originality for per-class and multiclass performance\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7359956/v1/6ae8fbeca8895b93b28879eb.jpeg"},{"id":91401479,"identity":"f503f1fc-5208-455f-9094-2ec0950f58af","added_by":"auto","created_at":"2025-09-16 07:07:45","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":38864,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePlot of detector performance among three different text lengths\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7359956/v1/c0ece5cfb446fb04d04e2652.png"},{"id":101775734,"identity":"7992ad60-79da-489b-9ee6-4cc3e3ceff1d","added_by":"auto","created_at":"2026-02-03 13:59:08","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1902895,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7359956/v1/6bbd7aaa-f15b-4a7e-aaf7-9b0dda68bc79.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eThe increasing use of Artificial Intelligence (AI) tools to generate human-like text and other content along with their ability to adapt to different human writing styles, has introduced new financial, societal and academic challenges. In academia, integrity, intellectual property rights, students' overreliance on technology, and the impact on critical thinking are some of these challenges. The potential misuse of generative AI (Gen-AI) by students in producing essay assignments and presenting them as original work is a major concern in higher education institutions (Currie, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) (Perkins, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). The use of Gen-AI to cheat in online exams is another persistent concern (Cotton, Cotton and Shipway, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). The possibility of Gen-AI misuse by students and researchers may have a profound impact on academic integrity (Zhong \u003cem\u003eet al.\u003c/em\u003e, 2023)(Wu, Duan and Ni, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Challenges are not restricted to the academic setting. Dissemination of fake news to redirect public opinions and generate deceptive product reviews are also expected to increase many folds with the invention of Gen-AI tools, particularly those capable of producing not only text, but also synthetic images and videos (Saheb, Sidaoui and Schmarzo, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)(Chakraborty \u003cem\u003eet al.\u003c/em\u003e, 2023). These challenges raise significant ethical concerns and issues (Bommasani \u003cem\u003eet al.\u003c/em\u003e, 2021).\u003c/p\u003e\u003cp\u003eCombating the negative consequences of the above-mentioned issues, automatic identification/detection of Gen-AI generated content has become essential. In their report, T. Waltzer et al. (Waltzer, Pilegard and Heyman, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) demonstrated that, on average, instructors correctly identified ChatGPT-written work only 70% of the time. A study by S. Gehrmann et al. (Gehrmann, Strobelt and Rush, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) concluded that humans are unable to differentiate between texts generated by Gen-AI and those generated by humans. Hence, an accurate and reliable automatic tool is required for this detection.\u003c/p\u003e\u003cp\u003eDevelopers of Gen-AI detection tools mainly exploit the differences between human-generated text distribution and those generated by Gen-AI, along with other emerging techniques, such as watermarking signatures, to automatically distinguish between human and Gen-AI writing (Fariello et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). \u003cem\u003eHuman text distribution\u003c/em\u003e and \u003cem\u003eGen-AI text distribution\u003c/em\u003e refer to statistical patterns of texts written by humans versus those generated by Gen-AI models (Wu et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). These distributions describe how linguistic features, such as word choice, sentence structure, and contextual coherence, are distributed across a dataset or text corpus. Human text features diversity, imperfect patterns, context awareness, non-randomness, and perplexity or surprise in the text (Liu et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). In contrast, the Gen-AI text features statistical optimization, repetition, high fluency and limited creativity, tighter pattern (frequency of specific words and sentence length), and low perplexity in text and limited surprise (Akram, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Detection tools extract these characteristics and feed them to classification algorithms.\u003c/p\u003e\u003cp\u003eSome developers of Large Language Models (LLMs), like OpenAI, aimed to maintain accountability for the misuse of Gen-AI tools in text creation. Hence, they proposed and adopted the watermarking technique. This technique works by embedding invisible statistical patterns or signals in the text during generation. Watermarking typically involves altering the choice of words, punctuation, or phrasing in subtle ways to encode a recognizable pattern. These patterns can later be analyzed to determine if the content originated from an AI model with watermarking enabled (Kirchenbauer \u003cem\u003eet al.\u003c/em\u003e, 2023)(Zhao \u003cem\u003eet al.\u003c/em\u003e, 2024).\u003c/p\u003e\u003cp\u003eAccuracy and reliability are highly significant for the proposed detection tools since the user of these tools is supposed to judge a student's or researcher's work as authentic or plagiarized based on their results. The current literature has a contradictory view of the accuracy and reliability of the Gen-AI detection tools, and some studies claim that it is not possible to create a tool to accurately detect Gen-AI content (Weber-Wulff et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Numerous studies were conducted to test the performance of detection tools. In one such study, a group of researchers evaluated twelve publicly available tools and two commercial ones. They systematically examined the general functionality of the detection tools and evaluated their accuracy and error types. They concluded that the tested tools were neither accurate nor reliable. (Weber-Wulff et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eAnother study (Elkhatat, Elsaid and Almeer, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) evaluated five publicly available tools, including a tool developed by OpenAI. They used fifteen paragraphs generated by Gen-AI and five human responses as control. They concluded that the detection tools were more accurate in detecting content generated by ChatGPT3.5 than ChatGPT4, and when applied to human-generated text, they exhibited inconsistency, producing false positive and uncertain classification.\u003c/p\u003e\u003cp\u003eAnother study, carried out by Sadasivan et al. (Sadasivan \u003cem\u003eet al.\u003c/em\u003e, 2023), suggested that current Gen-AI detection tools relying on the watermarking technique (Kirchenbauer \u003cem\u003eet al.\u003c/em\u003e, 2023) and zero-shot classifiers are not reliable. They showed that a simple paraphraser, such as PEGASUS (Zhang \u003cem\u003eet al.\u003c/em\u003e, 2020), designed with a lightweight neural network, drastically reduced the detection accuracy. In contrast to the conclusions drawn by Sadasivan et al. (Sadasivan \u003cem\u003eet al.\u003c/em\u003e, 2023), Chakraborty et al. (Chakraborty \u003cem\u003eet al.\u003c/em\u003e, 2023) asserted that the claim that the detection of Gen-AI in writing is impossible is not supported by real-world evidence.\u003c/p\u003e\u003cp\u003eThe authors in (Chakraborty \u003cem\u003eet al.\u003c/em\u003e, 2023) argued that effective and practical Gen-AI detection remains possible, as long as human and machine-generated texts do not exhibit similar distributional characteristics. Drawing from the information theory, they established sample complexity bounds, showing that as machine-generated text improves in quality, a large sample size is required, which reduces the practicality of the approach. They evaluated their suggestions across multiple datasets, including \u003cb\u003eXsum\u003c/b\u003e, \u003cb\u003eSquad\u003c/b\u003e, \u003cb\u003eIMDb\u003c/b\u003e, and \u003cb\u003eKaggle FakeNews\u003c/b\u003e. Their study tested various state-of-the-art Gen-AI against detectors like oBERTa-Large/Base-Detector and GPTZero.\u003c/p\u003e\u003cp\u003eGen-AI detectors face several key challenges. One such challenge is the wide range of attacks to evade the detection or confuse the detection model. These attacks are schemes used to mislead the GenAI detector into believing that tested texts were generated by a human. Some such schemes are: 1) Paraphrasing attacks; 2) Synonym Replacement in which users replace some words with their synonyms; and 3) Backtranslation, where the user translates the generated text into another language and then translates it back to the original language by machine translation. In some instances, a combination of these attacks is used, which is known as an Ensemble Attack (Creo and Pudasaini, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). V. S. Sadasvian et al. (Sadasivan \u003cem\u003eet al.\u003c/em\u003e, 2023) showed that a recursive paraphrasing attack was able to bypass a range of detectors, including those based on watermarking and neural networks, Zero-shot classifiers and Retrieval-based detectors. Anderson et.al [17] were able to mislead a GPT-2 Output Detector by changing its output decision from 0.02% human-generated to 99% human-generated through paraphrasing. In their report, M. Perkins et al. (Perkins et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) demonstrated that detector accuracy was reduced by 17.4% when simple techniques were applied to manipulate AI-generated content.\u003c/p\u003e\u003cp\u003eFrom the above-mentioned literature, it is clear that the accuracy and reliability of Gen-AI detection tools are yet to be conclusively proven. The studies addressing this issue are few, and most of the papers and reports are available in Gray Literature repositories, such as arXiv, as preprints. Furthermore, most of the existing studies either used very limited data prepared by humans or used publicly available datasets generated by fluent professional or native speakers. Studies specifically targeting undergraduate students learning English as a Foreign Language (EFL) are non-existent. This gap motivated the current study. Additionally, several of the previous studies used a limited number of metrics to evaluate the performance of a detector, while others did not clearly describe the methodology used to enable replication.\u003c/p\u003e\u003cp\u003eIn this study, we attempt to systematically estimate the accuracy and reliability of two very popular AI detection tools in higher education institutes, i.e. \u003cb\u003eTurnitin\u003c/b\u003e and \u003cb\u003eOriginality\u003c/b\u003e. Our work includes the use of original Middle Eastern student work, within an EFL context, written before the introduction of Gen-AI tools, as one of the datasets used for evaluation, which guarantees that it is an authentic sample of student writing. This use of the dataset allows the examination of the alleged bias of the current AI detection tools towards non-professional writers (Liang et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Furthermore, the types of datasets used in this study allow for deep insights into the detectors' performance in different writing styles. Our work aims to provide a recommendation based on rigorous and data-driven evaluation for higher education institutes\u0026rsquo; educators and policymakers for using Gen-AI text detection tools. A complete description of the datasets used in this work is found in the methodology section of this paper.\u003c/p\u003e\u003cp\u003eThe rest of the paper is organized into five sections. In the methodology section, we present a comprehensive description of the data and data analysis employed in the study. The findings of the study are then reported in the results section, followed by an in-depth discussion of the findings in the discussion section. The next section outlines the limitations of the study and offers recommendations for future research. Finally, the conclusion summarizes the key findings of the study.\u003c/p\u003e"},{"header":"2 Methodology","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Data Description\u003c/h2\u003e\u003cp\u003eFor this research, we used a dataset of 192 texts, equally distributed across four categories: 48 authentic English as a Foreign Language (EFL) student-written texts submitted for grading before the emergence of AI-based text generation technology, 48 authentic human-written texts on diverse topics by professional writers, 48 AI-generated texts produced by ChatGPT and Claude AI, and 48 hybrid texts created by merging content from the EFL student texts and AI-generated texts while maintaining textual coherence. To cover different text length, we divided the texts into three categories: texts with a minimum of 300 words, texts around 500 words (with a\u0026thinsp;\u0026plusmn;\u0026thinsp;10% margin), and texts around 1000 words (with a\u0026thinsp;\u0026plusmn;\u0026thinsp;10% margin). Additionally, we considered texts from both the humanities and science genres.\u003c/p\u003e\u003cp\u003eThe EFL student texts were drawn from final drafts of reports submitted for grading as a requirement of specific Foundation Program English for Sciences (FPES) courses at the Centre for Preparatory Studies, Sultan Qaboos University in Oman. The selected student reports were submitted in or before August 2022 to eliminate the possibility of Gen-AI usage. All necessary approvals were obtained from the university to use the students\u0026rsquo; submissions for the purpose of this research.\u003c/p\u003e\u003cp\u003eThe authentic texts by professional writers were sourced from the XSum (Extreme Summarization) dataset, a well-known open-source resource frequently used in Natural Language Processing (NLP) research. The XSum dataset is a large-scale corpus developed by the University of Edinburgh\u0026rsquo;s NLP group for abstractive text summarization (\u003cem\u003emarsyas/gtzan \u0026middot; Datasets at Hugging Face\u003c/em\u003e, no date)(\u003cem\u003eEdinburghNLP/XSum: Topic-Aware Convolutional Neural Networks for Extreme Summarization\u003c/em\u003e, no date). It consists of over 226,000 news articles from the BBC, each accompanied by a single-sentence summary that provides a highly condensed version of the original content. The articles, collected from BBC content (2010\u0026ndash;2017), include a variety of domains, including news, politics, sports, weather, business, technology, science, education, entertainment, and the arts.\u003c/p\u003e\u003cp\u003eDue to its concise and information-dense nature, XSum is a strong representation of human-authored text, making it well-suited for evaluating AI-generated text detectors. Its use in benchmarking summarization models has established it as a widely recognized resource in NLP research.\u003c/p\u003e\u003cp\u003eThe collected data covers four categories of written text: AI-generated, hybrid text, professional writer text, and EFL student-written texts. The inclusion of the last two categories allows us to investigate whether there is a performance difference in detectors when distinguishing between EFL-written texts and professional-written texts. This comparison helps address claims in the literature regarding potential biases in detectors against non-professional writers.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Classification Approach\u003c/h2\u003e\u003cp\u003eTurnitin reports the percentage of AI authorship, while Originality provides a likelihood score indicating whether the text is AI or human-authored (referred to as \"Original\"). To enable meaningful evaluation of the AI content detectors, these probabilistic outputs, typically expressed as percentage likelihoods, have been converted into discrete classification categories.\u003c/p\u003e\u003cp\u003eIn this study, we adopt a thresholding strategy that maps the detector's output into three classes: \u003cem\u003eHuman-Generated\u003c/em\u003e, \u003cem\u003eAI-Generated\u003c/em\u003e, and \u003cem\u003eHybrid\u003c/em\u003e. More specifically, texts with an AI-content probability between 0\u0026ndash;20% are labelled as \u003cem\u003eHuman\u003c/em\u003e, those between 21\u0026ndash;79% as \u003cem\u003eHybrid\u003c/em\u003e, and those equal to and above 80% as \u003cem\u003eAI\u003c/em\u003e. This threshold schema is grounded in both the output behavior of existing AI content detectors and practices documented in related studies (Weber-Wulff et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). For instance, tools such as GPTZero and OpenAI's AI classifier (now deprecated) have used a high-confidence threshold of 80% or more to indicate AI authorship, while probabilities below 20% are considered strong indicators of human authorship (Buchert, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) (OpenAI, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). The intermediate range is often associated with mixed or ambiguous authorship, justifying its alignment with the \u003cem\u003eHybrid\u003c/em\u003e category. Furthermore, Turnitin\u0026rsquo;s published guidelines and common practice among AI content detection tools in their website (iThenticate, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Its guidelines no longer report specific detection percentages below 20% due to a higher risk of false positives, while results above 80% are regarded as strong indicators of AI authorship. The intermediate range is typically associated with mixed or ambiguous authorship, making it suitable for the \u003cem\u003eHybrid\u003c/em\u003e category. Originality has the same guidelines (\u003cem\u003eHow Does AI Content Detection Work? \u0026ndash; Originality.AI\u003c/em\u003e, no date).\u003c/p\u003e\u003cp\u003eThese above-mentioned thresholds were not derived from the present dataset but were instead based on external standards, allowing for a fair assessment of the detector\u0026rsquo;s current performance against an accepted norm. The classification results will be compared to the ground-truth labels to compute standard performance metrics, such as accuracy, precision, recall, specificity, and F1-score.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Data Analysis\u003c/h2\u003e\u003cdiv id=\"Sec6\" class=\"Section3\"\u003e\u003ch2\u003e2.3.1 Performance Analysis\u003c/h2\u003e\u003cp\u003eThe confusion matrix is a standard tool used to evaluate the performance of classification models. While it is commonly applied to binary classification problems, it can be extended to multi-class classification tasks. It is widely adopted in Machine Learning (ML) research and practice to assess ML models\u0026rsquo; performance and to guide ML\u0026rsquo;s algorithms hyperparameter tuning. In our context, we treat AI detectors as text-to-label classifier models that categorize text into one of three classes: AI, Human, or Hybrid. The confusion matrix provides a detailed breakdown of how predicted labels align with the actual labels. A confusion matrix is typically represented as a table containing all possible combinations of actual and predicted class labels. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e illustrates the three-class confusion matrix, where each cell represents the number or percentage of instances assigned to a specific predicted class by the classifier.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe confusion matrix serves as the foundation for calculating several critical performance metrics, such as Recall (Sensitivity), Specificity, Precision, Accuracy, and F1-score. These metrics are essential for understanding classification quality beyond overall accuracy.\u003c/p\u003e\u003cp\u003eTo illustrate how these metrics are derived, consider the class AI and how the classifier correctly or incorrectly labels instances of this class. For this explanation, we treat AI-generated texts to represent the positive class.\u003c/p\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e depicts the classification outcomes related to the AI-generated class in terms of TP, TN, FP, and FN, defined below.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTrue Positives (TP)\u003c/b\u003e represents the number of samples correctly identified by the classifiers as AI-authored.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTrue Negatives (TN)\u003c/b\u003e represents the number of samples the detector correctly classifies a non-AI text (Human or Hybrid) as not AI-authored.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFalse Positives (FP)\u003c/b\u003e represents the number of samples the detector incorrectly classifies a non-AI-authored text as AI-authored.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFalse Negatives (FN)\u003c/b\u003e represents the number of samples the detector incorrectly classifies an AI-authored text as non-AI.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eIt is important to note that in binary or one-vs-rest classifications, the positive or negative designation depends on the class under consideration; in this case, it is AI-generated.\u003c/p\u003e\u003cp\u003eThe above-mentioned performance measures are mathematically expressed by: :\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eSensitivity (Sen) /Recall\u003c/strong\u003e\u003cp\u003eThis measure describes out of all actual AI-authored texts, how many were correctly identified by the detector.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:Sen\\:=\\:\\:\\frac{TP}{TP+\\sum\\:FN}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(1\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eSpecificity\u003c/b\u003e (\u003cb\u003eSpec\u003c/b\u003e): Specificity indicates out of all non-AI texts; how many were correctly identified as not AI-authored.\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:Spec\\:=\\:\\:\\frac{\\sum\\:TN}{\\sum\\:TN+\\sum\\:FP}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(2\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003ePrecision\u003c/strong\u003e\u003cp\u003ePrecision indicates out of all texts predicted as AI-authored, how many were actually AI-authored.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:Precision=\\:\\:\\frac{TP}{TP+\\sum\\:FP}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(3\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eF1-Score\u003csub\u003eAI\u003c/sub\u003e\u003c/strong\u003e\u003cp\u003eF1-score is the harmonic mean of Precision and Recall. It balances both false positives and false negatives.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:F1-score=\\:2\\times\\:\\:\\frac{Precision*Sen}{Precision+Sen}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(4\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eAccuracy\u003c/b\u003e (\u003cb\u003eAcc\u003c/b\u003e): The accuracy indicates the overall rate at which the detector correctly classifies both AI and non-AI texts\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:Acc\\:=\\:\\:\\frac{TP+\\sum\\:TN}{TP+\\sum\\:TN+\\sum\\:FP+\\sum\\:FN}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(5\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThese metrics together provide a comprehensive evaluation of the model's performance. Relying on accuracy alone, for example, can be misleading, especially in the presence of class imbalance.\u003c/p\u003e\u003cp\u003eWe repeated the above calculations for all three classes (AI, Human, Hybrid), treating each one as a positive class. versus the rest. Given that the dataset exhibits moderate imbalance (with Human-authored texts being approximately twice as frequent as the AI and Hybrid classes), we report the macro-averaged performance metrics. Macro-averaging gives equal weight to each class, offering a balanced and conservative view of performance in imbalanced settings (De Angeli et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\u003ch2\u003e2.3.2 Statistical Analysis\u003c/h2\u003e\u003cp\u003eIn our analysis, we systematically evaluate the performance of AI content detectors across several key variables relevant to academic contexts. These variables included \u003cb\u003e(1) Text length\u003c/b\u003e, categorized into three ranges: (A) 300 \u0026minus;\u0026thinsp;330 words, (B) 450\u0026ndash;550 words, and (C) 900 \u0026minus;\u0026thinsp;1100 words; \u003cb\u003e(2) Text genre\u003c/b\u003e, comparing Science and Humanities; and \u003cb\u003e(3) Authorship group\u003c/b\u003e, EFL student writing vs. professional writing.\u003c/p\u003e\u003cp\u003eTo determine whether detector performance varied significantly across the study variables, we applied Pearson\u0026rsquo;s chi-square test of independence. For each analysis, we (I) constructed an r \u0026times; c contingency table, (II) verified that the assumption of adequate expected cell counts was satisfied (no expected frequency\u0026thinsp;\u0026lt;\u0026thinsp;5), and (III) calculated the χ\u0026sup2; statistic. We report the χ\u0026sup2; value and its associated p-value for every comparison\u003c/p\u003e\u003cp\u003e(Bewick, Cheek and Ball, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2003\u003c/span\u003e)(McHugh, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2013\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn cases where the assumption of the Chi-square test is violated, we employ its non-parametric counterpart, namely Fisher\u0026rsquo;s Exact Test (Fisher, Marshall and Mitchell, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). Unlike the Chi-square test, Fisher\u0026rsquo;s Exact Test does not rely on large-sample approximations and is particularly appropriate when expected cell frequencies are small or the assumption of minimum expected cell counts for Chi-square is violated (Fisher, Marshall and Mitchell, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). It is particularly suited for 2\u0026times;2 contingency tables with small or unevenly distributed cell counts, as it calculates the exact hypergeometric probability of observing the obtained frequencies under the null hypothesis ofindependence. In our context, the test assessed whether the distribution of classification outcomes (Correct vs. Incorrect) was statistically independent of the author group, with a directional alternative hypothesis positing that Open Database texts would yield significantly higher correct classification rates compared to EFL-authored texts.\u003c/p\u003e\u003cp\u003eAll performance metrics (e.g., accuracy, precision, recall, F1-score) are reported alongside the corresponding statistical test results and p-values, providing a comprehensive and rigorous evaluation of how each variable affects the detector's classification performance. This multi-method approach ensures both statistical robustness and interpretability of the findings.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"3 Results","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Overall Detection Performance\u003c/h2\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the confusion matrix results of both detectors for per-class and multi-class performance\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the results of the overall per-class detection performance for each detector\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eof per-class parameters results for Turnitin and Originality\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDetector\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eClass\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSensitivity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSpecificity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1-Score\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHuman\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.53\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.66\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eAI\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.29\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.98\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.81\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHybrid\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.31\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.37\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.34\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.69\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHuman\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.55\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.68\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.80\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.76\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eAI\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.83\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.74\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.89\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHybrid\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.33\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.74\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe table summarizes how Turnitin and Originality perform across three text classes, namely Human-generated, AI-generated, and Hybrid, using standard classification metrics.\u003c/p\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass Human-generated\u003c/b\u003e: The results show high sensitivity (Sensitivity: 0.93) and good precision (0.66), but low Specificity (0.53). This indicates frequent confusion with non-human texts. F1-score (0.77) and Accuracy (0.73) reflect the overall performance.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass AI-generated\u003c/b\u003e: Here, the results show very high Specificity (0.98) and Precision (0.82). This means that the detector rarely mislabels other texts as AI. However, low Sensitivity (0.29) indicates that many AI texts are missed. F1-score (0.43) shows a limited balance between Precision and Recall.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass Hybrid\u003c/b\u003e: Performance is weak across all metrics, especially Sensitivity (0.31) and F1-score (0.34), suggesting poor detection of Hybrid texts.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass Human\u003c/b\u003e: Excellent Sensitivity (0.96) and good Precision (0.68) yield a high F1-score (0.80). Low Specificity of (0.55) while Accuracy (0.76) remains moderate.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass AI\u003c/b\u003e: High performance with Sensitivity (0.83), Specificity (0.90), and F1-score (0.78), indicating a good binary classification performance.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eClass Hybrid\u003c/b\u003e: Sensitivity is extremely low (0.02), with minimal correct detection. While Specificity (0.99) is high, the F1-score (0.04) confirms poor classification for this class.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e below shows the results of the macro-average for both detectors. The macro-averaged results reflect overall detector performance by computing the unweighted mean of the metrics (e.g. precision, recall, F1-score ) across all classes, treating each class equally regardless of its frequency. It is particularly suitable for imbalanced class distributions (Takahashi \u003cem\u003eet al.\u003c/em\u003e, 2022). These metrics provide a clearer picture of each system's consistency across Human, AI, and Hybrid classifications.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMacro-average of all performance metrics for Turnitin and Originality\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDetector\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eOverall Accuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMacro Avg Precision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMacro Avg Specificity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eMacro Avg Recall (Sensitivity)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eMacro Avg F1-Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e0.61\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.62\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.78\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.51\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.51\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e0.69\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.59\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.81\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.60\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.54\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eOverall Accuracy (0.61) and Macro F1-score (0.51) indicate limited overall performance.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe gap between \u003cb\u003ePrecision (0.62)\u003c/b\u003e and \u003cb\u003eRecall (0.51)\u003c/b\u003e suggests that while Turnitin may be cautious in its positive predictions, it fails to adequately capture true instances across all classes. This aligns with the low recall values observed in the per-class results for AI (0.29) and Hybrid (0.31), confirming inconsistent and unreliable detection across categories.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eA specificity of 0.78 shows a moderate ability to avoid false positives, with strong specificity for AI and Hybrid but weak for the Human class. Reflects inconsistency across classes.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eOverall Accuracy (0.69) and Macro F1-score (0.54) are unacceptably low for reliable use, especially in academic applications requiring consistent detection.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eDespite relatively balanced Precision (0.59) and Recall (0.60), the low F1-score reflects weak overall classification effectiveness.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe Hybrid class, with a near-zero recall (0.02), heavily impacts these averages, exposing a critical failure in handling this category.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSpecificity of (0.81) is slightly better than overall specificity, driven by high values for AI and Hybrid. Performance is more consistent but still not strong.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eOverall, both detectors show limited macro-level performance, with insufficiencies clearly aligned with poor per-class detection; particularly in the Hybrid category. These results highlight significant limitations in their current state.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2 The Effect of Text Length on Performance\u003c/h2\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the comparison results of the detector performance for different text lengths, and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e visualizes the results. In the figure, we only present and compare the results of accuracy, precision, and recall for clarity.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eDetectors' performance among the three different text lengths\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDetector\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eText length\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eSpecificity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eF- Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e300\u0026ndash;330\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.87\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.71\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.95\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e450\u0026ndash;550\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.56\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.60\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.48\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.46\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e900\u0026ndash;1100\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.68\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.69\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.76\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.88\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e0.54\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e300\u0026ndash;330\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.96\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.65\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.67\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.93\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e0.66\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e450\u0026ndash;550\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.63\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.59\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.60\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.79\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e0.51\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e900\u0026ndash;1100\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e0.84\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0.60\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e0.58\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e0.90\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e0.58\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThese results offer a clear indication that text length clearly affects the performance of both Turnitin and Originality in detecting AI-generated content:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eShort texts (300\u003cb\u003e\u0026ndash;\u003c/b\u003e330 words) were better classified by both systems.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMedium length texts (450\u0026ndash;550 words) show a drop in all metrics, especially for Turnitin.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLong texts (900\u003cb\u003e\u0026ndash;\u003c/b\u003e1100 words) are close to medium length and still perform less than short texts.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis implies both tools struggle more as text length increases and show inconsistent performance when text length changes. We further check with Pearson\u0026rsquo;s chi-square test of independence whether the observed performance difference is statistically significant; the contingency table for Correct detection vs. Incorrect detection created for the test is presented in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eContingency table for text length effect\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"3\" rowspan=\"4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colspan=\"3\" nameend=\"c8\" namest=\"c6\"\u003e\u003cp\u003eOriginality\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eA (300\u0026ndash;329)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eB (450\u0026ndash;549)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eC (900\u0026ndash;1100)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eA (300\u0026ndash;329)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003eB (450\u0026ndash;549)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003eC (900\u0026ndash;1100)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e17\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e22\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e21\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eIncorrect\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e3\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e63\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e8\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e1\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e54\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ePearson\u0026rsquo;s chi-square test of independence for Turnitin results yielded χ\u0026sup2;(2)\u0026thinsp;=\u0026thinsp;8.41, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.0149, and for Originality χ\u0026sup2;(2)\u0026thinsp;=\u0026thinsp;13.17, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.0014, indicating a statistically significant association between text length and detection outcome for both systems. This result has many implications in academic settings and technically may be attributed to several linguistic features that will be discussed in the discussion section.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.3 The Effect of Text Genre on Performance\u003c/h2\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows the results of the detector performance related to text genres (Science vs. Humanities). Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e shows the contingency table used for statistical tests of significance.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eDetectors performance across text genre\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDetector\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGenre\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eSpecificity\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eF1-Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eSci\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.51\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.50\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.47\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHum\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.86\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.60\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.95\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.51\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eSci\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.58\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.52\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.49\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eHum\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.62\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe results presented in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e show that both \u003cb\u003eTurnitin\u003c/b\u003e and \u003cb\u003eOriginality\u003c/b\u003e perform markedly better on humanities texts than on science texts across all evaluation criteria:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eTurnitin\u003c/b\u003e: Shows a clear drop in accuracy (from 0.86 in humanities to 0.51 in science) and similarly weaker precision, specificity, and F1-score in science.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e: Demonstrates high accuracy in humanities (0.96) and a clear drop in science (0.58) but maintains a more consistent recall across genres. Specificity and precision, however, still show notable declines in science.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eStatistical tests\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe further checked with Pearson\u0026rsquo;s chi-square test of independence to see whether the observed performance difference is statistically significant. The contingency table for correct detection vs. incorrect detection created for the test is presented in Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eContingency table to test statistical differences among text genre\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eOriginality\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eText Genre\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eIncorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eIncorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHumanities\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e49\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e55\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eScience\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e66\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e69\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e57\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e78\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ePearson\u0026rsquo;s Chi-square test of independence for Turnitin results yielded χ\u0026sup2;(1)\u0026thinsp;=\u0026thinsp;19.109, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.000, and for Originality χ\u0026sup2;(1)\u0026thinsp;=\u0026thinsp;26.429, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.000, indicating a statistically significant association between text genre and detection outcome for both detectors. It is worth noting that the assumption of expected cell counts being greater than or equal to 5 for the Chi-square test was examined and found to be satisfied.\u003c/p\u003e\u003cp\u003eBoth Turnitin and Originality detectors exhibit genre-dependent performance, with significantly better detection accuracy on humanities texts compared to science. The consistency of this pattern across all metrics in both systems, supported by highly significant statistical tests, points to a systematic limitation in genre generalization, which has a number of implications in academic settings. These findings underscore the need for evaluating AI content detectors not just overall, but within specific academic domains.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.4 The Effect of Text Type on Performance\u003c/h2\u003e\u003cp\u003eIn the analysis of detector performance across different authorship groups (EFL vs. Open Database), the available ground-truth data consisted exclusively of texts that were written by humans. This constraint limits the classification framework to a single actual class (\u0026ldquo;Human\u0026rdquo;), thus impeding the construction of a full multi-class confusion matrix necessary for computing metrics, such as precision, specificity, and F1-score (Skaik, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2008\u003c/span\u003e). Consequently, the evaluation of performance in this context was confined to a binary distinction between correctly and incorrectly classified Human texts. As a result, accuracy defined as the proportion of correctly identified Human texts out of the total, was adopted as the sole performance metric. This choice is methodologically justified, as accuracy in this case is mathematically equivalent to recall (sensitivity) for the Human class, and serves as an appropriate indicator of the detector\u0026rsquo;s ability to detect true Human authorship without misclassification. The use of accuracy under this design reflects a constrained but valid and interpretable approach to assessing detector consistency concerning variations in author background.\u003c/p\u003e\u003cp\u003eIn Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e, we presented the results of detector accuracy among different text authorships, i.e. EFL students and professional writers from the Open Database. Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e is used for the significance test. Tables are followed by a summary of performance trends and statistical findings for both detectors.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eDetectors\u0026rsquo; performance with different authorships\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDetector\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAuthor\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eIncorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eTotal\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eAccuracy (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eOpen Database\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e45\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e48\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e93.8%\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eEFL\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e44\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e48\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e91.6%\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOriginality\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eOpen Database\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e48\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e0\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e48\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e100%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eEFL\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e44\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e48\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e91.6%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTo determine whether there is a statistically significant difference in detectors\u0026rsquo; performance between texts authored by EFL students and those drawn from an Open Database, Fisher\u0026rsquo;s Exact Test was employed. Classification outcomes (correct vs. incorrect detection) were cross-tabulated by text source (EFL vs. Open Database). Then, the test assessed whether the distribution of detection outcomes, i.e. correctly vs. incorrectly classified texts, was independent of the author group (Open Database vs. EFL). Given the directional hypothesis that AI detectors may perform more favorably on texts authored by Open DB writers, a one-sided Fisher\u0026rsquo;s Exact Test was conducted for each detector. In Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e, we presented the calculated contingency table and followed by the results of Fisher\u0026rsquo;s Exact Test.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eContingency table used for calculating statistical test\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e\u003cp\u003eTurnitin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eOriginality\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAuthor\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eIncorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eIncorrect\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOpen Database\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e45\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e48\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eEFL\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e44\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e44\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eFor Turnitin, the test yielded a non-significant result, the one-sided p-value ( Open DB\u0026thinsp;\u0026gt;\u0026thinsp;EFL\u0026thinsp;=\u0026thinsp;0.500, the odds ratio\u0026thinsp;=\u0026thinsp;1.364), indicating no evidence of better performance for the Open Database group. For Originality, Fisher\u0026rsquo;s Exact Test one-sided p-value (Open DB\u0026thinsp;\u0026gt;\u0026thinsp;EFL\u0026thinsp;=\u0026thinsp;0.058, odds ratio\u0026thinsp;=\u0026thinsp;undefined- \u003cem\u003ezero count in one cell\u003c/em\u003e), suggesting a trend toward higher accuracy in the Open Database group.\u003c/p\u003e\u003cp\u003eThe above analysis shows that for Turnitin, Fisher\u0026rsquo;s Exact Test yielded a non-significant result, indicating no evidence of a performance difference between authorship groups. In contrast, for Originality, the test approached near statistical significance, suggesting a potential trend toward higher accuracy for texts from the Open Database group. Specifically, Turnitin's accuracy decreased slightly from 93.8% for texts written by professional authors (Open Database) to 91.6% for those written by EFL students. For Originality, the accuracy dropped more noticeably, from 100% for professional texts to 91.6% for EFL texts. These outcomes, combined with Fisher\u0026rsquo;s test results, suggest that Turnitin maintains consistent performance across authorship types, showing no detectable bias. In contrast, Originality demonstrates a slight tendency toward favoring professional writing, with a near-significant advantage (one-sided p\u0026thinsp;=\u0026thinsp;0.0586) for the Open Database texts. This comparative analysis underscores that authorship type does not influence Turnitin\u0026rsquo;s performance, while Originality displays mild author-dependent behavior, primarily in its handling of errors and classification decisions. The borderline significant result (p\u0026thinsp;=\u0026thinsp;0.058) in Originality\u0026rsquo;s performance across authorship groups highlights a possible bias that warrants further investigation.\u003c/p\u003e\u003c/div\u003e"},{"header":"4 Discussion","content":"\u003cp\u003eThis study offers a critical empirical assessment of both Turnitin and Originality tools across varied text genres, lengths, and authorship types, revealing clear patterns of inconsistency, technical limitations, and signs of potential bias. Based on the accumulated evidence, it must be concluded that neither Turnitin nor Originality can currently be considered sufficiently reliable as AI detection solutions in academic settings.\u003c/p\u003e\u003cp\u003eThe overall classification results (using macro-average metrics) indicate that Originality slightly outperforms Turnitin (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). While Turnitin achieved a macro average accuracy of 61%, Originality reached 69%, with higher values for macro-average recall (60% vs. 51%) and F1-score (54% vs. 51%). While Originality outperformed Turnitin across macro-average metrics, both systems failed to demonstrate robust detection capabilities across all considered scenarios, i.e. text length, text genre and authorship type. Most notably, performance related to the Hybrid (a mix of human and AI authorship) was noticeably low. For example, Turnitin and Originality achieved recalls of 31% and 2% respectively. Given that Hybrid compositions reflect the emerging reality of student engagement with AI tools, this performance failure has serious implications for the credibility of detection judgments. Furthermore, the performance related to the hybrid class indicates that these tools are sensitive to common tactics used by students and academics, such as paraphrasing, synonym substitution, and rephrasing.\u003c/p\u003e\u003cp\u003eAlthough both tools achieved high accuracy on short texts (300\u0026ndash;330 words), this accuracy degraded with longer texts (900\u0026ndash;1100 words), a category more typical of authentic student assessments. The accuracy for longer texts fell from 87\u0026ndash;68% for Turnitin and from 96\u0026ndash;84% in the case of Originality, with statistically significant differences confirmed through Pearson\u0026rsquo;s chi-square test with \u003cem\u003ep-value\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.0149 for Turnitin and \u003cem\u003ep-value\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.0014 for Originality. This instability raises practical concerns for academic use, where essay-length submissions are the norm. These results are consistent with observations that many AI detectors rely on features like perplexity that become less stable in longer texts (Weber-Wulff et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDetection reliability must be evaluated not only by technical metrics but also by fairness across demographic and disciplinary lines. The evaluation of AI content detectors across authorship types reveals important disparities with significant implications for academic integrity implementation. We observe meaningful insights into potential bias in automated detection systems.\u003c/p\u003e\u003cp\u003eFor Originality, the one-sided p-value of 0.0586 approached the conventional significance threshold (α\u0026thinsp;=\u0026thinsp;0.05), suggesting a borderline or marginally non-significant difference in favor of the expert professional writers\u0026rsquo; texts (Open Database group). While not statistically significant at the 0.05 level, this result indicates a directional trend: Originality achieved perfect classification accuracy (100%) in the Open Database group, while it misclassified four texts in the EFL group (accuracy\u0026thinsp;\u0026asymp;\u0026thinsp;91.7%).\u003c/p\u003e\u003cp\u003eFisher\u0026rsquo;s Exact Test is specifically appropriate in this context because of the small sample size and the presence of a zero cell (no misclassifications in the Open Database group). Unlike asymptotic tests (e.g., chi-square), Fisher's test does not rely on large-sample approximations and provides an exact p-value, making the result more reliable for small or unbalanced groups. Given the observed difference (4 incorrect instances in EFL students vs. Zero incorrect instances for professional writers) and the marginal p-value (0.058), it is reasonable to infer that the likelihood of detecting a statistically significant difference would increase with a larger sample size. The exact nature of the test ensures that this conclusion is grounded in conservative statistical inference. Although the result did not cross the traditional threshold for significance, the borderline p-value (0.0586) and the perfect performance in one group versus errors in another suggest a potentially meaningful difference in detection accuracy between EFL and expert professional writers\u0026rsquo; texts. On the other hand, Turnitin exhibited no statistically significant differences between the two groups in accuracy, p-value\u0026thinsp;=\u0026thinsp;0.5. This consistency suggests that Turnitin applies its classification logic uniformly, without systematic favour or disadvantage to EFL writers.\u003c/p\u003e\u003cp\u003eIn the case of Originality, EFL-authored texts were more likely to be incorrectly flagged or mislabeled compared to those from professional writers. This may be attributed to the lack of language variability among non-professional writers, which in turn makes their writing follow a repetitive pattern. (Liang et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e\u003cp\u003eThis possible disparity is particularly problematic in academic contexts, where detection outcomes carry disciplinary and reputational consequences. A detector that disproportionately misclassifies or fails to validate the authenticity of EFL writing can contribute to structural inequities, potentially stigmatizing students, based not on the originality of their work, but on linguistic patterns or syntactic markers correlated with non-professional writing. The presence of authorship-based bias in an AI detector undermines its utility as an objective arbiter of academic misconduct. For institutions that serve linguistically diverse populations, especially those with high proportions of EFL learners, such bias can reduce trust in detection outcomes, lead to unfair accusations, and reinforce barriers to academic inclusion. The findings from the Originality detector point to a need for detector model retraining, diversification of training data, and the incorporation of bias mitigation strategies. This finding aligns with concerns raised in recent AI ethics literature about linguistic disadvantage in algorithmic systems (Helm et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2024\u003c/span\u003e)(Andersen et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eSimilarly, genre-based performance differences were noticeable (Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). Turnitin achieved 86% accuracy on humanities texts but only 51% on scientific writing. Originality, despite its stronger overall metrics, exhibited similar discrepancies: 96% accuracy on humanities versus 58% on science. Both detectors appear to struggle with technical, domain-specific language, which may resemble the lexical density and structure of AI-generated outputs. The significance of these differences was confirmed by Pearson\u0026rsquo;s Chi-square test of independence (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001) for both detectors. These findings suggest that the detectors may be detecting stylistic deviation rather than actual machine authorship.\u003c/p\u003e\u003cp\u003eThe most consequential insight from this study is that both tools, even if deployed together, do not offer a defensible level of accuracy or reliability for autonomous AI authorship detection. Their classification is inconsistent, context-sensitive, and particularly weak in edge cases such as Hybrid texts or genre-variant structures. As such, they should not be used in isolation for any disciplinary action or academic integrity implementation.\u003c/p\u003e\u003cp\u003eInstead of escalating the race between detection technologies and Gen-AI, academia must embrace a holistic rethinking of writing authorship and assessment in the Gen-AI era. Rather than treating Gen-AI as a threat, it should be approached as a tool that requires understanding and responsible use, critiqued, and ethically integrated. This calls for a curricular reform that embeds AI literacy and digital authorship ethics, assessment redesign that prioritizes process over product and enables transparency of AI usage, institutional policies that differentiate between acceptable augmentation and unethical authorship substitution, and faculty training to interpret AI detection scores critically and contextually, not deterministically (Cotton, Cotton and Shipway, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDetection tools may still serve as supplementary aids, particularly when used as part of a broader review process. However, their limitations demand transparency in their deployment and restraint in their interpretation. Just as plagiarism software never replaced human judgment, AI detectors must not become unquestioned arbiters of academic misconduct.\u003c/p\u003e"},{"header":"5 Limitations and Future Research","content":"\u003cp\u003eWhile the sample size of 192 texts is moderate and sufficient for meaningful statistical analysis, a larger sample size may yield more robust findings. Additionally, the analysis is limited to two detectors. Given the rapid evolution and diversification of detection technologies, future research should extend the scope to include a wider range of AI detectors. Expanding the dataset and incorporating more detection systems will enhance the generalizability of findings and support more comprehensive benchmarking of detection reliability in diverse academic settings.\u003c/p\u003e\u003cp\u003eIn our hybrid class, we combined AI-generated and human-written texts in equal proportions (roughly 50/50), ensuring cohesion and coherence throughout the content. For future work, we plan to investigate the impact of varying mixture ratios, such as 80/20 or 60/40, on detector performance. This investigation may offer deeper insights into how different levels of hybridization influence the accuracy and reliability of AI-content detection systems.\u003c/p\u003e"},{"header":"6 Conclusion","content":"\u003cp\u003eIn conclusion, this study offers a multi-perspective evaluation of two leading commercial AI content detectors. It concludes that a reassessment of current approaches to AI detection in education is required. Turnitin and Originality, while technically sophisticated, are neither sufficiently accurate nor equitably reliable for exclusive use. Their effectiveness is noticeably affected by \u003cb\u003etext length\u003c/b\u003e, \u003cb\u003egenre\u003c/b\u003e, and to some extent by \u003cb\u003eauthorship type\u003c/b\u003e, and their poor performance on hybrid content reveals fundamental conceptual limitations. Academic institutions must resist the temptation to treat AI as a threat to be neutralized and instead embrace it as an emerging pedagogical innovation to be navigated with care, creativity, and integrity.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u0026bull; The authors have no relevant financial or non-financial interests to disclose.\u003c/p\u003e\u003cp\u003e\u0026bull; The authors have no conflicts of interest to declare that are relevant to the content of this article.\u003c/p\u003e\u003cp\u003e\u0026bull; All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.\u003c/p\u003e\u003cp\u003e\u0026bull; The authors have no financial or proprietary interests in any material discussed in this article.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eM.H. was responsible for the study design, data analysis, and preparation of the manuscript. K.C. contributed to data collection, dataset creation and management, execution of the experiments, and editing and reviewing the manuscript. M.M. provided subject matter expertise and supervision, and contributed to editing and reviewing the manuscript. All authors approved the final version of the manuscript and agree to be accountable for all aspects of the work.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eData and materials created for this research are available upon request. Please direct all inquiries to the corresponding author.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAkram, A. (2023) \u0026lsquo;An Empirical Study of AI-Generated Text Detection Tools\u0026rsquo;, \u003cem\u003eAdvances in Machine Learning \u0026amp; Artificial Intelligence\u003c/em\u003e, 4(2). doi: 10.33140/amlai.04.02.03.\u003c/li\u003e\n\u003cli\u003eAndersen, N. \u003cem\u003eet al.\u003c/em\u003e (2025) \u0026lsquo;Algorithmic Fairness in Automatic Short Answer Scoring\u0026rsquo;, \u003cem\u003eInternational Journal of Artificial Intelligence in Education\u003c/em\u003e. doi: 10.1007/s40593-025-00495-5.\u003c/li\u003e\n\u003cli\u003eDe Angeli, K. \u003cem\u003eet al.\u003c/em\u003e (2022) \u0026lsquo;Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types\u0026rsquo;, \u003cem\u003eJournal of Biomedical Informatics\u003c/em\u003e, 125, p. 103957. doi: https://doi.org/10.1016/j.jbi.2021.103957.\u003c/li\u003e\n\u003cli\u003eBewick, V., Cheek, L. and Ball, J. (2003) \u0026lsquo;Statistics review 8: Qualitative data \u0026ndash; tests of association\u0026rsquo;, \u003cem\u003eCritical Care\u003c/em\u003e, 8(1), p. 46. doi: 10.1186/cc2428.\u003c/li\u003e\n\u003cli\u003eBommasani, R. \u003cem\u003eet al.\u003c/em\u003e (2021) \u0026lsquo;On the Opportunities and Risks of Foundation Models\u0026rsquo;, \u003cem\u003eArXiv\u003c/em\u003e, abs/2108.0. Available at: https://api.semanticscholar.org/CorpusID:237091588.\u003c/li\u003e\n\u003cli\u003eBuchert, J. (2024) \u003cem\u003eHow do AI Detectors Work? Do they Work?\u003c/em\u003e, \u003cem\u003eIntellectualead\u003c/em\u003e. Available at: https://gptzero.me/news/how-ai-detectors-work/ (Accessed: 23 June 2025).\u003c/li\u003e\n\u003cli\u003eChakraborty, S. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;On the Possibilities of AI-Generated Text Detection\u0026rsquo;, \u003cem\u003eArXiv\u003c/em\u003e, abs/2304.0. Available at: https://api.semanticscholar.org/CorpusID:258048481.\u003c/li\u003e\n\u003cli\u003eCotton, D. R. E., Cotton, P. A. and Shipway, J. R. (2024) \u0026lsquo;Chatting and cheating: Ensuring academic integrity in the era of ChatGPT\u0026rsquo;, \u003cem\u003eInnovations in Education and Teaching International\u003c/em\u003e, 61(2), pp. 228\u0026ndash;239. doi: 10.1080/14703297.2023.2190148.\u003c/li\u003e\n\u003cli\u003eCreo, A. and Pudasaini, S. (2024) \u0026lsquo;Evading AI-Generated Content Detectors using Homoglyphs.\u0026rsquo;, \u003cem\u003eCoRR\u003c/em\u003e. doi: 10.48550/ARXIV.2406.11239.\u003c/li\u003e\n\u003cli\u003eCurrie, G. M. (2023) \u0026lsquo;Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy?\u0026rsquo;, \u003cem\u003eSeminars in nuclear medicine\u003c/em\u003e, 53(5), pp. 719\u0026ndash;730. doi: 10.1053/J.SEMNUCLMED.2023.04.008.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eEdinburghNLP/XSum: Topic-Aware Convolutional Neural Networks for Extreme Summarization\u003c/em\u003e (no date). Available at: https://github.com/EdinburghNLP/XSum (Accessed: 8 February 2025).\u003c/li\u003e\n\u003cli\u003eElkhatat, A. M., Elsaid, K. and Almeer, S. (2023) \u0026lsquo;Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text\u0026rsquo;, \u003cem\u003eInternational Journal for Educational Integrity\u003c/em\u003e, 19(1), pp. 1\u0026ndash;16. doi: 10.1007/s40979-023-00140-5.\u003c/li\u003e\n\u003cli\u003eFariello, S. \u003cem\u003eet al.\u003c/em\u003e (2025) \u0026lsquo;Distinguishing Human From Machine: A Review of Advances and Challenges in AI-Generated Text Detection\u0026rsquo;, \u003cem\u003eInternational Journal of Interactive Multimedia and Artificial Intelligence\u003c/em\u003e, 9(3), pp. 6\u0026ndash;18. doi: 10.9781/ijimai.2024.12.002.\u003c/li\u003e\n\u003cli\u003eFisher, M. J., Marshall, A. P. and Mitchell, M. (2011) \u0026lsquo;Testing differences in proportions\u0026rsquo;, \u003cem\u003eAustralian Critical Care\u003c/em\u003e, 24(2), pp. 133\u0026ndash;138. doi: https://doi.org/10.1016/j.aucc.2011.01.005.\u003c/li\u003e\n\u003cli\u003eGehrmann, S., Strobelt, H. and Rush, A. M. (2019) \u0026lsquo;GLTR: Statistical detection and visualization of generated text\u0026rsquo;, in \u003cem\u003eACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of System Demonstrations\u003c/em\u003e. Association for Computational Linguistics (ACL), pp. 111\u0026ndash;116. doi: 10.18653/v1/p19-3019.\u003c/li\u003e\n\u003cli\u003eHelm, P. \u003cem\u003eet al.\u003c/em\u003e (2024) \u0026lsquo;Diversity and language technology: how language modeling bias causes epistemic injustice\u0026rsquo;, \u003cem\u003eEthics and Information Technology\u003c/em\u003e, 26(1), pp. 1\u0026ndash;15. doi: 10.1007/s10676-023-09742-6.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eHow Does AI Content Detection Work? \u0026ndash; Originality.AI\u003c/em\u003e (no date). Available at: https://originality.ai/blog/how-does-ai-content-detection-work?utm_source=chatgpt.com (Accessed: 23 June 2025).\u003c/li\u003e\n\u003cli\u003eiThenticate (2024) \u003cem\u003eAI writing detection in the new, enhanced Similarity Report view\u003c/em\u003e, \u003cem\u003eiThenticate\u003c/em\u003e. Available at: https://guides.turnitin.com/hc/en-us/articles/22774058814093-AI-writing-detection-in-the-new-enhanced-Similarity-Report (Accessed: 23 June 2025).\u003c/li\u003e\n\u003cli\u003eKirchenbauer, J. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;A Watermark for Large Language Models\u0026rsquo;, \u003cem\u003eProceedings of Machine Learning Research\u003c/em\u003e, 202, pp. 17061\u0026ndash;17084.\u003c/li\u003e\n\u003cli\u003eLiang, W. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;GPT detectors are biased against non-native English writers\u0026rsquo;, \u003cem\u003ePatterns\u003c/em\u003e, 4(7), p. 100779. doi: 10.1016/j.patter.2023.100779.\u003c/li\u003e\n\u003cli\u003eLiu, J. Q. J. \u003cem\u003eet al.\u003c/em\u003e (2024) \u0026lsquo;The great detectives: humans versus AI detectors in catching large language model-generated medical writing\u0026rsquo;, \u003cem\u003eInternational Journal for Educational Integrity\u003c/em\u003e, 20(1), pp. 1\u0026ndash;14. doi: 10.1007/s40979-024-00155-6.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003emarsyas/gtzan \u0026middot; Datasets at Hugging Face\u003c/em\u003e (no date). Available at: https://huggingface.co/datasets/EdinburghNLP/xsum/viewer (Accessed: 8 February 2025).\u003c/li\u003e\n\u003cli\u003eMcHugh, M. L. (2013) \u0026lsquo;The Chi-square test of independence\u0026rsquo;, \u003cem\u003eBiochemia Medica\u003c/em\u003e, 23(2), pp. 143\u0026ndash;149. doi: 10.11613/BM.2013.018.\u003c/li\u003e\n\u003cli\u003eOpenAI (2023) \u003cem\u003eNew AI classifier for indicating AI-written text\u003c/em\u003e, \u003cem\u003eOpenAI\u003c/em\u003e. Available at: https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/ (Accessed: 23 June 2025).\u003c/li\u003e\n\u003cli\u003ePerkins, M. (2023) \u0026lsquo;Academic Integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond\u0026rsquo;, \u003cem\u003eJournal of University Teaching and Learning Practice\u003c/em\u003e, 20(2). doi: 10.53761/1.20.02.07.\u003c/li\u003e\n\u003cli\u003ePerkins, M. \u003cem\u003eet al.\u003c/em\u003e (2024) \u0026lsquo;Simple techniques to bypass GenAI text detectors: implications for inclusive education\u0026rsquo;, \u003cem\u003eInternational Journal of Educational Technology in Higher Education\u003c/em\u003e, 21(1), p. 53. doi: 10.1186/s41239-024-00487-w.\u003c/li\u003e\n\u003cli\u003eSadasivan, V. S. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;Can AI-Generated Text be Reliably Detected?\u0026rsquo; Available at: http://arxiv.org/abs/2303.11156 (Accessed: 16 January 2025).\u003c/li\u003e\n\u003cli\u003eSaheb, T., Sidaoui, M. and Schmarzo, B. (2024) \u0026lsquo;Convergence of artificial intelligence with social media: A bibliometric \u0026amp; qualitative analysis\u0026rsquo;, \u003cem\u003eTelematics and Informatics Reports\u003c/em\u003e, 14, p. 100146. doi: https://doi.org/10.1016/j.teler.2024.100146.\u003c/li\u003e\n\u003cli\u003eSkaik, Y. (2008) \u0026lsquo;Understanding and using sensitivity, specificity and predictive values\u0026rsquo;, \u003cem\u003eIndian Journal of Ophthalmology\u003c/em\u003e, 56(4), p. 341. doi: 10.4103/0301-4738.41424.\u003c/li\u003e\n\u003cli\u003eTakahashi, K. \u003cem\u003eet al.\u003c/em\u003e (2022) \u0026lsquo;Confidence interval for micro-averaged F (1) and macro-averaged F (1) scores.\u0026rsquo;, \u003cem\u003eApplied intelligence (Dordrecht, Netherlands)\u003c/em\u003e, 52(5), pp. 4961\u0026ndash;4972. doi: 10.1007/s10489-021-02635-5.\u003c/li\u003e\n\u003cli\u003eWaltzer, T., Pilegard, C. and Heyman, G. D. (2024) \u0026lsquo;Can you spot the bot? Identifying AI-generated writing in college essays\u0026rsquo;, \u003cem\u003eInternational Journal for Educational Integrity\u003c/em\u003e, 20(1), p. 11. doi: 10.1007/s40979-024-00158-3.\u003c/li\u003e\n\u003cli\u003eWeber-Wulff, D. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;Testing of detection tools for AI-generated text\u0026rsquo;, \u003cem\u003eInternational Journal for Educational Integrity\u003c/em\u003e, 19(1), pp. 1\u0026ndash;39. doi: 10.1007/s40979-023-00146-z.\u003c/li\u003e\n\u003cli\u003eWu, J. \u003cem\u003eet al.\u003c/em\u003e (2025) \u0026lsquo;A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions\u0026rsquo;, \u003cem\u003eComputational Linguistics\u003c/em\u003e, 51(1), pp. 275\u0026ndash;338. doi: 10.1162/coli_a_00549.\u003c/li\u003e\n\u003cli\u003eWu, X., Duan, R. and Ni, J. (2023) \u0026lsquo;Unveiling Security, Privacy, and Ethical Concerns of ChatGPT\u0026rsquo;, \u003cem\u003eArXiv\u003c/em\u003e, abs/2307.1. Available at: https://api.semanticscholar.org/CorpusID:260164746.\u003c/li\u003e\n\u003cli\u003eZhang, J. \u003cem\u003eet al.\u003c/em\u003e (2020) \u0026lsquo;{PEGASUS}: Pre-training with Extracted Gap-sentences for Abstractive Summarization\u0026rsquo;, in III, H. D. and Singh, A. (eds) \u003cem\u003eProceedings of the 37th International Conference on Machine Learning\u003c/em\u003e. PMLR (Proceedings of Machine Learning Research), pp. 11328\u0026ndash;11339. Available at: https://proceedings.mlr.press/v119/zhang20ae.html.\u003c/li\u003e\n\u003cli\u003eZhao, Y. \u003cem\u003eet al.\u003c/em\u003e (2024) \u0026lsquo;Leveraging Past Assignments to Determine If Students Are Using ChatGPT for Their Essays\u0026rsquo;, in \u003cem\u003eL@S 2024 - Proceedings of the 11th ACM Conference on Learning @ Scale\u003c/em\u003e. Association for Computing Machinery, Inc, pp. 320\u0026ndash;324. doi: 10.1145/3657604.3664707.\u003c/li\u003e\n\u003cli\u003eZhong, H. \u003cem\u003eet al.\u003c/em\u003e (2023) \u0026lsquo;Copyright Protection and Accountability of Generative AI: Attack, Watermarking and Attribution\u0026rsquo;, in \u003cem\u003eCompanion Proceedings of the ACM Web Conference 2023\u003c/em\u003e. New York, NY, USA: Association for Computing Machinery (WWW \u0026rsquo;23 Companion), pp. 94\u0026ndash;98. doi: 10.1145/3543873.3587321.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Generative Artificial Intelligence, AI Content Dectection, English as a Foreign Language (EFL) - Academic Integrity, Detection Bias, Detection Reliability.","lastPublishedDoi":"10.21203/rs.3.rs-7359956/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7359956/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eGenerative Artificial Intelligence (GenAI) tools capable of producing human-like text have raised considerable concerns regarding academic integrity. In response, AI content detectors such as Turnitin and Originality are increasingly employed in higher education. However, empirical evidence regarding their accuracy, reliability, and fairness, particularly in the context of English as a Foreign Language (EFL) writing remains limited. This study evaluates the performance of both detectors across variations in text length, genre, and authorship type. A balanced dataset of 192 texts was constructed, comprising authentic EFL student writing, professionally authored human texts, AI-generated outputs, and hybrid compositions. Based on the percentage of AI content identified by each detector, texts were categorized as Human, Hybrid, or AI. Detector performance was assessed against ground truth labels using precision, recall, specificity, F1 score, and accuracy. Statistical significance was tested using Pearson\u0026rsquo;s chi-square and Fisher\u0026rsquo;s Exact Test.\u003c/p\u003e\u003cp\u003eOriginality outperformed Turnitin in overall accuracy (0.69 vs. 0.61) and macro-average recall (0.60 vs. 0.51). However, both detectors performed poorly on Hybrid texts, with recall scores of 0.31 for Turnitin and 0.02 for Originality. Performance declined significantly with longer texts (p\u0026thinsp;\u0026lt;\u0026thinsp;0.015 for Turnitin; p\u0026thinsp;\u0026lt;\u0026thinsp;0.002 for Originality) and varied across genres, with higher accuracy observed in humanities than in science (p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001 for both detectors). Originality also exhibited a borderline statistically significant bias favoring professionally authored texts over EFL texts (p\u0026thinsp;=\u0026thinsp;0.058). These findings suggest that neither detector is sufficiently reliable to serve as the sole basis for academic misconduct decisions. Institutions are advised to supplement AI detection tools with human judgment, incorporate AI literacy into academic curricula, and encourage detector developers to pursue further research into bias mitigation.\u003c/p\u003e","manuscriptTitle":"Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-16 06:51:40","doi":"10.21203/rs.3.rs-7359956/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"267d1af0-ae9a-4bf2-ae06-964888ca91cc","owner":[],"postedDate":"September 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-02-03T13:58:58+00:00","versionOfRecord":{"articleIdentity":"rs-7359956","link":"https://doi.org/10.1007/s40979-026-00213-1","journal":{"identity":"international-journal-for-educational-integrity","isVorOnly":false,"title":"International Journal for Educational Integrity"},"publishedOn":"2026-02-02 00:00:00","publishedOnDateReadable":"February 2nd, 2026"},"versionCreatedAt":"2025-09-16 06:51:40","video":"","vorDoi":"10.1007/s40979-026-00213-1","vorDoiUrl":"https://doi.org/10.1007/s40979-026-00213-1","workflowStages":[]},"version":"v1","identity":"rs-7359956","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7359956","identity":"rs-7359956","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.