The King Saud bin Abdulaziz University for Health Sciences Learner Corpus: A Longitudinal Resource for Researching the Development of Academic Writing Proficiency in EFL Learners | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article The King Saud bin Abdulaziz University for Health Sciences Learner Corpus: A Longitudinal Resource for Researching the Development of Academic Writing Proficiency in EFL Learners Eman Al Nafjan, Alaa Alfelaij, Nouf Ali Binsuwadan, Norah Alfawaz, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4755662/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus is a new longitudinal corpus of English as a Foreign Language (EFL) academic writing at the university level that complies with the FAIR principles. Collection began in 2022, the corpus captures writing development during a period of emerging language technologies. The corpus currently contains over a million words from 158 learners across four proficiency levels (CEFR A1-B2). Texts span four trimesters and multiple academic genres, including paragraphs and essays in rhetorical modes like cause-effect, argumentation, and summarization. Metadata includes learners' parents’ first language background, other foreign languages, academic focus, and reference tool preferences. The corpus enables research into multiple dimensions of second language writing development to inform EAP pedagogy and assessment. Avenues for investigation include: lexico-grammatical development, cross-linguistic influence, individual differences, instructional interventions, impacts of task conditions and language technologies, and longitudinal trajectories. Ongoing corpus expansion aims to expand size, representation, and accessibility. This unique resource promises data-driven insights into the textual features and factors that characterize and facilitate intermediate and advanced EFL writing proficiency in university contexts. corpus compilation corpus design corpus study design learner corpus research metadata second language acquisition 1. Introduction This article introduces the King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus, a longitudinal collection of academic writing by Arabic L1 EFL/EAP learners at the university level, which, to the best of our knowledge, is the first of its kind (Al Nafjan and Jawhar, 2024), considering that it is a longitudinal Arabic L1 EFL/EAP learner corpus. Launched in 2022, the corpus provides a unique window into academic writing development before and during a time of emerging language technologies such as ChatGPT, which became widely available in this period. By enabling research into multiple dimensions of writing proficiency, the KSAUHS Learner Corpus aims to inform EAP writing instruction and assessment. We begin by discussing the status of learner corpora and how the KSAUHS Learner Corpus is distinctive due to five features. Then we describe the corpus’s design and compilation and finally we discuss our aspirations for growing the corpus and some of the multiple avenues of investigation that the availability of this corpus allows the wider research community. 2. Five Key Features of the Learner Corpus Learner corpora, or systematic collections of authentic language production from second language (L2) learners, are valuable resources for studying the development of language competencies (Granger, 2008 ). The value of such corpora has been well established over the past two decades especially in second language acquisition research, so much so that there is a journal dedicated to learner corpora, the International Journal of Learner Corpus Research published by John Benjamins. Amidst the expanding collection of learner corpora globally, the KSAUHS Learner Corpus is set apart through five features. Despite the growing body of learner corpora around the world, most learner corpora tend to be cross-sectional which makes any examinations of interlanguage development problematic. Hence, longitudinal learner corpora are particularly useful for researching the linguistic characteristics of L2 proficiency at different developmental stages (Myles, 2009 ; Granger & Paquot, 2013 ). In domains like EAP, such corpora can shed light on the lexical, grammatical, and rhetorical features that both facilitate and hinder L2 learners' academic writing success. However, Biber et al. ( 2020 ) note that not only are longitudinal corpora scarce but also that they tend to have fewer participants. This is not the case with the KSAUHS Learner Corpus, which leads us to the first feature. The KSAUHS Learner Corpus spans four trimesters and currently includes 158 participants. The second feature is that, historically, most learner corpora accessible to researchers have consisted of texts written by language learners in general language courses, often in argumentative or narrative genres while EFL and EAP students are required to write in a variety of registers and tasks (Biber et al., 2020 ). The International Corpus of Learner English (ICLE) (Granger et al., 2020 ) exemplifies this argumentative or narrative genres trend. In recent years, corpora of learner writing from standardized language assessments have also emerged, such as the ETS Corpus of Non-Native Written English (Blanchard et al., 2013) and the Open Cambridge Learner Corpus (2017). Meanwhile, the current KSAUHS Learner Corpus is not only a longitudinal corpus but also offers both classwork and exam conditions writing. The corpus covers a wider variety of registers, and all produced by the same learner beginning from first person general topics to an academic summary and critique of a published empirical paper. This variety can be utilized to examine multiple aspects of learner language and its development. The third feature is that the collection of the current KSAUHS Learner Corpus began at a time when ChatGPT was inaccessible worldwide and collection ended after it had become available for several months (See Table 1 ). The first assignments collected from participants occurred in September 2022 while ChatGPT only became available in November 2022. However, initially, it was not available in Saudi Arabia. Some users in Saudi Arabia bypassed this delay in availability by using VPNs to access the service, simulating their presence in countries where ChatGPT was accessible. Not until August 2023, did ChatGPT become easily accessible to users located in Saudi Arabia. Thus, this corpus allows researchers to examine the same student’s writing before and after the availability of ChatGPT and the many other writing tools built on similar generative large language models. The fourth feature is temporary but significant. The corpus began collection with the participants’ freshman year in 2022. This means that for the next few years, many of these participants remain available for voluntary follow-up and the provision of more writing samples. While this does require facilitation with the in-house research team and depending on the type of sampling or study, might need approval from the KSAUHS Institutional Review Board, it remains a viable and potentially exciting avenue of research. The fifth and final feature is that the KSAUHS Learner Corpus is committed to following the FAIR Guiding Principles for scientific data management and stewardship (Wilkinson et al., 2016 ), which emphasize that data should be Findable, Accessible, Interoperable, and Reusable. In terms of findability, we commit to regularly publishing updates about the corpus in indexed journals. We also intend to disseminate information regarding the availability of the corpus through various channels, including social media platforms and professional mailing lists in the fields of Teaching English as a Foreign Language (TEFL), linguistics, and corpus linguistics. For accessibility, the corpus data is currently shared in a widely used format (Excel) to ensure compatibility with standard software tools. Utilizing Excel ensures that researchers with limited technological expertise can access and examine the corpus without requiring third-party assistance or specialized training. The current corpus’s Excel sheet format captures the evolution of the participants’ academic writing skills over multiple trimesters, enabling a comprehensive analysis of each student's progress. The structure allows researchers to examine various linguistic and educational variables systematically. It is available upon request, ensuring responsible sharing while protecting participant confidentiality. Access requires signing an agreement that restricts use to non-commercial research purposes. Interoperability is maintained through standard vocabularies and formats, facilitating integration with other datasets, and simplifying format conversion. Researchers who prefer using platforms such as GitHub, R, or Python can effortlessly convert the Excel file into a more suitable format for their needs. Lastly, the corpus is reusable, with detailed documentation and annotations making it easier for future researchers to understand and utilize the data. The ethical considerations, including the agreements that restrict use to non-commercial purposes and prevent distribution of the original or modified data, ensure responsible reuse in future research. 3. Corpus Design and Composition The KSAUHS Learner Corpus currently comprises over 1 million words from 158 students ranging in initial proficiency from CEFR levels A1 to B2. The participants are relatively homogeneous. In its current form, the participants all attended a women only college and they range in age between 17 and 21 years. The proficiency levels can be categorized by specialization, from least proficient to most proficient: Nursing, AMS, Pharmacology, Dentistry, and Medicine. Metadata on participants includes their first language background, secondary school type (public or international), other languages spoken, study focus (pre-nursing, pre-medicine, dentistry, pre- pharmacology or applied medical sciences), and preferred reference tools and dictionaries. This data was collected via an online learner profile survey. See Appendix A for a copy of the learner profile form. Corpus collection spanned four trimesters from Fall 2022 to Fall 2023. The corpus was compiled by collecting students' writing tasks via the university learning management system and an online storage drive, along with handwritten exams that were scanned and later transcribed. Table 1 summarizes the writing tasks collected by trimester along with key events. The corpus is currently maintained in an Excel spreadsheet, with each row representing an individual participant. For the sake of anonymity, each participant is assigned a unique five-digit number, while the research team retains access to the participants' names via a private list. Metadata are presented in the initial columns and include information from the learner profile forms and the names of the instructors who taught each student each trimester. To further protect anonymity, instructor names are also coded and linked to an internal key list. Table 1 KSAUHS Learner Corpus Collection by Trimester - Writing Tasks - Key Events Trimester Writing Tasks Key Events 1 (September-December 2022) Task Practice paragraph Task 1.1 General paragraph Task 1.2 Cause/Effect paragraph Each range between 150–200 words on student life topics ChatGPT became available in many countries but not in Saudi Arabia without a VPN. 2 (December-March 2022–2023) Task 2.1 compare-contrast paragraph Task 2.2 a definition paragraph. Exam 1. An exam in which students produce two paragraphs, each on a different subject. Students filled out the learner profile forms. 3 (March-June 2023) Task 3.1 Essay 1 topics are in education and trends in digital learning Task 3.2 Essay 2 topics on marketing, consumerism, social image. Exam 2. An exam in which students produce an essay. 4(September-December (2023) Task 4.1 summary of a medical journal article Task 4.2 critique of the same medical journal article with additional references Task 4.3 1 revision/synthesis of summary and critique ChatGPT became available in Saudi Arabia. The first trimester focused on paragraph writing. Initially, the focus was on producing a topic sentence, its supporting ideas, and a concluding sentence for the practice and the first task. The topics for the practice paragraph included time management , volunteer work and the use of smartphones . Instructors give feedback on the practice paragraph but do not grade them. The first assessed task topics are general such as Being a Good Listener , and Reasons to Buy an iPad . The second task is cause-effect rhetoric, with topics related to students' lives such as College Life , and Family Issues . In the second trimester students are required to produce two assignments. The first is a paragraph comparing two concepts or entities such as Gym Versus Home Workouts , A Mother and Father’s Roles , or Childhood Fifty Years Ago Versus Now . The second is a paragraph defining a concept or entity such as Mental Health , Snapchat , or An Ideal Roommate . At the end of the trimester students take a timed three-hour exam writing by hand two paragraphs similar to the two that they have produced during the trimester without the benefit of access to language resources. In the exam, the students are given the option of four topics for a definition paragraph and four topics for a comparison paragraph. They are instructed to choose one of the four topics for each and write 12–16 sentences with a topic sentence that includes both a topic and a controlling idea, supporting sentences that provide the main points with specific details, and a concluding sentence. See Appendix C for scanned student samples from these exams. In the third trimester, students progressed to argumentative essay writing, producing two 450–500-word essays, and an additional one under exam conditions. The fourth trimester tasked students with basing their writing on a medical journal article of their choice, first summarizing it, then critiquing it using additional references, and finally revising and synthesizing the summary and critique into one paper. See Appendix B for a sample of a higher proficiency pharmacology student tagged writing across the first three trimesters. All texts were collected electronically and pre-processed to remove non-linguistic content before being XML tagged for spelling errors, transliterations, in-text references, and reference lists. In cases where the spelling errors are ambiguous either due to illegible handwriting or similar words such as using there in place of their , the transcriber sends a screenshot to a shared messaging group with the research group and the issue is resolved via consensus. 4. Avenues for Investigation The King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus offers a rich resource for examining multiple dimensions of English as a Foreign Language (EFL) academic writing development. The corpus, which comprises longitudinal data, enables comprehensive research into various aspects of learners' writing processes and outcomes. One significant area of investigation is the analysis of learners' lexical, grammatical, and semantic choices. By tracking how language use evolves over time and in response to different task demands, researchers can identify specific areas where learners exhibit growth or face challenges. This analysis is crucial for understanding the progression of writing proficiency and the factors that influence it. The linguistic and textual features of interest to these research strands include verb-argument constructions, inanimate subjects, rhetorical phrases, reporting verbs, and lexical bundles. These features are considered "late acquired" and remain challenging even at advanced proficiency levels. Potentially, they can be isolated across the corpus for data-driven approaches like cluster analysis and vectorization, which can identify proficiency benchmarks to inform EAP writing needs analysis and assessment (Wulff & Gries, 2011 ). The influence of the first language (L1) and culture on English foreign and academic writing is another key research avenue. Given that all 158 of the participants' native language is Arabic, the corpus provides a unique opportunity to study cross-linguistic effects. Researchers can explore how L1 influences manifest in learners' English writing, offering insights into the complexities of bilingual education and language transfer phenomena. The inclusion of transliteration tags within the corpus facilitates a detailed examination of the specific linguistic elements that students struggle to translate into their second language. This enables researchers to identify patterns and areas of difficulty, thereby contributing valuable insights into the cognitive processes involved in second language acquisition and translation. Individual differences among learners also play a significant role in writing development. The corpus includes metadata on participants' educational backgrounds, areas of study, and preferred reference tools. This information allows for an in-depth exploration of how these factors impact writing proficiency. For example, researchers can investigate whether students from different schooling backgrounds exhibit varying levels of writing competence or how the use of specific reference tools correlates with writing quality. The corpus is also instrumental in examining the effects of instructional interventions. By comparing the writing outcomes of students taught by different instructors or experiencing changes in instructional methods, researchers can derive valuable pedagogical insights. This comparison can shed light on the effectiveness of various teaching strategies and the potential benefits of maintaining consistent instruction versus introducing new instructional approaches across terms. Another area for exploration is the impact of different task conditions on writing performance. The corpus contains samples of both exam and homework writing, enabling researchers to contrast these contexts. Such comparisons can highlight the effects of timed conditions and the availability of reference materials on students' writing outputs. The advent of emerging language technologies, such as ChatGPT, presents an additional dimension for investigation. By dividing the corpus into pre- and post-ChatGPT eras, researchers can examine how access to such technologies influences writing development. This analysis can provide insights into the role of artificial intelligence in educational settings and its impact on learners' writing practices. Furthermore, the corpus allows for ongoing development and the exploration of learners' perceptions. Since many participants are still accessible, the corpus can be expanded with more recent writing samples for longitudinal comparisons. Additionally, interviews with learners can elicit their perceptions of their writing development and the use of technology, offering a qualitative complement to the quantitative data. 5. Future Directions and Conclusion The KSAUHS Learner Corpus represents a groundbreaking effort in capturing the academic writing development of Arabic L1 EFL learners within the context of emerging language technologies. By longitudinally tracking writing proficiency across various tasks, the corpus provides valuable data-driven insights into the lexical, grammatical, and rhetorical features that characterize and advance second language writing expertise. The ongoing expansions of the corpus aim to enhance its representativeness and analytical potential, making it an invaluable resource for EFL and EAP research and pedagogy. The KSAUHS Learner Corpus is available to researchers upon signing a data use agreement that permits only non-commercial use, prohibits full-text reproduction and redistribution, and requires acknowledgement. While the publicly shared corpus is anonymized, the research team can collect follow-up writing and interviews from participants for the next four years, enabling studies of continuing writing development post initial collection. Starting with the Fall semester of 2024, data collection will be extended to include both the male and female campuses. This expansion will address previous shortcomings by ensuring timely uploads of student papers to avoid missing assignments. Additionally, efforts will be made to obtain draft versions of assignments along with the final submissions, providing a more comprehensive view of the writing process and development. We also invite feedback from the wider research community interested in learner corpus research. Collaborative input will be essential for improving the corpus, making it more accessible, and enhancing its utility for producing meaningful insights. Engaging with the research community will help us refine our methodologies and expand the corpus's impact on understanding EFL writing development. By continuing to expand and refine the KSAUHS Learner Corpus, we aim to contribute significantly to the field of EAP and provide a robust foundation for future research into the complexities of second language writing development. Declarations 6.1 Ethics approval and consent to participate The collection of the participants’ written production and metadata was approved by the Institutional Review Board (IRB) of King Abdullah International Medical Research Center (KAIMRC), Ministry of National Guard - Health Affairs, Kingdom of Saudi Arabia. The IRB approval number for this study is IRB/1344/22, and the study number is NRC22R/312/06. The approval was granted on 27 July 2022. All participants provided informed consent prior to their inclusion in the study. Participation was voluntary, and participants were assured of the confidentiality and anonymity of their data. The study did not involve any interventions or treatments. 6.2 Consent for publication Not applicable. 6.3 Availability of data and material The KSAUHS Learner Corpus is available to researchers upon signing a data use agreement. This agreement permits only non-commercial use of the corpus, prohibits full-text reproduction and redistribution, and requires proper acknowledgment of the corpus in any resulting publications. Researchers interested in accessing the corpus should contact the corresponding author to obtain further details and initiate the data use agreement process. 6.4 Competing interests The authors declare no competing interests. 6.5 Funding No funding was received for this research. Authors' contributions Eman Al Nafjan conceptualized the study, oversaw data collection, and drafted the manuscript. All other authors, except Sabria Jawhar, contributed to the transcription, compilation, and refinement of the corpus. Sabria Jawhar coordinated logistical arrangements and secured the necessary IRB approval for data and metadata collection. All authors reviewed and approved the final version of the manuscript. Acknowledgements We extend our gratitude to Prof. Magali Paquot for her invaluable advice during the stages of corpus compilation and for providing critical insights that enhanced this corpus. We also acknowledge the administrative assistants at KSAUHS, Norah Alowerdhi and Mashael Alhumayd, for their assistance in scanning participants’ handwritten exam papers. Their support was instrumental in the successful compilation of this learner corpus. References Al Najfan, E., & Jawhar, S. (in press). Learner corpus research and data-driven learning in Saudi Arabia. In A. H. Al-Hoorie, C. Mitchell, & T. Elyas (Eds.), Language education in Saudi Arabia: Integrating technology in the classroom . Springer. Biber, D., Reppen, R., Staples, S., & Egbert, J. (2020). Exploring the longitudinal development of grammatical complexity in the disciplinary writing of L2-English university students. International Journal of Learner Corpus Research , 6 (1), 38–71. Callies, M., & Zaytseva, E. (2013). The Corpus of Academic Learner English (CALE): A new resource for the assessment of writing proficiency in the academic register. Dutch Journal of Applied Linguistics , 2 (1), 126–132. Granger, S., Dupont, M., Meunier, F., Naets, H., & Paquot, M. (2020). The International Corpus of Learner English (version 3) . Presses universitaires de Louvain. Granger, S., & Paquot, M. (2013). Language for specific purposes learner corpora. In C. A. Chapelle (Ed.), The Encyclopedia of Applied Linguistics . Blackwell-Wiley. Granger, S. (2008). Learner corpora. In A. Lüdeling, & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 259–275). Walter de Gruyter. Myles, F. (2009). Investigating learner language development with electronic longitudinal corpora: Theoretical and methodological issues. The longitudinal study of advanced L2 capacities (pp. 74–88). Routledge. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data , 3 (1), 1–9. Wulff, S., & Gries, S. T. (2011). Corpus-driven methods for assessing accuracy in learner production. In P. Robinson (Ed.), Second language task complexity: Researching the cognition hypothesis of language learning and performance (pp. 61–87). John Benjamins. Zaytseva, E. (2011). Register, genre, rhetorical functions: Variation in English native-speaker and learner writing. In H. Hedeland, T. Schmidt, & K. Wörner (Eds.), Multilingual resources and multilingual applications (pp. 239–242). University of Hamburg. 4o. Additional Declarations No competing interests reported. Supplementary Files AppendixA.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4755662","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":332369202,"identity":"398c9f48-a27a-43be-8a7c-e7b9cf3b1963","order_by":0,"name":"Eman Al Nafjan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/0lEQVRIiWNgGAWjYFACxgdAQkJOAsxuAJEJhLQwG4C0GJOshSFxBtFazNkPM7/4ucMifWb/4WcSjDvqGPjZcwwYf9Tg1mLZk8xm2XtGIne2RJqZBOOZwwySPW8MmHmO4dZicCD/mAFvm0TuPAkGM+m/bQcYDG7kGDAzsOHRcv4xm+HfNol0Of7j3yQY2+oY7G+AHPYPj5YbycyPgbYkSDPkAB3WxsxgIJFjwMDbhk/LYzZm2TYJw5kzcootGNsO80iceVZwmLcPn8OSmT++bauTlzh/fOMNoMPk+NuTNz788Q23FiBgk0Dm8YCIA3g1AOPyAwEFo2AUjIJRMNIBAFRES3h616VaAAAAAElFTkSuQmCC","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":true,"prefix":"","firstName":"Eman","middleName":"Al","lastName":"Nafjan","suffix":""},{"id":332369203,"identity":"6b84f634-c7cf-43f0-a660-e6bdfcb53886","order_by":1,"name":"Alaa Alfelaij","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Alaa","middleName":"","lastName":"Alfelaij","suffix":""},{"id":332369206,"identity":"e5788fb3-7f39-41f4-9051-934a7aa6c206","order_by":2,"name":"Nouf Ali Binsuwadan","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Nouf","middleName":"Ali","lastName":"Binsuwadan","suffix":""},{"id":332369207,"identity":"756961c4-8699-4b92-9b7e-dbc6f74ee695","order_by":3,"name":"Norah Alfawaz","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Norah","middleName":"","lastName":"Alfawaz","suffix":""},{"id":332369208,"identity":"85bc0a9b-3d4c-42ae-a022-115e769c438f","order_by":4,"name":"Banan Mohammed Althowaini","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Banan","middleName":"Mohammed","lastName":"Althowaini","suffix":""},{"id":332369209,"identity":"f5cda8c8-4f66-45d0-99bb-a981ae4a6dcb","order_by":5,"name":"Sarah Mohammed","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Sarah","middleName":"","lastName":"Mohammed","suffix":""},{"id":332369210,"identity":"979bacd9-4949-4fe6-bad0-24eaf11ecceb","order_by":6,"name":"Sabria Jawhar","email":"","orcid":"","institution":"King Saud bin Abdulaziz University for Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Sabria","middleName":"","lastName":"Jawhar","suffix":""}],"badges":[],"createdAt":"2024-07-17 10:56:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4755662/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4755662/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":63675852,"identity":"2b33564e-2525-438a-a7b2-f2afc18e5d6c","added_by":"auto","created_at":"2024-08-31 03:47:16","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":339727,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4755662/v1/f425c3c0-48fd-4433-be7d-6c10a83d11b3.pdf"},{"id":62398502,"identity":"a7ab150f-3e9c-46f4-9c55-506bad90e439","added_by":"auto","created_at":"2024-08-13 18:00:38","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1851357,"visible":true,"origin":"","legend":"","description":"","filename":"AppendixA.docx","url":"https://assets-eu.researchsquare.com/files/rs-4755662/v1/f6a172ed476f4543665a076d.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":" The King Saud bin Abdulaziz University for Health Sciences Learner Corpus: A Longitudinal Resource for Researching the Development of Academic Writing Proficiency in EFL Learners ","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThis article introduces the King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus, a longitudinal collection of academic writing by Arabic L1 EFL/EAP learners at the university level, which, to the best of our knowledge, is the first of its kind (Al Nafjan and Jawhar, 2024), considering that it is a longitudinal Arabic L1 EFL/EAP learner corpus.\u003c/p\u003e \u003cp\u003eLaunched in 2022, the corpus provides a unique window into academic writing development before and during a time of emerging language technologies such as ChatGPT, which became widely available in this period. By enabling research into multiple dimensions of writing proficiency, the KSAUHS Learner Corpus aims to inform EAP writing instruction and assessment. We begin by discussing the status of learner corpora and how the KSAUHS Learner Corpus is distinctive due to five features. Then we describe the corpus\u0026rsquo;s design and compilation and finally we discuss our aspirations for growing the corpus and some of the multiple avenues of investigation that the availability of this corpus allows the wider research community.\u003c/p\u003e"},{"header":"2. Five Key Features of the Learner Corpus","content":"\u003cp\u003eLearner corpora, or systematic collections of authentic language production from second language (L2) learners, are valuable resources for studying the development of language competencies (Granger, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2008\u003c/span\u003e). The value of such corpora has been well established over the past two decades especially in second language acquisition research, so much so that there is a journal dedicated to learner corpora, the \u003cem\u003eInternational Journal of Learner Corpus Research\u003c/em\u003e published by John Benjamins. Amidst the expanding collection of learner corpora globally, the KSAUHS Learner Corpus is set apart through five features.\u003c/p\u003e \u003cp\u003eDespite the growing body of learner corpora around the world, most learner corpora tend to be cross-sectional which makes any examinations of interlanguage development problematic. Hence, longitudinal learner corpora are particularly useful for researching the linguistic characteristics of L2 proficiency at different developmental stages (Myles, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Granger \u0026amp; Paquot, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). In domains like EAP, such corpora can shed light on the lexical, grammatical, and rhetorical features that both facilitate and hinder L2 learners' academic writing success. However, Biber et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) note that not only are longitudinal corpora scarce but also that they tend to have fewer participants. This is not the case with the KSAUHS Learner Corpus, which leads us to the first feature. The KSAUHS Learner Corpus spans four trimesters and currently includes 158 participants.\u003c/p\u003e \u003cp\u003eThe second feature is that, historically, most learner corpora accessible to researchers have consisted of texts written by language learners in general language courses, often in argumentative or narrative genres while EFL and EAP students are required to write in a variety of registers and tasks (Biber et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). The International Corpus of Learner English (ICLE) (Granger et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) exemplifies this argumentative or narrative genres trend. In recent years, corpora of learner writing from standardized language assessments have also emerged, such as the ETS Corpus of Non-Native Written English (Blanchard et al., 2013) and the Open Cambridge Learner Corpus (2017). Meanwhile, the current KSAUHS Learner Corpus is not only a longitudinal corpus but also offers both classwork and exam conditions writing. The corpus covers a wider variety of registers, and all produced by the same learner beginning from first person general topics to an academic summary and critique of a published empirical paper. This variety can be utilized to examine multiple aspects of learner language and its development.\u003c/p\u003e \u003cp\u003eThe third feature is that the collection of the current KSAUHS Learner Corpus began at a time when ChatGPT was inaccessible worldwide and collection ended after it had become available for several months (See Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The first assignments collected from participants occurred in September 2022 while ChatGPT only became available in November 2022. However, initially, it was not available in Saudi Arabia. Some users in Saudi Arabia bypassed this delay in availability by using VPNs to access the service, simulating their presence in countries where ChatGPT was accessible. Not until August 2023, did ChatGPT become easily accessible to users located in Saudi Arabia. Thus, this corpus allows researchers to examine the same student\u0026rsquo;s writing before and after the availability of ChatGPT and the many other writing tools built on similar generative large language models.\u003c/p\u003e \u003cp\u003eThe fourth feature is temporary but significant. The corpus began collection with the participants\u0026rsquo; freshman year in 2022. This means that for the next few years, many of these participants remain available for voluntary follow-up and the provision of more writing samples. While this does require facilitation with the in-house research team and depending on the type of sampling or study, might need approval from the KSAUHS Institutional Review Board, it remains a viable and potentially exciting avenue of research.\u003c/p\u003e \u003cp\u003eThe fifth and final feature is that the KSAUHS Learner Corpus is committed to following the FAIR Guiding Principles for scientific data management and stewardship (Wilkinson et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2016\u003c/span\u003e), which emphasize that data should be Findable, Accessible, Interoperable, and Reusable. In terms of findability, we commit to regularly publishing updates about the corpus in indexed journals. We also intend to disseminate information regarding the availability of the corpus through various channels, including social media platforms and professional mailing lists in the fields of Teaching English as a Foreign Language (TEFL), linguistics, and corpus linguistics.\u003c/p\u003e \u003cp\u003eFor accessibility, the corpus data is currently shared in a widely used format (Excel) to ensure compatibility with standard software tools. Utilizing Excel ensures that researchers with limited technological expertise can access and examine the corpus without requiring third-party assistance or specialized training. The current corpus\u0026rsquo;s Excel sheet format captures the evolution of the participants\u0026rsquo; academic writing skills over multiple trimesters, enabling a comprehensive analysis of each student's progress. The structure allows researchers to examine various linguistic and educational variables systematically. It is available upon request, ensuring responsible sharing while protecting participant confidentiality. Access requires signing an agreement that restricts use to non-commercial research purposes.\u003c/p\u003e \u003cp\u003eInteroperability is maintained through standard vocabularies and formats, facilitating integration with other datasets, and simplifying format conversion. Researchers who prefer using platforms such as GitHub, R, or Python can effortlessly convert the Excel file into a more suitable format for their needs.\u003c/p\u003e \u003cp\u003eLastly, the corpus is reusable, with detailed documentation and annotations making it easier for future researchers to understand and utilize the data. The ethical considerations, including the agreements that restrict use to non-commercial purposes and prevent distribution of the original or modified data, ensure responsible reuse in future research.\u003c/p\u003e"},{"header":"3. Corpus Design and Composition","content":"\u003cp\u003eThe KSAUHS Learner Corpus currently comprises over 1\u0026nbsp;million words from 158 students ranging in initial proficiency from CEFR levels A1 to B2. The participants are relatively homogeneous. In its current form, the participants all attended a women only college and they range in age between 17 and 21 years. The proficiency levels can be categorized by specialization, from least proficient to most proficient: Nursing, AMS, Pharmacology, Dentistry, and Medicine.\u003c/p\u003e \u003cp\u003eMetadata on participants includes their first language background, secondary school type (public or international), other languages spoken, study focus (pre-nursing, pre-medicine, dentistry, pre- pharmacology or applied medical sciences), and preferred reference tools and dictionaries. This data was collected via an online learner profile survey. See Appendix A for a copy of the learner profile form.\u003c/p\u003e \u003cp\u003eCorpus collection spanned four trimesters from Fall 2022 to Fall 2023. The corpus was compiled by collecting students' writing tasks via the university learning management system and an online storage drive, along with handwritten exams that were scanned and later transcribed. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarizes the writing tasks collected by trimester along with key events.\u003c/p\u003e \u003cp\u003eThe corpus is currently maintained in an Excel spreadsheet, with each row representing an individual participant. For the sake of anonymity, each participant is assigned a unique five-digit number, while the research team retains access to the participants' names via a private list. Metadata are presented in the initial columns and include information from the learner profile forms and the names of the instructors who taught each student each trimester. To further protect anonymity, instructor names are also coded and linked to an internal key list.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cem\u003eKSAUHS Learner Corpus Collection by Trimester - Writing Tasks - Key Events\u003c/em\u003e\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrimester\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWriting Tasks\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eKey Events\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1 (September-December 2022)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTask Practice paragraph\u003c/p\u003e \u003cp\u003eTask 1.1 General paragraph\u003c/p\u003e \u003cp\u003eTask 1.2 Cause/Effect paragraph\u003c/p\u003e \u003cp\u003eEach range between 150\u0026ndash;200 words on student life topics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eChatGPT became available in many countries but not in Saudi Arabia without a VPN.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2 (December-March 2022\u0026ndash;2023)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTask 2.1 compare-contrast paragraph\u003c/p\u003e \u003cp\u003eTask 2.2 a definition paragraph.\u003c/p\u003e \u003cp\u003eExam 1. An exam in which students produce two paragraphs, each on a different subject.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStudents filled out the learner profile forms.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3 (March-June 2023)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTask 3.1 Essay 1 topics are in education and trends in digital learning\u003c/p\u003e \u003cp\u003eTask 3.2 Essay 2 topics on marketing, consumerism, social image.\u003c/p\u003e \u003cp\u003eExam 2. An exam in which students produce an essay.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4(September-December (2023)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTask 4.1 summary of a medical journal article\u003c/p\u003e \u003cp\u003eTask 4.2 critique of the same medical journal article with additional references\u003c/p\u003e \u003cp\u003eTask 4.3 1 revision/synthesis of summary and critique\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eChatGPT became available in Saudi Arabia.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe first trimester focused on paragraph writing. Initially, the focus was on producing a topic sentence, its supporting ideas, and a concluding sentence for the practice and the first task. The topics for the practice paragraph included \u003cem\u003etime management\u003c/em\u003e, \u003cem\u003evolunteer work\u003c/em\u003e and the \u003cem\u003euse of smartphones\u003c/em\u003e. Instructors give feedback on the practice paragraph but do not grade them. The first assessed task topics are general such as \u003cem\u003eBeing a Good Listener\u003c/em\u003e, and \u003cem\u003eReasons to Buy an iPad\u003c/em\u003e. The second task is cause-effect rhetoric, with topics related to students' lives such as \u003cem\u003eCollege Life\u003c/em\u003e, and \u003cem\u003eFamily Issues\u003c/em\u003e. In the second trimester students are required to produce two assignments. The first is a paragraph comparing two concepts or entities such as \u003cem\u003eGym Versus Home Workouts\u003c/em\u003e, \u003cem\u003eA Mother and Father\u0026rsquo;s Roles\u003c/em\u003e, or \u003cem\u003eChildhood Fifty Years Ago Versus Now\u003c/em\u003e. The second is a paragraph defining a concept or entity such as \u003cem\u003eMental Health\u003c/em\u003e, \u003cem\u003eSnapchat\u003c/em\u003e, or \u003cem\u003eAn Ideal Roommate\u003c/em\u003e. At the end of the trimester students take a timed three-hour exam writing by hand two paragraphs similar to the two that they have produced during the trimester without the benefit of access to language resources. In the exam, the students are given the option of four topics for a definition paragraph and four topics for a comparison paragraph. They are instructed to choose one of the four topics for each and write 12\u0026ndash;16 sentences with a topic sentence that includes both a topic and a controlling idea, supporting sentences that provide the main points with specific details, and a concluding sentence. See Appendix C for scanned student samples from these exams.\u003c/p\u003e \u003cp\u003eIn the third trimester, students progressed to argumentative essay writing, producing two 450\u0026ndash;500-word essays, and an additional one under exam conditions. The fourth trimester tasked students with basing their writing on a medical journal article of their choice, first summarizing it, then critiquing it using additional references, and finally revising and synthesizing the summary and critique into one paper. See Appendix B for a sample of a higher proficiency pharmacology student tagged writing across the first three trimesters.\u003c/p\u003e \u003cp\u003eAll texts were collected electronically and pre-processed to remove non-linguistic content before being XML tagged for spelling errors, transliterations, in-text references, and reference lists. In cases where the spelling errors are ambiguous either due to illegible handwriting or similar words such as using \u003cem\u003ethere\u003c/em\u003e in place of \u003cem\u003etheir\u003c/em\u003e, the transcriber sends a screenshot to a shared messaging group with the research group and the issue is resolved via consensus.\u003c/p\u003e"},{"header":"4. Avenues for Investigation","content":"\u003cp\u003eThe King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus offers a rich resource for examining multiple dimensions of English as a Foreign Language (EFL) academic writing development. The corpus, which comprises longitudinal data, enables comprehensive research into various aspects of learners' writing processes and outcomes.\u003c/p\u003e \u003cp\u003eOne significant area of investigation is the analysis of learners' lexical, grammatical, and semantic choices. By tracking how language use evolves over time and in response to different task demands, researchers can identify specific areas where learners exhibit growth or face challenges. This analysis is crucial for understanding the progression of writing proficiency and the factors that influence it. The linguistic and textual features of interest to these research strands include verb-argument constructions, inanimate subjects, rhetorical phrases, reporting verbs, and lexical bundles. These features are considered \"late acquired\" and remain challenging even at advanced proficiency levels. Potentially, they can be isolated across the corpus for data-driven approaches like cluster analysis and vectorization, which can identify proficiency benchmarks to inform EAP writing needs analysis and assessment (Wulff \u0026amp; Gries, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2011\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe influence of the first language (L1) and culture on English foreign and academic writing is another key research avenue. Given that all 158 of the participants' native language is Arabic, the corpus provides a unique opportunity to study cross-linguistic effects. Researchers can explore how L1 influences manifest in learners' English writing, offering insights into the complexities of bilingual education and language transfer phenomena.\u003c/p\u003e \u003cp\u003eThe inclusion of transliteration tags within the corpus facilitates a detailed examination of the specific linguistic elements that students struggle to translate into their second language. This enables researchers to identify patterns and areas of difficulty, thereby contributing valuable insights into the cognitive processes involved in second language acquisition and translation.\u003c/p\u003e \u003cp\u003eIndividual differences among learners also play a significant role in writing development. The corpus includes metadata on participants' educational backgrounds, areas of study, and preferred reference tools. This information allows for an in-depth exploration of how these factors impact writing proficiency. For example, researchers can investigate whether students from different schooling backgrounds exhibit varying levels of writing competence or how the use of specific reference tools correlates with writing quality.\u003c/p\u003e \u003cp\u003eThe corpus is also instrumental in examining the effects of instructional interventions. By comparing the writing outcomes of students taught by different instructors or experiencing changes in instructional methods, researchers can derive valuable pedagogical insights. This comparison can shed light on the effectiveness of various teaching strategies and the potential benefits of maintaining consistent instruction versus introducing new instructional approaches across terms.\u003c/p\u003e \u003cp\u003eAnother area for exploration is the impact of different task conditions on writing performance. The corpus contains samples of both exam and homework writing, enabling researchers to contrast these contexts. Such comparisons can highlight the effects of timed conditions and the availability of reference materials on students' writing outputs.\u003c/p\u003e \u003cp\u003eThe advent of emerging language technologies, such as ChatGPT, presents an additional dimension for investigation. By dividing the corpus into pre- and post-ChatGPT eras, researchers can examine how access to such technologies influences writing development. This analysis can provide insights into the role of artificial intelligence in educational settings and its impact on learners' writing practices.\u003c/p\u003e \u003cp\u003eFurthermore, the corpus allows for ongoing development and the exploration of learners' perceptions. Since many participants are still accessible, the corpus can be expanded with more recent writing samples for longitudinal comparisons. Additionally, interviews with learners can elicit their perceptions of their writing development and the use of technology, offering a qualitative complement to the quantitative data.\u003c/p\u003e"},{"header":"5. Future Directions and Conclusion","content":"\u003cp\u003eThe KSAUHS Learner Corpus represents a groundbreaking effort in capturing the academic writing development of Arabic L1 EFL learners within the context of emerging language technologies. By longitudinally tracking writing proficiency across various tasks, the corpus provides valuable data-driven insights into the lexical, grammatical, and rhetorical features that characterize and advance second language writing expertise. The ongoing expansions of the corpus aim to enhance its representativeness and analytical potential, making it an invaluable resource for EFL and EAP research and pedagogy.\u003c/p\u003e \u003cp\u003eThe KSAUHS Learner Corpus is available to researchers upon signing a data use agreement that permits only non-commercial use, prohibits full-text reproduction and redistribution, and requires acknowledgement. While the publicly shared corpus is anonymized, the research team can collect follow-up writing and interviews from participants for the next four years, enabling studies of continuing writing development post initial collection.\u003c/p\u003e \u003cp\u003eStarting with the Fall semester of 2024, data collection will be extended to include both the male and female campuses. This expansion will address previous shortcomings by ensuring timely uploads of student papers to avoid missing assignments. Additionally, efforts will be made to obtain draft versions of assignments along with the final submissions, providing a more comprehensive view of the writing process and development.\u003c/p\u003e \u003cp\u003eWe also invite feedback from the wider research community interested in learner corpus research. Collaborative input will be essential for improving the corpus, making it more accessible, and enhancing its utility for producing meaningful insights. Engaging with the research community will help us refine our methodologies and expand the corpus's impact on understanding EFL writing development.\u003c/p\u003e \u003cp\u003eBy continuing to expand and refine the KSAUHS Learner Corpus, we aim to contribute significantly to the field of EAP and provide a robust foundation for future research into the complexities of second language writing development.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e6.1 Ethics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe collection of the participants\u0026rsquo; written production and metadata was approved by the Institutional Review Board (IRB) of King Abdullah International Medical Research Center (KAIMRC), Ministry of National Guard - Health Affairs, Kingdom of Saudi Arabia. The IRB approval number for this study is IRB/1344/22, and the study number is NRC22R/312/06. The approval was granted on 27 July 2022.\u003c/p\u003e\n\u003cp\u003eAll participants provided informed consent prior to their inclusion in the study. Participation was voluntary, and participants were assured of the confidentiality and anonymity of their data. The study did not involve any interventions or treatments.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e6.2 Consent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e6.3 Availability of data and material\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe KSAUHS Learner Corpus is available to researchers upon signing a data use agreement. This agreement permits only non-commercial use of the corpus, prohibits full-text reproduction and redistribution, and requires proper acknowledgment of the corpus in any resulting publications. Researchers interested in accessing the corpus should contact the corresponding author to obtain further details and initiate the data use agreement process.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e6.4 Competing interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e6.5 Funding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo funding was received for this research.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEman Al Nafjan conceptualized the study, oversaw data collection, and drafted the manuscript. All other authors, except Sabria Jawhar, contributed to the transcription, compilation, and refinement of the corpus. Sabria Jawhar coordinated logistical arrangements and secured the necessary IRB approval for data and metadata collection. All authors reviewed and approved the final version of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe extend our gratitude to Prof. Magali Paquot for her invaluable advice during the stages of corpus compilation and for providing critical insights that enhanced this corpus. We also acknowledge the administrative assistants at KSAUHS, Norah Alowerdhi and Mashael Alhumayd, for their assistance in scanning participants\u0026rsquo; handwritten exam papers. Their support was instrumental in the successful compilation of this learner corpus.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAl Najfan, E., \u0026amp; Jawhar, S. (in press). Learner corpus research and data-driven learning in Saudi Arabia. In A. H. Al-Hoorie, C. Mitchell, \u0026amp; T. Elyas (Eds.), \u003cem\u003eLanguage education in Saudi Arabia: Integrating technology in the classroom\u003c/em\u003e. Springer.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBiber, D., Reppen, R., Staples, S., \u0026amp; Egbert, J. (2020). Exploring the longitudinal development of grammatical complexity in the disciplinary writing of L2-English university students. \u003cem\u003eInternational Journal of Learner Corpus Research\u003c/em\u003e, \u003cem\u003e6\u003c/em\u003e(1), 38\u0026ndash;71.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCallies, M., \u0026amp; Zaytseva, E. (2013). The Corpus of Academic Learner English (CALE): A new resource for the assessment of writing proficiency in the academic register. \u003cem\u003eDutch Journal of Applied Linguistics\u003c/em\u003e, \u003cem\u003e2\u003c/em\u003e(1), 126\u0026ndash;132.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGranger, S., Dupont, M., Meunier, F., Naets, H., \u0026amp; Paquot, M. (2020). \u003cem\u003eThe International Corpus of Learner English (version 3)\u003c/em\u003e. Presses universitaires de Louvain.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGranger, S., \u0026amp; Paquot, M. (2013). Language for specific purposes learner corpora. In C. A. Chapelle (Ed.), \u003cem\u003eThe Encyclopedia of Applied Linguistics\u003c/em\u003e. Blackwell-Wiley.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGranger, S. (2008). Learner corpora. In A. L\u0026uuml;deling, \u0026amp; M. Kyt\u0026ouml; (Eds.), \u003cem\u003eCorpus linguistics: An international handbook\u003c/em\u003e (pp. 259\u0026ndash;275). Walter de Gruyter.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMyles, F. (2009). Investigating learner language development with electronic longitudinal corpora: Theoretical and methodological issues. \u003cem\u003eThe longitudinal study of advanced L2 capacities\u003c/em\u003e (pp. 74\u0026ndash;88). Routledge.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., \u0026amp; Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. \u003cem\u003eScientific Data\u003c/em\u003e, \u003cem\u003e3\u003c/em\u003e(1), 1\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWulff, S., \u0026amp; Gries, S. T. (2011). Corpus-driven methods for assessing accuracy in learner production. In P. Robinson (Ed.), \u003cem\u003eSecond language task complexity: Researching the cognition hypothesis of language learning and performance\u003c/em\u003e (pp. 61\u0026ndash;87). John Benjamins.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZaytseva, E. (2011). Register, genre, rhetorical functions: Variation in English native-speaker and learner writing. In H. Hedeland, T. Schmidt, \u0026amp; K. W\u0026ouml;rner (Eds.), \u003cem\u003eMultilingual resources and multilingual applications\u003c/em\u003e (pp. 239\u0026ndash;242). University of Hamburg. 4o.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"corpus compilation, corpus design, corpus study design, learner corpus research, metadata, second language acquisition","lastPublishedDoi":"10.21203/rs.3.rs-4755662/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4755662/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe King Saud bin Abdulaziz University for Health Sciences (KSAUHS) Learner Corpus is a new longitudinal corpus of English as a Foreign Language (EFL) academic writing at the university level that complies with the FAIR principles. Collection began in 2022, the corpus captures writing development during a period of emerging language technologies. The corpus currently contains over a million words from 158 learners across four proficiency levels (CEFR A1-B2). Texts span four trimesters and multiple academic genres, including paragraphs and essays in rhetorical modes like cause-effect, argumentation, and summarization. Metadata includes learners' parents\u0026rsquo; first language background, other foreign languages, academic focus, and reference tool preferences. The corpus enables research into multiple dimensions of second language writing development to inform EAP pedagogy and assessment. Avenues for investigation include: lexico-grammatical development, cross-linguistic influence, individual differences, instructional interventions, impacts of task conditions and language technologies, and longitudinal trajectories. Ongoing corpus expansion aims to expand size, representation, and accessibility. This unique resource promises data-driven insights into the textual features and factors that characterize and facilitate intermediate and advanced EFL writing proficiency in university contexts.\u003c/p\u003e","manuscriptTitle":" The King Saud bin Abdulaziz University for Health Sciences Learner Corpus: A Longitudinal Resource for Researching the Development of Academic Writing Proficiency in EFL Learners ","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-13 18:00:33","doi":"10.21203/rs.3.rs-4755662/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3dd597c3-6f3d-4760-b281-9f12c6651a51","owner":[],"postedDate":"August 13th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-09-01T04:38:14+00:00","versionOfRecord":[],"versionCreatedAt":"2024-08-13 18:00:33","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4755662","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4755662","identity":"rs-4755662","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.