Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam

doi:10.21203/rs.3.rs-6680435/v1

Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam

2025 · doi:10.21203/rs.3.rs-6680435/v1

preprint OA: closed

Full text JSON View at publisher

Full text 134,031 characters · extracted from preprint-html · click to expand

Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam Chan Young Jung, Sanghoun Song This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6680435/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract High-quality distractors are essential in multiple-choice questions to assess student understanding and diagnose misconceptions; however, constructing these distractors manually is labor-intensive. This study presents the first large-scale investigation of automated distractor generation (ADG) for the English section of Korea’s College Scholastic Ability Test (CSAT), a high-stakes exam of English as a Foreign Language (EFL) characterized by consistent item design and linguistic constraints. We implement and evaluate three ADG approaches using GPT-4.1: supervised fine-tuning on a curated CSAT dataset, in-context learning with a novel distractor attractiveness metric to guide exemplar retrieval, and Chain-of-Scaffolds, a prompting strategy inspired by educational scaffolding theory that decomposes distractor generation into reasoning stages. Across 80 unseen items from recent CSAT administrations, supervised fine-tuning achieves the highest semantic and lexical alignment with ground-truth distractors. In-context learning retrieves more pragmatically effective examples, producing distractor sets that best approximate realistic answer distributions. The Chain-of-Scaffolds method yields distractors that simulate test-taker misconceptions while minimizing confusion with the correct answer. These findings underscore the value of pedagogically grounded prompting and data-informed retrieval in high-stakes language assessment and suggest that ADG strategies should align with instructional contexts—for example, prioritizing fine-tuning for nationwide standardized exams, or selecting in-context learning for classroom diagnostics that require adaptability and rapid deployment. distractor generation language models GPT-4.1 EFL Korean CSAT Figures Figure 1 Figure 2 Figure 3 Figure 4 1 Introduction The College Scholastic Ability Test (CSAT; Suneung ) is South Korea’s high-stakes, standardized college entrance examination that plays a pivotal role in determining students’ access to higher education and career opportunities (Kwon, Lee & Shin, 2015 ). English holds particular significance among its core subjects, with educational policies assigning it comparable weight to Korean and mathematics (Brutt-Griffler & Kim, 2023 ), intensifying competition and encouraging widespread reliance on private education. Given the CSAT’s societal importance and recent advances in Generative Pre-trained Transformers (GPT) (Bubeck et al., 2023 ; Shahriar et al., 2024 ), we conduct a comparative, experimental analysis of GPT-based automatic distractor generation (ADG) techniques tailored to the CSAT’s English section. Distractors refer to incorrect but plausible answer choices presented alongside the correct answer in multiple-choice questions (MCQs). They play a crucial role in diagnosing the misconceptions of test takers and their areas for improvement (Gierl, Bulut, Guo, & Zhang, 2017 ). Given the labor-intensive nature of manual distractor generation, scalable ADG methods using PLMs such as GPT have gained popularity in recent years (Doughty et al., 2024 ; Feng et al., 2024 ; Maity, Deroy, & Sarkar, 2024 ; Tran et al., 2023; Zu, Choi, & Hao, 2023 ). Various ADG strategies have been assessed on English exam datasets, including CLOTH (Xie, Lai, Dai, & Hovy, 2018 ), DREAM (Sun et al., 2019 ), and RACE (Lai, Xie, Liu, Yang, & Hovy, 2017 ); however, the assessment of these ADG strategies has not been extended to the English section of the CSAT. Three main directions have emerged in PLM-based ADG: fine-tuning, in-context learning (ICL), and template-based prompting (Alhazmi, Sheng, Zhang, Zaib, & Alhazmi, 2024 ). Fine-tuning a PLM adapts the model to perform better in a domain or task by training it further on a task-specific dataset. Offerijns, Verberne, and Verhoef ( 2020 ) fine-tuned GPT-2 using the RACE dataset to generate three semantically correct and educationally relevant distractors for a given question and context. Conversely, Taslimipoor, Benedetto, Felice, and Buttery (2024) and Yu et al. ( 2024 ) fine-tuned the T5 model for cloze tasks and cross-domain MCQs. The latter two approaches are prompting-based—specialized prompts guide the PLM’s behavior at inference time without modifying its parameters. ICL uses examples embedded in the input prompt. Bitew, Deleu, Develder, and Demeester ( 2023 ) used a BERT-based question similarity model to retrieve a ranked list of sample questions similar to the target test question; however, McNichols et al. (2023) used the K-nearest neighbor (K-NN) retrieval for mathematics MCQs. Meanwhile, template-based prompting , which relies on carefully designed prompts without using training examples, is also promising in ADG (Doughty et al., 2024 ). The three approaches differ in the amount of ground-truth data required: fine-tuning requires substantial training data, ICL uses a few curated examples, and template-based prompting relies solely on the PLM’s innate reasoning capabilities. Educational specialists developed the English section of the CSAT to assess a specific population of EFL learners—namely, senior high school students (aged 17–18) preparing for university admission in Korea. Consequently, the exam exhibits consistent question formats, target skills, and stylistic features across its official and mock administrations. Our study focuses on high-difficulty CSAT items that hinge on English-language distractors (as detailed in Section 3.1 ). To address the CSAT’s distinctive challenges, we evaluate three tailored approaches, each aligned with one of the three main directions in PLM-based ADG. Our contributions are as follows. First, we propose a novel metric, Distractor Attractiveness Rate (DAR) , which incorporates entropy and empirically observed answer selection rates from CSAT administrations. DAR is an alternative to similarity-based K-NN retrieval strategies in ADG. Second, we introduce Chain-of-Scaffolds (CoS) , a pedagogically motivated adaptation of Chain-of-Thought (CoT) prompting (Kojima, Gu, Reid, Matsuo, & Iwasawa, 2023 ) that operationalizes educational scaffolding principles for ADG in high-stakes language assessment. Third, we quantitatively demonstrate the respective strengths of three ADG approaches for assessment design—respectively grounded in supervised fine-tuning, ICL, and template-based prompting—by measuring their semantic and lexical alignment with ground-truth CSAT distractors, and through a model-internal check for plausibility. Our results reveal that supervised fine-tuning consistently outperforms prompting approaches on automatic evaluation metrics, while Chain-of-Scaffolds and ICL with DAR-based retrieval outperform the GPT-4.1 baseline respectively in semantic alignment and plausibility with minimal reliance on training data. 2 Methods 2.1 Task Definition We formally define a CSAT MCQ Q as Q = {t, s, c, D, R} . Each MCQ comprises a question type t (where t ∈ T , | T | = 6), a stem s , a correct answer c , a set of distractors D (| D | = 4), and a set of selection rates R where r i ∈ R and r c respectively correspond to the reported selection rate of d i ∈ D and c among test takers in official CSAT and senior-year mock CSAT administrations. All questions—originally written in Korean—were replaced with one of six standardized English question templates at the time of data collection (Table 1 ). The questions were thus categorized by their type to minimize the influence of multilingual interference prevalent in transformer-based language models (Held & Yang, 2023 ; Shaham, Elbayad, Goswami, Levy, & Bhosale, 2023 ). Accordingly, each question stem was also reformatted, where necessary, to align with the question templates. The resulting t, s, c, d i components are all text sequences in English, each similar in length and style to equivalents across all CSAT administrations. The objective is to generate a new set of distractors D gen (| D gen | = 4), given ( t, s, c ) and optionally r 1 –r 4 , such that the generated distractors d gen_1 –d gen_4 exhibit semantic and lexical alignment to the ground-truth distractors d 1 –d 4 . Aggregating individually plausible distractors does not ensure the overall effectiveness of the set; thus, we adopt a joint generation strategy for D gen , following the approach of Rodriguez-Torrealba, Garcia-Lopez, and Garcia-Cabot ( 2025 ). Table 1 Six CSAT question types with distractors written in English, not embedded in the question stem. Note that the corresponding question numbers may vary slightly across test administrations prior to September 2019. English Question Template Original Question (Korean/English Translation) CSAT Question Numbers Change in sentiment 다음 글에 드러난 _______ 의 심경 변화로 가장 적절한 것은? (Which is the most appropriate description of the change in _______’s sentiment in the following passage?) 19 Meaning of [CTXT] in context 밑줄 친 _______ 이 다음 글에서 의미하는 바로 가장 적절한 것은? (Which is the most appropriate interpretation of the underlined _______ in the following passage?) 21 Topic of the passage 다음 글의 주제로 가장 적절한 것은? (Which is the most appropriate topic of the following passage?) 23 Title for the passage {다음 글의, 윗글의} 제목으로 가장 적절한 것은? (Which is the most appropriate title for the {following, above} passage?) 24 41 Fill in the [BLANK] 다음 빈칸에 들어갈 말로 가장 적절한 것을 고르시오. (Choose the most appropriate phrase to fill in the following blank.) 31 32 33 34 Pair of (A), (B) that best completes the [SUMMARY] to the passage 다음 글의 내용을 한 문장으로 요약하고자 한다. 빈칸 (A), (B)에 들어갈 말로 가장 적절한 것은? (To summarize the following passage in one sentence, which words best fit blanks (A) and (B)?) 40 2.2 Approaches Across all three approaches, we used the 2025-04-14 snapshot of GPT-4.1 as our base PLM. As of April 2025, GPT-4.1 was the latest version of OpenAI’s flagship chat model that supported fine-tuning via the application programming interface. Having used GPT-4o for preliminary experiments, we observed that GPT-4.1 outperformed GPT-4o across all three approaches, as well as the zero-shot baseline. Table 2 provides an overview of the three approaches. Table 2 Overview of GPT-based distractor generation approaches used in this study. All methods were use GPT-4.1 (2025-04-14 snapshot) as the base model. Approach Theoretical Basis Training / Example Data Used Base Model Supervised Fine Tuning Supervised Learning ( t, s, c, D ) (n = 419) GPT-4.1- 2025-04-14 In-Context Learning with DAR-Based Retrieval In-Context Learning (Few-Shot), Entropy-Based Selection ( t, s, c, D, R ) (n = 30; 5 per question type) GPT-4.1- 2025-04-14 Chain-of-Scaffolds Chain-of-Thought Prompting, Scaffolding None GPT-4.1- 2025-04-14 2.2.1 Supervised Fine Tuning The first approach involves adopting supervised fine-tuning to adapt the base PLM for the CSAT distractor generation task. The model is trained to jointly generate four distractors conditioned on a triplet input—question type, stem, and correct answer—using annotated examples from past CSAT items. We expose the model to a diverse set of verified distractor examples across all six question types to align its generation behavior with the implicit pedagogical and stylistic norms of the CSAT. The generalization capabilities of the base PLM are thus leveraged while its parameters are adjusted for task-specific performance. Section 3 provides further details on dataset composition and training configuration. 2.2.2 In-Context Learning with DAR-Based Retrieval The second approach involves building on the ICL framework, where examples are provided directly in the prompt to guide the model’s generation behavior at inference time. Various strategies for retrieving items similar to the target question have been explored in prior ADG research, including BERT-based ranking models (Bitew et al., 2023 ) and cosine similarity between Angle vectors representing encoded textual content, answers, and questions (Li & Li, 2024 ; Luo, Deng, Shen, Ng, & Chua, 2024 ). We introduce an ICL approach grounded in DAR, a novel metric based on entropy that quantifies the collective plausibility of distractors using real-world selection rates R from past CSAT administrations. In this way, this approach ensures that the retrieved distractor sets are contextually similar to the target question and sufficiently plausible. We retrieve five distractor sets with the highest DAR scores within the same question type for each question in the test set from the training set and concatenate them into the prompt at inference time. The number of sets retrieved (n = 5) was determined empirically through trial and error. Whereas the supervised fine-tuning approach exposes the model to a broad distribution of training examples to internalize generalizable patterns, this retrieval-based approach prioritizes a smaller set of highly effective examples for each question type, offering the model more targeted and high-impact input at inference time. Our definition of DAR incorporates three core heuristics: (i) distractor sets with low correctness rates are favored, (ii) distractor selections that are evenly distributed among test-takers indicate the comparable plausibility of all options, and (iii) sets in which a single distractor disproportionately dominates—suggesting that the remaining distractors serve merely as placeholders—are penalized. The first two heuristics are applicable to the CSAT, where every item includes four distractors. The third is motivated by Ma and Du ( 2023 ), who demonstrate that prompting PLMs to simulate process-of-elimination reasoning—central to human test-taking—enhances performance on MCQs. Their findings suggest that PLMs are sensitive to subtle plausibility gradients even among incorrect options, reinforcing the need to avoid exemplar sets that contain obviously implausible distractors. Hence, DAR is generically for any distractor set where | D | ≥ 2 as follows: $$\:DAR\left(Q\right)=\left(1-{r}_{c}\right)\times\:\left(\frac{-\sum\:_{{d}_{i}\in\:D}\stackrel{\sim}{{r}_{i}}{\text{log}}_{2}\stackrel{\sim}{{r}_{i}}}{{\text{log}}_{2}\left|D\right|}\times\:\frac{1-\text{max}\stackrel{\sim}{{r}_{i}}}{1-\frac{1}{\left|D\right|}}\right)$$ $$\:where\hspace{1em}\stackrel{\sim}{{r}_{i}}=\frac{{r}_{i}}{1-{r}_{c}}$$ Note that the range of DAR is [0,1]: a correctness rate of 100% yields a DAR of 0, while the maximum value of 1 applies when no test-taker selects the correct answer and distractor selections are perfectly balanced. Regarding CSAT English MCQs, where | D | = 4, the above formula is instantiated as follows: $$\:DAR\left(Q\right)=\left(1-{r}_{c}\right)\times\:\left(\frac{-\sum\:_{i=1}^{4}\stackrel{\sim}{{r}_{i}}{\text{log}}_{2}\stackrel{\sim}{{r}_{i}}}{2}\times\:\frac{1-\text{max}\{\stackrel{\sim}{{r}_{1}},\stackrel{\sim}{{r}_{2}},\stackrel{\sim}{{r}_{3}},\stackrel{\sim}{{r}_{4}}\}}{0.75}\right)$$ $$\:where\hspace{1em}\stackrel{\sim}{{r}_{i}}=\frac{{r}_{i}}{1-{r}_{c}}$$ 2.2.3 Chain-of-Scaffolds Our final approach, CoS, is a multi-step prompting strategy that generates pedagogically effective distractors without relying on any training examples. The approach involves exploiting the reported utility of CoT prompting (Kojima et al., 2023 ; Wei et al., 2023 ) to decompose distractor generation into four steps: i) correct answer rationale generation, ii) misconception generation, iii) distractor generation based on misconceptions, and iv) syntactic and lexical refinement. The primary objective is to stimulate internal reasoning pathways related to commonly cited difficulty factors among CSAT test-takers by eliciting intermediary outputs from the base PLM at each stage. These difficulty factors are i) processing long phrases before the main verb, ii) handling negation, conjunctions, and connectives, and iii) lexical diversity and unfamiliar vocabulary (Kim, 2024 ). This strategy is theoretically grounded in Vygotsky’s concept of the Zone of Proximal Development , operationalized through the concept of scaffolding (Vygotsky, 1978 ; Wood, Bruner, & Ross, 1976 ). Each step guides the model beyond its baseline capability, encouraging it to utilize real-world linguistic cues rather than self-detected patterns. The utility of zero-shot Chain-of-X prompting strategies has been verified in recent work. Maity et al. ( 2024 ) reported improvements in grammaticality, answerability, and difficulty of distractors generated by a multi-stage prompting approach incorporating multilingual paraphrasing, keyword extraction, and question generation. Table 3 provides an overview of the input prompt and expected output format at each generation stage within the CoS framework. Every stage operates in a zero-shot manner (i.e., without training examples). Table 3 An overview of our Chain-of-Scaffolds framework. The corresponding content for each test item replaces the placeholders. Generation Stage Input Prompt Expected Output Correct Answer Rationale Generation You are an expert in English education for Korean CSAT preparation. Question: {question} Stem: {stem} Correct Answer: {correct_answer} Task: Briefly explain in one sentence why the provided correct answer is the best choice. {rationale} Misconception Generation You are analyzing test-taker errors based on known CSAT difficulties: 1. Processing long phrases before the main verb 2. Handling negations/conjunctions/connectives 3. Lexical diversity and unfamiliar vocabulary Question: {question} Stem: {stem} Correct Answer: {correct_answer} Correct Answer Rationale: {rationale} List four misconceptions, one for each of the three difficulties (repeat one if needed). Format each as: - [Difficulty X]: [Misconception sentence] {misconceptions} , represented as [ {Difficulty X} ]: [ {Misconception sentence} ] (n = 4) Distractor Generation Generate one distractor per misconception below. Each distractor should: - Directly reflect the reasoning error in the misconception. - Be grammatically and contextually appropriate. - Be clearly incorrect but attractive to a test-taker making the given mistake. List only the four distractors. Do NOT number or explain them. Question: {question} Stem: {stem} Correct Answer: {correct_answer} Misconceptions: {misconceptions} {initial_distractors} (n = 4) Syntactic and Lexical Refinement Refine the following distractors to ensure they: - Are similar in length, vocabulary difficulty, and syntactic complexity to the correct answer. - Do not resemble each other or the correct answer too closely. Output exactly four distractors, one per line. Do NOT add explanations or numbering. Correct Answer: {correct_answer} Initial Distractors: {initial_distractors} {final_distractors} (n = 4) 3 Experiments 3.1 CSAT Dataset Our dataset comprises 499 MCQs sourced from the English section of the CSAT and its official mock examinations, which are administered six times annually in preparation for the November CSAT. The Korea Institute of Curriculum and Evaluation administers the CSAT and the June and September mock exams. Conversely, Regional Offices of Education administer the remaining mock exams in rotation across Korea. Where available, item-level response rate distributions for the top 15 most frequently missed questions from each exam are retrieved from the Educational Broadcasting System, a publicly accessible online learning platform.[2] We restricted our data source to exams administered since March 2018 as earlier versions of the exam employed a now-discontinued relative evaluation system and differed in their question type composition due to curricular revisions. We excluded questions that contained Korean distractors to avoid multilingual interference. Furthermore, the question types for which distractors could be trivially generated—such as those requiring paragraph reordering or factual matching—were also excluded. The resulting dataset comprised six question types across 50 administrations (Table 4). As described in Section 2, all the questions were translated into one of six English templates for consistency. When necessary, stems that originally contained intentional grammatical errors—owing to their linkage with other questions—were corrected to ensure standalone interpretability. To approximate an 80:20 train-to-test ratio, we designated the 419 MCQs from administrations between 2018 and 2023 as the training set and the 80 MCQs between 2024 and March 2025 administrations as the test set. We deliberately avoided a randomized train:test split to preserve the proportional distribution of question types because CSAT question numbers are mapped to specific question types. Furthermore, we assume that recently administered items were less likely to have appeared in the pre-training data of language models, given that data age is associated with performance degradation in language models (Longpre et al., 2024). Table 4: Overview of CSAT Dataset curated in this study. Items in the training set were used as training data for supervised fine-tuning and the candidate pool for exemplar retrieval in ICL. Question Type Items in Training Set Items in Test Set Total Change in sentiment 40 8 48 Meaning of [CTXT] in context 39 8 47 Topic of the passage 42 8 50 Title for the passage 85 16 101 Fill in the [BLANK] 171 32 203 Pair of (A) , (B) that best completes the [SUMMARY] to the passage 42 8 50 3.2 Baselines We used zero-shot GPT (GPT-4.1-2025-04-14, consistent with our three proposed approaches), following prior work by Bitew et al. (2023), McNichols et al. (2023), and Taslimipoor et al. (2024). Zero-shot prompting with off-the-shelf GPT models provides a robust benchmark across diverse natural language processing tasks (Meshkin et al., 2024). Further, it represents a practical and accessible solution for non-expert users, such as educators, curriculum designers, and test developers. We evaluate this baseline against our three proposed approaches in terms of semantic and lexical alignment with ground-truth distractors, defined as the official distractors used in the 2024–2025 test administrations within our held-out test set. 3.3 Evaluation Metrics We evaluated the quality of generated distractors along two key dimensions: semantic and lexical alignment to the ground-truth distractors. This two-dimensional approach enables a comprehensive assessment of the efficient approximation of distractor characteristics in authentic test items by each system. All the metrics were computed on a per-question basis by comparing each set of four generated distractors with the corresponding set of four ground-truth distractors. We report seven automatic evaluation metrics: SBERT-based Cosine Similarity, BERTScore (for semantic alignment), BLEU-1 through BLEU-4, and ROGUE-L (for lexical alignment). Further details are provided in sections 3.3.1 and 3.3.2. To complement these alignment-based automatic metrics, we also prompted our base model to select the best answer and identify all plausible distractors, given the correct answer and distractor set. This diagnostic procedure was repeated for the ground-truth distractors, baseline, and the three proposed approaches because GPT-4.1’s ability to discriminate between correct, plausible, and implausible options has not been evaluated previously on the CSAT. Therefore, we posit that an effective set of distractors should enable the model to identify the correct answer while also flagging one or more distractors as plausible. This approach serves as a proxy for evaluating pragmatic and syntactic alignment, the latter of which is difficult to otherwise quantify given the short length of CSAT answer choices. 3.3.1 Semantic Alignment We adopted two metrics to assess semantic alignment at the sentence and token levels. To evaluate sentence-level semantic alignment, we use Sentence-BERT (all-MiniLM-L6-v2 (Reimers & Gurevych, 2019)), a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model which produces semantically rich sentence embeddings. We computed a 4 × 4 cosine similarity matrix for each set of generated distractors, comparing the embeddings of the generated and ground-truth distractors. Subsequently, we calculated the best-match average, defined as the mean of the highest cosine similarity scores for each generated distractor. Although cosine similarity theoretically ranges from -1 to 1, practical values typically fall between 0 and 1, as SBERT embeddings tend to lie in the non-negative region of the embedding space. To evaluate token-level semantic alignment, we used BERTScore (Zhang, Kishore, Wu, Weinberger, & Artzi, 2020), which compares contextual embeddings from a pre-trained BERT model. BERTScore values range from 0 to 1. Higher scores of cosine similarity and BERTScore indicate stronger semantic alignment. We report both metrics as percentages—i.e., original values multiplied by 100—for ease of interpretation. 3.3.2 Lexical Alignment We first reported BLEU-1 through BLEU-4 to assess lexical alignment, following Taslimipoor et al. (2024). BLEU (Bilingual Evaluation Understudy; Papineni, Roukos, Ward, & Zhu, 2002) is a precision-oriented metric that evaluates surface-level lexical alignment by calculating n-gram overlap between the candidate (i.e., generated distractors) and reference (i.e., ground-truth distractors) sequences. We calculated the BLEU scores using the Natural Language Toolkit implementation, with smoothing techniques applied to mitigate zero-precision scores in short sequences. BLEU-1 considers unigram overlaps, while BLEU-2 through BLEU-4 measure contextual alignment over longer n-grams. We also reported ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004)), a recall-focused metric that evaluates lexical alignment using the longest common subsequence between the candidate and reference sequences, to better capture partial matches that preserve semantic cohesion. For each generated distractor, we calculated the F1 version of ROGUE-L, which balances precision and recall, and reported the highest score across all references. The BLEU and ROGUE-L scores range between 0 and 1, with a higher value indicating greater lexical alignment. As with semantic metrics, we reported BLEU-1 through BLEU-4 and ROGUE-L as percentages for interpretability. 3.4 Implementation Details In this section, we report key implementation details to support the replicability of our research. We collected our training and test data by downloading available PDF versions of past test questions through the EBSi website. We systematically extracted the relevant questions, stems, and distractors using the pdfplumber Python library and regular expressions. PDFs unsuccessfully retrieved through this procedure were manually curated. The answer selection rates were also retrieved from the same source. We used pandas to convert question sets into data frames and appended computed DAR values using Microsoft Excel macros. We fine-tuned the GPT-4.1-2025-04.14 model using OpenAI’s default fine-tuning pipeline. A temperature of 0.0 was selected to ensure deterministic output across all approaches. The model was trained on 419 JSONL -formatted training instances over three epochs. The max_tokens parameter was set to 250 and top_p to 1. The training instances were randomly shuffled prior to training to mitigate overfitting on early examples. The fine-tuning process was completed in 21 minutes, and Figure 1 visualizes the training loss over steps. We used the following system prompt for the supervised fine-tuning and ICL approaches: “ Generate four distractors (i.e., plausible but incorrect answer choices) for the given multiple choice question. Do not include any indices .” The questions, stems, and correct answers from the test set were inserted into the user prompt. In the ICL setup, retrieved examples were also appended to the user prompt under the header [EXAMPLES]. Section 2.2.3 describes the prompt designs for the CoS approach. [2] https://www.ebsi.co.kr/ebs/xip/xipa/retrievePastGrdCutWrongAnswerRate.ebs?tab=1 4 Results and Discussion 4.1 Semantic and Lexical Alignment Table 5 details the semantic and lexical alignment results for the three proposed approaches compared to the zero-shot baseline. Supervised fine-tuning consistently outperforms the baseline across all metrics, achieving substantial improvements in SBERT cosine similarity, BLEU, and ROUGE-L scores. These findings indicate that exposure to all 419 training instances enabled GPT-4.1 to internalize the semantic and lexical features of ground-truth CSAT distractors and that patterns observed in exams between 2018 and 2023 remained robustly predictive of distractor style in more recent items. The alternative approaches, which relied on smaller amounts of relevant data or GPT-4.1’s scaffolded reasoning abilities, produced more nuanced results. ICL with DAR-based retrieval revealed modest gains in semantic alignment but underperformed the baseline in lexical alignment. This suggests that a few high-quality examples may enhance the model’s contextual understanding of distractors’ meaning; however, they may not sufficiently improve surface-level lexical similarity. Notably, the CoS approach, likely due to incorporating correct answer rationale generation and misconception generation stages, achieved SBERT cosine similarity and BERTScore-F1 scores comparable to those of supervised fine-tuning. This suggests that GPT-4.1 was capable of modeling plausible misconceptions that align with real-world test-taker reasoning errors. However, the lexical overlap remained relatively low, as reflected in the slightly lower BLEU-1 (unigram match) and ROUGE-L scores compared to the baseline. Overall, the results highlight potential in both explicit model guidance and data-driven model adjustment in distractor generation. The consistent supremacy of supervised fine-tuning demonstrates the benefits of full-data exposure in capturing conceptual plausibility in lexical patterns. In contrast, selecting a few particularly effective samples at inference time proved less effective. Meanwhile, the high semantic alignment of CoS-generated distractors reveals that semantic plausibility can be attained despite an absence of task-specific training data by steering the model’s reasoning process toward the types of errors human test-takers are likely to make. Table 5 Automatic evaluation metrics measuring semantic and lexical alignment. All values are reported as percentages (i.e., original value multiplied by 100). Higher values indicate stronger alignment with ground-truth distractors for all the reported metrics. Approach Semantic Lexical SBERT Cosine Sim. ↑ BERTScore-F1 ↑ BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑ ROGUE-L ↑ Zero-Shot (baseline) 44.78 88.54 26.76 10.36 7.19 5.59 19.27 Supervised Fine- Tuning 51.45 89.00 34.18 17.42 10.22 7.34 24.71 ICL with DAR-based Retrieval 47.61 88.62 25.96 10.66 6.8 5.23 18.11 Chain-of- Scaffolds 50.74 88.8 26.2 11.23 7.51 5.81 18.52 4.2 Plausibility-based Diagnostic Evaluation Plausibility-based diagnostic evaluation results in Table 6 provide additional insight into the practical quality of the generated distractors. As expected, the ground-truth distractors produced the highest rate of correct answers (79 out of 80), affirming GPT-4.1’s ability to solve CSAT questions crafted by human experts. Among the generated sets, ICL with DAR-based retrieval yielded the highest number of instances (49) in which the model selected the correct answer and identified at least one plausible distractor—surpassing both supervised fine-tuning (37) and CoS (38). Despite their relatively low semantic and lexical alignment with ground-truth distractors, ICL-generated distractors effectively approximated distractor sets with near-optimal choice distributions, highlighting DAR’s value as a retrieval signal for ADG. Meanwhile, CoS strikes a strong balance between correctness and plausibility, yielding only three incorrect responses across the test set—comparable to the ground truth (1) and outperforming supervised fine-tuning (12). These findings suggest that scaffolded reasoning enhances semantic plausibility and reduces model confusion more effectively than fine-tuning alone. These results demonstrate the individual strengths of each modeling approach. Fine-tuning excels at surface-level alignment, ICL at leveraging plausibility signals, and CoS at promoting both interpretive reasoning and alignment. Crucially, the three approaches outperform the zero-shot baseline in at least one dimension, reinforcing the value of targeted strategies in high-stakes distractor generation tasks like the CSAT. Table 6 GPT-4.1’s problem-solving and plausibility identification results with each set of distractors for all questions in the test set (n = 80). The base PLM was also prompted to answer each question with its ground-truth distractors to measure its innate competence on the CSAT. Approach Correct answer & plausible distractors identified Correct answer & no distractors identified Wrong answer Ground truth 65 14 1 Zero-Shot (baseline) 32 45 3 Supervised Fine-Tuning 37 31 12 ICL with DAR-based Retrieval 49 29 2 Chain-of- Scaffolds 38 39 3 5 Conclusions and Future Work This paper evaluates three major ADG strategies tailored to the English section of the Korean CSAT, a linguistically constrained high-stakes EFL exam. We introduce a CoT-inspired pedagogical approach to reinforce GPT-4.1’s reasoning capability in ADG, define a novel metric for relevant item retrieval in ICL, and conduct extensive experiments using authentic materials on a dataset hitherto unexplored in ADG. Our comparative analysis reveals that supervised fine-tuning effectively reproduces surface-level semantic and lexical conventions; meanwhile, ICL with DAR-based retrieval and CoS respectively enhance the collective plausibility of generated distractors and their semantic alignment with ground-truth distractors. Collectively, these findings illustrate how different ADG strategies serve distinct roles in replicating human-like distractor generation, offering practical applicability in structured multiple-choice assessments under varying resource conditions. Despite the promising results of this study, some limitations should be acknowledged. First, the training dataset was relatively small by modern fine-tuning standards, comprising only 419 instances, which may present challenges in the generalizability of the supervised fine-tuning results to broader or more varied English MCQ contexts beyond the CSAT. However, this limitation is unavoidable because the CSAT is administered only six times annually under strict question quality control, inherently capping the amount of available data. Second, efforts were made to standardize question formatting and minimize multilingual interference; however, translating stems and options from Korean into English templates may have introduced subtle changes in meaning or difficulty level that could affect distractor generation performance. Lastly, automatic evaluation metrics and plausibility-based diagnostic evaluation provided a multifaceted assessment; nonetheless, human expert evaluation of distractor quality—such as plausibility, grammaticality, and educational value—was not incorporated, leaving room for future studies to complement quantitative metrics with qualitative human judgment. Building on our findings, future research should explore two key directions. First, while this study focused on EFL assessments, applying these ADG strategies to non-language domains—such as STEM subjects, social sciences, or professional certification—could test their adaptability in reasoning-intensive contexts. Second, future work may investigate the role of rationale-augmented generation, as an extension of CoS, to produce not only distractors but also accompanying justifications or feedback to support formative assessment. Additionally, adaptive weighting or hybridization of fine-tuning, retrieval, and scaffolded prompting could be explored to optimize distractor generation across different task types and learner populations, supporting broader applications in personalized learning and educational technology systems. Declarations Acknowledgments Disclosure statement No potential conflict of interest was reported by the author(s). Notes on contributors The first author is … The second author is … References Alhazmi, E., Sheng, Q. Z., Zhang, W. E., Zaib, M., & Alhazmi, A. (2024). Distractor generation in multiple-choice tasks: A survey of methods, datasets, and evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 14437–14458). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.799 Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2023). Distractor generation for multiple-choice questions with predictive prompting and large language models. arXiv Preprint arXiv:2307.16338. https://arxiv.org/abs/2307.16338 Brutt-Griffler, J., & Kim, S. (2023). The testing culture and the role of private education. Language, Culture and Curriculum, 36 (3), 293–309. https://doi.org/10.1080/07908318.2022.2148686 Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv Preprint arXiv:2303.12712. https://arxiv.org/abs/2303.12712 Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., & Sakr, M. (2024). A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In Proceedings of the 26th Australasian Computing Education Conference (pp. 114–123). ACM. https://doi.org/10.1145/3636243.3636256 Feng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., Ornelas, N., & Lan, A. (2024). Exploring automated distractor generation for math multiple-choice questions via large language models. Findings of the Association for Computational Linguistics: NAACL 2024 , 3067–3082. https://doi.org/10.18653/v1/2024.findings-naacl.193 Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87 (6), 1082–1116. https://doi.org/10.3102/0034654317726529 Held, W., & Yang, D. (2023). Shapley head pruning: Identifying and removing interference in multilingual transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 2416–2427). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.eacl-main.177 Kim, S.-Y. (2024). A corpus-based analysis of variables influencing reading question difficulty on the College Scholastic Ability Test (CSAT) English section. Yung-hap Yeongeo Yeongmunhak [Convergence English and American Literature], 9 (2), 353–378. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916. https://arxiv.org/abs/2205.11916 Kwon, S. K., Lee, M., & Shin, D. (2015). Educational assessment in the Republic of Korea: Lights and shadows of high-stake exam-based education system. Assessment in Education: Principles, Policy & Practice, 24 (1), 60–77. https://doi.org/10.1080/0969594X.2015.1074540 Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 785–794). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1082 Li, X., & Li, J. (2024). AnglE-optimized text embeddings. arXiv preprint arXiv:2309.12871. https://arxiv.org/abs/2309.12871 Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81). Association for Computational Linguistics. https://aclanthology.org/W04-1013/ Longpre, S., Yauney, G., Reif, E., Lee, K., Roberts, A., Zoph, B., Zhou, D., Wei, J., Robinson, K., Mimno, D., & Ippolito, D. (2024). A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 3245–3276). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.179 Luo, H., Deng, Y., Shen, Y., Ng, S.-K., & Chua, T.-S. (2024). Chain-of-Exemplar: Enhancing distractor generation for multimodal educational question generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7978–7993). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.432 Ma, C., & Du, X. (2023). POE: Process of elimination for multiple choice reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 4487–4496). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.273 Maity, S., Deroy, A., & Sarkar, S. (2024). A novel multi-stage prompting approach for language agnostic MCQ generation using GPT. arXiv Preprint arXiv:2401.07098. https://arxiv.org/abs/2401.07098 McNichols, H., Feng, W., Lee, J., Scarlatos, A., Smith, D., Woodhead, S., & Lan, A. (2024). Automated distractor and feedback generation for math multiple-choice questions via in-context learning. arXiv Preprint arXiv:2308.03234. https://arxiv.org/abs/2308.03234 Meshkin, H., Zirkle, J., Arabidarrehdor, G., Chaturbedi, A., Chakravartula, S., Mann, J., Thrasher, B., & Li, Z. (2024). Harnessing large language models’ zero-shot and few-shot learning capabilities for regulatory research. Briefings in Bioinformatics, 25 (5), Article bbae354. https://doi.org/10.1093/bib/bbae354 Offerijns, J., Verberne, S., & Verhoef, T. (2020). Better distractions: Transformer-based distractor generation and multiple choice question filtering. arXiv Preprint arXiv:2010.09598. https://arxiv.org/abs/2010.09598 Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135 Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410 Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2025). Joint generation of distractors for multiple-choice questions: A text-to-text approach. Computers, Materials & Continua, 83 (2), 1683–1705. https://doi.org/10.32604/cmc.2025.062004 Shaham, U., Elbayad, M., Goswami, V., Levy, O., & Bhosale, S. (2023). Causes and cures for interference in multilingual translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15849–15863). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.883 Shahriar, S., Lund, B., Mannuru, N. R., Arshad, M. A., Hayawi, K., Bevara, R. V. K., Mannuru, A., & Batool, L. (2024). Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. arXiv Preprint arXiv:2407.09519. https://arxiv.org/abs/2407.09519 Sun, K., Yu, D., Chen, J., Yu, D., Choi, Y., & Cardie, C. (2019). DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7 , 217–231. https://doi.org/10.1162/tacl_a_00264 Taslimipoor, S., Benedetto, L., Felice, M., & Buttery, P. (2024, May). Distractor generation using generative and discriminative capabilities of transformer-based models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 5052–5063). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.452/ Tran, A., Angelikas, K., Rama, E., Okechukwu, C., Smith, D., & Macneil, S. (2023, October). Generating multiple choice questions for computing courses using large language models. In 2023 IEEE Frontiers in Education Conference (FIE) . https://doi.org/10.1109/FIE58773.2023.10342898 Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes . Harvard University Press. https://doi.org/10.2307/j.ctvjf9vz4 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv Preprint arXiv:2201.11903. https://arxiv.org/abs/2201.11903 Wood, D. J., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving . Journal of Child Psychology and Psychiatry, 17 , 89–100. http://dx.doi.org/10.1111/j.1469-7610.1976.tb00381.x Xie, Q., Lai, G., Dai, Z., & Hovy, E. (2018). Large-scale cloze test dataset created by teachers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2344–2356). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1257 Yu, H. C., Shih, Y. A., Law, K. M., Hsieh, K., Cheng, Y. C., Ho, H. C., Lin, Z. A., Hsu, W.-C., & Fan, Y.-C. (2024). Enhancing distractor generation for multiple-choice questions with retrieval augmented pretraining and knowledge graph integration. Findings of the Association for Computational Linguistics: ACL 2024 , 11019–11029. https://doi.org/10.18653/v1/2024.findings-acl.655 Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. arXiv Preprint arXiv:1904.09675. https://arxiv.org/abs/1904.09675 Zu, J., Choi, I., & Hao, J. (2023). Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Test and Assessment Modeling, 65 (1), 55–75. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6680435","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":483640353,"identity":"fb61d675-8b11-4871-8016-0a6b80674a91","order_by":0,"name":"Chan Young Jung","email":"","orcid":"","institution":"Korea University","correspondingAuthor":false,"prefix":"","firstName":"Chan","middleName":"Young","lastName":"Jung","suffix":""},{"id":483640354,"identity":"aac95197-70cf-4886-b4be-40a574a82781","order_by":1,"name":"Sanghoun Song","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3ElEQVRIiWNgGAWjYJACZhDBzwPnsxGpRbKHZC0GZ4jVIt9+9vDrgorDecZnDh/78KGGQZ6/gS3tAz4tBmfy0qxnnDlcbHa2LXnmjGMMhjMOsB2egVcLQ46ZMW/b4cRt53mMmXnYGBg3MLA343dY/xuIls39QC1//jHYE9TCcCPH+DFIywbeHmNmxjaGxA0MbIfx6jC48caMmedMeuKMM8eSGXv7JJJnHGZLJuCwHOPPPBXWif09yYcZfnyzse1vbzPG7zBgNEggcSSg0YQfMOONhVEwCkbBKBgFDAC+RUSc+FFD6wAAAABJRU5ErkJggg==","orcid":"","institution":"Korea University","correspondingAuthor":true,"prefix":"","firstName":"Sanghoun","middleName":"","lastName":"Song","suffix":""}],"badges":[],"createdAt":"2025-05-16 11:38:26","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6680435/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6680435/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91627153,"identity":"e6faec16-5888-42f9-ba53-5e59c2767754","added_by":"auto","created_at":"2025-09-18 12:21:51","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":106990,"visible":true,"origin":"","legend":"\u003cp\u003eTraining loss over steps in our GPT-4.1 fine-tuning process.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6680435/v1/f37774e538912d6cc0eec9d2.png"},{"id":91627152,"identity":"24a5562a-a051-4204-b138-fada2d8cae7e","added_by":"auto","created_at":"2025-09-18 12:21:51","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":30003,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of per-item semantic alignment scores by approach, visualized as bar plots.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6680435/v1/492e213303a11d365e4cb20a.png"},{"id":91626453,"identity":"78244030-3a22-4e52-b101-5025e17d718c","added_by":"auto","created_at":"2025-09-18 12:13:51","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":56856,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of per-item BLEU scores for lexical alignment based on approach.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6680435/v1/d16372370ac5b37de4d89685.png"},{"id":91626455,"identity":"63c864d9-cd49-410a-a146-aaae42e56cae","added_by":"auto","created_at":"2025-09-18 12:13:51","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":13757,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of per-item ROGUE-L scores for lexical alignment based on approach.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6680435/v1/cf9d5373f998964d6e8d7e2a.png"},{"id":104965522,"identity":"8d421f49-0ff7-42f4-b066-a121607452e0","added_by":"auto","created_at":"2026-03-19 09:42:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1304253,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6680435/v1/fadd9716-7438-4958-bcad-2f3976f61624.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eThe College Scholastic Ability Test (CSAT; \u003cem\u003eSuneung\u003c/em\u003e) is South Korea\u0026rsquo;s high-stakes, standardized college entrance examination that plays a pivotal role in determining students\u0026rsquo; access to higher education and career opportunities (Kwon, Lee \u0026amp; Shin, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). English holds particular significance among its core subjects, with educational policies assigning it comparable weight to Korean and mathematics (Brutt-Griffler \u0026amp; Kim, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), intensifying competition and encouraging widespread reliance on private education. Given the CSAT\u0026rsquo;s societal importance and recent advances in Generative Pre-trained Transformers (GPT) (Bubeck et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Shahriar et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), we conduct a comparative, experimental analysis of GPT-based automatic distractor generation (ADG) techniques tailored to the CSAT\u0026rsquo;s English section.\u003c/p\u003e\u003cp\u003eDistractors refer to incorrect but plausible answer choices presented alongside the correct answer in multiple-choice questions (MCQs). They play a crucial role in diagnosing the misconceptions of test takers and their areas for improvement (Gierl, Bulut, Guo, \u0026amp; Zhang, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). Given the labor-intensive nature of manual distractor generation, scalable ADG methods using PLMs such as GPT have gained popularity in recent years (Doughty et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Feng et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Maity, Deroy, \u0026amp; Sarkar, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Tran et al., 2023; Zu, Choi, \u0026amp; Hao, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Various ADG strategies have been assessed on English exam datasets, including CLOTH (Xie, Lai, Dai, \u0026amp; Hovy, \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2018\u003c/span\u003e), DREAM (Sun et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), and RACE (Lai, Xie, Liu, Yang, \u0026amp; Hovy, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2017\u003c/span\u003e); however, the assessment of these ADG strategies has not been extended to the English section of the CSAT.\u003c/p\u003e\u003cp\u003eThree main directions have emerged in PLM-based ADG: \u003cem\u003efine-tuning, in-context learning (ICL), and template-based prompting\u003c/em\u003e (Alhazmi, Sheng, Zhang, Zaib, \u0026amp; Alhazmi, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). \u003cem\u003eFine-tuning\u003c/em\u003e a PLM adapts the model to perform better in a domain or task by training it further on a task-specific dataset. Offerijns, Verberne, and Verhoef (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) fine-tuned GPT-2 using the RACE dataset to generate three semantically correct and educationally relevant distractors for a given question and context. Conversely, Taslimipoor, Benedetto, Felice, and Buttery (2024) and Yu et al. (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) fine-tuned the T5 model for cloze tasks and cross-domain MCQs. The latter two approaches are prompting-based\u0026mdash;specialized prompts guide the PLM\u0026rsquo;s behavior at inference time without modifying its parameters. \u003cem\u003eICL\u003c/em\u003e uses examples embedded in the input prompt. Bitew, Deleu, Develder, and Demeester (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) used a BERT-based question similarity model to retrieve a ranked list of sample questions similar to the target test question; however, McNichols et al. (2023) used the K-nearest neighbor (K-NN) retrieval for mathematics MCQs. Meanwhile, \u003cem\u003etemplate-based prompting\u003c/em\u003e, which relies on carefully designed prompts without using training examples, is also promising in ADG (Doughty et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). The three approaches differ in the amount of ground-truth data required: \u003cem\u003efine-tuning\u003c/em\u003e requires substantial training data, \u003cem\u003eICL\u003c/em\u003e uses a few curated examples, and \u003cem\u003etemplate-based prompting\u003c/em\u003e relies solely on the PLM\u0026rsquo;s innate reasoning capabilities.\u003c/p\u003e\u003cp\u003eEducational specialists developed the English section of the CSAT to assess a specific population of EFL learners\u0026mdash;namely, senior high school students (aged 17\u0026ndash;18) preparing for university admission in Korea. Consequently, the exam exhibits consistent question formats, target skills, and stylistic features across its official and mock administrations. Our study focuses on high-difficulty CSAT items that hinge on English-language distractors (as detailed in Section \u003cspan refid=\"Sec9\" class=\"InternalRef\"\u003e3.1\u003c/span\u003e). To address the CSAT\u0026rsquo;s distinctive challenges, we evaluate three tailored approaches, each aligned with one of the three main directions in PLM-based ADG.\u003c/p\u003e\u003cp\u003eOur contributions are as follows. First, we propose a novel metric, \u003cem\u003eDistractor Attractiveness Rate (DAR)\u003c/em\u003e, which incorporates entropy and empirically observed answer selection rates from CSAT administrations. DAR is an alternative to similarity-based K-NN retrieval strategies in ADG. Second, we introduce \u003cem\u003eChain-of-Scaffolds (CoS)\u003c/em\u003e, a pedagogically motivated adaptation of \u003cem\u003eChain-of-Thought (CoT)\u003c/em\u003e prompting (Kojima, Gu, Reid, Matsuo, \u0026amp; Iwasawa, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) that operationalizes educational scaffolding principles for ADG in high-stakes language assessment. Third, we quantitatively demonstrate the respective strengths of three ADG approaches for assessment design\u0026mdash;respectively grounded in supervised fine-tuning, ICL, and template-based prompting\u0026mdash;by measuring their semantic and lexical alignment with ground-truth CSAT distractors, and through a model-internal check for plausibility. Our results reveal that supervised fine-tuning consistently outperforms prompting approaches on automatic evaluation metrics, while Chain-of-Scaffolds and ICL with DAR-based retrieval outperform the GPT-4.1 baseline respectively in semantic alignment and plausibility with minimal reliance on training data.\u003c/p\u003e"},{"header":"2 Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Task Definition\u003c/h2\u003e\u003cp\u003eWe formally define a CSAT MCQ \u003cem\u003eQ\u003c/em\u003e as \u003cem\u003eQ = {t, s, c, D, R}\u003c/em\u003e. Each MCQ comprises a question type \u003cem\u003et\u003c/em\u003e (where \u003cem\u003et\u003c/em\u003e \u0026isin; \u003cem\u003eT\u003c/em\u003e, |\u003cem\u003eT\u003c/em\u003e| = 6), a stem \u003cem\u003es\u003c/em\u003e, a correct answer \u003cem\u003ec\u003c/em\u003e, a set of distractors \u003cem\u003eD\u003c/em\u003e (|\u003cem\u003eD\u003c/em\u003e| = 4), and a set of selection rates \u003cem\u003eR\u003c/em\u003e where \u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026isin; \u003cem\u003eR\u003c/em\u003e and \u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003ec\u003c/em\u003e\u003c/sub\u003e respectively correspond to the reported selection rate of \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026isin; \u003cem\u003eD\u003c/em\u003e and \u003cem\u003ec\u003c/em\u003e among test takers in official CSAT and senior-year mock CSAT administrations. All questions\u0026mdash;originally written in Korean\u0026mdash;were replaced with one of six standardized English question templates at the time of data collection (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The questions were thus categorized by their type to minimize the influence of multilingual interference prevalent in transformer-based language models (Held \u0026amp; Yang, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Shaham, Elbayad, Goswami, Levy, \u0026amp; Bhosale, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Accordingly, each question stem was also reformatted, where necessary, to align with the question templates. The resulting \u003cem\u003et, s, c, d\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e components are all text sequences in English, each similar in length and style to equivalents across all CSAT administrations.\u003c/p\u003e\u003cp\u003eThe objective is to generate a new set of distractors \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003egen\u003c/em\u003e\u003c/sub\u003e (|\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003egen\u003c/em\u003e\u003c/sub\u003e| = 4), given (\u003cem\u003et, s, c\u003c/em\u003e) and optionally \u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003e1\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e\u0026ndash;r\u003c/em\u003e\u003csub\u003e\u003cem\u003e4\u003c/em\u003e\u003c/sub\u003e, such that the generated distractors \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003egen_1\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e\u0026ndash;d\u003c/em\u003e\u003csub\u003e\u003cem\u003egen_4\u003c/em\u003e\u003c/sub\u003e exhibit semantic and lexical alignment to the ground-truth distractors \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003e1\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e\u0026ndash;d\u003c/em\u003e\u003csub\u003e\u003cem\u003e4\u003c/em\u003e\u003c/sub\u003e. Aggregating individually plausible distractors does not ensure the overall effectiveness of the set; thus, we adopt a joint generation strategy for \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003egen\u003c/em\u003e\u003c/sub\u003e, following the approach of Rodriguez-Torrealba, Garcia-Lopez, and Garcia-Cabot (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSix CSAT question types with distractors written in English, not embedded in the question stem. Note that the corresponding question numbers may vary slightly across test administrations prior to September 2019.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnglish Question Template\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eOriginal Question\u003c/p\u003e\u003cp\u003e(Korean/English Translation)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCSAT Question Numbers\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eChange in sentiment\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e다음 글에 드러난 _______ 의 심경 변화로 가장 적절한 것은?\u003c/p\u003e\u003cp\u003e(Which is the most appropriate description of the change in _______\u0026rsquo;s sentiment in the following passage?)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e19\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMeaning of [CTXT] in context\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e밑줄 친 _______ 이 다음 글에서 의미하는 바로 가장 적절한 것은?\u003c/p\u003e\u003cp\u003e(Which is the most appropriate interpretation of the underlined _______ in the following passage?)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e21\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTopic of the passage\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e다음 글의 주제로 가장 적절한 것은?\u003c/p\u003e\u003cp\u003e(Which is the most appropriate topic of the following passage?)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e23\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTitle for the passage\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e{다음 글의, 윗글의} 제목으로 가장 적절한 것은?\u003c/p\u003e\u003cp\u003e(Which is the most appropriate title for the {following, above} passage?)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e24 41\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFill in the [BLANK]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e다음 빈칸에 들어갈 말로 가장 적절한 것을 고르시오.\u003c/p\u003e\u003cp\u003e(Choose the most appropriate phrase to fill in the following blank.)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e31 32 33 34\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePair of (A), (B) that best completes the [SUMMARY] to the passage\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e다음 글의 내용을 한 문장으로 요약하고자 한다. 빈칸 (A), (B)에 들어갈 말로 가장 적절한 것은?\u003c/p\u003e\u003cp\u003e(To summarize the following passage in one sentence, which words best fit blanks (A) and (B)?)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e40\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Approaches\u003c/h2\u003e\u003cp\u003eAcross all three approaches, we used the 2025-04-14 snapshot of GPT-4.1 as our base PLM. As of April 2025, GPT-4.1 was the latest version of OpenAI\u0026rsquo;s flagship chat model that supported fine-tuning via the application programming interface. Having used GPT-4o for preliminary experiments, we observed that GPT-4.1 outperformed GPT-4o across all three approaches, as well as the zero-shot baseline. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e provides an overview of the three approaches.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOverview of GPT-based distractor generation approaches used in this study. All methods were use GPT-4.1 (2025-04-14 snapshot) as the base model.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eApproach\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTheoretical Basis\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eTraining /\u003c/p\u003e\u003cp\u003eExample Data Used\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eBase Model\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eSupervised Fine Tuning\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSupervised Learning\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e(\u003cem\u003et, s, c, D\u003c/em\u003e)\u003c/p\u003e\u003cp\u003e(n\u0026thinsp;=\u0026thinsp;419)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGPT-4.1-\u003c/p\u003e\u003cp\u003e2025-04-14\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eIn-Context Learning with DAR-Based Retrieval\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eIn-Context Learning (Few-Shot), \u003c/p\u003e\u003cp\u003eEntropy-Based Selection\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e(\u003cem\u003et, s, c, D, R\u003c/em\u003e)\u003c/p\u003e\u003cp\u003e(n\u0026thinsp;=\u0026thinsp;30;\u003c/p\u003e\u003cp\u003e5 per question type)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGPT-4.1-\u003c/p\u003e\u003cp\u003e2025-04-14\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eChain-of-Scaffolds\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eChain-of-Thought Prompting, \u003c/p\u003e\u003cp\u003eScaffolding\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNone\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eGPT-4.1-\u003c/p\u003e\u003cp\u003e2025-04-14\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cdiv id=\"Sec5\" class=\"Section3\"\u003e\u003ch2\u003e2.2.1 Supervised Fine Tuning\u003c/h2\u003e\u003cp\u003eThe first approach involves adopting supervised fine-tuning to adapt the base PLM for the CSAT distractor generation task. The model is trained to jointly generate four distractors conditioned on a triplet input\u0026mdash;question type, stem, and correct answer\u0026mdash;using annotated examples from past CSAT items. We expose the model to a diverse set of verified distractor examples across all six question types to align its generation behavior with the implicit pedagogical and stylistic norms of the CSAT. The generalization capabilities of the base PLM are thus leveraged while its parameters are adjusted for task-specific performance. Section \u003cspan refid=\"Sec8\" class=\"InternalRef\"\u003e3\u003c/span\u003e provides further details on dataset composition and training configuration.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section3\"\u003e\u003ch2\u003e2.2.2 In-Context Learning with DAR-Based Retrieval\u003c/h2\u003e\u003cp\u003eThe second approach involves building on the ICL framework, where examples are provided directly in the prompt to guide the model\u0026rsquo;s generation behavior at inference time. Various strategies for retrieving items similar to the target question have been explored in prior ADG research, including BERT-based ranking models (Bitew et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) and cosine similarity between Angle vectors representing encoded textual content, answers, and questions (Li \u0026amp; Li, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Luo, Deng, Shen, Ng, \u0026amp; Chua, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eWe introduce an ICL approach grounded in DAR, a novel metric based on entropy that quantifies the collective plausibility of distractors using real-world selection rates \u003cem\u003eR\u003c/em\u003e from past CSAT administrations. In this way, this approach ensures that the retrieved distractor sets are contextually similar to the target question and sufficiently plausible. We retrieve five distractor sets with the highest DAR scores within the same question type for each question in the test set from the training set and concatenate them into the prompt at inference time. The number of sets retrieved (n\u0026thinsp;=\u0026thinsp;5) was determined empirically through trial and error.\u003c/p\u003e\u003cp\u003eWhereas the supervised fine-tuning approach exposes the model to a broad distribution of training examples to internalize generalizable patterns, this retrieval-based approach prioritizes a smaller set of highly effective examples for each question type, offering the model more targeted and high-impact input at inference time.\u003c/p\u003e\u003cp\u003eOur definition of DAR incorporates three core heuristics: (i) distractor sets with low correctness rates are favored, (ii) distractor selections that are evenly distributed among test-takers indicate the comparable plausibility of all options, and (iii) sets in which a single distractor disproportionately dominates\u0026mdash;suggesting that the remaining distractors serve merely as placeholders\u0026mdash;are penalized. The first two heuristics are applicable to the CSAT, where every item includes four distractors. The third is motivated by Ma and Du (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), who demonstrate that prompting PLMs to simulate process-of-elimination reasoning\u0026mdash;central to human test-taking\u0026mdash;enhances performance on MCQs. Their findings suggest that PLMs are sensitive to subtle plausibility gradients even among incorrect options, reinforcing the need to avoid exemplar sets that contain obviously implausible distractors.\u003c/p\u003e\u003cp\u003eHence, DAR is generically for any distractor set where |\u003cem\u003eD\u003c/em\u003e| \u0026ge; 2 as follows:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:DAR\\left(Q\\right)=\\left(1-{r}_{c}\\right)\\times\\:\\left(\\frac{-\\sum\\:_{{d}_{i}\\in\\:D}\\stackrel{\\sim}{{r}_{i}}{\\text{log}}_{2}\\stackrel{\\sim}{{r}_{i}}}{{\\text{log}}_{2}\\left|D\\right|}\\times\\:\\frac{1-\\text{max}\\stackrel{\\sim}{{r}_{i}}}{1-\\frac{1}{\\left|D\\right|}}\\right)$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:where\\hspace{1em}\\stackrel{\\sim}{{r}_{i}}=\\frac{{r}_{i}}{1-{r}_{c}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eNote that the range of DAR is [0,1]: a correctness rate of 100% yields a DAR of 0, while the maximum value of 1 applies when no test-taker selects the correct answer and distractor selections are perfectly balanced. Regarding CSAT English MCQs, where |\u003cem\u003eD\u003c/em\u003e| = 4, the above formula is instantiated as follows:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:DAR\\left(Q\\right)=\\left(1-{r}_{c}\\right)\\times\\:\\left(\\frac{-\\sum\\:_{i=1}^{4}\\stackrel{\\sim}{{r}_{i}}{\\text{log}}_{2}\\stackrel{\\sim}{{r}_{i}}}{2}\\times\\:\\frac{1-\\text{max}\\{\\stackrel{\\sim}{{r}_{1}},\\stackrel{\\sim}{{r}_{2}},\\stackrel{\\sim}{{r}_{3}},\\stackrel{\\sim}{{r}_{4}}\\}}{0.75}\\right)$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:where\\hspace{1em}\\stackrel{\\sim}{{r}_{i}}=\\frac{{r}_{i}}{1-{r}_{c}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\u003ch2\u003e2.2.3 Chain-of-Scaffolds\u003c/h2\u003e\u003cp\u003eOur final approach, CoS, is a multi-step prompting strategy that generates pedagogically effective distractors without relying on any training examples. The approach involves exploiting the reported utility of CoT prompting (Kojima et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Wei et al., \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) to decompose distractor generation into four steps: i) correct answer rationale generation, ii) misconception generation, iii) distractor generation based on misconceptions, and iv) syntactic and lexical refinement. The primary objective is to stimulate internal reasoning pathways related to commonly cited difficulty factors among CSAT test-takers by eliciting intermediary outputs from the base PLM at each stage. These difficulty factors are i) processing long phrases before the main verb, ii) handling negation, conjunctions, and connectives, and iii) lexical diversity and unfamiliar vocabulary (Kim, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThis strategy is theoretically grounded in Vygotsky\u0026rsquo;s concept of the \u003cem\u003eZone of Proximal Development\u003c/em\u003e, operationalized through the concept of \u003cem\u003escaffolding\u003c/em\u003e (Vygotsky, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e1978\u003c/span\u003e; Wood, Bruner, \u0026amp; Ross, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e1976\u003c/span\u003e). Each step guides the model beyond its baseline capability, encouraging it to utilize real-world linguistic cues rather than self-detected patterns.\u003c/p\u003e\u003cp\u003eThe utility of zero-shot \u003cem\u003eChain-of-X\u003c/em\u003e prompting strategies has been verified in recent work. Maity et al. (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) reported improvements in grammaticality, answerability, and difficulty of distractors generated by a multi-stage prompting approach incorporating multilingual paraphrasing, keyword extraction, and question generation. Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e provides an overview of the input prompt and expected output format at each generation stage within the CoS framework. Every stage operates in a zero-shot manner (i.e., without training examples).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAn overview of our \u003cem\u003eChain-of-Scaffolds\u003c/em\u003e framework. The corresponding content for each test item replaces the placeholders.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGeneration Stage\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eInput Prompt\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eExpected Output\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eCorrect Answer Rationale Generation\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eYou are an expert in English education for Korean CSAT preparation.\u003c/p\u003e\u003cp\u003eQuestion: \u003cem\u003e{question}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eStem: \u003cem\u003e{stem}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eCorrect Answer: \u003cem\u003e{correct_answer}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eTask: Briefly explain in one sentence why the provided correct answer is the best choice.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cem\u003e{rationale}\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eMisconception Generation\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eYou are analyzing test-taker errors based on known CSAT difficulties:\u003c/p\u003e\u003cp\u003e1. Processing long phrases before the main verb\u003c/p\u003e\u003cp\u003e2. Handling negations/conjunctions/connectives\u003c/p\u003e\u003cp\u003e3. Lexical diversity and unfamiliar vocabulary\u003c/p\u003e\u003cp\u003eQuestion: \u003cem\u003e{question}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eStem: \u003cem\u003e{stem}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eCorrect Answer: \u003cem\u003e{correct_answer}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eCorrect Answer Rationale: \u003cem\u003e{rationale}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eList four misconceptions, one for each of the three difficulties (repeat one if needed). Format each as:\u003c/p\u003e\u003cp\u003e- [Difficulty X]: [Misconception sentence]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cem\u003e{misconceptions}\u003c/em\u003e, represented as\u003c/p\u003e\u003cp\u003e[\u003cem\u003e{Difficulty X}\u003c/em\u003e]: [\u003cem\u003e{Misconception sentence}\u003c/em\u003e]\u003c/p\u003e\u003cp\u003e(n\u0026thinsp;=\u0026thinsp;4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eDistractor Generation\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGenerate one distractor per misconception below. Each distractor should:\u003c/p\u003e\u003cp\u003e- Directly reflect the reasoning error in the misconception.\u003c/p\u003e\u003cp\u003e- Be grammatically and contextually appropriate.\u003c/p\u003e\u003cp\u003e- Be clearly incorrect but attractive to a test-taker making the given mistake.\u003c/p\u003e\u003cp\u003eList only the four distractors. Do NOT number or explain them.\u003c/p\u003e\u003cp\u003eQuestion: \u003cem\u003e{question}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eStem: \u003cem\u003e{stem}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eCorrect Answer: \u003cem\u003e{correct_answer}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eMisconceptions:\u003c/p\u003e\u003cp\u003e\u003cem\u003e{misconceptions}\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cem\u003e{initial_distractors}\u003c/em\u003e\u003c/p\u003e\u003cp\u003e(n\u0026thinsp;=\u0026thinsp;4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eSyntactic and Lexical Refinement\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRefine the following distractors to ensure they:\u003c/p\u003e\u003cp\u003e- Are similar in length, vocabulary difficulty, and syntactic complexity to the correct answer.\u003c/p\u003e\u003cp\u003e- Do not resemble each other or the correct answer too closely.\u003c/p\u003e\u003cp\u003eOutput exactly four distractors, one per line. Do NOT add explanations or numbering.\u003c/p\u003e\u003cp\u003eCorrect Answer: \u003cem\u003e{correct_answer}\u003c/em\u003e\u003c/p\u003e\u003cp\u003eInitial Distractors:\u003c/p\u003e\u003cp\u003e\u003cem\u003e{initial_distractors}\u003c/em\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cem\u003e{final_distractors}\u003c/em\u003e\u003c/p\u003e\u003cp\u003e(n\u0026thinsp;=\u0026thinsp;4)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"3 Experiments","content":"\u003cp\u003e\u003cstrong\u003e3.1 CSAT Dataset\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur dataset comprises 499 MCQs sourced from the English section of the CSAT and its official mock examinations, which are administered six times annually in preparation for the November CSAT. The Korea Institute of Curriculum and Evaluation administers the CSAT and the June and September mock exams. Conversely, Regional Offices of Education administer the remaining mock exams in rotation across Korea. Where available, item-level response rate distributions for the top 15 most frequently missed questions from each exam are retrieved from the Educational Broadcasting System, a publicly accessible online learning platform.[2]\u003c/p\u003e\n\u003cp\u003eWe restricted our data source to exams administered since March 2018 as earlier versions of the exam employed a now-discontinued relative evaluation system and differed in their question type composition due to curricular revisions. We excluded questions that contained Korean distractors to avoid multilingual interference. Furthermore, the question types for which distractors could be trivially generated\u0026mdash;such as those requiring paragraph reordering or factual matching\u0026mdash;were also excluded. The resulting dataset comprised six question types across 50 administrations (Table 4). As described in Section 2, all the questions were translated into one of six English templates for consistency. When necessary, stems that originally contained intentional grammatical errors\u0026mdash;owing to their linkage with other questions\u0026mdash;were corrected to ensure standalone interpretability.\u003c/p\u003e\n\u003cp\u003eTo approximate an 80:20 train-to-test ratio, we designated the 419 MCQs from administrations between 2018 and 2023 as the training set and the 80 MCQs between 2024 and March 2025 administrations as the test set. We deliberately avoided a randomized train:test split to preserve the proportional distribution of question types because CSAT question numbers are mapped to specific question types. Furthermore, we assume that recently administered items were less likely to have appeared in the pre-training data of language models, given that data age is associated with performance degradation in language models (Longpre et al., 2024).\u003c/p\u003e\n\u003cp\u003eTable 4: Overview of CSAT Dataset curated in this study. Items in the training set were used as training data for supervised fine-tuning and the candidate pool for exemplar retrieval in ICL.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"624\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eQuestion Type\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eItems in Training Set\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eItems in Test Set\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eTotal\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eChange in sentiment\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e48\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eMeaning of [CTXT] in context\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e47\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eTopic of the passage\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eTitle\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003efor\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;the passage\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e101\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eFill in the [BLANK]\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e171\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e203\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003ePair of (A)\u003c/strong\u003e\u003cstrong\u003e,\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;(B) that best completes the [SUMMARY] to the passage\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e3.2 Baselines\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe used zero-shot GPT (GPT-4.1-2025-04-14, consistent with our three proposed approaches), following prior work by Bitew et al. (2023), McNichols et al. (2023), and Taslimipoor et al. (2024). Zero-shot prompting with off-the-shelf GPT models provides a robust benchmark across diverse natural language processing tasks (Meshkin et al., 2024). Further, it represents a practical and accessible solution for non-expert users, such as educators, curriculum designers, and test developers. We evaluate this baseline against our three proposed approaches in terms of semantic and lexical alignment with ground-truth distractors, defined as the official distractors used in the 2024\u0026ndash;2025 test administrations within our held-out test set.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3 Evaluation Metrics\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe evaluated the quality of generated distractors along two key dimensions: semantic and lexical alignment to the ground-truth distractors. This two-dimensional approach enables a comprehensive assessment of the efficient approximation of distractor characteristics in authentic test items by each system. All the metrics were computed on a per-question basis by comparing each set of four generated distractors with the corresponding set of four ground-truth distractors. We report seven automatic evaluation metrics: SBERT-based Cosine Similarity, BERTScore (for semantic alignment), BLEU-1 through BLEU-4, and ROGUE-L (for lexical alignment). Further details are provided in sections 3.3.1 and 3.3.2.\u003c/p\u003e\n\u003cp\u003eTo complement these alignment-based automatic metrics, we also prompted our base model to select the best answer and identify all plausible distractors, given the correct answer and distractor set. This diagnostic procedure was repeated for the ground-truth distractors, baseline, and the three proposed approaches because GPT-4.1\u0026rsquo;s ability to discriminate between correct, plausible, and implausible options has not been evaluated previously on the CSAT. Therefore, we posit that an effective set of distractors should enable the model to identify the correct answer while also flagging one or more distractors as plausible. This approach serves as a proxy for evaluating pragmatic and syntactic alignment, the latter of which is difficult to otherwise quantify given the short length of CSAT answer choices.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3.1 Semantic Alignment\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe adopted two metrics to assess semantic alignment at the sentence and token levels. To evaluate sentence-level semantic alignment, we use Sentence-BERT (all-MiniLM-L6-v2 (Reimers \u0026amp; Gurevych, 2019)), a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model which produces semantically rich sentence embeddings. We computed a 4 \u0026times; 4 cosine similarity matrix for each set of generated distractors, comparing the embeddings of the generated and ground-truth distractors. Subsequently, we calculated the best-match average, defined as the mean of the highest cosine similarity scores for each generated distractor. Although cosine similarity theoretically ranges from -1 to 1, practical values typically fall between 0 and 1, as SBERT embeddings tend to lie in the non-negative region of the embedding space.\u003c/p\u003e\n\u003cp\u003eTo evaluate token-level semantic alignment, we used BERTScore (Zhang, Kishore, Wu, Weinberger, \u0026amp; Artzi, 2020), which compares contextual embeddings from a pre-trained BERT model. BERTScore values range from 0 to 1. Higher scores of cosine similarity and BERTScore indicate stronger semantic alignment. We report both metrics as percentages\u0026mdash;i.e., original values multiplied by 100\u0026mdash;for ease of interpretation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3.2 Lexical Alignment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe first reported BLEU-1 through BLEU-4 to assess lexical alignment, following Taslimipoor et al. (2024). BLEU (Bilingual Evaluation Understudy; Papineni, Roukos, Ward, \u0026amp; Zhu, 2002) is a precision-oriented metric that evaluates surface-level lexical alignment by calculating n-gram overlap between the candidate (i.e., generated distractors) and reference (i.e., ground-truth distractors) sequences. We calculated the BLEU scores using the Natural Language Toolkit implementation, with smoothing techniques applied to mitigate zero-precision scores in short sequences. BLEU-1 considers unigram overlaps, while BLEU-2 through BLEU-4 measure contextual alignment over longer n-grams.\u003c/p\u003e\n\u003cp\u003eWe also reported ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004)), a recall-focused metric that evaluates lexical alignment using the longest common subsequence between the candidate and reference sequences, to better capture partial matches that preserve semantic cohesion. For each generated distractor, we calculated the F1 version of ROGUE-L, which balances precision and recall, and reported the highest score across all references. The BLEU and ROGUE-L scores range between 0 and 1, with a higher value indicating greater lexical alignment. As with semantic metrics, we reported BLEU-1 through BLEU-4 and ROGUE-L as percentages for interpretability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.4 Implementation Details\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this section, we report key implementation details to support the replicability of our research. We collected our training and test data by downloading available PDF versions of past test questions through the EBSi website. We systematically extracted the relevant questions, stems, and distractors using the \u003cem\u003epdfplumber\u003c/em\u003e Python library and regular expressions. PDFs unsuccessfully retrieved through this procedure were manually curated. The answer selection rates were also retrieved from the same source. We used \u003cem\u003epandas\u003c/em\u003e to convert question sets into data frames and appended computed DAR values using Microsoft Excel macros.\u003c/p\u003e\n\u003cp\u003eWe fine-tuned the GPT-4.1-2025-04.14 model using OpenAI\u0026rsquo;s default fine-tuning pipeline. A \u003cem\u003etemperature\u0026nbsp;\u003c/em\u003eof 0.0 was selected to ensure deterministic output across all approaches. The model was trained on 419 \u003cem\u003eJSONL\u003c/em\u003e-formatted training instances over three epochs. The \u003cem\u003emax_tokens\u003c/em\u003e parameter was set to 250 and\u003cem\u003e\u0026nbsp;top_p\u003c/em\u003e to 1. The training instances were randomly shuffled prior to training to mitigate overfitting on early examples. The fine-tuning process was completed in 21 minutes, and Figure 1 visualizes the training loss over steps.\u003c/p\u003e\n\u003cp\u003eWe used the following system prompt for the supervised fine-tuning and ICL approaches: \u0026ldquo;\u003cem\u003eGenerate four distractors (i.e., plausible but incorrect answer choices) for the given multiple\u003c/em\u003e\u003cem\u003echoice question. Do not include any indices\u003c/em\u003e.\u0026rdquo; The questions, stems, and correct answers from the test set were inserted into the user prompt. In the ICL setup, retrieved examples were also appended to the user prompt under the header [EXAMPLES]. Section 2.2.3 describes the prompt designs for the CoS approach.\u003c/p\u003e\n\u003cp\u003e[2] https://www.ebsi.co.kr/ebs/xip/xipa/retrievePastGrdCutWrongAnswerRate.ebs?tab=1\u003c/p\u003e"},{"header":"4 Results and Discussion","content":"\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\n \u003ch2\u003e4.1 Semantic and Lexical Alignment\u003c/h2\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e details the semantic and lexical alignment results for the three proposed approaches compared to the zero-shot baseline. Supervised fine-tuning consistently outperforms the baseline across all metrics, achieving substantial improvements in SBERT cosine similarity, BLEU, and ROUGE-L scores. These findings indicate that exposure to all 419 training instances enabled GPT-4.1 to internalize the semantic and lexical features of ground-truth CSAT distractors and that patterns observed in exams between 2018 and 2023 remained robustly predictive of distractor style in more recent items.\u003c/p\u003e\n \u003cp\u003eThe alternative approaches, which relied on smaller amounts of relevant data or GPT-4.1\u0026rsquo;s scaffolded reasoning abilities, produced more nuanced results. ICL with DAR-based retrieval revealed modest gains in semantic alignment but underperformed the baseline in lexical alignment. This suggests that a few high-quality examples may enhance the model\u0026rsquo;s contextual understanding of distractors\u0026rsquo; meaning; however, they may not sufficiently improve surface-level lexical similarity. Notably, the CoS approach, likely due to incorporating correct answer rationale generation and misconception generation stages, achieved SBERT cosine similarity and BERTScore-F1 scores comparable to those of supervised fine-tuning. This suggests that GPT-4.1 was capable of modeling plausible misconceptions that align with real-world test-taker reasoning errors. However, the lexical overlap remained relatively low, as reflected in the slightly lower BLEU-1 (unigram match) and ROUGE-L scores compared to the baseline.\u003c/p\u003e\n \u003cp\u003eOverall, the results highlight potential in both explicit model guidance and data-driven model adjustment in distractor generation. The consistent supremacy of supervised fine-tuning demonstrates the benefits of full-data exposure in capturing conceptual plausibility in lexical patterns. In contrast, selecting a few particularly effective samples at inference time proved less effective. Meanwhile, the high semantic alignment of CoS-generated distractors reveals that semantic plausibility can be attained despite an absence of task-specific training data by steering the model\u0026rsquo;s reasoning process toward the types of errors human test-takers are likely to make.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab5\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAutomatic evaluation metrics measuring semantic and lexical alignment. All values are reported as percentages (i.e., original value multiplied by 100). Higher values indicate stronger alignment with ground-truth distractors for all the reported metrics.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eApproach\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eSemantic\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"5\"\u003e\n \u003cp\u003eLexical\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSBERT Cosine Sim. \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBERTScore-F1 \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBLEU-1 \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBLEU-2 \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBLEU-3 \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBLEU-4 \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eROGUE-L \u0026uarr;\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eZero-Shot (baseline)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e44.78\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e88.54\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e26.76\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e10.36\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e7.19\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e5.59\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e19.27\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupervised Fine- Tuning\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e51.45\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.00\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e34.18\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e17.42\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e10.22\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e7.34\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e24.71\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eICL with DAR-based Retrieval\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e47.61\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e88.62\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e25.96\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e10.66\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e6.8\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e5.23\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e18.11\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eChain-of-\u003c/p\u003e\n \u003cp\u003eScaffolds\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e50.74\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e88.8\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e26.2\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e11.23\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e7.51\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e5.81\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e18.52\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\n \u003ch2\u003e4.2 Plausibility-based Diagnostic Evaluation\u003c/h2\u003e\n \u003cp\u003ePlausibility-based diagnostic evaluation results in Table \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e provide additional insight into the practical quality of the generated distractors. As expected, the ground-truth distractors produced the highest rate of correct answers (79 out of 80), affirming GPT-4.1\u0026rsquo;s ability to solve CSAT questions crafted by human experts. Among the generated sets, ICL with DAR-based retrieval yielded the highest number of instances (49) in which the model selected the correct answer and identified at least one plausible distractor\u0026mdash;surpassing both supervised fine-tuning (37) and CoS (38). Despite their relatively low semantic and lexical alignment with ground-truth distractors, ICL-generated distractors effectively approximated distractor sets with near-optimal choice distributions, highlighting DAR\u0026rsquo;s value as a retrieval signal for ADG. Meanwhile, CoS strikes a strong balance between correctness and plausibility, yielding only three incorrect responses across the test set\u0026mdash;comparable to the ground truth (1) and outperforming supervised fine-tuning (12). These findings suggest that scaffolded reasoning enhances semantic plausibility and reduces model confusion more effectively than fine-tuning alone.\u003c/p\u003e\n \u003cp\u003eThese results demonstrate the individual strengths of each modeling approach. Fine-tuning excels at surface-level alignment, ICL at leveraging plausibility signals, and CoS at promoting both interpretive reasoning and alignment. Crucially, the three approaches outperform the zero-shot baseline in at least one dimension, reinforcing the value of targeted strategies in high-stakes distractor generation tasks like the CSAT.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab6\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eGPT-4.1\u0026rsquo;s problem-solving and plausibility identification results with each set of distractors for all questions in the test set (n\u0026thinsp;=\u0026thinsp;80). The base PLM was also prompted to answer each question with its ground-truth distractors to measure its innate competence on the CSAT.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eApproach\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCorrect answer \u0026amp; plausible distractors identified\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCorrect answer \u0026amp; no distractors identified\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eWrong answer\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGround truth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eZero-Shot (baseline)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupervised Fine-Tuning\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eICL with DAR-based Retrieval\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eChain-of-\u003c/p\u003e\n \u003cp\u003eScaffolds\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003cbr\u003e\u003c/div\u003e"},{"header":"5 Conclusions and Future Work","content":"\u003cp\u003eThis paper evaluates three major ADG strategies tailored to the English section of the Korean CSAT, a linguistically constrained high-stakes EFL exam. We introduce a CoT-inspired pedagogical approach to reinforce GPT-4.1\u0026rsquo;s reasoning capability in ADG, define a novel metric for relevant item retrieval in ICL, and conduct extensive experiments using authentic materials on a dataset hitherto unexplored in ADG. Our comparative analysis reveals that supervised fine-tuning effectively reproduces surface-level semantic and lexical conventions; meanwhile, ICL with DAR-based retrieval and CoS respectively enhance the collective plausibility of generated distractors and their semantic alignment with ground-truth distractors. Collectively, these findings illustrate how different ADG strategies serve distinct roles in replicating human-like distractor generation, offering practical applicability in structured multiple-choice assessments under varying resource conditions.\u003c/p\u003e\u003cp\u003eDespite the promising results of this study, some limitations should be acknowledged. First, the training dataset was relatively small by modern fine-tuning standards, comprising only 419 instances, which may present challenges in the generalizability of the supervised fine-tuning results to broader or more varied English MCQ contexts beyond the CSAT. However, this limitation is unavoidable because the CSAT is administered only six times annually under strict question quality control, inherently capping the amount of available data. Second, efforts were made to standardize question formatting and minimize multilingual interference; however, translating stems and options from Korean into English templates may have introduced subtle changes in meaning or difficulty level that could affect distractor generation performance. Lastly, automatic evaluation metrics and plausibility-based diagnostic evaluation provided a multifaceted assessment; nonetheless, human expert evaluation of distractor quality\u0026mdash;such as plausibility, grammaticality, and educational value\u0026mdash;was not incorporated, leaving room for future studies to complement quantitative metrics with qualitative human judgment.\u003c/p\u003e\u003cp\u003eBuilding on our findings, future research should explore two key directions. First, while this study focused on EFL assessments, applying these ADG strategies to non-language domains\u0026mdash;such as STEM subjects, social sciences, or professional certification\u0026mdash;could test their adaptability in reasoning-intensive contexts. Second, future work may investigate the role of rationale-augmented generation, as an extension of CoS, to produce not only distractors but also accompanying justifications or feedback to support formative assessment. Additionally, adaptive weighting or hybridization of fine-tuning, retrieval, and scaffolded prompting could be explored to optimize distractor generation across different task types and learner populations, supporting broader applications in personalized learning and educational technology systems.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDisclosure statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo potential conflict of interest was reported by the author(s).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNotes on contributors\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe first author is …\u003c/p\u003e\n\u003cp\u003eThe second author is …\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAlhazmi, E., Sheng, Q. Z., Zhang, W. E., Zaib, M., \u0026amp; Alhazmi, A. (2024). Distractor generation in multiple-choice tasks: A survey of methods, datasets, and evaluation. In \u003cem\u003eProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing\u003c/em\u003e (pp. 14437\u0026ndash;14458). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.799\u003c/li\u003e\n\u003cli\u003eBitew, S. K., Deleu, J., Develder, C., \u0026amp; Demeester, T. (2023). Distractor generation for multiple-choice questions with predictive prompting and large language models. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:2307.16338. https://arxiv.org/abs/2307.16338\u003c/li\u003e\n\u003cli\u003eBrutt-Griffler, J., \u0026amp; Kim, S. (2023). The testing culture and the role of private education. \u003cem\u003eLanguage, Culture and Curriculum, 36\u003c/em\u003e(3), 293\u0026ndash;309. https://doi.org/10.1080/07908318.2022.2148686\u003c/li\u003e\n\u003cli\u003eBubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., \u0026amp; Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:2303.12712. https://arxiv.org/abs/2303.12712\u003c/li\u003e\n\u003cli\u003eDoughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., \u0026amp; Sakr, M. (2024). A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In \u003cem\u003eProceedings of the 26th Australasian Computing Education Conference\u003c/em\u003e (pp. 114\u0026ndash;123). ACM. https://doi.org/10.1145/3636243.3636256 \u003c/li\u003e\n\u003cli\u003eFeng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., Ornelas, N., \u0026amp; Lan, A. (2024). Exploring automated distractor generation for math multiple-choice questions via large language models. \u003cem\u003eFindings of the Association for Computational Linguistics: NAACL 2024\u003c/em\u003e, 3067\u0026ndash;3082. https://doi.org/10.18653/v1/2024.findings-naacl.193\u003c/li\u003e\n\u003cli\u003eGierl, M. J., Bulut, O., Guo, Q., \u0026amp; Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. \u003cem\u003eReview of Educational Research, 87\u003c/em\u003e(6), 1082\u0026ndash;1116. https://doi.org/10.3102/0034654317726529\u003c/li\u003e\n\u003cli\u003eHeld, W., \u0026amp; Yang, D. (2023). Shapley head pruning: Identifying and removing interference in multilingual transformers. In \u003cem\u003eProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics\u003c/em\u003e (pp. 2416\u0026ndash;2427). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.eacl-main.177\u003c/li\u003e\n\u003cli\u003eKim, S.-Y. (2024). A corpus-based analysis of variables influencing reading question difficulty on the College Scholastic Ability Test (CSAT) English section. \u003cem\u003eYung-hap Yeongeo Yeongmunhak [Convergence English and American Literature], 9\u003c/em\u003e(2), 353\u0026ndash;378.\u003c/li\u003e\n\u003cli\u003eKojima, T., Gu, S. S., Reid, M., Matsuo, Y., \u0026amp; Iwasawa, Y. (2023). Large language models are zero-shot reasoners. \u003cem\u003earXiv preprint \u003c/em\u003earXiv:2205.11916. https://arxiv.org/abs/2205.11916\u003c/li\u003e\n\u003cli\u003eKwon, S. K., Lee, M., \u0026amp; Shin, D. (2015). Educational assessment in the Republic of Korea: Lights and shadows of high-stake exam-based education system. \u003cem\u003eAssessment in Education: Principles, Policy \u0026amp; Practice, 24\u003c/em\u003e(1), 60\u0026ndash;77. https://doi.org/10.1080/0969594X.2015.1074540 \u003c/li\u003e\n\u003cli\u003eLai, G., Xie, Q., Liu, H., Yang, Y., \u0026amp; Hovy, E. (2017). RACE: Large-scale ReAding comprehension dataset from examinations. In \u003cem\u003eProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing\u003c/em\u003e (pp. 785\u0026ndash;794). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1082\u003c/li\u003e\n\u003cli\u003eLi, X., \u0026amp; Li, J. (2024). AnglE-optimized text embeddings. \u003cem\u003earXiv \u003c/em\u003epreprint arXiv:2309.12871. https://arxiv.org/abs/2309.12871\u003c/li\u003e\n\u003cli\u003eLin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In \u003cem\u003eText Summarization Branches Out\u003c/em\u003e (pp. 74\u0026ndash;81). Association for Computational Linguistics. https://aclanthology.org/W04-1013/\u003c/li\u003e\n\u003cli\u003eLongpre, S., Yauney, G., Reif, E., Lee, K., Roberts, A., Zoph, B., Zhou, D., Wei, J., Robinson, K., Mimno, D., \u0026amp; Ippolito, D. (2024). A pretrainer\u0026rsquo;s guide to training data: Measuring the effects of data age, domain coverage, quality, \u0026amp; toxicity. \u003cem\u003eProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)\u003c/em\u003e (pp. 3245\u0026ndash;3276). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.179\u003c/li\u003e\n\u003cli\u003eLuo, H., Deng, Y., Shen, Y., Ng, S.-K., \u0026amp; Chua, T.-S. (2024). Chain-of-Exemplar: Enhancing distractor generation for multimodal educational question generation. \u003cem\u003eIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\u003c/em\u003e (pp. 7978\u0026ndash;7993). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.432\u003c/li\u003e\n\u003cli\u003eMa, C., \u0026amp; Du, X. (2023). POE: Process of elimination for multiple choice reasoning. In \u003cem\u003eProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \u003c/em\u003e(pp. 4487\u0026ndash;4496). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.273\u003c/li\u003e\n\u003cli\u003eMaity, S., Deroy, A., \u0026amp; Sarkar, S. (2024). A novel multi-stage prompting approach for language agnostic MCQ generation using GPT. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:2401.07098. https://arxiv.org/abs/2401.07098\u003c/li\u003e\n\u003cli\u003eMcNichols, H., Feng, W., Lee, J., Scarlatos, A., Smith, D., Woodhead, S., \u0026amp; Lan, A. (2024). Automated distractor and feedback generation for math multiple-choice questions via in-context learning. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:2308.03234. https://arxiv.org/abs/2308.03234\u003c/li\u003e\n\u003cli\u003eMeshkin, H., Zirkle, J., Arabidarrehdor, G., Chaturbedi, A., Chakravartula, S., Mann, J., Thrasher, B., \u0026amp; Li, Z. (2024). Harnessing large language models\u0026rsquo; zero-shot and few-shot learning capabilities for regulatory research. \u003cem\u003eBriefings in Bioinformatics, 25\u003c/em\u003e(5), Article bbae354. https://doi.org/10.1093/bib/bbae354\u003c/li\u003e\n\u003cli\u003eOfferijns, J., Verberne, S., \u0026amp; Verhoef, T. (2020). Better distractions: Transformer-based distractor generation and multiple choice question filtering. \u003cem\u003earXiv Preprint \u003c/em\u003earXiv:2010.09598. https://arxiv.org/abs/2010.09598\u003c/li\u003e\n\u003cli\u003ePapineni, K., Roukos, S., Ward, T., \u0026amp; Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In \u003cem\u003eProceedings of the 40th Annual Meeting of the Association for Computational Linguistics\u003c/em\u003e (pp. 311\u0026ndash;318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135\u003c/li\u003e\n\u003cli\u003eReimers, N., \u0026amp; Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In \u003cem\u003eProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)\u003c/em\u003e (pp. 3982\u0026ndash;3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410\u003c/li\u003e\n\u003cli\u003eRodriguez-Torrealba, R., Garcia-Lopez, E., \u0026amp; Garcia-Cabot, A. (2025). Joint generation of distractors for multiple-choice questions: A text-to-text approach. \u003cem\u003eComputers, Materials \u0026amp; Continua, 83\u003c/em\u003e(2), 1683\u0026ndash;1705. https://doi.org/10.32604/cmc.2025.062004\u003c/li\u003e\n\u003cli\u003eShaham, U., Elbayad, M., Goswami, V., Levy, O., \u0026amp; Bhosale, S. (2023). Causes and cures for interference in multilingual translation. In \u003cem\u003eProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \u003c/em\u003e(pp. 15849\u0026ndash;15863). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.883\u003c/li\u003e\n\u003cli\u003eShahriar, S., Lund, B., Mannuru, N. R., Arshad, M. A., Hayawi, K., Bevara, R. V. K., Mannuru, A., \u0026amp; Batool, L. (2024). Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:2407.09519. https://arxiv.org/abs/2407.09519\u003c/li\u003e\n\u003cli\u003eSun, K., Yu, D., Chen, J., Yu, D., Choi, Y., \u0026amp; Cardie, C. (2019). DREAM: A challenge data set and models for dialogue-based reading comprehension. \u003cem\u003eTransactions of the Association for Computational Linguistics, 7\u003c/em\u003e, 217\u0026ndash;231. https://doi.org/10.1162/tacl_a_00264\u003c/li\u003e\n\u003cli\u003eTaslimipoor, S., Benedetto, L., Felice, M., \u0026amp; Buttery, P. (2024, May). Distractor generation using generative and discriminative capabilities of transformer-based models. In \u003cem\u003eProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)\u003c/em\u003e (pp. 5052\u0026ndash;5063). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.452/\u003c/li\u003e\n\u003cli\u003eTran, A., Angelikas, K., Rama, E., Okechukwu, C., Smith, D., \u0026amp; Macneil, S. (2023, October). Generating multiple choice questions for computing courses using large language models. In \u003cem\u003e2023 IEEE Frontiers in Education Conference (FIE)\u003c/em\u003e. https://doi.org/10.1109/FIE58773.2023.10342898\u003c/li\u003e\n\u003cli\u003eVygotsky, L. S. (1978). \u003cem\u003eMind in society: The development of higher psychological processes\u003c/em\u003e. Harvard University Press. https://doi.org/10.2307/j.ctvjf9vz4\u003c/li\u003e\n\u003cli\u003eWei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., \u0026amp; Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. \u003cem\u003earXiv Preprint \u003c/em\u003earXiv:2201.11903. https://arxiv.org/abs/2201.11903\u003c/li\u003e\n\u003cli\u003eWood, D. J., Bruner, J. S., \u0026amp; Ross, G. (1976). The role of tutoring in problem solving\u003cem\u003e. Journal of Child Psychology and Psychiatry, 17\u003c/em\u003e, 89\u0026ndash;100. http://dx.doi.org/10.1111/j.1469-7610.1976.tb00381.x\u003c/li\u003e\n\u003cli\u003eXie, Q., Lai, G., Dai, Z., \u0026amp; Hovy, E. (2018). Large-scale cloze test dataset created by teachers. In \u003cem\u003eProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing\u003c/em\u003e (pp. 2344\u0026ndash;2356). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1257\u003c/li\u003e\n\u003cli\u003eYu, H. C., Shih, Y. A., Law, K. M., Hsieh, K., Cheng, Y. C., Ho, H. C., Lin, Z. A., Hsu, W.-C., \u0026amp; Fan, Y.-C. (2024). Enhancing distractor generation for multiple-choice questions with retrieval augmented pretraining and knowledge graph integration. \u003cem\u003eFindings of the Association for Computational Linguistics: ACL 2024\u003c/em\u003e, 11019\u0026ndash;11029. https://doi.org/10.18653/v1/2024.findings-acl.655\u003c/li\u003e\n\u003cli\u003eZhang, T., Kishore, V., Wu, F., Weinberger, K. Q., \u0026amp; Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. \u003cem\u003earXiv Preprint\u003c/em\u003e arXiv:1904.09675. https://arxiv.org/abs/1904.09675\u003c/li\u003e\n\u003cli\u003eZu, J., Choi, I., \u0026amp; Hao, J. (2023). Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. \u003cem\u003ePsychological Test and Assessment Modeling, 65\u003c/em\u003e(1), 55\u0026ndash;75.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"distractor generation, language models, GPT-4.1, EFL, Korean CSAT","lastPublishedDoi":"10.21203/rs.3.rs-6680435/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6680435/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eHigh-quality distractors are essential in multiple-choice questions to assess student understanding and diagnose misconceptions; however, constructing these distractors manually is labor-intensive. This study presents the first large-scale investigation of automated distractor generation (ADG) for the English section of Korea’s College Scholastic Ability Test (CSAT), a high-stakes exam of English as a Foreign Language (EFL) characterized by consistent item design and linguistic constraints. We implement and evaluate three ADG approaches using GPT-4.1: supervised fine-tuning on a curated CSAT dataset, in-context learning with a novel distractor attractiveness metric to guide exemplar retrieval, and Chain-of-Scaffolds, a prompting strategy inspired by educational scaffolding theory that decomposes distractor generation into reasoning stages. Across 80 unseen items from recent CSAT administrations, supervised fine-tuning achieves the highest semantic and lexical alignment with ground-truth distractors. In-context learning retrieves more pragmatically effective examples, producing distractor sets that best approximate realistic answer distributions. The Chain-of-Scaffolds method yields distractors that simulate test-taker misconceptions while minimizing confusion with the correct answer. These findings underscore the value of pedagogically grounded prompting and data-informed retrieval in high-stakes language assessment and suggest that ADG strategies should align with instructional contexts—for example, prioritizing fine-tuning for nationwide standardized exams, or selecting in-context learning for classroom diagnostics that require adaptability and rapid deployment.\u003c/p\u003e","manuscriptTitle":"Optimizing GPT-Based Distractor Generation for the Korean CSAT English Exam","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-18 12:13:46","doi":"10.21203/rs.3.rs-6680435/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5ec60681-d041-48dc-ad1d-834da9331921","owner":[],"postedDate":"September 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-19T09:41:19+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-18 12:13:46","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6680435","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6680435","identity":"rs-6680435","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00