Evaluation of Large Language Models on the Chinese Dental Licensing Examination

preprint OA: closed
Full text JSON View at publisher
Full text 112,825 characters · extracted from preprint-html · click to expand
Evaluation of Large Language Models on the Chinese Dental Licensing Examination | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Evaluation of Large Language Models on the Chinese Dental Licensing Examination jie ji, weini xin This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7968968/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Objective: This study aimed to evaluate the performance of large language models (LLMs) on the Chinese Dental Licensing Examination (CDLE). It also examined whether including an ‘unknown’ option in prompts—or combining this option with a penalty for incorrect answers—could improve model accuracy and reduce hallucinations. Methods: The official preparation book, titled Historical Chinese Dental Licensing Examinations , authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group, was used as the data source. Three cloud-based models (Qwen3-Max, Qwen-Plus, DeepSeek-V3.1) and two locally deployed models (Qwen3-32B and GPT-OSS-120B) were evaluated on the CDLE. A custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. Model performance was evaluated at both the exam and question levels. Exam-level performance was assessed by mean accuracy (± standard deviation (SD)) and pass/fail outcomes, while question-level performance was evaluated primarily by accuracy with 95% and 99% confidence intervals (CIs). Results: A dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question. Qwen3-Max, Qwen-Plus, DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B achieved exam-level mean accuracies ±SD of 0.866±0.089, 0.851±0.0767, 0.737±0.0738, 0.748±0.0868, 0.652±0.0799, respectively. At the question level, the accuracies with 95% CIs were 0.865 (0.852–0.878), 0.851 (0.837–0.865), 0.727 (0.709–0.745), 0.741 (0.724–0.756), and 0.651 (0.634–0.671), respectively. Prompts that included an ‘unknown’ option—or combined it with a penalty for incorrect answers—did not improve model accuracy. Conclusion: All models successfully passed the CDLEs, with some achieving remarkably high scores. Among them, Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Other uncertainty estimation methods should be considered instead of simply adding an ‘unknown’ option to the input prompt. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students’ self-directed learning. Health sciences/Diseases Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Chinese Dental Licensing Examination large language models Qwen3-Max Qwen-Plus Qwen3 DeepSeek-V3.1 GPT-OSS hallucinations in large language models dental education Figures Figure 1 Figure 2 Introduction Over the past few decades, artificial narrow intelligence (ANI)—defined as AI systems designed to perform specific, well-defined tasks within a limited domain—has made substantial progress in medicine. For example, deep learning models now match or even surpass human experts in many specialized tasks when large labeled datasets are available. However, real-world clinical applications encompass many specialized domains and tasks, and developing separate AI systems for each task is prohibitively costly. Moreover, in cases involving rare diseases or substantial missing data, acquiring labeled data can be not only labor-intensive but, in some cases, impossible. In recent years, large language models (LLMs) have advanced rapidly and, despite ongoing debate, are considered to possess certain general artificial-intelligence-like capabilities. LLMs exhibit emergent abilities (1–3) that are absent in smaller models but present in larger ones. Most importantly, LLMs can generalize, to some extent, to new tasks for which they were not specifically trained. As a result, LLMs have opened new avenues for artificial intelligence applications in healthcare. Numerous previous studies have focused on evaluating the ability of LLMs to pass medical exams, answer medical questions, or provide recommendations (4–12). Some studies have utilized LLMs for clinical decision-making, including disease diagnosis and outcome prediction (13–20). In dentistry, several studies have assessed the performance of LLMs on dental examinations conducted in various languages(21–25); however, the results have been inconsistent. In oral and maxillofacial surgery examinations, Quah, B. (22) reported that LLMs were capable of achieving a passing score of 62.5% in the oral and maxillofacial surgery multiple-choice questions, and GPT-4 (26) and Copilot performed the best of the included LLMs. A systematic review and meta-analysis study (21), which included 11 studies, found that GPT-4 achieved an integrated accuracy of 73%, outperforming GPT-3.5 (54%) and Bard (56%). However, compared with medical licensing examinations, LLMs performed worse and faced greater challenges in dental licensing examinations(21). Revilla-León, Marta evaluated the performance of ChatGPT on the European certification in implant dentistry exam(23) and reported that ChatGPT was able to pass the exam for the 2022 Certification in Implant Dentistry of the European Association for Osseointegration. Conversely, a Japanese study reported that GPT-4 did not pass the Japanese National Dentist Examination and all LLMs demonstrated significantly lower accuracy for dentistry questions compared with other types of questions(24). A study that evaluated the performance of AI on the Turkish dental specialization exam(25) found that AI-powered chatbots, namely ChatGPT-4.0 and Gemini Advanced, passed the DUS by exceeding the threshold score of 45. In addition, several studies have evaluated the question-answering capabilities of LLMs, and these scenarios are very similar to exams. Özbay, Yağız conducted a study of evaluation of the performance of LLMs in clinical decision-making in endodontics (27) and found that ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) compared to other LLMs. Zhu, Guohui conducted a study about assessing and enhancing the reliability of Chinese LLMs in dental implantology (28) and confirmed that Qwen 2.5 and ERNIE Bot 3.5 demonstrated exceptional reliability in dental implantology, excelling in answer accuracy and minimizing misinformation across question types. Birkan Eyup Yilmaz conducted a comparative analysis about Artificial intelligence performance in answering multiple-choice oral pathology questions (29) and reported LLMs demonstrated variable proficiency in oral pathology questions, with ChatGPT o1 showing higher accuracy. Although some studies have examined the performance of LLMs in both dental examinations and other question answer scenarios across various languages, their results have been inconsistent. The languages investigated in dental licensure examinations include English, Japanese, and Turkish; however, Chinese, which is the most spoken language in the world, has not been investigated. The Chinese Dental Licensing Examination(CDLE) was established under the “Medical Practitioners Law of the People’s Republic of China(30)”. The exam consists of two parts: a comprehensive written examination and a practical skills assessment. This study focuses on the first part of the exam, which can be automatically conducted by LLMs. The comprehensive written examination covers five major modules: basic medicine, clinical medicine, preventive medicine, medical humanities, and clinical dental medicine. It is administered via computer and includes four question types: A1, A2, A3, and B1. All questions are five-choice, single-answer questions. Passing the exam requires answering 60% of the questions correctly. In medical domains, mispredictions can have serious consequences. However, it is well known that LLMs are prone to hallucinations(31–33), which represent one of the greatest risks limiting their application, particularly in the medical field. Hallucinations are often defined as instances where LLMs generate plausible but factually incorrect or nonsensical information (32, 34). Therefore, it would be beneficial for a model to be cautious in situations where it is uncertain about its predictions(35). One way to accomplish this is to use machine learning models with rejection, and this is true for both traditional machine learning models and LLMs. Most machine learning models can output predicted probability values, and by adding threshold-based post-processing, a rejection option can be incorporated. Additionally, there are some relatively sophisticated uncertainty estimation methods such as Bayesian neural networks, Bayesian Monte Carlo simulation method(36–38), Dirichlet-based models, separated rejectors, test-time data augmentation(39, 40), among others. Because LLMs are computationally expensive and intermediate results or internal representations from cloud-based models are often inaccessible, a simple method is needed to implement a rejection option. A previous study adopted an uncertainty estimation method to address hallucinations in LLMs(33); however, there remains a need for lightweight and specialized methods for LLMs. A few studies suggest that explicitly adding an 'unknown' option and imposing a penalty for incorrect answers in the prompt can help reduce hallucinations(41). However, the effectiveness of this approach has not been widely demonstrated, especially in the medical domain. To address these research gaps, this study evaluates the performance of multiple LLMs on the CDLE. Because each multiple-choice question in this exam has a definitive correct answer, it serves as an effective benchmark for assessing hallucinations in LLMs. Furthermore, this study conducts a comparative analysis of three prompting strategies: (i) a standard prompt template without an 'unknown' option, (ii) a modified prompt including an explicit ‘unknown’ response option, and (iii) a comprehensive prompting approach combining both the ‘unknown’ option and a penalty for incorrect answers. Methods Data processing The official preparation book, titled Historical Chinese Professional Dental Licensing Examinations (including questions, answers, and detailed explanations) , contains four past CDLEs from the past ten years, authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group and published by Liaoning University Press in November 2024 with ISBN 978-7-5610-8453-3, was used as the data source. All examinations and questions from the book were retained. Models Three cloud-based models—Qwen3-Max, Qwen-Plus(42), and DeepSeek-V3.1(43)—along with two locally deployed models, Qwen3-32B and GPT-OSS-120B(44)—were evaluated on the CDLE. These models were selected for the following reasons. The Qwen and DeepSeek model families have demonstrated strong multilingual capabilities, with particularly robust performance on Chinese(42, 43). For example, on Chinese language benchmarks such as C-Eval(45) and AlignBench(9), Qwen3 models achieved higher scores than GPT-4o(26, 46), Gemma-3, LLaMA-3.1, LLama-4, and even DeepSeek-R1. DeepSeek performed on par with state-of-the-art proprietary models in clinical decision-making(18). The GPT series models developed by OpenAI, including GPT-4 and GPT-4o, and the newly released GPT-5 (released on August 7, 2025) are among the most prominent LLMs in the world. The GPT-OSS-120B, the most powerful open-source model released by OpenAI to date, achieves near parity with GPT-o4-mini across multiple tasks (47). Among the five models, the first two are proprietary models, while the latter three are open source. Qwen3-Max has over 1 trillion parameters; however, the exact number of parameters for Qwen-Plus has not been publicly disclosed. DeepSeek-V3.1 is a powerful Mixture-of-Experts (MoE) language model with a total of 685 billion parameters, including 671 billion from the main model and 14 billion from the Multi-Token Prediction (MTP) module. However, only about 37 billion parameters are activated per token. GPT-OSS-120B has 117 billion parameters in total, with approximately 5.1 billion active parameters per forward pass, and it operates using 4-bit quantization. Qwen3-Max was released by Alibaba on September 24, 2025, and DeepSeek-V3.1 was released by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Company on August 21, 2025. OpenAI released the open-source model gpt-oss-120b on August 5, 2025. The GPT-OSS-120B model achieves near parity with OpenAI o4-mini on multiple tasks(47). Alibaba Cloud is the largest cloud computing company in China and one of the largest in the world. The Qwen-Plus, Qwen3-Max, and DeepSeek-V3.1 models were evaluated on Alibaba Cloud, a highly stable platform that provides an OpenAI-compatible API for model inference. The afore-mentioned Qwen3-32B and GPT-OSS-120B models were deployed locally using Ollama, which provides built-in support for the OpenAI Chat Completions API. Customized program for automatically conducting exams A custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. The program utilizes the Chat Completions API—specifically, the chat.completions.create method in the OpenAI Python library—to communicate with LLMs. The messages sent to each LLM include a system message and a user message. To determine whether using prompts that include an ‘unknown’ response option can reduce hallucinations in LLMs—and thereby improve answer accuracy—three types of system messages were implemented. The first is the traditional message: "You are an experienced dentist. Please select the most correct answer from the five given options based on the question description, and respond directly with the number of the correct answer." The second message introduces an "I don't know" option: "You are an experienced dentist. Among the five options provided for the following question, only one is the most correct. If you are fairly certain about which option is correct, respond directly with its number; otherwise, respond with 'I don't know.'" The third message not only includes the "I don't know" option but also imposes a penalty for incorrect answers: "You are an experienced dentist. Among the five options provided for the following question, only one is the most correct. If you are more than 90% confident in your answer, respond directly with the number of the option; otherwise, respond with 'I don't know.' You will receive 1 point for a correct answer, 0 points for responding 'I don't know,' and − 1 point for an incorrect answer." Parsing LLM results These LLMs are generative AI systems that produce text sequences in response to a given input prompt. Although carefully designed prompts can guide LLMs to output question numbers in most cases, the responses sometimes contain incorrect formats or invalid values. Nonetheless, in most cases, humans can readily interpret the outputs to identify the correct answers. To extract the predicted answer numbers from the model generated text, rule-based parsing techniques were applied. When the system failed to extract a valid answer number, the instance was classified as a parsing error. In this study, all parsing errors were treated as incorrect answers. Statistics Analysis Model performance was evaluated at both the exam and the question levels. At the exam level, overall mean accuracy rates ± standard deviation (SD), treating both parsing errors and unknown cases as incorrect, and pass/fail outcomes were used as performance metrics. At the question level—aggregating all questions across exams—accuracy rates, along with 95% and 99% confidence intervals (CIs), were computed. CIs were estimated using a nonparametric bootstrap resampling method with 500 replications(48). When unknown cases were present, accuracy rates were calculated in two ways: (1) excluding those cases and (2) treating them as incorrect. Additionally, the number and proportion of parsing errors and unknown cases were reported. Differences in means and accuracy rates comparison were performed using the t-test. P values were computed from CIs (49), and a p -value of < 0.05 was considered statistically significant. Statistics analysis was conducted using a customized python program. Experimental settings Hardware: Intel E5-2620 V4 * 2, 256GB Memory, Nvidia GTX 4090 * 2 48GB Software: Ubuntu 20.04, CUDA 12.4, Anaconda 23.1, Ollama 0.11. The programming language and libraries: Python 3.10, the OpenAI Python library, Pandas, NumPy, SciPy, and among others. Detailed information about these software libraries can be found in the file requirements.txt of the source code. Results Dataset A dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question. Exam-level Performance Exam-level model performance comparison for the CDLE is shown in Fig. 1 . Table 1 shows the mean accuracy (± SD) and pass/fail outcome of each model Table 1 Mean accuracy (± SD) and pass/fail outcome of each model Exam type Model name Mean accuracy(± SD) pass/fail outcomes CDLE Qwen3-Max 0.866 ± 0.089 pass CDLE Qwen-Plus 0.851 ± 0.0767 pass CDLE DeepSeek-V3.1 0.737 ± 0.0737 pass CDLE Qwen3-32B 0.748 ± 0.0868 pass CDLE GPT-OSS-120B 0.652 ± 0.0799 pass As shown in Fig. 1 at the exam level, Qwen3-Max achieved higher accuracy rates than Qwen-Plus (p < 0.01). Additionally, both Qwen3-Max and Qwen-Plus significantly outperformed DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B (p < 0.01). All models passed the CDLE. Question-level Performance Figure 2 shows the question-level accuracy rates, with 95% CIs, for different models using instruction template No. 1. Tables 2 , 3 , and 4 present detailed performance metrics for different models using Instruction Templates No. 1, No. 2, and No. 3, respectively. The reported metrics include parsing error rate, wrong answer rate, and correct answer rate. CIs (95% and 99%) are provided for correct rates. For prompts involving an unknown option, the unknown rate and adjusted correct answer rates were calculated using two approaches: (1) excluding unknown cases, and (2) treating unknown responses as incorrect answers. CIs (95% and 99%) are provided for correct rates. Table 2 Performance metrics using instruction template No. 1. Model name Parsing errors Correct answers Wrong answer Correct rate Qwen3-Max 1 (0.04%) 2077 (86.5%) 322 (13.4%) 0.865 (95%CI: 0.852–0.878, 99%CI: 0.847–0.884) Qwen-Plus 0 2043 (85.1%) 357 (14.9%) 0.851 (95%CI: 0.837–0.865, 99%CI: 0.832–0.870) DeepSeek-V3.1 34 (1.42%) 1745 (72.7%) 621 (25.9%) 0.727 (95%CI: 0.709–0.745, 99%CI: 0.703–0.752) Qwen3-32B 24 (1.0%) 1778 (74.1%) 598 (24.9%) 0.741 (95%CI: 0.724–0.756, 99%CI: 0.715–0.763) GPT-OSS-120B 1 (0.04%) 1563 (65.1%) 836 (34.8%) 0.651 (95%CI: 0.634–0.671, 99%CI: 0.624–0.675) Table 3 Performance metrics using instruction template No. 2. Model name Parsing errors Unknown cases Correct answers Wrong answer Correct rate (exclude unknown cases) Correct rate (unknowns as errors) Qwen3-Max 1 (0.04%) 15 (0.63%) 2078 (86.6%) 306 (12.8%) 0.871 (95%CI: 0.858–0.883, 99%CI: 0.852–0.886) 0.8658 (95%CI: 0.851–0.879, 99% CI: 0.848–0.883) Qwen-Plus 0 13 (0.54%) 2027 (84.5%) 360 (15%) 0.849 (95%CI: 0.834–0.863, 99%: 0.829–0.865) 0.8446 (95%CI: 0.830–0.857, 99%: 0.826–0.862) DeepSeek-V3.1 62 (2.58%) 86 (3.58%) 1629 (67.9%) 623 (26.0%) 0.704 (95%CI:0.686–0.723, 99%CI: 0.679–0.734) 0.6787 (95%CI: 0.660–0.698, 99% CI:0.658–0.705) Qwen3-32B 18 (0.75%) 0 1783 (74.2%) 599 (25.0%) 0.743 (95%CI: 0.725–0.759, 99%CI: 0.7208–0.7658) 0.7429 (95%CI: 0.725–0.759, 99%CI: 0.721–0.766) GPT-OSS-120B 2 (0.08%) 29 (1.20%) 1551 (64.6%) 818 (34.1%) 0.6542 (95%CI: 0.634–0.674, 99%CI: 0.632–0.678) 0.646 (95%CI: 0.625–0.665, 99% CI:0.623–0.668) Table 4 Performance metrics using instruction template No. 3. Model name Parsing errors Unknown cases Correct answers Wrong answer Correct rate (exclude unknown cases) Correct rate (unknowns as errors) Qwen3-Max 1 (0.04%) 15 (0.63%) 2075 (86.46%) 309 (12.88%) 0.870 (95%CI: 0.857–0.885, 99%CI: 0.853–0.889) 0.8646 (95%CI: 0.851–0.878, 99%CI: 0.847–0.885) Qwen-Plus 0 13 (0.54%) 2026 (84.4%) 361 (15.04%) 0.849 (95%CI: 0.832–0.862, 99%CI: 0.832–0.866) 0.844 (95%CI: 0.829–0.858, 99%CI: 0.827–0.861) DeepSeek-V3.1 0 121 (5.04%) 1572 (65.50%) 707 (29.46%) 0.6898 (95%CI: 0.671–0.709, 99%CI: 0.665–0.717) 0.655 (95%CI: 0.634–0.674, 99%CI: 0.632–0.676) Qwen3-32B 24 (1.00%) 3 (0.13%) 1782 (74.25%) 591 (24.63%) 0.743 (95%CI: 0.726–0.761, 99%CI: 0.720–0.769) 0.7425 (95%CI: 0.725–0.762, 99%CI: 0.723–0.766) GPT-OSS-120B 0 53 (2.21%) 1551 (64.63%) 796 (33.17%) 0.661 (95%CI: 0.641–0.681, 99%CI: 0.638–0.687) 0.646 (95%CI: 0.626–0.663, 99%CI: 0.620–0.672) Tables 2 – 4 show that Qwen3-Max achieved a higher question-level accuracy than Qwen-Plus (P < 0.05). Furthermore, both Qwen3-Max and Qwen-Plus significantly outperformed DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B in terms of question-level accuracy (P < 0.01). Table 5 presents the execution time of different models Table 5 Execution time of models Model name Execution time(s) Average each question time (s) Qwen3-Max 1478.31 0.62 Qwen-Plus 1120.26 0.47 DeepSeek-V3.1 4142.36 1.73 Qwen3-32B 46259.44 19.27 GPT-OSS-120B 11502.46 4.79 As shown in Tables 5 , cloud-based models performed inference significantly faster than locally deployed models. Among the local models, although GPT-OSS-120B has nearly four times as many parameters as Qwen3-32B, it ran considerably faster. This is because GPT-OSS-120B is a sparse model and employs 4-bit quantization, whereas Qwen3-32B is a dense model without additional quantization. All experiments were conducted using the default settings, including those for the OpenAI API and Ollama. It should be noted that although this comparison result is meaningful, the comparison is not entirely equivalent across models. Qwen3-32B enables thinking mode by default, whereas for commercial Qwen models such as Qwen-Plus, thinking mode is off by default and must be manually enabled. Compared with the non-thinking mode, LLMs operating in thinking mode generate far more tokens and run more slowly. Discussion All models successfully passed the CDLE. Qwen3-Max achieved the highest scores across all metrics, and both Qwen3-Max and Qwen-Plus performed significantly better than the other models. Although cloud-based models have considerably more parameters and higher computational demands than locally deployed models, they nevertheless operated considerably faster. While using local LLMs is theoretically cost-free, using cloud-based models is also highly cost-effective—the experiments conducted in this study cost only a few U.S. dollars (when converted from RMB). Therefore, in most cases, using cloud-based LLMs represents a more practical choice than deploying models locally, except for a few companies or institutions with access to extensive computational resources. Contrary to the findings of a previous study (41), using prompts that included an ‘unknown’ option—or combining the ‘unknown’ option with a penalty for incorrect answers—did not apparently improve model accuracy or reduce hallucinations. Consequently, mitigating hallucinations in LLMs should focus on enhancing training data, model architecture, and training processes, rather than relying solely on prompt engineering. This study has several notable strengths. To the best of our knowledge, it is the first to evaluate the performances of multiple LLMs on the CDLE or any other Chinese dental examination. It is also the first to compare the performance of using a standard prompt, incorporating an ‘unknown’ option, and combining the ‘unknown’ option with a penalty for incorrect answers on medical examinations. As a result, other uncertainty estimation methods should be considered. Exam-level and question-level performances were assessed, and the results were significant and consistent across both levels. Despite these promising results, several limitations should be acknowledged. OpenAI’s ChatGPT-4 and ChatGPT-5, as well as Google Gemini, are among the most widely used LLMs to date; however, due to certain constraints, these models were not included in the present evaluation. Although the open-source GPT-OSS-120B could serve as a surrogate for ChatGPT-4, their relative performance—particularly on the CDLE—remains unknown. The comprehensive written examination covers five major modules and includes four question types: A1, A2, A3, and B1. However, subgroup analyses were not conducted in this study. Although the data in this study are drawn from officially published books by authoritative institutions and the examinations are selected in their entirety from recent real-world examinations, possibly due to concerns regarding exam security and prevention of data leakage, the authors and the publishing agency did not disclose the specific year or date of the examinations. Conclusion In conclusion, five LLMs were evaluated on the CDLE, all of which successfully achieved passing scores and some models achieved remarkably high scores. Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Moreover, using prompts that included an ‘unknown’ option—or combining the ‘unknown’ option with a penalty for incorrect answers—did not improve model accuracy. As a result, other uncertainty estimation methods should be considered. Cloud-based LLMs were significantly faster than locally deployed models. In most cases, using cloud-based LLMs represents a better choice than using locally deployed models. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students’ self-directed learning. Abbreviations The following abbreviations are used in this manuscript: LLM large language model CDLE Chinese Dental Licensing Examination SD standard deviation CI confidence interval ANI artificial narrow intelligence Declarations Ethics approval and consent to participate This study was exempt from institutional review board approval owing to the use of a published book. Consent for publication Not applicable. Clinical trial number: not applicable. Availability of data and materials The code, dataset, and prediction results are available at: https://github.com/linchundan88/Chinese_professional_dental_licensing_examinations. Competing interests The authors declare no conflicts of interest. Funding This research was funded by the 2019 Guangdong Science and Technology Special Fund “Medical Education Talent Training and Clinical Technology Improvement Plan” (Research Project Number 2019113134). Acknowledgements Not applicable. Authors’ Contributions Conceptualization: JJ and WX; data curation, JJ; formal analysis, JJ; investigation, WX; resources, WX; validation, JJ; writing—original draft preparation, JJ; writing—review and editing, JJ, WX; visualization, JJ; supervision, WX; project administration, WX; funding acquisition, WX. All authors have read and agreed to the published version of the manuscript. References Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, et al. Emergent Abilities of Large Language Models. arXiv e-prints. 2022:arXiv:2206.07682. Schaeffer R, Miranda B, Koyejo S. Are Emergent Abilities of Large Language Models a Mirage? arXiv e-prints. 2023:arXiv:2304.15004. Lu S, Bigoulaeva I, Sachdeva R, Tayyar Madabushi H, Gurevych I, editors. Are Emergent Abilities in Large Language Models just In-Context Learning?2024 August; Bangkok, Thailand: Association for Computational Linguistics. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nature Medicine. 2025;31(3):943-50. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment. Academic Medicine. 2024;99(2):192-7. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492. Liu X, Lei X, Wang S, Huang Y, Feng A, Wen B, et al., editors. AlignBench: Benchmarking Chinese Alignment of Large Language Models2024 August; Bangkok, Thailand: Association for Computational Linguistics. Hsieh C-H, Hsieh H-Y, Lin H-P. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon. 2024;10(14). Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023;329(10):842-4. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Medical Education. 2024;24(1). Zhu M, Lin H, Jiang J, Jinia AJ, Jee J, Pichotta K, et al. Large language model trained on clinical oncology data predicts cancer progression. npj Digital Medicine. 2025;8(1):397. Shashikumar SP, Mohammadi S, Krishnamoorthy R, Patel A, Wardi G, Ahn JC, et al. Development and prospective implementation of a large language model based system for early sepsis prediction. npj Digital Medicine. 2025;8(1):290. Cano-Besquet S, Rice-Canetto T, Abou-El-Hassan H, Alarcon S, Zimmerman J, Issagholian L, et al. ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study. Heliyon. 2024;10(24):e40964. Patel A, Ruoff C, Helgeson SA, Carvalho DZ, Castillo PR, Cheung J. Diagnostic performance of Large Language Models (LLMs) compared with physicians in sleep medicine. Sleep Medicine. 2025;134:106677. Han C, Kim DW, Kim S, Chan You S, Park JY, Bae S, et al. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. iScience. 2024;27(2). Open-source LLM DeepSeek on a par with proprietary models in clinical decision making. Nature Medicine. 2025. Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature Medicine. 2024;30(9):2613-22. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine. 2023;183(6):589-96. Liu M, Okuhara T, Huang W, Ogihara A, Nagao HS, Okada H, et al. Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis. International Dental Journal. 2025;75(1):213-22. Quah B, Yong CW, Lai CWM, Islam I. Performance of large language models in oral and maxillofacial surgery examinations. International Journal of Oral and Maxillofacial Surgery. 2024;53(10):881-6. Revilla-León M, Barmak AB, Sailer I, Kois J, Att W. Performance of an Artificial Intelligence-Based Chatbot (ChatGPT) Answering the European Certification in Implant Dentistry Exam. The International Journal of Prosthodontics. 2024:1-5. Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus. 2023;15. Sismanoglu S, Capan BS. Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and gemini advanced achieve comparable results to humans? BMC Medical Education. 2025;25(1):214. OpenAi, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 Technical Report. arXiv e-prints. 2023:arXiv:2303.08774. Özbay Y, Erdoğan D, Dinçer GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health. 2025;25(1):648. Zhu G, Zhang X, Chen C. Assessing and enhancing the reliability of Chinese large language models in dental implantology. BMC Oral Health. 2025;25(1):1242. Yilmaz BE, Gokkurt Yilmaz BN, Ozbey F. Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis. BMC Oral Health. 2025;25(1):573. China TNPsCotPsRo. Medical Practitioners Law of the People's Republic of China 2021. Available from: http://en.npc.gov.cn.cdurl.cn/2021-08/20/c_875935.htm. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv e-prints. 2023:arXiv:2311.05232. Xu Z, Jain S, Kankanhalli M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv e-prints. 2024:arXiv:2401.11817. Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-30. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in Natural Language Generation. arXiv e-prints. 2022:arXiv:2202.03629. Hendrickx K, Perini L, Van der Plas D, Meert W, Davis J. Machine learning with a reject option: a survey. Machine Learning. 2024;113(5):3073-110. Papadopoulos CE, Yeung H. Uncertainty estimation and Monte Carlo simulation method. Flow Measurement and Instrumentation. 2001;12(4):291-8. van Ravenzwaaij D, Cassey P, Brown SD. A simple introduction to Markov Chain Monte–Carlo sampling. Psychonomic Bulletin & Review. 2018;25(1):143-54. Hüllermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning. 2021;110(3):457-506. Kopetzki A-K, Charpentier B, Zügner D, Giri S, Günnemann S. Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable? arXiv e-prints. 2020:arXiv:2010.14986. Gawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review. 2023;56(1):1513-89. Tauman Kalai A, Nachum O, Vempala SS, Zhang E. Why Language Models Hallucinate2025 September 01, 2025:[arXiv:2509.04664 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2025arXiv250904664T. Yang A, Li A, Yang B, Zhang B, Hui B, Zheng B, et al. Qwen3 Technical Report. arXiv e-prints. 2025:arXiv:2505.09388. DeepSeek AI, Liu A, Feng B, Xue B, Wang B, Wu B, et al. DeepSeek-V3 Technical Report. arXiv e-prints. 2024:arXiv:2412.19437. OpenAi, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv e-prints. 2025:arXiv:2508.10925. Huang Y, Bai Y, Zhu Z, Zhang J, Zhang J, Su T, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models. Proceedings of the 37th International Conference on Neural Information Processing Systems; New Orleans, LA, USA: Curran Associates Inc.; 2023. p. Article 2749. OpenAi, Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, et al. GPT-4o System Card. arXiv e-prints. 2024:arXiv:2410.21276. Rossettini G, Bargeri S, Cook C, Guida S, Palese A, Rodeghiero L, et al. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. Frontiers in Digital Health. 2025;Volume 7 - 2025. Thomas JD, Bradley E. Bootstrap confidence intervals. Statistical Science. 1996;11(3):189-228. Altman DG, Bland JM. How to obtain the P value from a confidence interval. Bmj. 2011;343:d2304. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7968968","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":550945270,"identity":"fb763603-7ccb-4f83-bf7e-49a4092b127c","order_by":0,"name":"jie ji","email":"","orcid":"","institution":"Network and Information Center of Shantou University","correspondingAuthor":false,"prefix":"","firstName":"jie","middleName":"","lastName":"ji","suffix":""},{"id":550945271,"identity":"158d16a0-7a81-46b3-8f50-38b04127e764","order_by":1,"name":"weini xin","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsElEQVRIiWNgGAWjYJCCAwwVEIYECVrOQFQTr4WBsY0ULQa3ewwPF86rqzM4wHzwNg+DXR5hLXeOJRyeue2whMEBtmRrHobkYoJazG4kHzjMu+0AUAuPmTQPw4HEBsJaEhsO886pA2rh/0asFpAtDcwgW9iI02J/Iy3hMM+xw5IzD7MZW84xSCasRXJGjvFnnpo6fr7jzQ9vvKmwI6wFAZhBhAHx6kfBKBgFo2AU4AEAzcA6Hu0WN8AAAAAASUVORK5CYII=","orcid":"","institution":"Hospital of Stomatology, Shantou University Medical College","correspondingAuthor":true,"prefix":"","firstName":"weini","middleName":"","lastName":"xin","suffix":""}],"badges":[],"createdAt":"2025-10-28 16:41:58","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7968968/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7968968/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96995681,"identity":"75b86fa7-7fad-4627-bedd-711685586eb9","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":252942,"visible":true,"origin":"","legend":"","description":"","filename":"manuscriptCDLE20251017.docx","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/2f1c79a7693431d34dfda9ad.docx"},{"id":96995678,"identity":"c343b90f-3c9d-48dc-a9a1-813028da41ee","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5239,"visible":true,"origin":"","legend":"","description":"","filename":"d57b6751793c45d8aa67ff389f89d662.json","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/10e0e55d32b5b9046188ec30.json"},{"id":96995680,"identity":"d151abde-3a5d-44c9-9f67-6b312f211e1b","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":68468,"visible":true,"origin":"","legend":"","description":"","filename":"d57b6751793c45d8aa67ff389f89d6621enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/b254d756f4569360762c6f7e.xml"},{"id":96995683,"identity":"0ed624f1-baec-4638-bb22-27544d224dda","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":34310,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/c376552aaa180360d6a896b4.png"},{"id":97139449,"identity":"b4797b8a-7344-47ed-860f-050dffc5b9a8","added_by":"auto","created_at":"2025-12-01 10:00:24","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":17819,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/e0b32b5b9a3d3df6607d6782.png"},{"id":96995685,"identity":"e9122bcf-1ba0-4e6c-8da3-f21390fbb3d4","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"xml","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":70006,"visible":true,"origin":"","legend":"","description":"","filename":"d57b6751793c45d8aa67ff389f89d6621structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/85b37cf5d6a8abcead362496.xml"},{"id":96995686,"identity":"dd20ff8b-82aa-4863-9cfd-fefd04bb381b","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"html","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":78677,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/b25cd08b5482bcd9fe60131b.html"},{"id":96995679,"identity":"3198fb24-e34f-4a53-9a57-163f5cf4b451","added_by":"auto","created_at":"2025-11-28 12:11:22","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":247868,"visible":true,"origin":"","legend":"\u003cp\u003eExam-level Model Performance Comparison\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/cb13a4d463172a87f1475df9.jpeg"},{"id":97138779,"identity":"53246bb2-9ae7-4743-b7ff-5d491c386bc8","added_by":"auto","created_at":"2025-12-01 09:59:20","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":53289,"visible":true,"origin":"","legend":"\u003cp\u003eQuestion accuracy rates including 95% CIs using instruction template No. 1\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/f5d9e564befd3017da2d6721.png"},{"id":105443574,"identity":"a73d8f71-50d9-4c50-9798-cf115830b6ef","added_by":"auto","created_at":"2026-03-26 06:42:26","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1152885,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7968968/v1/dea99076-7424-4159-8b84-688ce5b05985.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Evaluation of Large Language Models on the Chinese Dental Licensing Examination","fulltext":[{"header":"Introduction","content":"\u003cp\u003eOver the past few decades, artificial narrow intelligence (ANI)\u0026mdash;defined as AI systems designed to perform specific, well-defined tasks within a limited domain\u0026mdash;has made substantial progress in medicine. For example, deep learning models now match or even surpass human experts in many specialized tasks when large labeled datasets are available. However, real-world clinical applications encompass many specialized domains and tasks, and developing separate AI systems for each task is prohibitively costly. Moreover, in cases involving rare diseases or substantial missing data, acquiring labeled data can be not only labor-intensive but, in some cases, impossible.\u003c/p\u003e\u003cp\u003eIn recent years, large language models (LLMs) have advanced rapidly and, despite ongoing debate, are considered to possess certain general artificial-intelligence-like capabilities. LLMs exhibit emergent abilities (1\u0026ndash;3) that are absent in smaller models but present in larger ones. Most importantly, LLMs can generalize, to some extent, to new tasks for which they were not specifically trained. As a result, LLMs have opened new avenues for artificial intelligence applications in healthcare.\u003c/p\u003e\u003cp\u003eNumerous previous studies have focused on evaluating the ability of LLMs to pass medical exams, answer medical questions, or provide recommendations (4\u0026ndash;12). Some studies have utilized LLMs for clinical decision-making, including disease diagnosis and outcome prediction (13\u0026ndash;20). In dentistry, several studies have assessed the performance of LLMs on dental examinations conducted in various languages(21\u0026ndash;25); however, the results have been inconsistent. In oral and maxillofacial surgery examinations, Quah, B. (22) reported that LLMs were capable of achieving a passing score of 62.5% in the oral and maxillofacial surgery multiple-choice questions, and GPT-4 (26) and Copilot performed the best of the included LLMs. A systematic review and meta-analysis study (21), which included 11 studies, found that GPT-4 achieved an integrated accuracy of 73%, outperforming GPT-3.5 (54%) and Bard (56%). However, compared with medical licensing examinations, LLMs performed worse and faced greater challenges in dental licensing examinations(21). Revilla-Le\u0026oacute;n, Marta evaluated the performance of ChatGPT on the European certification in implant dentistry exam(23) and reported that ChatGPT was able to pass the exam for the 2022 Certification in Implant Dentistry of the European Association for Osseointegration. Conversely, a Japanese study reported that GPT-4 did not pass the Japanese National Dentist Examination and all LLMs demonstrated significantly lower accuracy for dentistry questions compared with other types of questions(24). A study that evaluated the performance of AI on the Turkish dental specialization exam(25) found that AI-powered chatbots, namely ChatGPT-4.0 and Gemini Advanced, passed the DUS by exceeding the threshold score of 45. In addition, several studies have evaluated the question-answering capabilities of LLMs, and these scenarios are very similar to exams. \u0026Ouml;zbay, Yağız conducted a study of evaluation of the performance of LLMs in clinical decision-making in endodontics (27) and found that ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P\u0026thinsp;=\u0026thinsp;0.008) compared to other LLMs. Zhu, Guohui conducted a study about assessing and enhancing the reliability of Chinese LLMs in dental implantology (28) and confirmed that Qwen 2.5 and ERNIE Bot 3.5 demonstrated exceptional reliability in dental implantology, excelling in answer accuracy and minimizing misinformation across question types. Birkan Eyup Yilmaz conducted a comparative analysis about Artificial intelligence performance in answering multiple-choice oral pathology questions (29) and reported LLMs demonstrated variable proficiency in oral pathology questions, with ChatGPT o1 showing higher accuracy.\u003c/p\u003e\u003cp\u003eAlthough some studies have examined the performance of LLMs in both dental examinations and other question answer scenarios across various languages, their results have been inconsistent. The languages investigated in dental licensure examinations include English, Japanese, and Turkish; however, Chinese, which is the most spoken language in the world, has not been investigated.\u003c/p\u003e\u003cp\u003eThe Chinese Dental Licensing Examination(CDLE) was established under the \u0026ldquo;Medical Practitioners Law of the People\u0026rsquo;s Republic of China(30)\u0026rdquo;. The exam consists of two parts: a comprehensive written examination and a practical skills assessment. This study focuses on the first part of the exam, which can be automatically conducted by LLMs. The comprehensive written examination covers five major modules: basic medicine, clinical medicine, preventive medicine, medical humanities, and clinical dental medicine. It is administered via computer and includes four question types: A1, A2, A3, and B1. All questions are five-choice, single-answer questions. Passing the exam requires answering 60% of the questions correctly.\u003c/p\u003e\u003cp\u003eIn medical domains, mispredictions can have serious consequences. However, it is well known that LLMs are prone to hallucinations(31\u0026ndash;33), which represent one of the greatest risks limiting their application, particularly in the medical field. Hallucinations are often defined as instances where LLMs generate plausible but factually incorrect or nonsensical information (32, 34). Therefore, it would be beneficial for a model to be cautious in situations where it is uncertain about its predictions(35). One way to accomplish this is to use machine learning models with rejection, and this is true for both traditional machine learning models and LLMs. Most machine learning models can output predicted probability values, and by adding threshold-based post-processing, a rejection option can be incorporated. Additionally, there are some relatively sophisticated uncertainty estimation methods such as Bayesian neural networks, Bayesian Monte Carlo simulation method(36\u0026ndash;38), Dirichlet-based models, separated rejectors, test-time data augmentation(39, 40), among others. Because LLMs are computationally expensive and intermediate results or internal representations from cloud-based models are often inaccessible, a simple method is needed to implement a rejection option. A previous study adopted an uncertainty estimation method to address hallucinations in LLMs(33); however, there remains a need for lightweight and specialized methods for LLMs. A few studies suggest that explicitly adding an 'unknown' option and imposing a penalty for incorrect answers in the prompt can help reduce hallucinations(41). However, the effectiveness of this approach has not been widely demonstrated, especially in the medical domain.\u003c/p\u003e\u003cp\u003eTo address these research gaps, this study evaluates the performance of multiple LLMs on the CDLE. Because each multiple-choice question in this exam has a definitive correct answer, it serves as an effective benchmark for assessing hallucinations in LLMs. Furthermore, this study conducts a comparative analysis of three prompting strategies: (i) a standard prompt template without an 'unknown' option, (ii) a modified prompt including an explicit \u0026lsquo;unknown\u0026rsquo; response option, and (iii) a comprehensive prompting approach combining both the \u0026lsquo;unknown\u0026rsquo; option and a penalty for incorrect answers.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eData processing\u003c/h2\u003e\u003cp\u003eThe official preparation book, titled \u003cem\u003eHistorical Chinese Professional Dental Licensing Examinations (including questions, answers, and detailed explanations)\u003c/em\u003e, contains four past CDLEs from the past ten years, authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group and published by Liaoning University Press in November 2024 with ISBN 978-7-5610-8453-3, was used as the data source. All examinations and questions from the book were retained.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eModels\u003c/h3\u003e\n\u003cp\u003eThree cloud-based models\u0026mdash;Qwen3-Max, Qwen-Plus(42), and DeepSeek-V3.1(43)\u0026mdash;along with two locally deployed models, Qwen3-32B and GPT-OSS-120B(44)\u0026mdash;were evaluated on the CDLE. These models were selected for the following reasons. The Qwen and DeepSeek model families have demonstrated strong multilingual capabilities, with particularly robust performance on Chinese(42, 43). For example, on Chinese language benchmarks such as C-Eval(45) and AlignBench(9), Qwen3 models achieved higher scores than GPT-4o(26, 46), Gemma-3, LLaMA-3.1, LLama-4, and even DeepSeek-R1. DeepSeek performed on par with state-of-the-art proprietary models in clinical decision-making(18). The GPT series models developed by OpenAI, including GPT-4 and GPT-4o, and the newly released GPT-5 (released on August 7, 2025) are among the most prominent LLMs in the world. The GPT-OSS-120B, the most powerful open-source model released by OpenAI to date, achieves near parity with GPT-o4-mini across multiple tasks (47).\u003c/p\u003e\u003cp\u003eAmong the five models, the first two are proprietary models, while the latter three are open source. Qwen3-Max has over 1 trillion parameters; however, the exact number of parameters for Qwen-Plus has not been publicly disclosed. DeepSeek-V3.1 is a powerful Mixture-of-Experts (MoE) language model with a total of 685\u0026nbsp;billion parameters, including 671\u0026nbsp;billion from the main model and 14\u0026nbsp;billion from the Multi-Token Prediction (MTP) module. However, only about 37\u0026nbsp;billion parameters are activated per token. GPT-OSS-120B has 117\u0026nbsp;billion parameters in total, with approximately 5.1\u0026nbsp;billion active parameters per forward pass, and it operates using 4-bit quantization. Qwen3-Max was released by Alibaba on September 24, 2025, and DeepSeek-V3.1 was released by Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Company on August 21, 2025. OpenAI released the open-source model gpt-oss-120b on August 5, 2025. The GPT-OSS-120B model achieves near parity with OpenAI o4-mini on multiple tasks(47).\u003c/p\u003e\u003cp\u003eAlibaba Cloud is the largest cloud computing company in China and one of the largest in the world. The Qwen-Plus, Qwen3-Max, and DeepSeek-V3.1 models were evaluated on Alibaba Cloud, a highly stable platform that provides an OpenAI-compatible API for model inference.\u003c/p\u003e\u003cp\u003eThe afore-mentioned Qwen3-32B and GPT-OSS-120B models were deployed locally using Ollama, which provides built-in support for the OpenAI Chat Completions API.\u003c/p\u003e\n\u003ch3\u003eCustomized program for automatically conducting exams\u003c/h3\u003e\n\u003cp\u003eA custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. The program utilizes the Chat Completions API\u0026mdash;specifically, the chat.completions.create method in the OpenAI Python library\u0026mdash;to communicate with LLMs. The messages sent to each LLM include a system message and a user message.\u003c/p\u003e\u003cp\u003eTo determine whether using prompts that include an \u0026lsquo;unknown\u0026rsquo; response option can reduce hallucinations in LLMs\u0026mdash;and thereby improve answer accuracy\u0026mdash;three types of system messages were implemented. The first is the traditional message: \u003cem\u003e\"You are an experienced dentist. Please select the most correct answer from the five given options based on the question description, and respond directly with the number of the correct answer.\"\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe second message introduces an \"I don't know\" option: \u003cem\u003e\"You are an experienced dentist. Among the five options provided for the following question, only one is the most correct. If you are fairly certain about which option is correct, respond directly with its number; otherwise, respond with 'I don't know.'\"\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe third message not only includes the \"I don't know\" option but also imposes a penalty for incorrect answers: \u003cem\u003e\"You are an experienced dentist. Among the five options provided for the following question, only one is the most correct. If you are more than 90% confident in your answer, respond directly with the number of the option; otherwise, respond with 'I don't know.' You will receive 1 point for a correct answer, 0 points for responding 'I don't know,' and \u0026minus;\u0026thinsp;1 point for an incorrect answer.\"\u003c/em\u003e\u003c/p\u003e\n\u003ch3\u003eParsing LLM results\u003c/h3\u003e\n\u003cp\u003eThese LLMs are generative AI systems that produce text sequences in response to a given input prompt. Although carefully designed prompts can guide LLMs to output question numbers in most cases, the responses sometimes contain incorrect formats or invalid values. Nonetheless, in most cases, humans can readily interpret the outputs to identify the correct answers. To extract the predicted answer numbers from the model generated text, rule-based parsing techniques were applied. When the system failed to extract a valid answer number, the instance was classified as a parsing error. In this study, all parsing errors were treated as incorrect answers.\u003c/p\u003e\n\u003ch3\u003eStatistics Analysis\u003c/h3\u003e\n\u003cp\u003eModel performance was evaluated at both the exam and the question levels. At the exam level, overall mean accuracy rates\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation (SD), treating both parsing errors and unknown cases as incorrect, and pass/fail outcomes were used as performance metrics. At the question level\u0026mdash;aggregating all questions across exams\u0026mdash;accuracy rates, along with 95% and 99% confidence intervals (CIs), were computed. CIs were estimated using a nonparametric bootstrap resampling method with 500 replications(48). When unknown cases were present, accuracy rates were calculated in two ways: (1) excluding those cases and (2) treating them as incorrect. Additionally, the number and proportion of parsing errors and unknown cases were reported. Differences in means and accuracy rates comparison were performed using the t-test. P values were computed from CIs (49), and a \u003cem\u003ep\u003c/em\u003e-value of \u0026lt;\u0026thinsp;0.05 was considered statistically significant. Statistics analysis was conducted using a customized python program.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eExperimental settings\u003c/h2\u003e\u003cp\u003eHardware: Intel E5-2620 V4 * 2, 256GB Memory, Nvidia GTX 4090 * 2 48GB\u003c/p\u003e\u003cp\u003eSoftware: Ubuntu 20.04, CUDA 12.4, Anaconda 23.1, Ollama 0.11.\u003c/p\u003e\u003cp\u003eThe programming language and libraries: Python 3.10, the OpenAI Python library, Pandas, NumPy, SciPy, and among others. Detailed information about these software libraries can be found in the file requirements.txt of the source code.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003eDataset\u003c/h2\u003e\u003cp\u003eA dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eExam-level Performance\u003c/h2\u003e\u003cp\u003eExam-level model performance comparison for the CDLE is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows the mean accuracy (\u0026plusmn;\u0026thinsp;SD) and pass/fail outcome of each model\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMean accuracy (\u0026plusmn;\u0026thinsp;SD) and pass/fail outcome of each model\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eExam type\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMean accuracy(\u0026plusmn;\u0026thinsp;SD)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass/fail outcomes\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCDLE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eQwen3-Max\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e0.866\u0026thinsp;\u0026plusmn;\u0026thinsp;0.089\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCDLE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eQwen-Plus\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e0.851\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0767\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCDLE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDeepSeek-V3.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e0.737\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0737\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCDLE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eQwen3-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e0.748\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0868\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCDLE\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e\u003cp\u003e0.652\u0026thinsp;\u0026plusmn;\u0026thinsp;0.0799\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003epass\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e at the exam level, Qwen3-Max achieved higher accuracy rates than Qwen-Plus (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Additionally, both Qwen3-Max and Qwen-Plus significantly outperformed DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01). All models passed the CDLE.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eQuestion-level Performance\u003c/h2\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows the question-level accuracy rates, with 95% CIs, for different models using instruction template No. 1.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eTables\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, \u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, and \u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e present detailed performance metrics for different models using Instruction Templates No. 1, No. 2, and No. 3, respectively. The reported metrics include parsing error rate, wrong answer rate, and correct answer rate. CIs (95% and 99%) are provided for correct rates. For prompts involving an unknown option, the unknown rate and adjusted correct answer rates were calculated using two approaches: (1) excluding unknown cases, and (2) treating unknown responses as incorrect answers. CIs (95% and 99%) are provided for correct rates.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance metrics using instruction template No. 1.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eParsing\u003c/p\u003e\u003cp\u003eerrors\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003cp\u003eanswers\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eWrong answer\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCorrect rate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-Max\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003cp\u003e(0.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e2077\u003c/p\u003e\u003cp\u003e(86.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e322\u003c/p\u003e\u003cp\u003e(13.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.865\u003c/p\u003e\u003cp\u003e(95%CI: 0.852\u0026ndash;0.878,\u003c/p\u003e\u003cp\u003e99%CI: 0.847\u0026ndash;0.884)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen-Plus\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e2043\u003c/p\u003e\u003cp\u003e(85.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e357\u003c/p\u003e\u003cp\u003e(14.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.851\u003c/p\u003e\u003cp\u003e(95%CI: 0.837\u0026ndash;0.865, 99%CI: 0.832\u0026ndash;0.870)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDeepSeek-V3.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e34\u003c/p\u003e\u003cp\u003e(1.42%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1745\u003c/p\u003e\u003cp\u003e(72.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e621\u003c/p\u003e\u003cp\u003e(25.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.727\u003c/p\u003e\u003cp\u003e(95%CI: 0.709\u0026ndash;0.745,\u003c/p\u003e\u003cp\u003e99%CI: 0.703\u0026ndash;0.752)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e24\u003c/p\u003e\u003cp\u003e(1.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1778\u003c/p\u003e\u003cp\u003e(74.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e598\u003c/p\u003e\u003cp\u003e(24.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.741\u003c/p\u003e\u003cp\u003e(95%CI: 0.724\u0026ndash;0.756,\u003c/p\u003e\u003cp\u003e99%CI: 0.715\u0026ndash;0.763)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003cp\u003e(0.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1563\u003c/p\u003e\u003cp\u003e(65.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e836\u003c/p\u003e\u003cp\u003e(34.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.651\u003c/p\u003e\u003cp\u003e(95%CI: 0.634\u0026ndash;0.671,\u003c/p\u003e\u003cp\u003e99%CI: 0.624\u0026ndash;0.675)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance metrics using instruction template No. 2.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eParsing\u003c/p\u003e\u003cp\u003eerrors\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eUnknown\u003c/p\u003e\u003cp\u003ecases\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003cp\u003eanswers\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eWrong answer\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eCorrect rate\u003c/p\u003e\u003cp\u003e(exclude unknown cases)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eCorrect rate\u003c/p\u003e\u003cp\u003e(unknowns as errors)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-Max\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003cp\u003e(0.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e15\u003c/p\u003e\u003cp\u003e(0.63%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e2078\u003c/p\u003e\u003cp\u003e(86.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e306\u003c/p\u003e\u003cp\u003e(12.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.871\u003c/p\u003e\u003cp\u003e(95%CI: 0.858\u0026ndash;0.883,\u003c/p\u003e\u003cp\u003e99%CI: 0.852\u0026ndash;0.886)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.8658\u003c/p\u003e\u003cp\u003e(95%CI: 0.851\u0026ndash;0.879,\u003c/p\u003e\u003cp\u003e99% CI: 0.848\u0026ndash;0.883)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen-Plus\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e13\u003c/p\u003e\u003cp\u003e(0.54%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e2027\u003c/p\u003e\u003cp\u003e(84.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e360\u003c/p\u003e\u003cp\u003e(15%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.849\u003c/p\u003e\u003cp\u003e(95%CI: 0.834\u0026ndash;0.863,\u003c/p\u003e\u003cp\u003e99%: 0.829\u0026ndash;0.865)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.8446\u003c/p\u003e\u003cp\u003e(95%CI: 0.830\u0026ndash;0.857,\u003c/p\u003e\u003cp\u003e99%: 0.826\u0026ndash;0.862)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDeepSeek-V3.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e62\u003c/p\u003e\u003cp\u003e(2.58%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e86\u003c/p\u003e\u003cp\u003e(3.58%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1629\u003c/p\u003e\u003cp\u003e(67.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e623\u003c/p\u003e\u003cp\u003e(26.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.704\u003c/p\u003e\u003cp\u003e(95%CI:0.686\u0026ndash;0.723,\u003c/p\u003e\u003cp\u003e99%CI: 0.679\u0026ndash;0.734)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.6787\u003c/p\u003e\u003cp\u003e(95%CI: 0.660\u0026ndash;0.698,\u003c/p\u003e\u003cp\u003e99% CI:0.658\u0026ndash;0.705)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e18\u003c/p\u003e\u003cp\u003e(0.75%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1783\u003c/p\u003e\u003cp\u003e(74.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e599\u003c/p\u003e\u003cp\u003e(25.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.743\u003c/p\u003e\u003cp\u003e(95%CI: 0.725\u0026ndash;0.759,\u003c/p\u003e\u003cp\u003e99%CI: 0.7208\u0026ndash;0.7658)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.7429\u003c/p\u003e\u003cp\u003e(95%CI: 0.725\u0026ndash;0.759,\u003c/p\u003e\u003cp\u003e99%CI: 0.721\u0026ndash;0.766)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e2\u003c/p\u003e\u003cp\u003e(0.08%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e29\u003c/p\u003e\u003cp\u003e(1.20%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1551\u003c/p\u003e\u003cp\u003e(64.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e818\u003c/p\u003e\u003cp\u003e(34.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.6542\u003c/p\u003e\u003cp\u003e(95%CI: 0.634\u0026ndash;0.674,\u003c/p\u003e\u003cp\u003e99%CI: 0.632\u0026ndash;0.678)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.646\u003c/p\u003e\u003cp\u003e(95%CI: 0.625\u0026ndash;0.665,\u003c/p\u003e\u003cp\u003e99% CI:0.623\u0026ndash;0.668)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance metrics using instruction template No. 3.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eParsing\u003c/p\u003e\u003cp\u003eerrors\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eUnknown\u003c/p\u003e\u003cp\u003ecases\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCorrect\u003c/p\u003e\u003cp\u003eanswers\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eWrong answer\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eCorrect rate\u003c/p\u003e\u003cp\u003e(exclude unknown cases)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eCorrect rate\u003c/p\u003e\u003cp\u003e(unknowns as errors)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-Max\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003cp\u003e(0.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e15\u003c/p\u003e\u003cp\u003e(0.63%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e2075\u003c/p\u003e\u003cp\u003e(86.46%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e309\u003c/p\u003e\u003cp\u003e(12.88%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.870\u003c/p\u003e\u003cp\u003e(95%CI: 0.857\u0026ndash;0.885,\u003c/p\u003e\u003cp\u003e99%CI: 0.853\u0026ndash;0.889)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.8646\u003c/p\u003e\u003cp\u003e(95%CI: 0.851\u0026ndash;0.878,\u003c/p\u003e\u003cp\u003e99%CI: 0.847\u0026ndash;0.885)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen-Plus\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e13\u003c/p\u003e\u003cp\u003e(0.54%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e2026\u003c/p\u003e\u003cp\u003e(84.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e361\u003c/p\u003e\u003cp\u003e(15.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.849\u003c/p\u003e\u003cp\u003e(95%CI: 0.832\u0026ndash;0.862,\u003c/p\u003e\u003cp\u003e99%CI: 0.832\u0026ndash;0.866)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.844\u003c/p\u003e\u003cp\u003e(95%CI: 0.829\u0026ndash;0.858,\u003c/p\u003e\u003cp\u003e99%CI: 0.827\u0026ndash;0.861)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDeepSeek-V3.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e121\u003c/p\u003e\u003cp\u003e(5.04%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1572\u003c/p\u003e\u003cp\u003e(65.50%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e707\u003c/p\u003e\u003cp\u003e(29.46%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.6898\u003c/p\u003e\u003cp\u003e(95%CI: 0.671\u0026ndash;0.709,\u003c/p\u003e\u003cp\u003e99%CI: 0.665\u0026ndash;0.717)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.655\u003c/p\u003e\u003cp\u003e(95%CI: 0.634\u0026ndash;0.674,\u003c/p\u003e\u003cp\u003e99%CI: 0.632\u0026ndash;0.676)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e24\u003c/p\u003e\u003cp\u003e(1.00%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e3\u003c/p\u003e\u003cp\u003e(0.13%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1782\u003c/p\u003e\u003cp\u003e(74.25%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e591\u003c/p\u003e\u003cp\u003e(24.63%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.743\u003c/p\u003e\u003cp\u003e(95%CI: 0.726\u0026ndash;0.761,\u003c/p\u003e\u003cp\u003e99%CI: 0.720\u0026ndash;0.769)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.7425\u003c/p\u003e\u003cp\u003e(95%CI: 0.725\u0026ndash;0.762,\u003c/p\u003e\u003cp\u003e99%CI: 0.723\u0026ndash;0.766)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e53\u003c/p\u003e\u003cp\u003e(2.21%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1551\u003c/p\u003e\u003cp\u003e(64.63%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e796\u003c/p\u003e\u003cp\u003e(33.17%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.661\u003c/p\u003e\u003cp\u003e(95%CI: 0.641\u0026ndash;0.681,\u003c/p\u003e\u003cp\u003e99%CI: 0.638\u0026ndash;0.687)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.646\u003c/p\u003e\u003cp\u003e(95%CI: 0.626\u0026ndash;0.663,\u003c/p\u003e\u003cp\u003e99%CI: 0.620\u0026ndash;0.672)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTables\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e show that Qwen3-Max achieved a higher question-level accuracy than Qwen-Plus (P\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Furthermore, both Qwen3-Max and Qwen-Plus significantly outperformed DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B in terms of question-level accuracy (P\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e presents the execution time of different models\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eExecution time of models\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eExecution time(s)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAverage each question time (s)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-Max\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1478.31\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.62\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen-Plus\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1120.26\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.47\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDeepSeek-V3.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e4142.36\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1.73\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eQwen3-32B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e46259.44\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e19.27\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT-OSS-120B\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e11502.46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e4.79\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eAs shown in Tables\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, cloud-based models performed inference significantly faster than locally deployed models. Among the local models, although GPT-OSS-120B has nearly four times as many parameters as Qwen3-32B, it ran considerably faster. This is because GPT-OSS-120B is a sparse model and employs 4-bit quantization, whereas Qwen3-32B is a dense model without additional quantization. All experiments were conducted using the default settings, including those for the OpenAI API and Ollama. It should be noted that although this comparison result is meaningful, the comparison is not entirely equivalent across models. Qwen3-32B enables thinking mode by default, whereas for commercial Qwen models such as Qwen-Plus, thinking mode is off by default and must be manually enabled. Compared with the non-thinking mode, LLMs operating in thinking mode generate far more tokens and run more slowly.\u003c/p\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eAll models successfully passed the CDLE. Qwen3-Max achieved the highest scores across all metrics, and both Qwen3-Max and Qwen-Plus performed significantly better than the other models. Although cloud-based models have considerably more parameters and higher computational demands than locally deployed models, they nevertheless operated considerably faster. While using local LLMs is theoretically cost-free, using cloud-based models is also highly cost-effective\u0026mdash;the experiments conducted in this study cost only a few U.S. dollars (when converted from RMB). Therefore, in most cases, using cloud-based LLMs represents a more practical choice than deploying models locally, except for a few companies or institutions with access to extensive computational resources.\u003c/p\u003e\u003cp\u003eContrary to the findings of a previous study (41), using prompts that included an \u0026lsquo;unknown\u0026rsquo; option\u0026mdash;or combining the \u0026lsquo;unknown\u0026rsquo; option with a penalty for incorrect answers\u0026mdash;did not apparently improve model accuracy or reduce hallucinations. Consequently, mitigating hallucinations in LLMs should focus on enhancing training data, model architecture, and training processes, rather than relying solely on prompt engineering.\u003c/p\u003e\u003cp\u003eThis study has several notable strengths. To the best of our knowledge, it is the first to evaluate the performances of multiple LLMs on the CDLE or any other Chinese dental examination. It is also the first to compare the performance of using a standard prompt, incorporating an \u0026lsquo;unknown\u0026rsquo; option, and combining the \u0026lsquo;unknown\u0026rsquo; option with a penalty for incorrect answers on medical examinations. As a result, other uncertainty estimation methods should be considered. Exam-level and question-level performances were assessed, and the results were significant and consistent across both levels.\u003c/p\u003e\u003cp\u003eDespite these promising results, several limitations should be acknowledged. OpenAI\u0026rsquo;s ChatGPT-4 and ChatGPT-5, as well as Google Gemini, are among the most widely used LLMs to date; however, due to certain constraints, these models were not included in the present evaluation. Although the open-source GPT-OSS-120B could serve as a surrogate for ChatGPT-4, their relative performance\u0026mdash;particularly on the CDLE\u0026mdash;remains unknown. The comprehensive written examination covers five major modules and includes four question types: A1, A2, A3, and B1. However, subgroup analyses were not conducted in this study. Although the data in this study are drawn from officially published books by authoritative institutions and the examinations are selected in their entirety from recent real-world examinations, possibly due to concerns regarding exam security and prevention of data leakage, the authors and the publishing agency did not disclose the specific year or date of the examinations.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn conclusion, five LLMs were evaluated on the CDLE, all of which successfully achieved passing scores and some models achieved remarkably high scores. Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Moreover, using prompts that included an \u0026lsquo;unknown\u0026rsquo; option\u0026mdash;or combining the \u0026lsquo;unknown\u0026rsquo; option with a penalty for incorrect answers\u0026mdash;did not improve model accuracy. As a result, other uncertainty estimation methods should be considered. Cloud-based LLMs were significantly faster than locally deployed models. In most cases, using cloud-based LLMs represents a better choice than using locally deployed models. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students\u0026rsquo; self-directed learning.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eThe following abbreviations are used in this manuscript:\u003c/p\u003e\n\u003cp\u003eLLM\u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;large language model\u003c/p\u003e\n\u003cp\u003eCDLE\u0026nbsp; \u0026nbsp;\u0026nbsp;Chinese Dental Licensing Examination\u003c/p\u003e\n\u003cp\u003eSD standard deviation\u003c/p\u003e\n\u003cp\u003eCI confidence interval\u003c/p\u003e\n\u003cp\u003eANI artificial narrow intelligence\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was exempt from institutional review board approval owing to the use of a published book.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical trial number: not applicable.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code, dataset, and prediction results are available at: https://github.com/linchundan88/Chinese_professional_dental_licensing_examinations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no conflicts of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research was funded by the 2019 Guangdong Science and Technology Special Fund “Medical Education Talent Training and Clinical Technology Improvement Plan” (Research Project Number 2019113134).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors’ Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualization: JJ and WX; data curation, JJ; formal analysis, JJ; investigation, WX; resources, WX; validation, JJ; writing—original draft preparation, JJ; writing—review and editing, JJ, WX; visualization, JJ; supervision, WX; project administration, WX; funding acquisition, WX. All authors have read and agreed to the published version of the manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eWei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, et al. Emergent Abilities of Large Language Models. arXiv e-prints. 2022:arXiv:2206.07682.\u003c/li\u003e\n\u003cli\u003eSchaeffer R, Miranda B, Koyejo S. Are Emergent Abilities of Large Language Models a Mirage? arXiv e-prints. 2023:arXiv:2304.15004.\u003c/li\u003e\n\u003cli\u003eLu S, Bigoulaeva I, Sachdeva R, Tayyar Madabushi H, Gurevych I, editors. Are Emergent Abilities in Large Language Models just In-Context Learning?2024 August; Bangkok, Thailand: Association for Computational Linguistics.\u003c/li\u003e\n\u003cli\u003eSinghal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nature Medicine. 2025;31(3):943-50.\u003c/li\u003e\n\u003cli\u003eGilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.\u003c/li\u003e\n\u003cli\u003eYaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment. Academic Medicine. 2024;99(2):192-7.\u003c/li\u003e\n\u003cli\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198.\u003c/li\u003e\n\u003cli\u003eBrin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492.\u003c/li\u003e\n\u003cli\u003eLiu X, Lei X, Wang S, Huang Y, Feng A, Wen B, et al., editors. AlignBench: Benchmarking Chinese Alignment of Large Language Models2024 August; Bangkok, Thailand: Association for Computational Linguistics.\u003c/li\u003e\n\u003cli\u003eHsieh C-H, Hsieh H-Y, Lin H-P. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon. 2024;10(14).\u003c/li\u003e\n\u003cli\u003eSarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023;329(10):842-4.\u003c/li\u003e\n\u003cli\u003eZong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Medical Education. 2024;24(1).\u003c/li\u003e\n\u003cli\u003eZhu M, Lin H, Jiang J, Jinia AJ, Jee J, Pichotta K, et al. Large language model trained on clinical oncology data predicts cancer progression. npj Digital Medicine. 2025;8(1):397.\u003c/li\u003e\n\u003cli\u003eShashikumar SP, Mohammadi S, Krishnamoorthy R, Patel A, Wardi G, Ahn JC, et al. Development and prospective implementation of a large language model based system for early sepsis prediction. npj Digital Medicine. 2025;8(1):290.\u003c/li\u003e\n\u003cli\u003eCano-Besquet S, Rice-Canetto T, Abou-El-Hassan H, Alarcon S, Zimmerman J, Issagholian L, et al. ChatGPT4\u0026rsquo;s diagnostic accuracy in inpatient neurology: A retrospective cohort study. Heliyon. 2024;10(24):e40964.\u003c/li\u003e\n\u003cli\u003ePatel A, Ruoff C, Helgeson SA, Carvalho DZ, Castillo PR, Cheung J. Diagnostic performance of Large Language Models (LLMs) compared with physicians in sleep medicine. Sleep Medicine. 2025;134:106677.\u003c/li\u003e\n\u003cli\u003eHan C, Kim DW, Kim S, Chan You S, Park JY, Bae S, et al. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. iScience. 2024;27(2).\u003c/li\u003e\n\u003cli\u003eOpen-source LLM DeepSeek on a par with proprietary models in clinical decision making. Nature Medicine. 2025.\u003c/li\u003e\n\u003cli\u003eHager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature Medicine. 2024;30(9):2613-22.\u003c/li\u003e\n\u003cli\u003eAyers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine. 2023;183(6):589-96.\u003c/li\u003e\n\u003cli\u003eLiu M, Okuhara T, Huang W, Ogihara A, Nagao HS, Okada H, et al. Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis. International Dental Journal. 2025;75(1):213-22.\u003c/li\u003e\n\u003cli\u003eQuah B, Yong CW, Lai CWM, Islam I. Performance of large language models in oral and maxillofacial surgery examinations. International Journal of Oral and Maxillofacial Surgery. 2024;53(10):881-6.\u003c/li\u003e\n\u003cli\u003eRevilla-Le\u0026oacute;n M, Barmak AB, Sailer I, Kois J, Att W. Performance of an Artificial Intelligence-Based Chatbot (ChatGPT) Answering the European Certification in Implant Dentistry Exam. The International Journal of Prosthodontics. 2024:1-5.\u003c/li\u003e\n\u003cli\u003eOhta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus. 2023;15.\u003c/li\u003e\n\u003cli\u003eSismanoglu S, Capan BS. Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and gemini advanced achieve comparable results to humans? BMC Medical Education. 2025;25(1):214.\u003c/li\u003e\n\u003cli\u003eOpenAi, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 Technical Report. arXiv e-prints. 2023:arXiv:2303.08774.\u003c/li\u003e\n\u003cli\u003e\u0026Ouml;zbay Y, Erdoğan D, Din\u0026ccedil;er GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health. 2025;25(1):648.\u003c/li\u003e\n\u003cli\u003eZhu G, Zhang X, Chen C. Assessing and enhancing the reliability of Chinese large language models in dental implantology. BMC Oral Health. 2025;25(1):1242.\u003c/li\u003e\n\u003cli\u003eYilmaz BE, Gokkurt Yilmaz BN, Ozbey F. Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis. BMC Oral Health. 2025;25(1):573.\u003c/li\u003e\n\u003cli\u003eChina TNPsCotPsRo. Medical Practitioners Law of the People's Republic of China 2021. Available from: http://en.npc.gov.cn.cdurl.cn/2021-08/20/c_875935.htm.\u003c/li\u003e\n\u003cli\u003eHuang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv e-prints. 2023:arXiv:2311.05232.\u003c/li\u003e\n\u003cli\u003eXu Z, Jain S, Kankanhalli M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv e-prints. 2024:arXiv:2401.11817.\u003c/li\u003e\n\u003cli\u003eFarquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-30.\u003c/li\u003e\n\u003cli\u003eJi Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in Natural Language Generation. arXiv e-prints. 2022:arXiv:2202.03629.\u003c/li\u003e\n\u003cli\u003eHendrickx K, Perini L, Van der Plas D, Meert W, Davis J. Machine learning with a reject option: a survey. Machine Learning. 2024;113(5):3073-110.\u003c/li\u003e\n\u003cli\u003ePapadopoulos CE, Yeung H. Uncertainty estimation and Monte Carlo simulation method. Flow Measurement and Instrumentation. 2001;12(4):291-8.\u003c/li\u003e\n\u003cli\u003evan Ravenzwaaij D, Cassey P, Brown SD. A simple introduction to Markov Chain Monte\u0026ndash;Carlo sampling. Psychonomic Bulletin \u0026amp; Review. 2018;25(1):143-54.\u003c/li\u003e\n\u003cli\u003eH\u0026uuml;llermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning. 2021;110(3):457-506.\u003c/li\u003e\n\u003cli\u003eKopetzki A-K, Charpentier B, Z\u0026uuml;gner D, Giri S, G\u0026uuml;nnemann S. Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable? arXiv e-prints. 2020:arXiv:2010.14986.\u003c/li\u003e\n\u003cli\u003eGawlikowski J, Tassi CRN, Ali M, Lee J, Humt M, Feng J, et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review. 2023;56(1):1513-89.\u003c/li\u003e\n\u003cli\u003eTauman Kalai A, Nachum O, Vempala SS, Zhang E. Why Language Models Hallucinate2025 September 01, 2025:[arXiv:2509.04664 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2025arXiv250904664T.\u003c/li\u003e\n\u003cli\u003eYang A, Li A, Yang B, Zhang B, Hui B, Zheng B, et al. Qwen3 Technical Report. arXiv e-prints. 2025:arXiv:2505.09388.\u003c/li\u003e\n\u003cli\u003eDeepSeek AI, Liu A, Feng B, Xue B, Wang B, Wu B, et al. DeepSeek-V3 Technical Report. arXiv e-prints. 2024:arXiv:2412.19437.\u003c/li\u003e\n\u003cli\u003eOpenAi, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt-oss-120b \u0026amp; gpt-oss-20b Model Card. arXiv e-prints. 2025:arXiv:2508.10925.\u003c/li\u003e\n\u003cli\u003eHuang Y, Bai Y, Zhu Z, Zhang J, Zhang J, Su T, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models. Proceedings of the 37th International Conference on Neural Information Processing Systems; New Orleans, LA, USA: Curran Associates Inc.; 2023. p. Article 2749.\u003c/li\u003e\n\u003cli\u003eOpenAi, Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, et al. GPT-4o System Card. arXiv e-prints. 2024:arXiv:2410.21276.\u003c/li\u003e\n\u003cli\u003eRossettini G, Bargeri S, Cook C, Guida S, Palese A, Rodeghiero L, et al. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. Frontiers in Digital Health. 2025;Volume 7 - 2025.\u003c/li\u003e\n\u003cli\u003eThomas JD, Bradley E. Bootstrap confidence intervals. Statistical Science. 1996;11(3):189-228.\u003c/li\u003e\n\u003cli\u003eAltman DG, Bland JM. How to obtain the P value from a confidence interval. Bmj. 2011;343:d2304.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Chinese Dental Licensing Examination, large language models, Qwen3-Max, Qwen-Plus, Qwen3, DeepSeek-V3.1, GPT-OSS, hallucinations in large language models, dental education","lastPublishedDoi":"10.21203/rs.3.rs-7968968/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7968968/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eObjective:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study aimed to evaluate the performance of large language models (LLMs) on the Chinese Dental Licensing Examination (CDLE). It also examined whether including an ‘unknown’ option in prompts—or combining this option with a penalty for incorrect answers—could improve model accuracy and reduce hallucinations.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe official preparation book, titled \u003cem\u003eHistorical Chinese Dental Licensing Examinations\u003c/em\u003e, authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group, was used as the data source. Three cloud-based models (Qwen3-Max, Qwen-Plus, DeepSeek-V3.1) and two locally deployed models (Qwen3-32B and GPT-OSS-120B) were evaluated on the CDLE. A custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. Model performance was evaluated at both the exam and question levels. Exam-level performance was assessed by mean accuracy (± standard deviation (SD)) and pass/fail outcomes, while question-level performance was evaluated primarily by accuracy with 95% and 99% confidence intervals (CIs).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question. Qwen3-Max, Qwen-Plus, DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B achieved exam-level mean accuracies ±SD of 0.866±0.089, 0.851±0.0767, 0.737±0.0738, 0.748±0.0868, 0.652±0.0799, respectively. At the question level, the accuracies with 95% CIs were 0.865 (0.852–0.878), 0.851 (0.837–0.865), 0.727 (0.709–0.745), 0.741 (0.724–0.756), and 0.651 (0.634–0.671), respectively. Prompts that included an ‘unknown’ option—or combined it with a penalty for incorrect answers—did not improve model accuracy.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll models successfully passed the CDLEs, with some achieving remarkably high scores. Among them, Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Other uncertainty estimation methods should be considered instead of simply adding an ‘unknown’ option to the input prompt. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students’ self-directed learning.\u003c/p\u003e","manuscriptTitle":"Evaluation of Large Language Models on the Chinese Dental Licensing Examination","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-28 12:11:18","doi":"10.21203/rs.3.rs-7968968/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"2e2a30f4-bb21-4bbb-8bda-3d9f7e89dd43","owner":[],"postedDate":"November 28th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58624850,"name":"Health sciences/Diseases"},{"id":58624851,"name":"Health sciences/Health care"},{"id":58624852,"name":"Physical sciences/Mathematics and computing"},{"id":58624853,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-03-26T06:41:26+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-28 12:11:18","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7968968","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7968968","identity":"rs-7968968","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-19T01:45:01.086888+00:00