Full text
7,386 characters
· extracted from
preprint-html
· click to expand
Evaluation and Improvement of Test Selection for Large Language Models | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL Journal of Software: Evolution and Process This is a preprint and has not been peer reviewed. Data may be preliminary. 19 June 2025 V1 Latest version Share on Evaluation and Improvement of Test Selection for Large Language Models Authors : Lili Quan , Jin Wen , Qiang Hu [email protected] , Maxime Cordy , Yuheng Huang , Lei Ma , and Xiaohong Li Authors Info & Affiliations https://doi.org/10.22541/au.175033647.70907223/v1 Published Journal of Software: Evolution and Process Version of record Peer review timeline 332 views 249 downloads Contents Abstract Supplementary Material Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, many faults still exist that LLM cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLM could face is important, but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear and underexplored. In this paper, we conduct the first empirical study to investigate the effectiveness of existing test selection methods for LLMs. Experimental results on four different tasks (including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA3 and GPT4) demonstrated that simple methods such as Margin perform well on LLMs but there is still a big room for improvement. Based on the study, we further propose MuCS , a prompt Mu tation-based prediction C onfidence S moothing framework to boost the test selection capability. Concretely, multiple prompt mutation techniques have been proposed to help collect diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with test relative coverage improvement by up to 70.53%. Supplementary Material File (main document - latex pdf.pdf) Download 1.08 MB Information & Authors Information Version history V1 Version 1 19 June 2025 Peer review timeline Published Journal of Software: Evolution and Process Version of Record 8 Oct 2025 Published Copyright This work is licensed under a Non Exclusive No Reuse License. Collection Journal of Software: Evolution and Process Keywords deep learning testing llms test selection Authors Affiliations Lili Quan Tianjin University View all articles by this author Jin Wen Universite du Luxembourg - Campus Walferdange View all articles by this author Qiang Hu [email protected] Tianjin University View all articles by this author Maxime Cordy Universite du Luxembourg - Campus Walferdange View all articles by this author Yuheng Huang University of Tokyo View all articles by this author Lei Ma University of Tokyo View all articles by this author Xiaohong Li Tianjin University View all articles by this author Metrics & Citations Metrics Article Usage 332 views 249 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Lili Quan, Jin Wen, Qiang Hu, et al. Evaluation and Improvement of Test Selection for Large Language Models. Authorea . 19 June 2025. DOI: https://doi.org/10.22541/au.175033647.70907223/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.175033647.70907223/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9ffb7c406a294193',t:'MTc3OTQ0OTM5MA=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.