A Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling

doi:10.21203/rs.3.rs-7594024/v1

A Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling

2025 · doi:10.21203/rs.3.rs-7594024/v1

preprint OA: closed

Full text JSON View at publisher

Full text 11,947 characters · extracted from preprint-html · click to expand

A Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling Hyung Gyu Rho This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7594024/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70%, and when tuned as an inference accelerator, it improves average inference speed by over 22% in IMBD-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency. Artificial Intelligence and Machine Learning Reward Model Inference Time Alignment Discrete Choice Figures Figure 1 Figure 2 Figure 3 Full Text Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7594024","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":513798054,"identity":"e1cc6f73-b2bd-4594-ae39-40c7fe5f49aa","order_by":0,"name":"Hyung Gyu Rho","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3klEQVRIiWNgGAWjYDACZjApwcPGcPgAiCFDrBYLOX7GYwlgvcTaVWEs2XzGAMQirEW+nffh54IKicQNx858fnWjxoKHgf3w0Q34tBgcZjeWnnEGqOXM2W3WOceADuNJS7uBVwszG4M0bxtQy42z24xz2IBaJHjM8GqRb2Zj/g3Wcv/NM+Ocf0RoYTjMxgayxViy4Qzz49w2IrQYALVY85yRkONnOGbGnNsHjCBCfpHvP8Z8m6eiDhSVjz/nfKuT42c/fAy/w5AAmwSYJFY5CDB/IEX1KBgFo2AUjBwAACQVQqgybQMEAAAAAElFTkSuQmCC","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Hyung","middleName":"Gyu","lastName":"Rho","suffix":""}],"badges":[],"createdAt":"2025-09-11 17:08:38","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7594024/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7594024/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91185126,"identity":"6820e91c-1112-46a1-bbfa-15e89ffcf857","added_by":"auto","created_at":"2025-09-12 13:46:57","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":201521,"visible":true,"origin":"","legend":"\u003cp\u003eIn standard BoN, the False Positive Count increases with the number of samples (N), even as the mean reward improves. This highlights a critical reliability vulnerability.\u003c/p\u003e","description":"","filename":"bonfailuregraph1.png","url":"https://assets-eu.researchsquare.com/files/rs-7594024/v1/9d752f025a0c9c36c25b8dcf.png"},{"id":91185125,"identity":"c5a11969-e389-444e-aa89-6e7d3e20b209","added_by":"auto","created_at":"2025-09-12 13:46:57","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":243001,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance of the alignment guardrail configuration compared to the BoN-32 baseline. The Mini-16 in 2 loops setting dramatically reduces the False Positive Count with only a marginal decrease in mean reward.\u003c/p\u003e","description":"","filename":"alignmentguardrailcomparison1.png","url":"https://assets-eu.researchsquare.com/files/rs-7594024/v1/a7183d353056b4ce1f681944.png"},{"id":91185124,"identity":"6de43280-a16a-4e82-b163-ee1c6124e682","added_by":"auto","created_at":"2025-09-12 13:46:57","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":238174,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance of the inference accelerator configuration. The Mini-16 in 2 loops setting provides the fastest mean execution time, outperforming the BoN-32 baseline by over 22%\u003c/p\u003e","description":"","filename":"inferenceacceleratorcomparison1.png","url":"https://assets-eu.researchsquare.com/files/rs-7594024/v1/3902bd9e7ad6770874633b9f.png"},{"id":91186241,"identity":"a92019a0-d487-4c84-9260-06ca5a97213d","added_by":"auto","created_at":"2025-09-12 13:55:03","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1704489,"visible":true,"origin":"","legend":"","description":"","filename":"contextqualitybon.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7594024/v1_covered_75032168-2eea-49e2-9999-a1db2fa31001.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eA Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Reward Model, Inference Time Alignment, Discrete Choice","lastPublishedDoi":"10.21203/rs.3.rs-7594024/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7594024/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eModern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70%, and when tuned as an inference accelerator, it improves average inference speed by over 22% in IMBD-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.\u003c/p\u003e","manuscriptTitle":"A Contextual Quality Reward Model For Reliable and Efficient Best of N Sampling","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-12 13:46:52","doi":"10.21203/rs.3.rs-7594024/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5f6c6c2c-63dc-4f8a-a371-02a9bdc8c89e","owner":[],"postedDate":"September 12th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":54587702,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-09-12T13:46:52+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-12 13:46:52","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7594024","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7594024","identity":"rs-7594024","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00