SAMP: Source-Aware Multi-Prototype Learning for Machine-Generated Text Detection

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 14,628 characters · extracted from preprint-html · click to expand
SAMP: Source-Aware Multi-Prototype Learning for Machine-Generated Text Detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article SAMP: Source-Aware Multi-Prototype Learning for Machine-Generated Text Detection Yan Xu, Wenzhong Yang, Yabo Yin, Hongzhen Lv, Zhenhua Wang, Jingfeng He, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9598516/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract Machine-generated text detection is commonly formulated as a binary classification problem that separates human-written texts from machine-generated ones. In heterogeneous settings involving diverse generators, domains, languages, and adversarial perturbations, binary classes are often not structurally simple or unimodal. Human-written texts typically exhibit broad dispersion, while machine-generated texts frequently contain both dominant modes and long-tailed substructures. However, standard binary detectors mainly learn a coarse human-machine decision boundary, providing limited constraints on the internal geometry of the learned representation space and leaving intra-class structures insufficiently organized. Moreover, by collapsing all machine-generated texts into a single homogeneous category, such detectors can obscure generator-dependent regularities and weaken generalization under distribution shifts. To address these issues, we propose SAMP, a Source-Aware Multi-Prototype learning framework for robust machine-generated text detection. SAMP represents each binary class with multiple class-conditional prototypes to capture intra-class heterogeneity, thereby explicitly accommodating complex internal data distributions. It further leverages source-model identity strictly as auxiliary training supervision to preserve generator-dependent relations within the machine-generated region, preventing fine-grained source regularities from being overshadowed by standard binary supervision. Experiments on MAGE, M4-multilingual, and RAID show that SAMP consistently outperforms strong zero-shot and training-based baselines. Notably, on the attacked RAID setting, SAMP achieves 96.35% AUROC, 99.87% AUPR, and 16.86% FPR95, demonstrating improved reliability under challenging perturbation conditions. Machine-Generated Text Detection Source-Aware Learning Multi-Prototype Learning Robust Detection Representation Learning Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 07 May, 2026 Reviewers invited by journal 07 May, 2026 Editor assigned by journal 05 May, 2026 Submission checks completed at journal 05 May, 2026 First submitted to journal 03 May, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9598516","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":640296436,"identity":"ce1d670c-885f-4df1-a4b6-4e21539d7aeb","order_by":0,"name":"Yan Xu","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Xu","suffix":""},{"id":640296438,"identity":"c0f110b5-1d27-40fc-8dec-08d3406731de","order_by":1,"name":"Wenzhong Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+ElEQVRIiWNgGAWjYDADAxDxgYENziZOC+MMkrUw8yDZiFvl8bOHX/NU3LE3Z+89/Nq2jS+xgb15mwRDzR3cWs7kpVnznHnGbNlzLs065wxbYgPPsTIJhmPPcGs5kGNmnNt2mM3gBpCRUwHUIpFjJsHYcBi3lvNvgFr+HeYBa7EwAGqRf0NAy40c48e5DYclwAwGsC08+LVI3nhjxvzn2GEDgzNnzBh7zrAZt/GkFVskHMOthe98jvHHGTWH7Q2O9xh/+Nl2TLaf/fDGGx9qcGtROMDAJgFlgxjHIJGZgFMDA4N8AwPzBygbxKjBo3YUjIJRMApGKgAA5GlXL250Gt8AAAAASUVORK5CYII=","orcid":"","institution":"Xinjiang University","correspondingAuthor":true,"prefix":"","firstName":"Wenzhong","middleName":"","lastName":"Yang","suffix":""},{"id":640296440,"identity":"58457cd6-215c-4a0f-b0b4-c90a9ef34cd3","order_by":2,"name":"Yabo Yin","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Yabo","middleName":"","lastName":"Yin","suffix":""},{"id":640296442,"identity":"cd8a5cbe-4b86-4ace-8db0-eb8067c423a5","order_by":3,"name":"Hongzhen Lv","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Hongzhen","middleName":"","lastName":"Lv","suffix":""},{"id":640296445,"identity":"1b04df50-f89c-4ec7-99f1-4d898a1fab95","order_by":4,"name":"Zhenhua Wang","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Zhenhua","middleName":"","lastName":"Wang","suffix":""},{"id":640296447,"identity":"e7a52981-505f-41b3-bc78-e79b93658ba3","order_by":5,"name":"Jingfeng He","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Jingfeng","middleName":"","lastName":"He","suffix":""},{"id":640296448,"identity":"9ed71a6c-ea9f-491b-8a49-5093bff556ff","order_by":6,"name":"Xiangyi Jia","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Xiangyi","middleName":"","lastName":"Jia","suffix":""},{"id":640296449,"identity":"d287d089-c90a-4426-8da9-653dd1b08cc3","order_by":7,"name":"Xianfeng Wang","email":"","orcid":"","institution":"Xinjiang University","correspondingAuthor":false,"prefix":"","firstName":"Xianfeng","middleName":"","lastName":"Wang","suffix":""}],"badges":[],"createdAt":"2026-05-03 08:38:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9598516/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9598516/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":109332304,"identity":"a4462f17-9bbe-46e5-8ca5-bd9c1a858a84","added_by":"auto","created_at":"2026-05-15 16:15:04","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4712455,"visible":true,"origin":"","legend":"","description":"","filename":"ArticleTitleniming.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9598516/v1_covered_17c86b1f-251c-4f51-9adf-d1da99ed25fa.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"SAMP: Source-Aware Multi-Prototype Learning for Machine-Generated Text Detection","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"journal-of-king-saud-university-computer-and-information-sciences","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Journal of King Saud University Computer and Information Sciences](https://link.springer.com/journal/44443)","snPcode":"44443","submissionUrl":"https://submission.springernature.com/new-submission/44443/3","title":"Journal of King Saud University Computer and Information Sciences","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Machine-Generated Text Detection, Source-Aware Learning, Multi-Prototype Learning, Robust Detection, Representation Learning","lastPublishedDoi":"10.21203/rs.3.rs-9598516/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9598516/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Machine-generated text detection is commonly formulated as a binary classification problem that separates human-written texts from machine-generated ones. In heterogeneous settings involving diverse generators, domains, languages, and adversarial perturbations, binary classes are often not structurally simple or unimodal. Human-written texts typically exhibit broad dispersion, while machine-generated texts frequently contain both dominant modes and long-tailed substructures. However, standard binary detectors mainly learn a coarse human-machine decision boundary, providing limited constraints on the internal geometry of the learned representation space and leaving intra-class structures insufficiently organized. Moreover, by collapsing all machine-generated texts into a single homogeneous category, such detectors can obscure generator-dependent regularities and weaken generalization under distribution shifts. To address these issues, we propose SAMP, a Source-Aware Multi-Prototype learning framework for robust machine-generated text detection. SAMP represents each binary class with multiple class-conditional prototypes to capture intra-class heterogeneity, thereby explicitly accommodating complex internal data distributions. It further leverages source-model identity strictly as auxiliary training supervision to preserve generator-dependent relations within the machine-generated region, preventing fine-grained source regularities from being overshadowed by standard binary supervision. Experiments on MAGE, M4-multilingual, and RAID show that SAMP consistently outperforms strong zero-shot and training-based baselines. Notably, on the attacked RAID setting, SAMP achieves 96.35% AUROC, 99.87% AUPR, and 16.86% FPR95, demonstrating improved reliability under challenging perturbation conditions. ","manuscriptTitle":"SAMP: Source-Aware Multi-Prototype Learning for Machine-Generated Text Detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-15 16:14:53","doi":"10.21203/rs.3.rs-9598516/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"242190375687905010898586829873636421025","date":"2026-05-07T07:30:02+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-05-07T07:21:27+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-05T04:10:00+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-05T04:09:09+00:00","index":"","fulltext":""},{"type":"submitted","content":"Journal of King Saud University Computer and Information Sciences","date":"2026-05-03T08:22:25+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"journal-of-king-saud-university-computer-and-information-sciences","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Journal of King Saud University Computer and Information Sciences](https://link.springer.com/journal/44443)","snPcode":"44443","submissionUrl":"https://submission.springernature.com/new-submission/44443/3","title":"Journal of King Saud University Computer and Information Sciences","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Open","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"0844fc39-ffa2-4bbd-a755-54f5d77ae999","owner":[],"postedDate":"May 15th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"242190375687905010898586829873636421025","date":"2026-05-07T07:30:02+00:00","index":10,"fulltext":""},{"type":"reviewersInvited","content":"6","date":"2026-05-07T07:21:27+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-05T04:10:00+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-05T04:09:09+00:00","index":"","fulltext":""},{"type":"submitted","content":"Journal of King Saud University Computer and Information Sciences","date":"2026-05-03T08:22:25+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-15T16:14:53+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-15 16:14:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9598516","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9598516","identity":"rs-9598516","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0