Toward Zero-Human-Intervention Autonomous Robot Learning: A Continuous Result-Driven Self-Reward and Correction Framework

preprint OA: closed
Full text JSON View at publisher
Full text 9,858 characters · extracted from preprint-html · click to expand
Toward Zero-Human-Intervention Autonomous Robot Learning: A Continuous Result-Driven Self-Reward and Correction Framework | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Toward Zero-Human-Intervention Autonomous Robot Learning: A Continuous Result-Driven Self-Reward and Correction Framework Hong Su This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9398282/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Autonomous robots operating in complex real-world environments require the ability to continuously improve their behavior without human-provided reward annotation or online intervention. However, robot actions often produce delayed and multi-factor consequences, making it difficult to correctly associate later outcomes with earlier actions and to perform reliable autonomous self-reward and self-correction. In this paper, we propose a continuous result-driven self-reward and correction framework for autonomous robots. The framework enables robots to collect full-process behavioral, environmental, and internal reasoning records, perform delayed outcome discovery and temporal backtracking, generate unified internal--external self-reward signals, and revise earlier reward judgments when later evidence reveals them to be incomplete or incorrect. It also supports self-intervention for improving causal verification and policy reliability. Experimental results show that, compared with conventional baselines, the proposed framework improves delayed-outcome attribution accuracy from 15.93% to 38.36%, increases safety score from 66.1 to 98.5, and reduces persistent wrong reward rate of 48.77%. Autonomous robots Self-reward learning Delayed outcome attribution Zero Human Intervention Full Text Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9398282","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":621912667,"identity":"757c1a07-933a-4997-9dcc-4f783afa0852","order_by":0,"name":"Hong Su","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAt0lEQVRIiWNgGAWjYBADOTb25gMk6TAw5uM5lkCalsR5EjkKxKmVn5F+TeLDnz/pbQw5DAw/KrYRYfyNnDLJmW0GuW0MZw8w9py5TYQWiZw0ad4GoBbGvgRmxjYitMjPAGr588cgnY2Zx4A4LQw30o9JM7AZJLCxEavF4MwbZsveNmPDNh62hINE+UW+Pf3hjR9/5OTl5z8++OBHBTEOY+AxgDMPEKMeCNgfEKlwFIyCUTAKRiwAAJtVOMUT72qXAAAAAElFTkSuQmCC","orcid":"","institution":"Chengdu University of Information Technology","correspondingAuthor":true,"prefix":"","firstName":"Hong","middleName":"","lastName":"Su","suffix":""}],"badges":[],"createdAt":"2026-04-13 03:47:46","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9398282/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9398282/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":106961552,"identity":"bdd4d54b-c72c-43a8-9ed8-18a2444914cd","added_by":"auto","created_at":"2026-04-15 09:26:00","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":717365,"visible":true,"origin":"","legend":"","description":"","filename":"mv7.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9398282/v1_covered_db984c76-8440-41a5-830e-98227485aecf.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eToward Zero-Human-Intervention Autonomous Robot Learning: A Continuous Result-Driven Self-Reward and Correction Framework\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Chengdu University of Information Technology","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Autonomous robots, Self-reward learning, Delayed outcome attribution, Zero Human Intervention","lastPublishedDoi":"10.21203/rs.3.rs-9398282/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9398282/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAutonomous robots operating in complex real-world environments require the ability to continuously improve their behavior without human-provided reward annotation or online intervention. However, robot actions often produce delayed and multi-factor consequences, making it difficult to correctly associate later outcomes with earlier actions and to perform reliable autonomous self-reward and self-correction. In this paper, we propose a continuous result-driven self-reward and correction framework for autonomous robots. The framework enables robots to collect full-process behavioral, environmental, and internal reasoning records, perform delayed outcome discovery and temporal backtracking, generate unified internal--external self-reward signals, and revise earlier reward judgments when later evidence reveals them to be incomplete or incorrect. It also supports self-intervention for improving causal verification and policy reliability. Experimental results show that, compared with conventional baselines, the proposed framework improves delayed-outcome attribution accuracy from 15.93% to 38.36%, increases safety score from 66.1 to 98.5, and reduces persistent wrong reward rate of 48.77%.\u003c/p\u003e","manuscriptTitle":"Toward Zero-Human-Intervention Autonomous Robot Learning: A Continuous Result-Driven Self-Reward and Correction Framework","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-14 12:31:37","doi":"10.21203/rs.3.rs-9398282/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6109d714-eb8d-40a7-87a9-0a9f88e0ded1","owner":[],"postedDate":"April 14th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-04-14T12:31:37+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-14 12:31:37","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9398282","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9398282","identity":"rs-9398282","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00