CLGRPO: Reasoning Ability Enhancement for Small VLMs | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article CLGRPO: Reasoning Ability Enhancement for Small VLMs Fanyi Wang, Bingzhi Dong, Weijie Zou, Haotian Hu, Jinjin Xu, Chongyang Wang, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7504885/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 14 You are reading this latest preprint version Abstract Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models. Physical sciences/Engineering Physical sciences/Mathematics and computing Small Vision Language Models Reasoning post-training optimization Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 24 Oct, 2025 Reviews received at journal 23 Oct, 2025 Reviews received at journal 20 Oct, 2025 Reviews received at journal 17 Oct, 2025 Reviewers agreed at journal 16 Oct, 2025 Reviewers agreed at journal 13 Oct, 2025 Reviewers agreed at journal 12 Oct, 2025 Reviewers agreed at journal 12 Oct, 2025 Reviewers agreed at journal 10 Oct, 2025 Reviewers invited by journal 10 Oct, 2025 Editor assigned by journal 10 Oct, 2025 Editor invited by journal 23 Sep, 2025 Submission checks completed at journal 19 Sep, 2025 First submitted to journal 19 Sep, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7504885","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":533978218,"identity":"ccc099a4-588b-4144-b07b-4597e37c0977","order_by":0,"name":"Fanyi Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2ElEQVRIie3RPQrCMBiA4a8E0iW1a8S/KwQCRehlEoR2EecOHQJC1q4eQ/ACEaFTxNVREJwFd7HWxcnGTTDv9kGehCQAPt8PhlF8OwlWjmMA0cyom/RCSNi1qHlfuZJmd9ZfWSTXpp0dCEYgeKQx5wdzoVCkUoV700XMOdLDcWJMRsHmUpGF6CCBak9Jtiqjgd5JRQnrIAgGkUZys4SG3J0Ihtf18ZMoJ0Jw+8jUwmwq6pxrMv9MJpVFr6+srDxey3RUhfYzeYuI9jOx6/qm0Hyx2Ofz+f6pB8eSPZHOTc50AAAAAElFTkSuQmCC","orcid":"","institution":"Honor AI Center","correspondingAuthor":true,"prefix":"","firstName":"Fanyi","middleName":"","lastName":"Wang","suffix":""},{"id":533978219,"identity":"1c1aa378-9080-46b8-940b-322abfc35b64","order_by":1,"name":"Bingzhi Dong","email":"","orcid":"","institution":"Honor AI Center","correspondingAuthor":false,"prefix":"","firstName":"Bingzhi","middleName":"","lastName":"Dong","suffix":""},{"id":533978220,"identity":"459e3cba-3ad5-475b-a824-35aeda8dfd0e","order_by":2,"name":"Weijie Zou","email":"","orcid":"","institution":"NingboTech University","correspondingAuthor":false,"prefix":"","firstName":"Weijie","middleName":"","lastName":"Zou","suffix":""},{"id":533978221,"identity":"011cfbdb-1fa4-4e8c-b856-02bdeb1c0891","order_by":3,"name":"Haotian Hu","email":"","orcid":"","institution":"Zhejiang University","correspondingAuthor":false,"prefix":"","firstName":"Haotian","middleName":"","lastName":"Hu","suffix":""},{"id":533978222,"identity":"2fb1325a-5d31-4617-9c4f-1aa33b2ef98e","order_by":4,"name":"Jinjin Xu","email":"","orcid":"","institution":"ByteDance (China)","correspondingAuthor":false,"prefix":"","firstName":"Jinjin","middleName":"","lastName":"Xu","suffix":""},{"id":533978224,"identity":"ff0e8a45-898d-42dc-a037-1e78cfb07a6b","order_by":5,"name":"Chongyang Wang","email":"","orcid":"","institution":"Zhejiang University","correspondingAuthor":false,"prefix":"","firstName":"Chongyang","middleName":"","lastName":"Wang","suffix":""},{"id":533978225,"identity":"38c539dd-a7d4-4b8f-a8a5-38276bd84086","order_by":6,"name":"Zhiwang Zhang","email":"","orcid":"","institution":"NingboTech University","correspondingAuthor":false,"prefix":"","firstName":"Zhiwang","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2025-09-01 06:23:41","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7504885/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7504885/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":94254480,"identity":"c529fe4e-4ca5-45e2-a90f-7883add7ad51","added_by":"auto","created_at":"2025-10-24 07:32:56","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8472,"visible":true,"origin":"","legend":"","description":"","filename":"d5023e555f6c4ac3a2e336beab3411f3.json","url":"https://assets-eu.researchsquare.com/files/rs-7504885/v1/fea2501b241ab996a4aeba6b.json"},{"id":94255629,"identity":"a93e53b2-e017-4ac7-9d9a-59cb101c7f65","added_by":"auto","created_at":"2025-10-24 07:40:53","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":907504,"visible":true,"origin":"","legend":"","description":"","filename":"CLGRPOspringerwangfanyi.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7504885/v1_covered_dc317a44-8ebb-4d0b-b6dc-bff427bf2f8c.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"CLGRPO: Reasoning Ability Enhancement for Small VLMs","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Small Vision Language Models, Reasoning, post-training optimization","lastPublishedDoi":"10.21203/rs.3.rs-7504885/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7504885/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.","manuscriptTitle":"CLGRPO: Reasoning Ability Enhancement for Small VLMs","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-24 07:32:48","doi":"10.21203/rs.3.rs-7504885/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-10-24T07:17:55+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-10-23T06:40:43+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-10-20T18:28:44+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-10-17T15:44:54+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"247589291916696682927067086603351139471","date":"2025-10-16T06:24:23+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"311791038615805790894090069991144119863","date":"2025-10-13T06:39:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"314539112136697475824772764916874573047","date":"2025-10-13T00:14:12+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"63661215770375937504312161735862523983","date":"2025-10-12T14:42:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"92184567243469147144115524971960876600","date":"2025-10-10T15:02:50+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-10-10T14:26:21+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-10-10T14:23:40+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-09-23T05:23:19+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-09-19T07:49:52+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-09-19T07:48:23+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"cee56641-6c46-4a10-8613-9afe4c02799b","owner":[],"postedDate":"October 24th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[{"id":56775546,"name":"Physical sciences/Engineering"},{"id":56775547,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2025-10-24T07:32:48+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-24 07:32:48","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7504885","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7504885","identity":"rs-7504885","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.