Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach Ali Javidani, Mohammad Amin Sadeghi, Babak Nadjar Araabi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4662935/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 03 Apr, 2025 Read the published version in Signal, Image and Video Processing → Version 1 posted 13 You are reading this latest preprint version Abstract Self-supervised visual representation learning traditionally focuses on image-level instance discrimination. Our study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into these methodologies. This integration allows for the simultaneous analysis of local and global visual features, thereby enriching the quality of the learned representations. Initially, the original images undergo spatial augmentation. Subsequently, we employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view. This approach generates a diverse training dataset with distinct color variations in each segment. The augmented images are then processed through a self-distillation learning framework, utilizing the Vision Transformer (ViT) as its backbone. The proposed method minimizes the representation distances across both image and patch levels to capture details from macro to micro perspectives. To this end, we present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views. Thanks to the efficient structure of the patch-matching algorithm, our method reduces computational complexity compared to similar approaches. Consequently, we achieve an advanced understanding of the model without adding significant computational requirements. We have extensively pretrained our method on datasets of varied scales, such as Cifar10, ImageNet-100, and ImageNet-1K. It demonstrates superior performance over state-of-the-art self-supervised representation learning methods in image classification and downstream tasks, such as copy detection and image retrieval. Self-Supervised Learning Patch-Wise Representation Learning Self-Distillation Patch-level Augmentation Patch-Matching Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 03 Apr, 2025 Read the published version in Signal, Image and Video Processing → Version 1 posted Editorial decision: Revision requested 12 Aug, 2024 Reviews received at journal 05 Aug, 2024 Reviews received at journal 30 Jul, 2024 Reviews received at journal 22 Jul, 2024 Reviews received at journal 21 Jul, 2024 Reviewers agreed at journal 14 Jul, 2024 Reviewers agreed at journal 10 Jul, 2024 Reviewers agreed at journal 10 Jul, 2024 Reviewers agreed at journal 09 Jul, 2024 Reviewers invited by journal 09 Jul, 2024 Editor assigned by journal 30 Jun, 2024 Submission checks completed at journal 30 Jun, 2024 First submitted to journal 30 Jun, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4662935","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":329977422,"identity":"fadfb18e-ce6b-4867-86b1-e13df1417f25","order_by":0,"name":"Ali Javidani","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4UlEQVRIiWNgGAWjYDACdsYGAwYGCcYGBsbGBxAhIBsvYEZoaTZgSCBKC8JkNgmIFgKAv5m5oeBnm4Vs/+zmtmreHwzy/A3MbR/waZE4zNhg2NsmYTzjzsG22zwJDIYzDjA2z8BrDVCLAc8ZicSGG4lgLYwbgJ7Cq0MeZMsfoJb5QC3FQC32BLUYALUY81RIJG4AamEGakkkqMUQpEWmQsJ4443EZsk5aRLJMw4T0CJ3vP2Z4RuDOtl5N9IffnhjY2Pb397+GK8WIGAzQOJIwGMKH2B+QFjNKBgFo2AUjGgAAAVmRqDe08tuAAAAAElFTkSuQmCC","orcid":"","institution":"University of Tehran","correspondingAuthor":true,"prefix":"","firstName":"Ali","middleName":"","lastName":"Javidani","suffix":""},{"id":329977424,"identity":"640571de-a037-413a-b0af-67ae4ffe2919","order_by":1,"name":"Mohammad Amin Sadeghi","email":"","orcid":"","institution":"Hamad bin Khalifa University","correspondingAuthor":false,"prefix":"","firstName":"Mohammad","middleName":"Amin","lastName":"Sadeghi","suffix":""},{"id":329977427,"identity":"91753e9e-2798-4164-836b-e5498cdedb71","order_by":2,"name":"Babak Nadjar Araabi","email":"","orcid":"","institution":"University of Tehran","correspondingAuthor":false,"prefix":"","firstName":"Babak","middleName":"Nadjar","lastName":"Araabi","suffix":""}],"badges":[],"createdAt":"2024-06-30 13:37:23","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4662935/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4662935/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11760-025-04020-y","type":"published","date":"2025-04-03T15:57:12+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":80082009,"identity":"8cf2aac3-f5d8-4cf1-bf3f-09276c3bbc92","added_by":"auto","created_at":"2025-04-07 16:05:36","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1127913,"visible":true,"origin":"","legend":"","description":"","filename":"snarticletemplate.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4662935/v1_covered_8b116d67-1378-4b17-ab90-7c97b9ef921b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Self-Supervised Learning, Patch-Wise Representation Learning, Self-Distillation, Patch-level Augmentation, Patch-Matching","lastPublishedDoi":"10.21203/rs.3.rs-4662935/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4662935/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Self-supervised visual representation learning traditionally focuses on image-level instance discrimination. Our study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into these methodologies. This integration allows for the simultaneous analysis of local and global visual features, thereby enriching the quality of the learned representations. Initially, the original images undergo spatial augmentation. Subsequently, we employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view. This approach generates a diverse training dataset with distinct color variations in each segment. The augmented images are then processed through a self-distillation learning framework, utilizing the Vision Transformer (ViT) as its backbone. The proposed method minimizes the representation distances across both image and patch levels to capture details from macro to micro perspectives. To this end, we present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views. Thanks to the efficient structure of the patch-matching algorithm, our method reduces computational complexity compared to similar approaches. Consequently, we achieve an advanced understanding of the model without adding significant computational requirements. We have extensively pretrained our method on datasets of varied scales, such as Cifar10, ImageNet-100, and ImageNet-1K. It demonstrates superior performance over state-of-the-art self-supervised representation learning methods in image classification and downstream tasks, such as copy detection and image retrieval.","manuscriptTitle":"Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-23 14:03:58","doi":"10.21203/rs.3.rs-4662935/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-08-13T01:10:33+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-08-05T11:28:01+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-30T04:04:35+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-22T06:59:14+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-21T09:07:25+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"85284656806118154694410088449462604241","date":"2024-07-15T02:01:12+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"182176974706193538470612939125549470345","date":"2024-07-11T00:02:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"170008623324621516206811409081424350914","date":"2024-07-10T05:16:43+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"326364375136620915545478384829558648601","date":"2024-07-09T21:52:41+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-07-09T19:06:44+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-01T02:27:43+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-01T02:26:59+00:00","index":"","fulltext":""},{"type":"submitted","content":"Signal, Image and Video Processing","date":"2024-06-30T13:36:12+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"e2e9a2bf-b916-44fe-8fbb-77c311139baa","owner":[],"postedDate":"July 23rd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-04-07T15:59:52+00:00","versionOfRecord":{"articleIdentity":"rs-4662935","link":"https://doi.org/10.1007/s11760-025-04020-y","journal":{"identity":"signal-image-and-video-processing","isVorOnly":false,"title":"Signal, Image and Video Processing"},"publishedOn":"2025-04-03 15:57:12","publishedOnDateReadable":"April 3rd, 2025"},"versionCreatedAt":"2024-07-23 14:03:58","video":"","vorDoi":"10.1007/s11760-025-04020-y","vorDoiUrl":"https://doi.org/10.1007/s11760-025-04020-y","workflowStages":[]},"version":"v1","identity":"rs-4662935","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4662935","identity":"rs-4662935","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.