3D Feature Distillation with Object-Centric Priors | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article 3D Feature Distillation with Object-Centric Priors Georgios Tziafas, Yucheng Xu, Zhibin Li, Hamidreza Kasaei This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7694742/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Grounding natural language to the physical world is a ubiquitous topic with a widerange of applications in computer vision and robotics. Recently, 2D vision-languagemodels such as CLIP have been widely popularized, due to their impressivecapabilities for open-vocabulary grounding in 2D images. Subsequent works aimto elevate 2D CLIP features to 3D via feature distillation, but either learn neuralfields that are scene-specific and hence lack generalization, or focus on indoor roomscan data that require access to multiple camera views, which is not practical inrobot manipulation scenarios. Additionally, related methods typically fuse featuresat pixel-level and assume that all camera views are equally informative. In thiswork, we show that this approach leads to sub-optimal 3D features, both in termsof grounding accuracy, as well as segmentation crispness. To alleviate this, wepropose a multi-view feature fusion strategy that employs object-centric priors toeliminate uninformative views based on semantic information, and fuse featuresat object-level via instance segmentation masks. To distill our object-centric3D features, we generate a large-scale synthetic multi-view dataset of clutteredtabletop scenes, spawning 15k scenes from over 3300 unique object instances,which we make publicly available. We show that our method reconstructs 3DCLIP features with improved grounding capacity and spatial consistency, whiledoing so from single-view RGB-D, thus departing from the assumption of multiplecamera views at test time. Finally, we show that our approach can generalize tonovel tabletop domains and be re-purposed for 3D instance segmentation withoutfine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter. Released assets and supplementary material is made available at thewebsite https://gtziafas.github.io/DROP_project/ . 3D Feature Distillation Open-Vocabulary 3D Segmentation Language-guided Robot Grasping Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7694742","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":530670670,"identity":"e21ae257-413f-49be-924b-4facb6ea0520","order_by":0,"name":"Georgios Tziafas","email":"","orcid":"","institution":"University of Groningen","correspondingAuthor":false,"prefix":"","firstName":"Georgios","middleName":"","lastName":"Tziafas","suffix":""},{"id":530670671,"identity":"f975fb7b-d850-4249-b8ba-7c9b190f6571","order_by":1,"name":"Yucheng Xu","email":"","orcid":"","institution":"University of Edinburgh","correspondingAuthor":false,"prefix":"","firstName":"Yucheng","middleName":"","lastName":"Xu","suffix":""},{"id":530670672,"identity":"570270dd-4ac1-4376-8b68-1de56abe3809","order_by":2,"name":"Zhibin Li","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Zhibin","middleName":"","lastName":"Li","suffix":""},{"id":530670673,"identity":"1d837086-0db7-4eaf-bef1-085296669ea8","order_by":3,"name":"Hamidreza Kasaei","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+UlEQVRIiWNgGAWjYBACAyA+wANmsoFJHn6IhAUJWiQbwLQEXi0MyFoYDA4Q0GLOfjrxwNs9Nvb8DWyJH3/U2MgY3z5j/IJxB24tlj25Gw7OeZbGLHGA7bA0z7E0HrNzOWYWjGfwOOxA7obDPAcOszEcYG+QZmA7zGN2hsfMgLENj5bzb0Fa/vPIH2Bv/vnj338e4x5CWm6AbTkgYXCA7ZgEb9sBHgMeHuMH+LW8BfrlQLKB4WG2NGvevmQeiTNsZQyJeB2Wu/nDmwN29nLH24xv/vhmZ8/fw7z5w8c2G5xaEIAZwWSTSCBCA6ruD6TqGAWjYBSMgmENAKyAUyMC8FcZAAAAAElFTkSuQmCC","orcid":"","institution":"University of Groningen","correspondingAuthor":true,"prefix":"","firstName":"Hamidreza","middleName":"","lastName":"Kasaei","suffix":""}],"badges":[],"createdAt":"2025-09-23 13:08:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7694742/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7694742/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":93734378,"identity":"ee43cd06-f1c5-4a06-8aed-d488b6c89fe7","added_by":"auto","created_at":"2025-10-17 03:07:43","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6737,"visible":true,"origin":"","legend":"","description":"","filename":"4077b916a4404bf6b6f06d4d0c812945.json","url":"https://assets-eu.researchsquare.com/files/rs-7694742/v1/5440912ae259f125455e0132.json"},{"id":94729491,"identity":"98211cab-01db-49c8-aa4e-565e24ba353e","added_by":"auto","created_at":"2025-10-30 07:05:03","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":20710683,"visible":true,"origin":"","legend":"","description":"","filename":"SpringerDROP2025v2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7694742/v1_covered_a7679cec-5d39-43d2-8e91-c9dd2cf071fd.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"3D Feature Distillation with Object-Centric Priors","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"3D Feature Distillation, Open-Vocabulary 3D Segmentation, Language-guided Robot Grasping","lastPublishedDoi":"10.21203/rs.3.rs-7694742/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7694742/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Grounding natural language to the physical world is a ubiquitous topic with a widerange of applications in computer vision and robotics. Recently, 2D vision-languagemodels such as CLIP have been widely popularized, due to their impressivecapabilities for open-vocabulary grounding in 2D images. Subsequent works aimto elevate 2D CLIP features to 3D via feature distillation, but either learn neuralfields that are scene-specific and hence lack generalization, or focus on indoor roomscan data that require access to multiple camera views, which is not practical inrobot manipulation scenarios. Additionally, related methods typically fuse featuresat pixel-level and assume that all camera views are equally informative. In thiswork, we show that this approach leads to sub-optimal 3D features, both in termsof grounding accuracy, as well as segmentation crispness. To alleviate this, wepropose a multi-view feature fusion strategy that employs object-centric priors toeliminate uninformative views based on semantic information, and fuse featuresat object-level via instance segmentation masks. To distill our object-centric3D features, we generate a large-scale synthetic multi-view dataset of clutteredtabletop scenes, spawning 15k scenes from over 3300 unique object instances,which we make publicly available. We show that our method reconstructs 3DCLIP features with improved grounding capacity and spatial consistency, whiledoing so from single-view RGB-D, thus departing from the assumption of multiplecamera views at test time. Finally, we show that our approach can generalize tonovel tabletop domains and be re-purposed for 3D instance segmentation withoutfine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter. Released assets and supplementary material is made available at thewebsite https://gtziafas.github.io/DROP_project/.","manuscriptTitle":"3D Feature Distillation with Object-Centric Priors","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-17 03:07:39","doi":"10.21203/rs.3.rs-7694742/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6f133171-de31-41a7-b423-da00b055e3d7","owner":[],"postedDate":"October 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-10-29T15:53:48+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-17 03:07:39","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7694742","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7694742","identity":"rs-7694742","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.