Large-scale Clustering via Fast Splitting of a Sparse Representative Tree Based on Local Density | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Large-scale Clustering via Fast Splitting of a Sparse Representative Tree Based on Local Density Renmin Wang, Jie Li This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6746982/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 11 Aug, 2025 Read the published version in Scientific Reports → Version 1 posted 8 You are reading this latest preprint version Abstract Large-scale clustering remains an active yet challenging task in data mining and machine learning, where existing algorithms often struggle to balance efficiency, accuracy, and adaptability. This paper proposes a novel large-scale clustering framework with three key innovations: (1)Parameter-free cluster discovery: unlike conventional methods requiring predefined cluster numbers, our algorithm autonomously identifies natural cluster structures through dynamic density-based splitting decisions.(2)Hybrid sampling-partitioning strategy: by integrating randomized sampling with K-means-based partitioning, we extract high-quality representative points that preserve data integrity with linear computational complexity.(3)Local density-driven MST segmentation: A minimum spanning tree (MST) constructed from representatives is adaptively partitioned using a local density criterion, which dynamically disconnects weakly associated edges by comparing density peaks between adjacent representative points. Extensive experiments on synthetic and real-world data sets (up to 20 million samples) demonstrate the algorithm's superiority: it achieves higher clustering accuracy than state-of-the-art methods while reducing runtime. Notably, the framework exhibits remarkable robustness to sampling ratios and eliminates dependency on user-specified parameters, making it ideal for real-world applications with complex, arbitrary-shaped data distributions. Physical sciences/Mathematics and computing/Computer science Physical sciences/Mathematics and computing/Scientific data K-means large-scale clustering local density minimum spanning tree representative points Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 11 Aug, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 30 Jun, 2025 Reviews received at journal 20 Jun, 2025 Reviewers agreed at journal 12 Jun, 2025 Reviewers invited by journal 12 Jun, 2025 Editor assigned by journal 11 Jun, 2025 Editor invited by journal 04 Jun, 2025 Submission checks completed at journal 04 Jun, 2025 First submitted to journal 26 May, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6746982","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":471131075,"identity":"0a4e81c0-4cd9-4216-8117-1222a135f7b9","order_by":0,"name":"Renmin Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9ElEQVRIiWNgGAWjYDADNgbmAwwPYDwe4rSwJTAkkKQFqMyAOC0Gx88efl1RcSeaT7rn44fENgZ5vhsJjA/eAhnmuLScyUuzPHPmWW6bzNnNEkAthjNvJDAbzgUydjZg12J2IMfMsLHtcG6bRO4GkJYEgxsJbNK8IMYBHFrOvwFq+QfSkvP4B1QL+2+8Wm7kGD9sbABrYYPbwoxPi/2NN2aMDcdAWtLMLBLOSRjOPPOwWXIOkLEBhxbJ/hzjjw01h3Pnz0h+fONDmY083/Hkgx/eABm4bAECNgkkDpB9gLEBwsANmD+g8nEbPgpGwSgYBSMUAABY7V+rZTPsFQAAAABJRU5ErkJggg==","orcid":"","institution":"Guizhou University of Traditional Chinese Medicine","correspondingAuthor":true,"prefix":"","firstName":"Renmin","middleName":"","lastName":"Wang","suffix":""},{"id":471131076,"identity":"11b8cae4-35f8-495a-a81a-e538640d2e59","order_by":1,"name":"Jie Li","email":"","orcid":"","institution":"Shanxi University of Finance and Economics","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2025-05-26 04:38:53","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6746982/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6746982/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-13848-w","type":"published","date":"2025-08-11T15:57:18+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":89310549,"identity":"d594d589-7578-4192-88ce-c7245c12feac","added_by":"auto","created_at":"2025-08-18 16:07:43","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":6273618,"visible":true,"origin":"","legend":"","description":"","filename":"snarticletemplate.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6746982/v1_covered_d7a4fec0-03f9-46ca-81d5-65882a18bef6.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Large-scale Clustering via Fast Splitting of a Sparse Representative Tree Based on Local Density","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"K-means, large-scale clustering, local density, minimum spanning tree, representative points","lastPublishedDoi":"10.21203/rs.3.rs-6746982/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6746982/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge-scale clustering remains an active yet challenging task in data mining and machine learning, where existing algorithms often struggle to balance efficiency, accuracy, and adaptability. This paper proposes a novel large-scale clustering framework with three key innovations: (1)Parameter-free cluster discovery: unlike conventional methods requiring predefined cluster numbers, our algorithm autonomously identifies natural cluster structures through dynamic density-based splitting decisions.(2)Hybrid sampling-partitioning strategy: by integrating randomized sampling with K-means-based partitioning, we extract high-quality representative points that preserve data integrity with linear computational complexity.(3)Local density-driven MST segmentation: A minimum spanning tree (MST) constructed from representatives is adaptively partitioned using a local density criterion, which dynamically disconnects weakly associated edges by comparing density peaks between adjacent representative points. Extensive experiments on synthetic and real-world data sets (up to 20 million samples) demonstrate the algorithm's superiority: it achieves higher clustering accuracy than state-of-the-art methods while reducing runtime. Notably, the framework exhibits remarkable robustness to sampling ratios and eliminates dependency on user-specified parameters, making it ideal for real-world applications with complex, arbitrary-shaped data distributions.\u003c/p\u003e","manuscriptTitle":"Large-scale Clustering via Fast Splitting of a Sparse Representative Tree Based on Local Density","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-16 05:56:53","doi":"10.21203/rs.3.rs-6746982/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-06-30T05:25:46+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-20T23:52:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"281752456499386051905735578291696136257","date":"2025-06-12T10:11:22+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-06-12T10:01:36+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-06-11T15:48:46+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-06-04T14:58:49+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-04T04:38:12+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-05-26T04:33:46+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"064bc5eb-3f1d-40d2-9b60-95c2b199ff35","owner":[],"postedDate":"June 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":50036768,"name":"Physical sciences/Mathematics and computing/Computer science"},{"id":50036769,"name":"Physical sciences/Mathematics and computing/Scientific data"}],"tags":[],"updatedAt":"2025-08-18T16:00:30+00:00","versionOfRecord":{"articleIdentity":"rs-6746982","link":"https://doi.org/10.1038/s41598-025-13848-w","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-08-11 15:57:18","publishedOnDateReadable":"August 11th, 2025"},"versionCreatedAt":"2025-06-16 05:56:53","video":"","vorDoi":"10.1038/s41598-025-13848-w","vorDoiUrl":"https://doi.org/10.1038/s41598-025-13848-w","workflowStages":[]},"version":"v1","identity":"rs-6746982","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6746982","identity":"rs-6746982","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.