Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs

doi:10.21203/rs.3.rs-9588669/v1

Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs

2026 · doi:10.21203/rs.3.rs-9588669/v1

preprint OA: closed

Full text JSON View at publisher

Full text 12,324 characters · extracted from preprint-html · click to expand

Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs Ana Bossler, Enric Bas, Andrés Fullana This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9588669/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Citation-based interdisciplinarity metrics treat cross-domain citation flow as evidence of knowledge integration. We identify a condition under which this assumption fails: citation-connected subdomains can encode equivalent concepts in almost entirely non-overlapping vocabularies—an invisible fragmentation pattern we term awareness without synthesis (AWS). We present a reproducible, open-source scientometric tool that operationalizes AWS detection through a hybrid human–LLM pipeline comprising: (1) a Cross-Subdomain Coherence score (CSC-score), combining structural citation flow (S cross ) with flow-weighted vocabulary divergence (¯ D w) via Rank-Biased Overlap; (2) a 2 × 2 factorial LLM evaluation across six model families and three expert role framings providing double-invariant validation; and (3) an LLM-powered vocabulary alignment module enabling automatic SPARQL query expansion with a mean retrieval gain of 636 ± 35 papers (+39.9%; p = 1.95 × 10 −83), exceeding random substitution by 16.3 standard deviations. Applied to a plastic recycling corpus (3,138 publications), the tool reveals that 61.7% of within-corpus citations cross subdomain boundaries yet vocabularies are almost entirely non-overlapping (mean 1 − RBO = 0.976), while shared terms maintain stable conceptual referents (mean cosine distance = 0.059, p < 0.001; CSC-score = 0.402, 9.3× more fragmented than a homogeneous control). The 2 × 2 LLM panel reveals that role sensitivity is itself a scientometric signal: chemical engineer framings detect finer-grained divergence than policy framings (diff = 0.30, p < 10 −48), with human annotations (Krippendorff α = 0.752) corroborating the same subdomain interfaces. The tool directly addresses the LLM4SCIM challenges of design, communication, and LLM-empowered analytical tools for scientometric research. awareness without synthesis vocabulary fragmentation interdisciplinarity measurement LLM-as-judge query expansion scholarly knowledge graphs Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9588669","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":638776809,"identity":"1936c434-3b69-4027-aa87-fc96f5c2bd0b","order_by":0,"name":"Ana Bossler","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCElEQVRIiWNgGAWjYLACHgjF+ICBwQJIJxCvhdmAgUGCNC1sEkRp4Z/d/IDhTUWdnMH5w88qv9RIyBkcT2CT+MFgY49Li8SdYwaMc84cNjY4cMzstswxCWODMw/YJHsY0hIbcOm5kWDAzNt2IHHDwQaz25INEokbbgBt4WE4jNN58jfSPzDz/qtL3HCY/VsxTIvkH4b/OB1mcCMHaEsDc+KGYzxmjB+hWqR5GA4w4nKY4Y2cgoNzjh02ljzDUyzNAPSL5JmHzdYyBsk4/SJ3I33jgzc1dXJ8549v/PijxkaO73jywZtvKuxwOgwEDsAYzJAIAjnJAJ8GJMD4g0iFo2AUjIJRMLIAAE7lV+RynChiAAAAAElFTkSuQmCC","orcid":"","institution":"University of Alicante","correspondingAuthor":true,"prefix":"","firstName":"Ana","middleName":"","lastName":"Bossler","suffix":""},{"id":638776810,"identity":"fe715c82-e2d7-4af0-a9d7-cac1d5c477eb","order_by":1,"name":"Enric Bas","email":"","orcid":"","institution":"University of Alicante","correspondingAuthor":false,"prefix":"","firstName":"Enric","middleName":"","lastName":"Bas","suffix":""},{"id":638776811,"identity":"1aadd463-d878-431e-a870-2189942a9b9b","order_by":2,"name":"Andrés Fullana","email":"","orcid":"","institution":"University of Alicante","correspondingAuthor":false,"prefix":"","firstName":"Andrés","middleName":"","lastName":"Fullana","suffix":""}],"badges":[],"createdAt":"2026-05-01 18:53:30","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9588669/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9588669/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":109205191,"identity":"6044f23c-3e0c-44f6-a59e-696d93b47d2a","added_by":"auto","created_at":"2026-05-13 15:03:43","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":601456,"visible":true,"origin":"","legend":"","description":"","filename":"ceurnetwork1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9588669/v1_covered_673e33df-b531-4152-a16a-68c2ba30899b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"awareness without synthesis, vocabulary fragmentation, interdisciplinarity measurement, LLM-as-judge, query expansion, scholarly knowledge graphs","lastPublishedDoi":"10.21203/rs.3.rs-9588669/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9588669/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCitation-based interdisciplinarity metrics treat cross-domain citation flow as evidence of knowledge integration. We identify a condition under which this assumption fails: citation-connected subdomains can encode equivalent concepts in almost entirely non-overlapping vocabularies—an invisible fragmentation pattern we term awareness without synthesis (AWS). We present a reproducible, open-source scientometric tool that operationalizes AWS detection through a hybrid human–LLM pipeline comprising: (1) a Cross-Subdomain Coherence score (CSC-score), combining structural citation flow (S \u003csup\u003ecross\u003c/sup\u003e) with flow-weighted vocabulary divergence (¯ D w) via Rank-Biased Overlap; (2) a 2 × 2 factorial LLM evaluation across six model families and three expert role framings providing double-invariant validation; and (3) an LLM-powered vocabulary alignment module enabling automatic SPARQL query expansion with a mean retrieval gain of 636 ± 35 papers (+39.9%; p = 1.95 × 10 −83), exceeding random substitution by 16.3 standard deviations. Applied to a plastic recycling corpus (3,138 publications), the tool reveals that 61.7% of within-corpus citations cross subdomain boundaries yet vocabularies are almost entirely non-overlapping (mean 1 − RBO = 0.976), while shared terms maintain stable conceptual referents (mean cosine distance = 0.059, p \u0026lt; 0.001; CSC-score = 0.402, 9.3× more fragmented than a homogeneous control). The 2 × 2 LLM panel reveals that role sensitivity is itself a scientometric signal: chemical engineer framings detect finer-grained divergence than policy framings (diff = 0.30, p \u0026lt; 10 −48), with human annotations (Krippendorff α = 0.752) corroborating the same subdomain interfaces. The tool directly addresses the LLM4SCIM challenges of design, communication, and LLM-empowered analytical tools for scientometric research.\u003c/p\u003e","manuscriptTitle":"Awareness Without Synthesis: LLM-Powered Detection and Alignment of Hidden Vocabulary Fragmentation in Scholarly Knowledge Graphs","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-13 11:06:26","doi":"10.21203/rs.3.rs-9588669/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ab406731-9fcc-41da-b71d-062d921750a3","owner":[],"postedDate":"May 13th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"262428591325005093549153008082772747884","date":"2026-05-19T07:38:57+00:00","index":19,"fulltext":""},{"type":"reviewerAgreed","content":"241355624859498094117500059188061242026","date":"2026-05-12T07:02:23+00:00","index":11,"fulltext":""},{"type":"reviewersInvited","content":"8","date":"2026-05-12T00:35:08+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-05-08T09:32:56+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-05-05T04:46:09+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientometrics","date":"2026-05-01T18:40:55+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-13T11:06:26+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-13 11:06:26","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9588669","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9588669","identity":"rs-9588669","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00