DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

preprint OA: closed
Full text JSON View at publisher
Full text 11,893 characters · extracted from preprint-html · click to expand
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation Tingwen Zhang, Ling Yue, Zhen Xu, Shaowu Pan This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8917857/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Recent advances in autonomous “AI scientist” systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the “end-to-end” paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a largescale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https:// huggingface.co/datasets/zhangt20/DiagramBank with code at https: //github.com/csml-rpi/DiagramBank. Artificial Intelligence and Machine Learning Scientific diagrams Dataset curation Diagram retrieval Retrievalaugmented generation Multimodal generation Full Text Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8917857","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":593952697,"identity":"d5a6c957-9860-4111-a1d2-51d92919c78f","order_by":0,"name":"Tingwen Zhang","email":"","orcid":"","institution":"Rensselaer Polytechnic Institute","correspondingAuthor":false,"prefix":"","firstName":"Tingwen","middleName":"","lastName":"Zhang","suffix":""},{"id":593952752,"identity":"bd8a6011-1cd6-41a6-9be1-d89afe1dfa94","order_by":1,"name":"Ling Yue","email":"","orcid":"","institution":"Rensselaer Polytechnic Institute","correspondingAuthor":false,"prefix":"","firstName":"Ling","middleName":"","lastName":"Yue","suffix":""},{"id":593952942,"identity":"ca3acf67-a230-4c19-bb8f-7bbfc959d91a","order_by":2,"name":"Zhen Xu","email":"","orcid":"","institution":"Rensselaer Polytechnic Institute","correspondingAuthor":false,"prefix":"","firstName":"Zhen","middleName":"","lastName":"Xu","suffix":""},{"id":593952943,"identity":"20d4631e-d1ce-4f03-920d-b489a5c7bf80","order_by":3,"name":"Shaowu Pan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAqUlEQVRIiWNgGAWjYLCCDxUMDGxAWoJoHYwzzpCqhZm3DcIgTos5+9nDH3jn3cvjY2A+eJuHGC2WPXlpEpLbiovZGNiSrYnSYnAgx4zBcFtCYhsDj5k0cVrOvzH+kDgHpIX/G5FabuQYSBxsANvCRpwWyxlvzCQbjiUUszGzGVvOIUaLOX+O8ec/NQl58u3ND2+8IcphUDqBgZkY5ahaRsEoGAWjYBTgAgDwGyux8MiW6AAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-2462-362X","institution":"Rensselaer Polytechnic Institute","correspondingAuthor":true,"prefix":"","firstName":"Shaowu","middleName":"","lastName":"Pan","suffix":""}],"badges":[],"createdAt":"2026-02-19 14:14:43","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8917857/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8917857/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103050165,"identity":"527b4c58-258a-4b45-8b42-a73e48cc3d83","added_by":"auto","created_at":"2026-02-20 07:48:35","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1481376,"visible":true,"origin":"","legend":"","description":"","filename":"DiagramBank.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8917857/v1_covered_d81415ec-d002-4c10-aacb-900621cbfcba.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eDiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[{"identity":"9bba0b89-0f95-40a6-89f4-76bbc02aa257","identifier":"10.13039/100000015","name":"U.S. Department of Energy","awardNumber":"DE-SC0025425","order_by":0}],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Rensselaer Polytechnic Institute","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Scientific diagrams, Dataset curation, Diagram retrieval, Retrievalaugmented generation, Multimodal generation","lastPublishedDoi":"10.21203/rs.3.rs-8917857/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8917857/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRecent advances in autonomous “AI scientist” systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the “end-to-end” paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a largescale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https:// huggingface.co/datasets/zhangt20/DiagramBank with code at https: //github.com/csml-rpi/DiagramBank.\u003c/p\u003e","manuscriptTitle":"DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-20 02:51:00","doi":"10.21203/rs.3.rs-8917857/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"21b882a9-1211-4594-9842-b86d0545f0a2","owner":[],"postedDate":"February 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":63205573,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2026-02-20T02:51:00+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-20 02:51:00","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8917857","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8917857","identity":"rs-8917857","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00