A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model Zhan Chen, Fangzhou Liu, Martijn Naaijer, Willem Th. van Peursen This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8509249/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 6 You are reading this latest preprint version Abstract Under the ETCBC encoding system, morphological parsing is a rigorous recon- struction of internal word structures rather than a simple tagging task. While contemporary NLP paradigms emphasize performance gains through data accumulation, we demonstrate that in Classical Syriac, a strictly bounded corpus, scaling training data produces counterintuitive results: expanding the training setwith distributionally divergent, Out-of-Distribution (OOD) data fails to yield positive transfer. To overcome this, we propose a Structure-First solution, which prioritizes representational and architectural constraints over raw data scale. This paradigm integrates two synergistic interventions: 1) a Discretization Strategy that maps variable-length morphological strings into atomic primi- tives, establishing the fixed-length structural alignment necessary for the models employed in the following step, and 2) an Encoder-Only Classifier, which leads to a Masked Diffusion Model that employs iterative denoising to capture global morphological dependencies. Unlike autoregressive models limited by lin- ear error propagation, the diffusion mechanismenables the model to dynamically resolve ambiguities through an evolving global context—a process mirroring the non-linear cognitive workflow of expert philologists. Our approach achieves a state-of-the-art Character Error Rate (CER) of 3.42%, successfully overcoming the performance plateau effect. These findings suggest that for the parsingoflow- resource historical languages, optimized structural representation and superior architectural improvement prove more effective than indiscriminate data scaling. Physical sciences/Engineering Physical sciences/Mathematics and computing Morphological Parsing Digital Philology Scaling Law Masked Diffusion Models Low-Resource NLP Classical Syriac Full Text Additional Declarations No competing interests reported. Supplementary Files supplementarymethods.pdf Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 04 May, 2026 Reviewers invited by journal 14 Apr, 2026 Editor assigned by journal 14 Apr, 2026 Editor invited by journal 03 Feb, 2026 Submission checks completed at journal 29 Jan, 2026 First submitted to journal 29 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8509249","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":623129381,"identity":"686f7e64-be8b-4493-a607-a1d1ad18ad7a","order_by":0,"name":"Zhan Chen","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA0ElEQVRIiWNgGAWjYDACCSBmbGBg4AeTDMwkaJFsIFmLwQEwlwgt8rObnz34ucMmz/hGcuMHhgrrxAb2swfwamGcc8zcsPdMWrHZjcRmCYYz6YkNPHkJeLUwSySYSTO2HU7cdiOxjQHEaJDgMcCrhU0i/RtQy//EzTNAWv4RoYVHIgdky4HEDRIgLQ1EaJGQyCmT7G1LTpxx5mGzRMKxdOM2nhz8WuRnpG+T+Nlml9jfnv7ww4caa9l+9jP4taCCBJDvSFA/CkbBKBgFowAHAAD5jkNvJL0WIAAAAABJRU5ErkJggg==","orcid":"","institution":"Beijing Normal University","correspondingAuthor":true,"prefix":"","firstName":"Zhan","middleName":"","lastName":"Chen","suffix":""},{"id":623129382,"identity":"62087303-9fc0-4021-a5f4-a3c4fceaecc7","order_by":1,"name":"Fangzhou Liu","email":"","orcid":"","institution":"Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Fangzhou","middleName":"","lastName":"Liu","suffix":""},{"id":623129383,"identity":"a6402588-c947-43d8-85ad-b75d7900dff6","order_by":2,"name":"Martijn Naaijer","email":"","orcid":"","institution":"University of Zurich","correspondingAuthor":false,"prefix":"","firstName":"Martijn","middleName":"","lastName":"Naaijer","suffix":""},{"id":623129384,"identity":"baaa8b85-ad2d-4736-bd64-4e0a41d85ce2","order_by":3,"name":"Willem Th. van Peursen","email":"","orcid":"","institution":"Vrije Universiteit Amsterdam","correspondingAuthor":false,"prefix":"","firstName":"Willem","middleName":"Th. van","lastName":"Peursen","suffix":""}],"badges":[],"createdAt":"2026-01-03 23:23:01","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8509249/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8509249/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107489817,"identity":"f629ea3a-feab-49af-bbf8-b04f0dca5560","added_by":"auto","created_at":"2026-04-22 02:49:03","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":811567,"visible":true,"origin":"","legend":"","description":"","filename":"SpringerNatureLaTeXTemplateReview2025.01.29.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8509249/v1_covered_46361368-b382-426f-8463-41e7b565a0c8.pdf"},{"id":107474511,"identity":"21ab3350-0b8c-40e3-9808-555612e1ad54","added_by":"auto","created_at":"2026-04-21 21:54:18","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":227651,"visible":true,"origin":"","legend":"","description":"","filename":"supplementarymethods.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8509249/v1/60b37dd19198d972342892a9.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"humanities-and-social-sciences-communications","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"palcomms","sideBox":"Learn more about [Humanities \u0026 Social Sciences Communications](http://www.nature.com/palcomms/)","snPcode":"41599","submissionUrl":"https://submission.springernature.com/new-submission/41599/3","title":"Humanities and Social Sciences Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Nature AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Morphological Parsing, Digital Philology, Scaling Law, Masked Diffusion Models, Low-Resource NLP, Classical Syriac","lastPublishedDoi":"10.21203/rs.3.rs-8509249/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8509249/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Under the ETCBC encoding system, morphological parsing is a rigorous recon- struction of internal word structures rather than a simple tagging task. While contemporary NLP paradigms emphasize performance gains through data accumulation, we demonstrate that in Classical Syriac, a strictly bounded corpus, scaling training data produces counterintuitive results: expanding the training setwith distributionally divergent, Out-of-Distribution (OOD) data fails to yield positive transfer. To overcome this, we propose a Structure-First solution, which prioritizes representational and architectural constraints over raw data scale. This paradigm integrates two synergistic interventions: 1) a Discretization Strategy that maps variable-length morphological strings into atomic primi- tives, establishing the fixed-length structural alignment necessary for the models employed in the following step, and 2) an Encoder-Only Classifier, which leads to a Masked Diffusion Model that employs iterative denoising to capture global morphological dependencies. Unlike autoregressive models limited by lin- ear error propagation, the diffusion mechanismenables the model to dynamically resolve ambiguities through an evolving global context—a process mirroring the non-linear cognitive workflow of expert philologists. Our approach achieves a state-of-the-art Character Error Rate (CER) of 3.42%, successfully overcoming the performance plateau effect. These findings suggest that for the parsingoflow- resource historical languages, optimized structural representation and superior architectural improvement prove more effective than indiscriminate data scaling.","manuscriptTitle":"A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-21 21:54:03","doi":"10.21203/rs.3.rs-8509249/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"289283528340045743700467787621568679477","date":"2026-05-04T14:58:59+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-14T12:45:51+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-04-14T12:37:02+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-02-03T23:08:51+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-29T21:06:33+00:00","index":"","fulltext":""},{"type":"submitted","content":"Humanities and Social Sciences Communications","date":"2026-01-29T21:05:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"humanities-and-social-sciences-communications","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"palcomms","sideBox":"Learn more about [Humanities \u0026 Social Sciences Communications](http://www.nature.com/palcomms/)","snPcode":"41599","submissionUrl":"https://submission.springernature.com/new-submission/41599/3","title":"Humanities and Social Sciences Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Nature AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"d5d5ce14-da1d-453c-99a6-c47b29271b14","owner":[],"postedDate":"April 21st, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"289283528340045743700467787621568679477","date":"2026-05-04T14:58:59+00:00","index":92,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":66314893,"name":"Physical sciences/Engineering"},{"id":66314894,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-04-21T21:54:03+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-21 21:54:03","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8509249","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8509249","identity":"rs-8509249","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.