Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review

doi:10.21203/rs.3.rs-7075948/v1

Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review

2025 · doi:10.21203/rs.3.rs-7075948/v1

preprint OA: closed

Full text JSON View at publisher

Full text 12,076 characters · extracted from preprint-html · click to expand

Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review Josh McGiff, Nikola S. Nikolov This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7075948/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 8 You are reading this latest preprint version Abstract Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies. Generative language modelling Low-resource languages Natural language generation Systematic review Data scarcity Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 02 Apr, 2026 Reviews received at journal 17 Nov, 2025 Reviewers agreed at journal 28 Sep, 2025 Reviewers agreed at journal 06 Sep, 2025 Reviewers invited by journal 06 Sep, 2025 Editor assigned by journal 19 Aug, 2025 Submission checks completed at journal 09 Jul, 2025 First submitted to journal 08 Jul, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7075948","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":511226155,"identity":"17c665e4-7743-4821-b0a5-525fe1d756f6","order_by":0,"name":"Josh McGiff","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1ElEQVRIie3QIQ7CMBiG4b9p0pmGBdcFkl2hhGQYsl2lSwUGQYIhwUx1hgQ7wTmmh0EtXAAzzBQCLkDoJkD+k4i+qqJPvrQALtcf5mXAgBgIGT3fG9gARwnpyRVmhum5BDmYvEEB05GwBI/keduQMlYjRtmeS5j6KDnUC0lKrZglN0t4kGGkWDNBykqb8HjpiaxQsmoFMVVq7Mq2IwlGxoWKvoT2K9hbJryORGq0/WRKg5MUXGArIy9vxcvEYWgoeT52y8TPsJku9TuKIfddLpfLhfUB6/kye3cHY2IAAAAASUVORK5CYII=","orcid":"","institution":"University of Limerick","correspondingAuthor":true,"prefix":"","firstName":"Josh","middleName":"","lastName":"McGiff","suffix":""},{"id":511226156,"identity":"f5d20c94-cc08-4190-bf57-eb6a28690b02","order_by":1,"name":"Nikola S. Nikolov","email":"","orcid":"","institution":"University of Limerick","correspondingAuthor":false,"prefix":"","firstName":"Nikola","middleName":"S.","lastName":"Nikolov","suffix":""}],"badges":[],"createdAt":"2025-07-08 14:38:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7075948/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7075948/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91215162,"identity":"1a55efae-1133-4ff6-a93b-60c994489c1f","added_by":"auto","created_at":"2025-09-12 19:17:38","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1446755,"visible":true,"origin":"","legend":"","description":"","filename":"finalspringer.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7075948/v1_covered_f6460740-5459-41ab-94ff-043df115399b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"language-resources-and-evaluation","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"lrev","sideBox":"Learn more about [Language Resources and Evaluation](http://link.springer.com/journal/10579)","snPcode":"10579","submissionUrl":"https://submission.nature.com/new-submission/10579/3","title":"Language Resources and Evaluation","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Generative language modelling, Low-resource languages, Natural language generation, Systematic review, Data scarcity","lastPublishedDoi":"10.21203/rs.3.rs-7075948/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7075948/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eGenerative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.\u003c/p\u003e","manuscriptTitle":"Overcoming data scarcity in generative language modelling for low-resource languages: a systematic review","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-12 18:45:30","doi":"10.21203/rs.3.rs-7075948/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-02T12:14:36+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-11-17T08:29:18+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"295864932624470372291419322322521813207","date":"2025-09-28T15:30:48+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"11397699663042586222784972190648691815","date":"2025-09-06T16:28:02+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-09-06T14:48:03+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-08-19T12:28:16+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-07-10T01:03:15+00:00","index":"","fulltext":""},{"type":"submitted","content":"Language Resources and Evaluation","date":"2025-07-08T14:35:43+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"language-resources-and-evaluation","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"lrev","sideBox":"Learn more about [Language Resources and Evaluation](http://link.springer.com/journal/10579)","snPcode":"10579","submissionUrl":"https://submission.nature.com/new-submission/10579/3","title":"Language Resources and Evaluation","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"5a00192d-718a-44dc-8296-7d986ebf2d68","owner":[],"postedDate":"September 12th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-14T10:53:10+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-12 18:45:30","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7075948","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7075948","identity":"rs-7075948","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00