AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

doi:10.21203/rs.3.rs-8124065/v1

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

2025 · doi:10.21203/rs.3.rs-8124065/v1

preprint OA: closed

Full text JSON View at publisher

Full text 54,491 characters · extracted from preprint-html · click to expand

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP Badal Nyalang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8124065/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract We present AssameseRoBERTa, a monolingual language model trained from scratch on 1.6 million Assamese sentences comprising approximately 77 million tokens. Despite being trained on a relatively modest corpus compared to mainstream language models, our model achieves remarkable performance improvements over existing multilingual baselines. AssameseRoBERTa obtains a perplexity of 1.57 on in-domain text and 5.93 on unseen text, representing a 7.7× improvement over the previous best Assamese-specific model and outperforming multilingual models like mBERT and MuRIL by significant margins. Our approach demonstrates that dedicated monolingual models can effectively address the challenges of low-resource language processing, particularly for morphologically rich languages like Assamese. We release our model and training methodology to facilitate further research in Northeast Indian language technologies. Computer Architecture and Engineering Low-resource languages Assamese NLP RoBERTa Language models Northeast Indian languages 1. Introduction The development of natural language processing (NLP) technologies for low-resource languages remains a critical challenge in the field of computational linguistics. While recent advances in large language models (LLMs) have revolutionized NLP for high-resource languages, the majority of the world's languages remain underserved by these technologies (Hettiarachchi et al., 2025 ; Minaee et al., 2024 ). This disparity is particularly pronounced for languages of the Indian subcontinent, where despite a collective speaker base exceeding one billion, many languages lack adequate computational resources. Assamese, an Indo-Aryan language spoken by over 15 million people primarily in the Indian state of Assam, exemplifies this challenge. As one of the 22 scheduled languages of India, Assamese plays a crucial role in the cultural and administrative landscape of Northeast India. However, the language has historically been underrepresented in NLP research, with limited availability of high-quality language models and computational resources. Recent efforts to address this gap have primarily relied on multilingual models such as mBERT (Devlin et al., 2019 ), XLM-R (Conneau et al., 2020 ), and IndicBERT (Kakwani et al., 2020 ). While these models provide broad coverage across multiple languages, they often struggle with the unique linguistic characteristics of individual languages, particularly those with complex morphology and limited training data representation. The IndicBERT initiative by AI4Bharat represents a significant step forward, covering 12 major Indian languages including Assamese. However, with approximately 110 million parameters shared across all languages, the model's capacity to capture language-specific nuances remains limited. In this work, we present AssameseRoBERTa, a dedicated monolingual language model for Assamese that addresses these limitations. Our key contributions are: Development of the largest curated Assamese corpus to date, comprising 1.6 million sentences and approximately 77 million tokens Training of a RoBERTa-based model from scratch, optimized specifically for Assamese language characteristics Comprehensive evaluation demonstrating state-of-the-art performance on both in-domain and out-of-domain text Public release of the model and training framework to support future research in Assamese NLP 2. Related Work 2.1 Language Models for Low-Resource Languages The challenge of developing language models for low-resource languages has garnered increasing attention in recent years. The First Workshop on Language Models for Low-Resource Languages (LoResLM 2025) highlighted that while neural language models have revolutionized NLP, their capabilities remain primarily determined by the characteristics of their pre-training corpora, creating disparities for languages with limited resources (Hettiarachchi et al., 2025 ). Recent approaches have explored various strategies to address data scarcity in low-resource settings. Transfer learning techniques, particularly through large language models, have shown promise but with limited success in extremely low-resource scenarios (Tran et al., 2024 ). Al Nazi et al. ( 2025 ) demonstrated that specialized prompting techniques like few-shot and chain-of-thought can improve LLM performance for languages like Bengali, though significant challenges remain. The development of small language models (SLMs) has emerged as a particularly promising direction for 2025 and beyond. As noted by industry analysts, SLMs offer advantages including faster training times, lower carbon footprint, and improved security compared to large-scale models (Al-Dhahir, 2024 ). These models, typically containing fewer than 10 billion parameters, are particularly well-suited for domain-specific functions and resource-constrained environments. 2.2 Multilingual Models for Indian Languages The landscape of Indian language NLP has been significantly shaped by several multilingual initiatives. IndicBERT (Kakwani et al., 2020 ) represents one of the most comprehensive efforts, providing a multilingual ALBERT model pre-trained on 12 major Indian languages with approximately 9 billion tokens. Despite having fewer parameters than models like mBERT, IndicBERT achieves competitive performance across various tasks through its focused approach to Indian languages. However, analyses of transformer-based models for Indian languages have revealed important limitations. Shridhar et al. (2020) found that while monolingual models generally outperform their multilingual counterparts for languages like Hindi, Telugu, and Bengali, the choice of tokenizer significantly impacts performance for morphologically rich languages. Their work particularly highlighted that RoBERTa's byte-level BPE tokenizer can adversely affect the typology of such languages. The MuRIL model by Google (Khanuja et al., 2021 ) and various community efforts have further contributed to this space. However, as demonstrated in our evaluation, these models still struggle with language-specific nuances, particularly for languages with limited representation in their training data. 2.3 Assamese Language Processing Previous work on Assamese NLP has seen several initiatives with varying degrees of success. AxomiyaBERTa (Nath et al., 2023 ) introduced a phonologically-aware transformer model that incorporated Assamese phonological features into the tokenization process and used an embedding disperser mechanism to address embedding space anisotropy. While this model showed improvements over multilingual baselines on tasks like NER and sentiment analysis, our evaluation reveals significant challenges in its language modeling capabilities, as evidenced by extremely high perplexity scores. The L3Cube initiative (Joshi, 2022 ) released monolingual BERT models for multiple Indian languages including Assamese, as part of their broader effort to create language-specific models. While the L3Cube Assamese-BERT performs better than generic multilingual models, it still shows substantial room for improvement, with perplexity scores of 48.82 on training domain text and 12.59 on unseen text in our evaluation. The unique characteristics of Assamese, including its complex morphology, use of the Bengali script with language-specific modifications, and extensive use of compound words and inflections, pose particular challenges for language modeling. These features necessitate specialized approaches that can capture the language's linguistic patterns effectively, which our byte-level BPE tokenization approach addresses more successfully than previous attempts. 3. Methodology 3.1 Data Collection and Preprocessing We compiled the MWirelabs/assamese-monolingual-corpus, currently the largest curated Assamese corpus, containing 1,613,879 lines of text from diverse sources including news articles, literature, web crawl data, government documents, and social media content. This diversity ensures broad domain coverage and helps the model generalize across different text styles and registers. Our preprocessing pipeline focused on maintaining the authenticity of Assamese text while ensuring data quality: Removal of newline artifacts and formatting inconsistencies Preservation of native Assamese orthography without normalization Retention of diacritics and language-specific characters No deduplication to maintain natural text distribution 3.2 Tokenizer Design We developed a custom RoBERTa Byte-Level BPE tokenizer specifically for Assamese, addressing the limitations identified in previous work regarding byte-level tokenizers for morphologically rich languages. Our tokenizer configuration includes: Vocabulary size of 50,265 tokens Minimum frequency threshold of 2 Special tokens: , , , , ByteLevel pre-tokenization with prefix space addition RoBERTa-style post-processing 3.3 Model Architecture AssameseRoBERTa follows the standard RoBERTa-base architecture with the following specifications: Parameter Value Hidden Size 768 Hidden Layers 12 Attention Heads 12 Intermediate Size 3072 Max Position Embeddings 130 Total Parameters ~ 110M 3.4 Training Procedure The model was trained using masked language modeling (MLM) with a 20% masking ratio. Our training configuration was optimized for efficiency on a single NVIDIA A40 GPU (48GB): Sequence length: 128 tokens Per-device batch size: 64 Gradient accumulation steps: 2 (effective batch size: 128) Optimizer: AdamW with learning rate 5e-5 Learning rate schedule: Cosine with 8000 warmup steps Precision: BF16 mixed precision Training epochs: 10 Total training time: ~12 hours For data packing, we concatenated the tokenized corpus and split it into fixed 128-token blocks, allocating 99% for training and 1% for evaluation. This approach ensures stable MLM learning without per-line truncation. 4. Results and Evaluation 4.1 Training Dynamics The model exhibited stable convergence throughout training, with validation loss decreasing from 4.28 at 5k steps to 0.91 at 55k steps. The smooth training curve without instability or overfitting indicates effective regularization and appropriate model capacity for the dataset size. 4.2 Perplexity Evaluation We evaluated AssameseRoBERTa against existing baselines using perplexity as the primary metric, testing on both in-domain and out-of-domain text to assess generalization capabilities. Table 1 Perplexity comparison on training domain and unseen Assamese text Model Training Domain PPL Unseen Text PPL AssameseRoBERTa (Ours) 1.78 2.53 Assamese-BERT (L3Cube) 48.82 12.59 MuRIL 85.73 8.70 mBERT 26.71 18.16 IndicBERT 3194.18 595.46 AxomiyaBERTa 83.6M 30.9M Our model achieves the lowest perplexity scores across all evaluation settings, with 1.78 on training domain text and 2.53 on unseen text. This represents a 27× improvement over the previous Assamese-specific model (L3Cube Assamese-BERT) and 5× improvement over the next best baseline (MuRIL) on unseen text evaluation. Notably, AxomiyaBERTa, despite incorporating phonological features, shows extremely high perplexity values, suggesting potential issues with tokenization or domain mismatch. 5. Discussion 5.1 Model Effectiveness The significant performance improvements achieved by AssameseRoBERTa validate the importance of dedicated monolingual models for low-resource languages. With perplexity scores of 1.78 on training domain text and 2.53 on unseen text, our model establishes a new benchmark for Assamese language modeling. The dramatic improvements over existing models-including a 27× reduction in perplexity compared to L3Cube Assamese-BERT and over 12,000× improvement over AxomiyaBERTa-highlight the effectiveness of our approach. The gap between training domain (1.78) and unseen text perplexity (2.53) is remarkably small, indicating excellent generalization capabilities. This is particularly noteworthy given that even well-established multilingual models like mBERT and MuRIL show much larger gaps between their training and unseen text performance. The consistent low perplexity across both evaluation sets suggests that our model has effectively learned robust representations of the Assamese language. 5.2 Tokenizer Quality Our byte-level BPE tokenizer successfully addresses the challenges of Assamese morphology while preventing out-of-vocabulary issues. The tokenizer handles Assamese-English code-mixing effectively, a crucial feature given the prevalence of bilingual text in real-world applications. 5.3 Training Efficiency The ability to train a state-of-the-art model in approximately 12 hours on a single GPU demonstrates the feasibility of developing high-quality language models without massive computational infrastructure. This efficiency is particularly important for research groups and organizations working with limited resources, aligning with the growing trend toward sustainable AI development. 5.4 Limitations and Future Work While our model achieves strong performance on language modeling tasks, several limitations warrant future investigation: The maximum sequence length of 130 tokens may limit performance on longer documents Evaluation on downstream tasks such as NER, sentiment analysis, and question answering remains to be conducted The model's performance on code-mixed text, while promising, requires systematic evaluation Scaling to larger model sizes and longer contexts could further improve performance 6. Conclusion We presented AssameseRoBERTa, a monolingual language model that establishes new state-of-the-art performance for Assamese NLP. Our work demonstrates that dedicated models trained on curated corpora can significantly outperform larger multilingual alternatives, even with limited computational resources. The exceptional performance-with perplexity of 1.78 on training domain text and 2.53 on unseen text-represents a 27× improvement over previous Assamese-specific models and validates the importance of language-specific approaches for low-resource languages. By releasing our model and training framework under an open license, we aim to catalyze further research in Assamese and other Northeast Indian languages. Our work contributes to the broader goal of democratizing NLP technologies and ensuring linguistic diversity in the age of AI. 7. Ethics Statement This work was conducted with careful consideration of ethical implications. Our training corpus was collected from publicly available sources with appropriate usage rights. We acknowledge that the model may reflect biases present in the training data and recommend careful evaluation before deployment in sensitive applications. The model is released under Creative Commons Attribution 4.0 International License to ensure broad accessibility while maintaining attribution requirements. References Al-Dhahir I (2024) Small Language Models Set for High Market Impact in 2025. Global Data Market Intelligence Report Al Nazi Z, Hossain MR, Mamun A, F (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process 10:100124 Conneau A et al (2020) Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of ACL 2020 Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019 Hettiarachchi H, Ranasinghe T, Rayson P, Mitkov R, Gaber M, Premasiri D, Tan FA, Uyangodage L (2025) Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025). In Proceedings of LoResLM 2025, Abu Dhabi, UAE Joshi R (2022) L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418. Kakwani D, Kunchukuttan A, Golla S, Gokul NC, Bhattacharyya A, Khapra MM, Kumar P (2020) IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of EMNLP 2020 Khanuja S et al (2021) MuRIL: Multilingual Representations for Indian Languages. arXiv preprint arXiv:2103.10730 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: A survey. arXiv preprint arXiv:2402.06196 Nath A, Mannan S, Krishnaswamy N (2023) AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11629–11646 Shridhar K (2020) Indic Transformers: An Analysis of Transformer Language Models for Indian Languages. NeuralSpace Technical Report Tamang S, Bora DJ (2024) Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language. arXiv preprint arXiv:2410.03718 Tran C et al (2024) Transfer Learning Limitations in Extremely Low-Resource Settings. In Proceedings of LREC-COLING 2024 Zhong T, Yang Z, Liu Z, Zhang R, Liu Y, Sun H, Pan Y, Li Y, Zhou Y, Jiang H, Chen J, Liu T (2025) Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research. arXiv preprint arXiv:2412.04497 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8124065","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":545624539,"identity":"4d0f751e-1b3c-4925-a77e-111861a3525c","order_by":0,"name":"Badal Nyalang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYHACxgNAQg5INzAwFDAkEKUHpMUYosWABC2JDWAmMVrMJZIfHPi5ozZ97Yzktg8fDOzy5BuYHz66gUeL5Yw0g4O9Z47nbruR2DxzhkFyscEBNmPjHDxaDG4kGBzgbTuWu+12YjMzjwFz4gYGHjZp/FrSPxz823Ys3Qyk5Y9BfeL8BoJacgwO87bVJIC1MBgcTmw4QEjLmTcFh2XbDhhuu/+wmbHH4HixwWFCfjmevvHh27Y6ebMzxx8z/KiozpNvb374GJ8WBoEEEHkYSYQZn3IQ4D8AIusIKRsFo2AUjIKRDAB3NFbrQMKkvwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0009-0006-6923-0797","institution":"MWire Labs","correspondingAuthor":true,"prefix":"","firstName":"Badal","middleName":"","lastName":"Nyalang","suffix":""}],"badges":[],"createdAt":"2025-11-15 19:58:03","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8124065/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8124065/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96138548,"identity":"c9df1c7a-351b-4ba0-8e35-f0b6f4d6e0eb","added_by":"auto","created_at":"2025-11-18 04:41:02","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":30407,"visible":true,"origin":"","legend":"","description":"","filename":"AssameseRoBERTaResearchPaper1.docx","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/a2668a933f5557f3c610b33d.docx"},{"id":96138545,"identity":"4da52aaf-6e48-49a1-a65e-ba571f6f7d2c","added_by":"auto","created_at":"2025-11-18 04:41:02","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8124065.json","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/33bc07ba80763eafe89f0d34.json"},{"id":96138546,"identity":"41f73109-efc1-4c58-b460-4596fe29cc60","added_by":"auto","created_at":"2025-11-18 04:41:02","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":46015,"visible":true,"origin":"","legend":"","description":"","filename":"rs81240650enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/d1dc5867826a824cef027041.xml"},{"id":96138543,"identity":"7cc4ffab-5024-49b5-a249-06c53dff0da7","added_by":"auto","created_at":"2025-11-18 04:41:02","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":44697,"visible":true,"origin":"","legend":"","description":"","filename":"rs81240650structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/40aded778bbdf338b4a63cfc.xml"},{"id":96138550,"identity":"3fbab441-cee1-4256-b09c-133be782aaae","added_by":"auto","created_at":"2025-11-18 04:41:02","extension":"html","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":46800,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/165be2b9cccce898d76dcebe.html"},{"id":96251318,"identity":"3f65c4df-546d-48e8-b6bf-b118c281737a","added_by":"auto","created_at":"2025-11-19 07:39:38","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":487439,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8124065/v1/a7e4cc1e-e7a7-4040-a17a-2c313e67b99d.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eAssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe development of natural language processing (NLP) technologies for low-resource languages remains a critical challenge in the field of computational linguistics. While recent advances in large language models (LLMs) have revolutionized NLP for high-resource languages, the majority of the world's languages remain underserved by these technologies (Hettiarachchi et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Minaee et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). This disparity is particularly pronounced for languages of the Indian subcontinent, where despite a collective speaker base exceeding one billion, many languages lack adequate computational resources.\u003c/p\u003e\u003cp\u003eAssamese, an Indo-Aryan language spoken by over 15\u0026nbsp;million people primarily in the Indian state of Assam, exemplifies this challenge. As one of the 22 scheduled languages of India, Assamese plays a crucial role in the cultural and administrative landscape of Northeast India. However, the language has historically been underrepresented in NLP research, with limited availability of high-quality language models and computational resources.\u003c/p\u003e\u003cp\u003eRecent efforts to address this gap have primarily relied on multilingual models such as mBERT (Devlin et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), XLM-R (Conneau et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), and IndicBERT (Kakwani et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). While these models provide broad coverage across multiple languages, they often struggle with the unique linguistic characteristics of individual languages, particularly those with complex morphology and limited training data representation. The IndicBERT initiative by AI4Bharat represents a significant step forward, covering 12 major Indian languages including Assamese. However, with approximately 110\u0026nbsp;million parameters shared across all languages, the model's capacity to capture language-specific nuances remains limited.\u003c/p\u003e\u003cp\u003eIn this work, we present AssameseRoBERTa, a dedicated monolingual language model for Assamese that addresses these limitations. Our key contributions are:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eDevelopment of the largest curated Assamese corpus to date, comprising 1.6\u0026nbsp;million sentences and approximately 77\u0026nbsp;million tokens\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTraining of a RoBERTa-based model from scratch, optimized specifically for Assamese language characteristics\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eComprehensive evaluation demonstrating state-of-the-art performance on both in-domain and out-of-domain text\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePublic release of the model and training framework to support future research in Assamese NLP\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Language Models for Low-Resource Languages\u003c/h2\u003e\u003cp\u003eThe challenge of developing language models for low-resource languages has garnered increasing attention in recent years. The First Workshop on Language Models for Low-Resource Languages (LoResLM 2025) highlighted that while neural language models have revolutionized NLP, their capabilities remain primarily determined by the characteristics of their pre-training corpora, creating disparities for languages with limited resources (Hettiarachchi et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eRecent approaches have explored various strategies to address data scarcity in low-resource settings. Transfer learning techniques, particularly through large language models, have shown promise but with limited success in extremely low-resource scenarios (Tran et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Al Nazi et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) demonstrated that specialized prompting techniques like few-shot and chain-of-thought can improve LLM performance for languages like Bengali, though significant challenges remain.\u003c/p\u003e\u003cp\u003eThe development of small language models (SLMs) has emerged as a particularly promising direction for 2025 and beyond. As noted by industry analysts, SLMs offer advantages including faster training times, lower carbon footprint, and improved security compared to large-scale models (Al-Dhahir, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). These models, typically containing fewer than 10\u0026nbsp;billion parameters, are particularly well-suited for domain-specific functions and resource-constrained environments.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Multilingual Models for Indian Languages\u003c/h2\u003e\u003cp\u003eThe landscape of Indian language NLP has been significantly shaped by several multilingual initiatives. IndicBERT (Kakwani et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) represents one of the most comprehensive efforts, providing a multilingual ALBERT model pre-trained on 12 major Indian languages with approximately 9\u0026nbsp;billion tokens. Despite having fewer parameters than models like mBERT, IndicBERT achieves competitive performance across various tasks through its focused approach to Indian languages.\u003c/p\u003e\u003cp\u003eHowever, analyses of transformer-based models for Indian languages have revealed important limitations. Shridhar et al. (2020) found that while monolingual models generally outperform their multilingual counterparts for languages like Hindi, Telugu, and Bengali, the choice of tokenizer significantly impacts performance for morphologically rich languages. Their work particularly highlighted that RoBERTa's byte-level BPE tokenizer can adversely affect the typology of such languages.\u003c/p\u003e\u003cp\u003eThe MuRIL model by Google (Khanuja et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) and various community efforts have further contributed to this space. However, as demonstrated in our evaluation, these models still struggle with language-specific nuances, particularly for languages with limited representation in their training data.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Assamese Language Processing\u003c/h2\u003e\u003cp\u003ePrevious work on Assamese NLP has seen several initiatives with varying degrees of success. AxomiyaBERTa (Nath et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) introduced a phonologically-aware transformer model that incorporated Assamese phonological features into the tokenization process and used an embedding disperser mechanism to address embedding space anisotropy. While this model showed improvements over multilingual baselines on tasks like NER and sentiment analysis, our evaluation reveals significant challenges in its language modeling capabilities, as evidenced by extremely high perplexity scores.\u003c/p\u003e\u003cp\u003eThe L3Cube initiative (Joshi, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) released monolingual BERT models for multiple Indian languages including Assamese, as part of their broader effort to create language-specific models. While the L3Cube Assamese-BERT performs better than generic multilingual models, it still shows substantial room for improvement, with perplexity scores of 48.82 on training domain text and 12.59 on unseen text in our evaluation.\u003c/p\u003e\u003cp\u003eThe unique characteristics of Assamese, including its complex morphology, use of the Bengali script with language-specific modifications, and extensive use of compound words and inflections, pose particular challenges for language modeling. These features necessitate specialized approaches that can capture the language's linguistic patterns effectively, which our byte-level BPE tokenization approach addresses more successfully than previous attempts.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Data Collection and Preprocessing\u003c/h2\u003e\u003cp\u003eWe compiled the MWirelabs/assamese-monolingual-corpus, currently the largest curated Assamese corpus, containing 1,613,879 lines of text from diverse sources including news articles, literature, web crawl data, government documents, and social media content. This diversity ensures broad domain coverage and helps the model generalize across different text styles and registers.\u003c/p\u003e\u003cp\u003eOur preprocessing pipeline focused on maintaining the authenticity of Assamese text while ensuring data quality:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eRemoval of newline artifacts and formatting inconsistencies\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePreservation of native Assamese orthography without normalization\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eRetention of diacritics and language-specific characters\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eNo deduplication to maintain natural text distribution\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Tokenizer Design\u003c/h2\u003e\u003cp\u003eWe developed a custom RoBERTa Byte-Level BPE tokenizer specifically for Assamese, addressing the limitations identified in previous work regarding byte-level tokenizers for morphologically rich languages. Our tokenizer configuration includes:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eVocabulary size of 50,265 tokens\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMinimum frequency threshold of 2\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSpecial tokens: \u0026lt;s\u0026gt;, \u0026lt;/s\u0026gt;, \u0026lt;pad\u0026gt;, \u0026lt;unk\u0026gt;, \u0026lt;mask\u0026gt;\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eByteLevel pre-tokenization with prefix space addition\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eRoBERTa-style post-processing\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Model Architecture\u003c/h2\u003e\u003cp\u003eAssameseRoBERTa follows the standard RoBERTa-base architecture with the following specifications:\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eParameter\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eValue\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHidden Size\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e768\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHidden Layers\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAttention Heads\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eIntermediate Size\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3072\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMax Position Embeddings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e130\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTotal Parameters\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e~\u0026thinsp;110M\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Training Procedure\u003c/h2\u003e\u003cp\u003eThe model was trained using masked language modeling (MLM) with a 20% masking ratio. Our training configuration was optimized for efficiency on a single NVIDIA A40 GPU (48GB):\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eSequence length: 128 tokens\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePer-device batch size: 64\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eGradient accumulation steps: 2 (effective batch size: 128)\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eOptimizer: AdamW with learning rate 5e-5\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLearning rate schedule: Cosine with 8000 warmup steps\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003ePrecision: BF16 mixed precision\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTraining epochs: 10\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTotal training time: ~12 hours\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eFor data packing, we concatenated the tokenized corpus and split it into fixed 128-token blocks, allocating 99% for training and 1% for evaluation. This approach ensures stable MLM learning without per-line truncation.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Results and Evaluation","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Training Dynamics\u003c/h2\u003e\u003cp\u003eThe model exhibited stable convergence throughout training, with validation loss decreasing from 4.28 at 5k steps to 0.91 at 55k steps. The smooth training curve without instability or overfitting indicates effective regularization and appropriate model capacity for the dataset size.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Perplexity Evaluation\u003c/h2\u003e\u003cp\u003eWe evaluated AssameseRoBERTa against existing baselines using perplexity as the primary metric, testing on both in-domain and out-of-domain text to assess generalization capabilities.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerplexity comparison on training domain and unseen Assamese text\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTraining Domain PPL\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eUnseen Text PPL\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAssameseRoBERTa (Ours)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1.78\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2.53\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAssamese-BERT (L3Cube)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e48.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e12.59\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMuRIL\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e85.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e8.70\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003emBERT\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e26.71\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e18.16\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eIndicBERT\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3194.18\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e595.46\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAxomiyaBERTa\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e83.6M\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e30.9M\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eOur model achieves the lowest perplexity scores across all evaluation settings, with 1.78 on training domain text and 2.53 on unseen text. This represents a 27\u0026times; improvement over the previous Assamese-specific model (L3Cube Assamese-BERT) and 5\u0026times; improvement over the next best baseline (MuRIL) on unseen text evaluation. Notably, AxomiyaBERTa, despite incorporating phonological features, shows extremely high perplexity values, suggesting potential issues with tokenization or domain mismatch.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Discussion","content":"\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Model Effectiveness\u003c/h2\u003e\u003cp\u003eThe significant performance improvements achieved by AssameseRoBERTa validate the importance of dedicated monolingual models for low-resource languages. With perplexity scores of 1.78 on training domain text and 2.53 on unseen text, our model establishes a new benchmark for Assamese language modeling. The dramatic improvements over existing models-including a 27\u0026times; reduction in perplexity compared to L3Cube Assamese-BERT and over 12,000\u0026times; improvement over AxomiyaBERTa-highlight the effectiveness of our approach.\u003c/p\u003e\u003cp\u003eThe gap between training domain (1.78) and unseen text perplexity (2.53) is remarkably small, indicating excellent generalization capabilities. This is particularly noteworthy given that even well-established multilingual models like mBERT and MuRIL show much larger gaps between their training and unseen text performance. The consistent low perplexity across both evaluation sets suggests that our model has effectively learned robust representations of the Assamese language.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Tokenizer Quality\u003c/h2\u003e\u003cp\u003eOur byte-level BPE tokenizer successfully addresses the challenges of Assamese morphology while preventing out-of-vocabulary issues. The tokenizer handles Assamese-English code-mixing effectively, a crucial feature given the prevalence of bilingual text in real-world applications.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e5.3 Training Efficiency\u003c/h2\u003e\u003cp\u003eThe ability to train a state-of-the-art model in approximately 12 hours on a single GPU demonstrates the feasibility of developing high-quality language models without massive computational infrastructure. This efficiency is particularly important for research groups and organizations working with limited resources, aligning with the growing trend toward sustainable AI development.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e5.4 Limitations and Future Work\u003c/h2\u003e\u003cp\u003eWhile our model achieves strong performance on language modeling tasks, several limitations warrant future investigation:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eThe maximum sequence length of 130 tokens may limit performance on longer documents\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEvaluation on downstream tasks such as NER, sentiment analysis, and question answering remains to be conducted\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe model's performance on code-mixed text, while promising, requires systematic evaluation\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eScaling to larger model sizes and longer contexts could further improve performance\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eWe presented AssameseRoBERTa, a monolingual language model that establishes new state-of-the-art performance for Assamese NLP. Our work demonstrates that dedicated models trained on curated corpora can significantly outperform larger multilingual alternatives, even with limited computational resources. The exceptional performance-with perplexity of 1.78 on training domain text and 2.53 on unseen text-represents a 27\u0026times; improvement over previous Assamese-specific models and validates the importance of language-specific approaches for low-resource languages.\u003c/p\u003e\u003cp\u003eBy releasing our model and training framework under an open license, we aim to catalyze further research in Assamese and other Northeast Indian languages. Our work contributes to the broader goal of democratizing NLP technologies and ensuring linguistic diversity in the age of AI.\u003c/p\u003e"},{"header":"7. Ethics Statement","content":"\u003cp\u003eThis work was conducted with careful consideration of ethical implications. Our training corpus was collected from publicly available sources with appropriate usage rights. We acknowledge that the model may reflect biases present in the training data and recommend careful evaluation before deployment in sensitive applications. The model is released under Creative Commons Attribution 4.0 International License to ensure broad accessibility while maintaining attribution requirements.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAl-Dhahir I (2024) Small Language Models Set for High Market Impact in 2025. Global Data Market Intelligence Report\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAl Nazi Z, Hossain MR, Mamun A, F (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process 10:100124\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eConneau A et al (2020) Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of ACL 2020\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDevlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHettiarachchi H, Ranasinghe T, Rayson P, Mitkov R, Gaber M, Premasiri D, Tan FA, Uyangodage L (2025) Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025). In Proceedings of LoResLM 2025, Abu Dhabi, UAE\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJoshi R (2022) L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKakwani D, Kunchukuttan A, Golla S, Gokul NC, Bhattacharyya A, Khapra MM, Kumar P (2020) IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of EMNLP 2020\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKhanuja S et al (2021) MuRIL: Multilingual Representations for Indian Languages. arXiv preprint arXiv:2103.10730\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMinaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: A survey. arXiv preprint arXiv:2402.06196\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNath A, Mannan S, Krishnaswamy N (2023) AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11629\u0026ndash;11646\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShridhar K (2020) Indic Transformers: An Analysis of Transformer Language Models for Indian Languages. NeuralSpace Technical Report\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTamang S, Bora DJ (2024) Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language. arXiv preprint arXiv:2410.03718\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTran C et al (2024) Transfer Learning Limitations in Extremely Low-Resource Settings. In Proceedings of LREC-COLING 2024\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhong T, Yang Z, Liu Z, Zhang R, Liu Y, Sun H, Pan Y, Li Y, Zhou Y, Jiang H, Chen J, Liu T (2025) Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research. arXiv preprint arXiv:2412.04497\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"MWire Labs","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Low-resource languages, Assamese NLP, RoBERTa, Language models, Northeast Indian languages","lastPublishedDoi":"10.21203/rs.3.rs-8124065/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8124065/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWe present AssameseRoBERTa, a monolingual language model trained from scratch on 1.6\u0026nbsp;million Assamese sentences comprising approximately 77\u0026nbsp;million tokens. Despite being trained on a relatively modest corpus compared to mainstream language models, our model achieves remarkable performance improvements over existing multilingual baselines. AssameseRoBERTa obtains a perplexity of 1.57 on in-domain text and 5.93 on unseen text, representing a 7.7\u0026times; improvement over the previous best Assamese-specific model and outperforming multilingual models like mBERT and MuRIL by significant margins. Our approach demonstrates that dedicated monolingual models can effectively address the challenges of low-resource language processing, particularly for morphologically rich languages like Assamese. We release our model and training methodology to facilitate further research in Northeast Indian language technologies.\u003c/p\u003e","manuscriptTitle":"AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-18 04:40:57","doi":"10.21203/rs.3.rs-8124065/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"56d2833e-7421-4441-b9f7-177b1b9b702a","owner":[],"postedDate":"November 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58051177,"name":"Computer Architecture and Engineering"}],"tags":[],"updatedAt":"2025-11-18T04:40:57+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-18 04:40:57","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8124065","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8124065","identity":"rs-8124065","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00