Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents

preprint OA: closed
Full text JSON View at publisher
Full text 29,189 characters · extracted from preprint-html · click to expand
Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents Rajat Verma, Vriti Sharma, Manikandan Ravikiran, Rohit Saluja, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8122105/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 10 You are reading this latest preprint version Abstract Document Clutter is less explored problem in stored documents, which arises due to accidental splilling or smudging of liquids like sauces, inks, tea, etc, on the modern documents or naturally present in the historical or legal documents. Such clutter leads to loss of information while performing Optical Character Recognition (OCR) due to non-readability of cluttered letters or words. In this paper, we introduce ClutterOCRBench : a dataset containing 1080 document images with and without clutter, created using a thoughtful three-step process to achieve 100% correct ground truth, despite the non-readability of some data. In the first step, we print the \((1080)\) pages covering 12 domains and directly scan the printed pages. In the second step, we manually add 10 different types of clutter to the printed pages such as paint, coffee, and mud, with five different levels of degradation. Pages with clutter are scanned using the same orientation as in the first step. The step ensures that the sentence-level boxes in clean images are aligned with those in cluttered images. In the third step, we manually transcribe the text in the clean documents and use them for the aligned cluttered documents. We provide a comprehensive comparison of the latest OCR and Vision Language Models to perform text extraction from cluttered documents. After fine-tuning on the proposed dataset, the best models achieve a \((14%)\) reduction in CER and a \((7%)\) reduction in WER on the ClutterOCRBench test set. ClutterOCRBench Optical Character Recognition Vision Language Models CER and WER Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 01 Apr, 2026 Reviews received at journal 01 Apr, 2026 Reviewers agreed at journal 16 Mar, 2026 Reviews received at journal 12 Feb, 2026 Reviewers agreed at journal 12 Feb, 2026 Reviewers agreed at journal 19 Jan, 2026 Reviewers invited by journal 02 Dec, 2025 Editor assigned by journal 21 Nov, 2025 Submission checks completed at journal 19 Nov, 2025 First submitted to journal 15 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8122105","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":554218898,"identity":"0277692c-d243-48f9-bc4a-701b594b82c8","order_by":0,"name":"Rajat Verma","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABPElEQVRIie3PsWrCQBjA8QuBuFx1TRDNK3xyIEgFh77IBSFZEiq4dkgRzqV1tnTwFeLWUThIljxABoe0BecrglCwpZeUOpjQrh3uD0nujvtxF4RUqv/YZsxzCsNuMTbLFV0Li4/GyhnUENcFMXEJMkoCBbkNfyc+WEvBnfBE5O7w51VXM6FAMHBvtbrrZW9PxxFq6POX7GbbWSD9WaDJ9pxYaU5fMXhBFGMyeEjBCeXFZn68IwwZxESwOyeQ0Y085TKIDNxvXzCgqCQGd+S/9OUGXiVO2MagezaT5IPB6Jt8FqRxqCdjZC3hiqJYEo2BVl4sYAXBtadY6c4AAW4vit3p4J4Rh0nyGCw4YTqemrRKmsn1PqfHoW3P+Dp7Z91Rq5Xke//AO6v5fC3EsUKqGaeRLh/6N1CpVCpVtS+29HEMCRHtSAAAAABJRU5ErkJggg==","orcid":"","institution":"Indian Institute of Technology Mandi","correspondingAuthor":true,"prefix":"","firstName":"Rajat","middleName":"","lastName":"Verma","suffix":""},{"id":554218899,"identity":"3de3f81a-8284-4816-8b89-fc7a539d4f6b","order_by":1,"name":"Vriti Sharma","email":"","orcid":"","institution":"Indian Institute of Technology Mandi","correspondingAuthor":false,"prefix":"","firstName":"Vriti","middleName":"","lastName":"Sharma","suffix":""},{"id":554218902,"identity":"459e301f-80c5-4f8a-9338-f931dce14397","order_by":2,"name":"Manikandan Ravikiran","email":"","orcid":"","institution":"Indian Institute of Technology Mandi","correspondingAuthor":false,"prefix":"","firstName":"Manikandan","middleName":"","lastName":"Ravikiran","suffix":""},{"id":554218904,"identity":"76d2d469-ecee-4db5-8790-2b6684e226d6","order_by":3,"name":"Rohit Saluja","email":"","orcid":"","institution":"Indian Institute of Technology Mandi","correspondingAuthor":false,"prefix":"","firstName":"Rohit","middleName":"","lastName":"Saluja","suffix":""},{"id":554218906,"identity":"371bad43-e611-4b0c-afd4-d72748fc9bc3","order_by":4,"name":"Laxmidhar Behera","email":"","orcid":"","institution":"Indian Institute of Technology Mandi","correspondingAuthor":false,"prefix":"","firstName":"Laxmidhar","middleName":"","lastName":"Behera","suffix":""}],"badges":[],"createdAt":"2025-11-15 12:53:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8122105/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8122105/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97484421,"identity":"0243c2b4-d7f3-4ec7-b425-d31c3584a877","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6371,"visible":true,"origin":"","legend":"","description":"","filename":"383d004735f44134a1d1b36162ee76aa.json","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/8720b78d837246073dab0b8b.json"},{"id":97484418,"identity":"b44885d9-fe4e-4591-8762-a68aa51e6c49","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"xml","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":75116,"visible":true,"origin":"","legend":"","description":"","filename":"383d004735f44134a1d1b36162ee76aa1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/5eeeb835aae4e95b5c5b596a.xml"},{"id":97669187,"identity":"09f2e963-6044-45b8-9b9c-3a2fcaeb3f26","added_by":"auto","created_at":"2025-12-08 09:27:32","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":25318,"visible":true,"origin":"","legend":"","description":"","filename":"CoverVriti2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/0d60abfd4a6be78450699efd.pdf"},{"id":97484424,"identity":"af0abc77-5a10-4990-b4bf-46cbef5012e9","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6382260,"visible":true,"origin":"","legend":"","description":"","filename":"IJDARClutterSubmitted.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/5d3dc831a63b8436d2ef799c.pdf"},{"id":97484422,"identity":"d75a7781-6ee1-4c25-b1fb-18a2a46738fd","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":674407,"visible":true,"origin":"","legend":"","description":"","filename":"attentionmap.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/dba22d66f33b6dacfd6cdad6.png"},{"id":97670558,"identity":"5cd09bd4-68d3-4472-a316-469d02f1aa8b","added_by":"auto","created_at":"2025-12-08 09:30:56","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":788496,"visible":true,"origin":"","legend":"","description":"","filename":"cer.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/786b750e908c2f3955e88b8a.png"},{"id":97484420,"identity":"512e18d9-447f-407b-a54d-b05e4e90e483","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":34811,"visible":true,"origin":"","legend":"","description":"","filename":"chartlevels.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/15ac273f7346917d388cd228.png"},{"id":97670221,"identity":"e589e6b1-b9f8-44e5-aca7-b89da5d684b7","added_by":"auto","created_at":"2025-12-08 09:29:57","extension":"eps","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2890,"visible":true,"origin":"","legend":"","description":"","filename":"empty.eps","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/0750fdff49b9954754af71c1.eps"},{"id":97669390,"identity":"5ae7a57a-2eea-46fa-93fe-e351be801179","added_by":"auto","created_at":"2025-12-08 09:27:55","extension":"eps","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":91593,"visible":true,"origin":"","legend":"","description":"","filename":"fig.eps","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/a5a85d2685076243c96f54c3.eps"},{"id":97669689,"identity":"63793f49-9656-4c93-88cc-e22a30889eaf","added_by":"auto","created_at":"2025-12-08 09:28:41","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5976588,"visible":true,"origin":"","legend":"","description":"","filename":"firstteaser.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/a94af04ce8d421c34d47f9d8.png"},{"id":97484425,"identity":"248e4a00-f4f6-426b-832d-61efca68d925","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1555667,"visible":true,"origin":"","legend":"","description":"","filename":"linelevelsegmentationpipeline.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/4e997d07b4ff7f9d0753c2f1.png"},{"id":97669766,"identity":"3ebf1f49-1aca-400b-8984-08245ee06751","added_by":"auto","created_at":"2025-12-08 09:28:51","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3156040,"visible":true,"origin":"","legend":"","description":"","filename":"mainimage.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/4163d49d641e58acb867123a.png"},{"id":97670446,"identity":"98614d35-2bb3-44f9-9dc7-f0d0615f43be","added_by":"auto","created_at":"2025-12-08 09:30:41","extension":"bst","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":146013,"visible":true,"origin":"","legend":"","description":"","filename":"snapacite.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/773503704bf869dd2ee73aa6.bst"},{"id":97484429,"identity":"60b7e5e8-70a5-48fe-b078-68381bd32642","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"bst","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29828,"visible":true,"origin":"","legend":"","description":"","filename":"snaps.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/21053a8f4f5dd8d016547aea.bst"},{"id":97669409,"identity":"8d947a88-f0d3-46db-b850-abe459ad6756","added_by":"auto","created_at":"2025-12-08 09:27:55","extension":"pdf","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":421391,"visible":true,"origin":"","legend":"","description":"","filename":"snarticle.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/dcf09b1294c779666ecbf649.pdf"},{"id":97484428,"identity":"308c3db2-bddb-4cf0-8bbf-951ab88e1fa0","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"bst","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":35515,"visible":true,"origin":"","legend":"","description":"","filename":"snbasic.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/e99da1801e26c12d219938ac.bst"},{"id":97670225,"identity":"24dead93-be81-4eda-bede-68d84c03f8d4","added_by":"auto","created_at":"2025-12-08 09:29:58","extension":"bst","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":33968,"visible":true,"origin":"","legend":"","description":"","filename":"snchicago.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/d04c20ca65650d6c68da0a8c.bst"},{"id":97484435,"identity":"0e5de884-d419-43d0-8cea-8cb8c673a0a0","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"cls","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":55857,"visible":true,"origin":"","legend":"","description":"","filename":"snjnl.cls","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/e7684aed0ac69784baae77d4.cls"},{"id":97484454,"identity":"0eeca72c-1748-4f2d-9e27-66d686f6d5cb","added_by":"auto","created_at":"2025-12-04 23:50:19","extension":"bst","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":64023,"visible":true,"origin":"","legend":"","description":"","filename":"snmathphysay.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/31431dd1e1b6ea62031ae0bb.bst"},{"id":97484447,"identity":"667cae26-2f5f-40a0-8753-c1a5f296105f","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"bst","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":64166,"visible":true,"origin":"","legend":"","description":"","filename":"snmathphysnum.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/8ac971bd5534eb84fdbf16c3.bst"},{"id":97669366,"identity":"fe972260-7496-4517-bbb0-921983941e9d","added_by":"auto","created_at":"2025-12-08 09:27:52","extension":"bst","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":37333,"visible":true,"origin":"","legend":"","description":"","filename":"snnature.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/0d2905f1932619bbdc4e400b.bst"},{"id":97484440,"identity":"61f61c33-2ebe-4292-967b-bbaff51bdd1b","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"bst","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":39951,"visible":true,"origin":"","legend":"","description":"","filename":"snvancouveray.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/6f41a110c59cd295cf094ca0.bst"},{"id":97484436,"identity":"7d56895a-7387-4680-8286-6df20ba0430f","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"bst","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":40758,"visible":true,"origin":"","legend":"","description":"","filename":"snvancouvernum.bst","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/e5f13ad5dd08d3f1f7300bf3.bst"},{"id":97484439,"identity":"54d7532d-7705-4b27-8bf5-419ead663e59","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":49109,"visible":true,"origin":"","legend":"","description":"","filename":"subject.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/93cad7ea134cb10f0ca032dd.png"},{"id":97484445,"identity":"fd16f351-ca3b-47cf-b224-b136534b254c","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":49572,"visible":true,"origin":"","legend":"","description":"","filename":"subject1.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/f88ea15295b001bc845eeab5.png"},{"id":97669682,"identity":"9aa842af-0aab-4a98-a996-fd77862d7f88","added_by":"auto","created_at":"2025-12-08 09:28:41","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":74125,"visible":true,"origin":"","legend":"","description":"","filename":"subject2.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/50080a782ce0f689a484d285.png"},{"id":97484446,"identity":"e92ebb19-6852-45b0-9869-7f576258f063","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4530471,"visible":true,"origin":"","legend":"","description":"","filename":"teaserimage.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/ad67597f62d7701817eafa1e.png"},{"id":97484444,"identity":"8f4df50f-f295-467d-b7d9-76de675e6fe0","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"pdf","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":418495,"visible":true,"origin":"","legend":"","description":"","filename":"usermanual.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/7a63855812aa9981de279de5.pdf"},{"id":97669619,"identity":"7c8cde77-72be-402a-a454-2ba1bbb47c2e","added_by":"auto","created_at":"2025-12-08 09:28:32","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":732537,"visible":true,"origin":"","legend":"","description":"","filename":"wer.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/b5da204d094f922762dc8c96.png"},{"id":97484437,"identity":"12ace77c-6372-4469-9e5c-28d054fdcb2b","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":32,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1056210,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefirstteaser.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/710e198707adcb75d6ad43d5.png"},{"id":97670599,"identity":"5bdb17bb-8e94-4bc4-a864-40dbc10190c1","added_by":"auto","created_at":"2025-12-08 09:31:01","extension":"png","order_by":35,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":37641,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinesubject.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/8a095fab51fcedc94f3dc3c1.png"},{"id":97484438,"identity":"bc6289d0-5555-4ba1-b54a-8db69aa13a6c","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":36,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":51595,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinesubject1.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/17310abaf0935e6105a70fb4.png"},{"id":97670432,"identity":"ec4dff42-8ed5-426e-a332-8ddc649c4746","added_by":"auto","created_at":"2025-12-08 09:30:39","extension":"png","order_by":37,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":74686,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinesubject2.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/661657e09a32bd6d332f2140.png"},{"id":97484448,"identity":"1768b8f7-5cf4-4370-84e7-09f3f71e131d","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"png","order_by":38,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":340892,"visible":true,"origin":"","legend":"","description":"","filename":"Onlineteaserimage.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/4a6b6b46eec5d03eefb29115.png"},{"id":97484453,"identity":"6e5362ce-f8a2-4dd2-a28b-bf8b91b95d43","added_by":"auto","created_at":"2025-12-04 23:50:19","extension":"png","order_by":39,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":653286,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinewer.png","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/b4b787b727234fb170468d4b.png"},{"id":97484449,"identity":"d91c17a0-d169-4baf-ac9a-f7212963d8dc","added_by":"auto","created_at":"2025-12-04 23:50:19","extension":"xml","order_by":40,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":81135,"visible":true,"origin":"","legend":"","description":"","filename":"383d004735f44134a1d1b36162ee76aa1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/cd28e8f011ec25169c2be799.xml"},{"id":97484443,"identity":"ff3df9e6-2c2d-41b5-b4e3-c234c8350083","added_by":"auto","created_at":"2025-12-04 23:50:18","extension":"html","order_by":41,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":89182,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1/569025ee8b18238db36f1dfe.html"},{"id":97677790,"identity":"426257ec-3898-4018-9de3-9723290d7838","added_by":"auto","created_at":"2025-12-08 09:54:32","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2585199,"visible":true,"origin":"","legend":"","description":"","filename":"IJDARClutterSubmitted.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8122105/v1_covered_2a3059cb-2091-45e6-8eb7-7b37f845fe34.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"international-journal-on-document-analysis-and-recognition-ijdar","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijda","sideBox":"Learn more about [International Journal on Document Analysis and Recognition (IJDAR)](http://link.springer.com/journal/10032)","snPcode":"10032","submissionUrl":"https://submission.nature.com/new-submission/10032/3","title":"International Journal on Document Analysis and Recognition (IJDAR)","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"ClutterOCRBench, Optical Character Recognition, Vision Language Models, CER and WER","lastPublishedDoi":"10.21203/rs.3.rs-8122105/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8122105/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDocument Clutter is less explored problem in stored documents, which arises due to accidental splilling or smudging of liquids like sauces, inks, tea, etc, on the modern documents or naturally present in the historical or legal documents. Such clutter leads to loss of information while performing Optical Character Recognition (OCR) due to non-readability of cluttered letters or words. In this paper, we introduce \u003cb\u003eClutterOCRBench\u003c/b\u003e: a dataset containing 1080 document images with and without clutter, created using a thoughtful three-step process to achieve 100% correct ground truth, despite the non-readability of some data. In the first step, we print the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\((1080)\\)\u003c/span\u003e\u003c/span\u003e pages covering 12 domains and directly scan the printed pages. In the second step, we manually add 10 different types of clutter to the printed pages such as paint, coffee, and mud, with five different levels of degradation. Pages with clutter are scanned using the same orientation as in the first step. The step ensures that the sentence-level boxes in clean images are aligned with those in cluttered images. In the third step, we manually transcribe the text in the clean documents and use them for the aligned cluttered documents. We provide a comprehensive comparison of the latest OCR and Vision Language Models to perform text extraction from cluttered documents. After fine-tuning on the proposed dataset, the best models achieve a \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\((14%)\\)\u003c/span\u003e\u003c/span\u003e reduction in CER and a \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\((7%)\\)\u003c/span\u003e\u003c/span\u003e reduction in WER on the ClutterOCRBench test set.\u003c/p\u003e","manuscriptTitle":"Literally Reading behind the Lines: A benchmark for OCR on Cluttered Printed Documents","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-04 23:50:13","doi":"10.21203/rs.3.rs-8122105/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-01T09:06:01+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-01T09:02:52+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"165504882255288931428668076644803009440","date":"2026-03-16T15:19:53+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-12T13:33:46+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"99452062422272051935710822181577234887","date":"2026-02-12T13:24:42+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"137840263184590882712949590400366082547","date":"2026-01-19T15:32:00+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-02T17:13:14+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-21T05:23:26+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-20T02:43:27+00:00","index":"","fulltext":""},{"type":"submitted","content":"International Journal on Document Analysis and Recognition (IJDAR)","date":"2025-11-15T12:37:12+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"international-journal-on-document-analysis-and-recognition-ijdar","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijda","sideBox":"Learn more about [International Journal on Document Analysis and Recognition (IJDAR)](http://link.springer.com/journal/10032)","snPcode":"10032","submissionUrl":"https://submission.nature.com/new-submission/10032/3","title":"International Journal on Document Analysis and Recognition (IJDAR)","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"22e2ff2c-d15c-4dd1-9248-31dbf11e12a3","owner":[],"postedDate":"December 4th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-29T19:38:31+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-04 23:50:13","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8122105","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8122105","identity":"rs-8122105","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00