Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition

preprint OA: closed
Full text JSON View at publisher
Full text 12,633 characters · extracted from preprint-html · click to expand
Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition Hanan A. Taher, Subhi R. M. Zeebaree This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8387768/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 4 You are reading this latest preprint version Abstract Continuous Sign Language Recognition (CSLR) seeks to transcribe unsegmented sign language videos into gloss sequences without frame-level supervision, presenting persistent challenges in temporal alignment, long-range dependency modeling, and reliable sequence-level generalization. While recent advances have achieved strong performance in high-resource languages, Kurdish Sign Language (KrdSL) remains largely unexplored due to the absence of sentence-level benchmarks. To address this gap, we introduce KrdSL-1400 , the first continuous Kurdish Sign Language dataset, comprising 1,400 annotated video sequences covering 40 linguistically structured sentences performed by seven native signers, providing a standardized benchmark for low-resource CSLR. We propose a hybrid spatio-temporal CSLR framework that combines deep convolutional visual encoding with sequence-aware temporal modeling and a multi-task joint CTC–attention decoding strategy explicitly designed to address alignment uncertainty. The CTC objective enforces monotonic alignment, while an entropy-regularized multi-head attention mechanism dynamically emphasizes linguistically salient temporal segments, enabling robust sequence prediction without reliance on pose estimation or handcrafted features. Training dynamics exhibit stable and consistent convergence, with closely aligned training and validation WER curves indicating strong generalization. Quantitative evaluation shows that a CTC baseline achieves a WER of 13.5% , which is reduced to 10.5% using single-head attention, while the proposed model attains the best performance with a WER of 9.5% , corresponding to an approximate 30% relative improvement . Cross-dataset evaluation on the large-scale PHOENIX-2014-T benchmark further demonstrates generalization, achieving a WER of 13.7% and outperforming recent attention-based and transformer-based CSLR approaches. Continuous sign language recognition Gloss deep learning CTC Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 22 Dec, 2025 Editor assigned by journal 20 Dec, 2025 Submission checks completed at journal 20 Dec, 2025 First submitted to journal 17 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8387768","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":564299218,"identity":"0540ffb0-4b8a-4ebb-9eb0-371bdbea91d9","order_by":0,"name":"Hanan A. Taher","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9klEQVRIiWNgGAWjYBACAyQG4wMeMDOBgbGBgZmQFgMQg9mAZC1sEkRpMWc/e/BzAcMfeXP23mcVb3ccZuBnzzFgnLnHGqcWy568ZOkZDAaGO3uOm92ce+Ywg2TPGwPGDc/ScTvsQI6BNA8DUNGNNLbbvG2HGQxuAG15cOAwbi3n3xj/Bmqx33D/GVsxSIs9QS03csxAtiRuuMHGxgy2RQKoZQNeLW/MrHkMjJM3nEljlpx7Jp1H4syzgoMzDuDxy/kc49s8FXK2G44fY/zwdoe1HH978saHPQdwhxhUI5QGxgc4ag4QUI8EgFpGwSgYBaNgFGAAANacVRe4M9i+AAAAAElFTkSuQmCC","orcid":"","institution":"Technical College of Duhok, Duhok Polytechnic University","correspondingAuthor":true,"prefix":"","firstName":"Hanan","middleName":"A.","lastName":"Taher","suffix":""},{"id":564299219,"identity":"dd2a5789-25cd-4ac8-92d0-80995b487506","order_by":1,"name":"Subhi R. M. Zeebaree","email":"","orcid":"","institution":"Technical College of Engineering, Duhok Polytechnic University","correspondingAuthor":false,"prefix":"","firstName":"Subhi","middleName":"R. M.","lastName":"Zeebaree","suffix":""}],"badges":[],"createdAt":"2025-12-17 17:10:01","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8387768/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8387768/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102297095,"identity":"b41040b6-1e62-4c17-929d-0ee3349545e4","added_by":"auto","created_at":"2026-02-10 10:25:43","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":824494,"visible":true,"origin":"","legend":"","description":"","filename":"EntropyRegularizedJointCTCAttentionn.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8387768/v1_covered_79d70336-cd23-4ab8-8813-a551d07a2b87.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Continuous sign language recognition, Gloss, deep learning, CTC","lastPublishedDoi":"10.21203/rs.3.rs-8387768/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8387768/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eContinuous Sign Language Recognition (CSLR) seeks to transcribe unsegmented sign language videos into gloss sequences without frame-level supervision, presenting persistent challenges in temporal alignment, long-range dependency modeling, and reliable sequence-level generalization. While recent advances have achieved strong performance in high-resource languages, Kurdish Sign Language (KrdSL) remains largely unexplored due to the absence of sentence-level benchmarks. To address this gap, we introduce \u003cb\u003eKrdSL-1400\u003c/b\u003e, the first continuous Kurdish Sign Language dataset, comprising 1,400 annotated video sequences covering 40 linguistically structured sentences performed by seven native signers, providing a standardized benchmark for low-resource CSLR.\u003c/p\u003e \u003cp\u003eWe propose a \u003cb\u003ehybrid spatio-temporal CSLR framework\u003c/b\u003e that combines deep convolutional visual encoding with sequence-aware temporal modeling and a \u003cb\u003emulti-task joint CTC\u0026ndash;attention decoding strategy\u003c/b\u003e explicitly designed to address alignment uncertainty. The CTC objective enforces monotonic alignment, while an \u003cb\u003eentropy-regularized multi-head attention mechanism\u003c/b\u003e dynamically emphasizes linguistically salient temporal segments, enabling robust sequence prediction without reliance on pose estimation or handcrafted features.\u003c/p\u003e \u003cp\u003eTraining dynamics exhibit stable and consistent convergence, with closely aligned training and validation WER curves indicating strong generalization. Quantitative evaluation shows that a CTC baseline achieves a WER of \u003cb\u003e13.5%\u003c/b\u003e, which is reduced to \u003cb\u003e10.5%\u003c/b\u003e using single-head attention, while the proposed model attains the best performance with a WER of \u003cb\u003e9.5%\u003c/b\u003e, corresponding to an approximate \u003cb\u003e30% relative improvement\u003c/b\u003e. \u003cb\u003eCross-dataset evaluation\u003c/b\u003e on the large-scale \u003cb\u003ePHOENIX-2014-T\u003c/b\u003e benchmark further demonstrates generalization, achieving a WER of \u003cb\u003e13.7%\u003c/b\u003e and outperforming recent attention-based and transformer-based CSLR approaches.\u003c/p\u003e","manuscriptTitle":"Entropy-Regularized Joint CTC–Attention Learning for Low-Resource Continuous Sign Language Recognition","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-09 05:51:07","doi":"10.21203/rs.3.rs-8387768/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-12-22T18:22:32+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-20T08:24:16+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-20T08:22:36+00:00","index":"","fulltext":""},{"type":"submitted","content":"Signal, Image and Video Processing","date":"2025-12-17T16:33:03+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"4bf8bff3-42ee-4ecb-9155-b8e78843a66d","owner":[],"postedDate":"February 9th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-01T00:08:19+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-09 05:51:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8387768","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8387768","identity":"rs-8387768","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00