Full text
7,254 characters
· extracted from
preprint-html
· click to expand
CAT-Net: A Channel and Self-Attention TCN for Robust Frame-Level Overlapping Speech Detection | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL This is a preprint and has not been peer reviewed. Data may be preliminary. 25 August 2025 V1 Latest version Share on CAT-Net: A Channel and Self-Attention TCN for Robust Frame-Level Overlapping Speech Detection Authors : Yassin TERRAF 0009-0004-4026-5887 [email protected] and Youssef Iraqi Authors Info & Affiliations https://doi.org/10.22541/au.175615514.42527021/v1 Published IEEE Transactions on Audio, Speech and Language Processing Version of record Peer review timeline 136 views 111 downloads Contents Abstract Supplementary Material Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract Detecting overlapping speech is essential for improving the performance of speech processing systems such as speaker identification, diarization, and automatic speech recognition. However, existing methods often fail in noisy and reverberant environments, limiting their real-world applicability. To address this challenge, we propose CAT-Net, a novel lightweight architecture for robust frame-level Overlapping Speech Detection (OSD). CAT-Net combines multiscale channel-wise attention, which dynamically weights frequency bins based on their importance using parallel temporal convolutions. The weighted features are fed into a Self-Attention Temporal Convolutional Network (SA-TCN), which models short-and long-term temporal dependencies via dilated convolutions. To overcome the limitation of uniform weighting across the receptive field, a self-attention mechanism is applied after each dilated layer to adaptively reweight temporal features based on contextual importance. A final classification module labels each frame as overlapping or single-speaker speech. As part of this work, we construct two comprehensive single-channel OSD datasets: one derived from the GRID corpus for neutral speech, and another from the RAVDESS corpus for emotional speech. For each dataset, we generate multiple versions simulating clean, noisy, reverberant, and combined noise-reverberation conditions across a wide range of Signal-to-Noise Ratio (SNR) levels and noise types. This enables rigorous evaluation of OSD models under realistic acoustic environments. Experimental results demonstrate that CAT-Net outperforms state-of-the-art methods across all conditions while using significantly fewer parameters, underscoring its effectiveness, efficiency, and suitability for deployment in practical speech processing systems. Furthermore, integrating CAT-Net into a standard speaker diarization system results in consistent improvements across both clean and noisy conditions. Supplementary Material File (cat_net.pdf) Download 1.11 MB Information & Authors Information Version history V1 Version 1 25 August 2025 Peer review timeline Published IEEE Transactions on Audio, Speech and Language Processing Version of Record 1 Jan 2026 Published Copyright This work is licensed under a Non Exclusive No Reuse License. Keywords acoustic feature extraction noise robustness overlapping speech detection self-attention temporal convolutional networks Authors Affiliations Yassin TERRAF 0009-0004-4026-5887 [email protected] View all articles by this author Youssef Iraqi View all articles by this author Metrics & Citations Metrics Article Usage 136 views 111 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Yassin TERRAF, Youssef Iraqi. CAT-Net: A Channel and Self-Attention TCN for Robust Frame-Level Overlapping Speech Detection. Authorea . 25 August 2025. DOI: https://doi.org/10.22541/au.175615514.42527021/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.175615514.42527021/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9feb065aaff50704',t:'MTc3OTI3Njc4OA=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.