Wearable intelligent throat enables natural speech in stroke patients with dysarthria

preprint OA: closed
Full text JSON View at publisher
Full text 100,617 characters · extracted from preprint-html · click to expand
Wearable intelligent throat enables natural speech in stroke patients with dysarthria | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Wearable intelligent throat enables natural speech in stroke patients with dysarthria Luigi Occhipinti, Chenyu Tang, Shuo Gao, Cong Li, Wentian Yi, and 18 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5469584/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT’s LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems. Physical sciences/Mathematics and computing/Computational science Physical sciences/Engineering/Biomedical engineering Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 I. Main Neurological diseases such as stroke, amyotrophic lateral sclerosis (ALS), and Parkinson’s disease frequently result in dysarthria—a severe motor-speech disorder that compromises neuromuscular control over the vocal tract. This impairment drastically restricts effective communication, lowers quality of life, substantially impedes the rehabilitation process, and can even lead to severe psychological issues [1, 2, 3, 4]. Augmentative and alternative communication (AAC) technologies have been developed to address these challenges, including letter-by-letter spelling systems utilizing head or eye tracking [5, 6, 7, 8] and neuroprosthetics powered by brain-computer interface (BCI) devices [9, 10, 11, 12]. While head or eye tracking systems are relatively straightforward to implement, they suffer from slow communication speeds. Neuroprosthetics, while transformative for severe paralysis cases, often rely on invasive, complex recordings and processing of neural signals. For individuals retaining partial control over laryngeal or facial muscles, a strong need remains for solutions that are more intuitive and portable (SNote 1). A promising solution lies in wearable silent speech devices that capture non-acoustic signals, such as subtle skin vibrations [13, 14, 15, 16, 17] or electrophysiological signals from the speech motor cortex [18, 19, 20, 21]. These technologies offer non-invasiveness, comfort, and portability, with potential for seamless daily integration. Yet, despite their promise, current systems remain in their infancy, achieving reliable, discrete word decoding in healthy users but showing limited success in patient trials [13, 14, 15]. More critically, these systems fall short of delivering truly natural communication—requiring both delay-free expression and consistent contextual coherence, capabilities essential for fully effective and meaningful interactions. To advance wearable silent speech systems for real-world dysarthria patient use, we developed an AI-driven intelligent throat (IT) system that captures extrinsic laryngeal muscle vibrations and carotid pulse signals, integrating silent speech and emotional states analysis in real-time. The system generates personalized, contextually appropriate sentences that accurately reflect patients' intended meaning (Figure 1). It employs ultrasensitive textile strain sensors, fabricated using advanced printing techniques, to ensure comfortable, durable, and high-quality signal acquisition [14, 22]. By analyzing speech signals at the token level (~100ms), our approach outperforms traditional time-window methods, enabling continuous, fluent word and sentence expression in real time. Knowledge distillation further reduces computational latency by 76%, significantly enhancing communication fluidity. Large language models (LLMs) serve as intelligent agents, automatically correcting token classification errors and generating personalized, context-aware speech by integrating emotional states and environmental cues. Pre-trained on a dataset from 10 healthy individuals, the system achieved a word error rate (WER) of 4.2% and a sentence error rate (SER) of 2.9% when fine-tuned on data from five dysarthric stroke patients. Additionally, the integration of emotional states and contextual cues further personalizes and enriches the decoded sentences, resulting in a 55% increase in user satisfaction and enabling dysarthria patients to communicate with fluency and naturalness comparable to that of healthy individuals. STable 1 provides a comprehensive comparison between the IT system and state-of-the-art wearable silent speech systems. II. Results The intelligent throat system The IT system consists primarily of hardware (a smart choker embedding textile strain sensors and a wireless readout printed circuit board (PCB)) and software components (machine learning models and LLM agents). Silent speech signals generated in real time by the user’s silent expressions are decoded by a token decoding network and synthesized into an initial sentence by the token synthesis agent (TSA). Simultaneously, pulse signals are collected from the smart choker device and processed by an emotion decoding network to determine the user’s real-time emotional status. The sentence expansion agent (SEA) intelligently expands the TSA-generated sentence, incorporating personalized emotion labels and objective contextual background data to produce a refined, emotionally expressive, and logically coherent sentence that captures the user’s intended meaning (Fig. 1, SVideo 2). Each component of the IT system is elaborated upon in the following sections. Fig. 2a shows the structure of the strain sensing choker screen-printed on an elastic knitted textile. The choker features two channels located at the front and side of the neck, designed to monitor the strain applied to the skin by the muscles near the throat and the carotid artery (SFig. 1). The graphene layer printed on the textile forms ordered cracks along the stress concentration areas of the textile lattice to detect subtle skin vibrations [14]. Silver electrodes are connected to the integrated PCB on the choker. A rigid strain isolation layer with high Young's modulus is printed around each channel to reduce crosstalk between the two channels and the variable strains caused by wearing. Due to the difference in Young's modulus between the elastic textile substrate and the strain isolation layer, less than 1% of external strain is transmitted to the interior when wearing the choker, while the internal sensing areas remain soft and elastic (SFig. 2) [22]. For uniaxial stretching from 1-10 Hz, the printed textile-based graphene strain sensor shows good linear behaviour, producing a response over 10% to subtle strains of 0.1%, and maintains a gauge factor (GF) over 100 during high-frequency stretching (Fig. 2b). Furthermore, our previous studies have confirmed the reliability of the printed textile-based strain sensors with high robustness, durability and washability, as well as high levels of comfort, biocompatibility and breathability [14, 22]. To operate the system and enable wireless communication between the IT choker and server, the PCB was designed for bi-channel measurements (i.e., silent speech and carotid pulse signals), enabling simultaneous acquisition of speech and emotional cues. The PCB integrates a low-power Bluetooth module (Fig. 2c) for continuous data transmission while optimizing energy efficiency for extended use. Key components of the PCB include an analog-to-digital converter (ADC) for high-fidelity signal digitization and a microcontroller unit (MCU) that manages data processing and transmission (Fig. 2e, SFig. 4, and SFig. 5). Power supply, operational amplifiers, and the reference voltage chip are configured to ensure stable signal amplification, catering to the sensitivity requirements of both strain and pulse sensors. For the energy management system, a comprehensive power budget analysis reveals that the designed PCB operates with a total power consumption of 76.5 mW (Fig. 2f). The main power-consuming components are the Bluetooth module (29.7 mW) and amplification circuits (31.9 mW). To extend operational time and support portable use, a 1800 mWh battery was incorporated, providing sufficient capacity for continuous operation thoughout an entire day without recharging. Token-level speech decoding Current wearable silent speech systems operate by recognizing discrete words or predefined sentences and lack the ability for continuous, real-time expression analysis typical of the human brain [45]. This limitation arises because these systems rely on fixed time windows (typically 1–3 seconds) for word decoding, requiring users to complete each word within a set interval and pause until the next window to continue [13-21]. Such constraints lead to fragmented expression and unnatural user experience. To address this, we developed a high-resolution tokenization method for signal segmentation (Fig. 2f), dividing speech signals into fine-grained ~100ms segments for continuous word label recognition. This granular segmentation ensures that each token accurately corresponds to a specific part of a single word and is labeled accordingly. This setup enables users to speak fluidly without worrying about timing constraints, as the system continuously classifies and aggregates tokens into coherent words and sentences. Our optimization determined that a token length of 144 ms offers the ideal balance: it minimizes boundary confusion from longer tokens while avoiding the increased computational demands associated with shorter tokens. While high-resolution tokenization improves fluidity, shorter tokens inherently contain limited context, making them less effective for accurate word decoding. Temporal machine learning models, like recurrent neural networks (RNN) or transformers, could capture contextual dependencies, but their complexity and computational cost render them suboptimal for wearable silent speech systems [23, 24, 25], which prioritize real-time operation. To balance context awareness and computational efficiency, we implemented an explicit context augmentation strategy (Fig. 3a), where each sample consists of N tokens: N-1 preceding tokens provide context, and the current token determines the sample’s label. For initial tokens, any missing preceding tokens are padded with blank tokens to ensure completeness. We found N=15 tokens to be optimal (Fig. 3c), with accuracy initially increasing as tokens accumulate, then declining due to insufficient context at lower counts and gradient decay or information loss at higher counts [26]. This strategy enables the use of efficient one-dimensional convolutional neural networks (1D-CNNs) instead of computationally intensive temporal models for token decoding [27, 28]. Attention maps reveal that signals from preceding regions indeed contribute to token decoding, validating the effectiveness of the explicit context augmentation strategy (SFig.10). To further enhance model efficiency and accuracy on patients’ data, we designed the training pipeline shown in Fig. 3b. The model was pre-trained on a larger dataset from healthy individuals and then fine-tuned on the limited patients’ data, leveraging shared signal features to enhance patient-specific decoding. After only 25 repetitions per word in few-shot learning, the model achieved a token classification accuracy of 92.2% (Fig. 3d). In contrast, a model trained from scratch using solely patients’ data could only reach an accuracy of 79.8%. Additionally, we employed response-based knowledge distillation [29] to transfer knowledge from a larger 1D ResNet-101 model to a smaller 1D ResNet-18, reducing computational load by 75.6% while maintaining high accuracy, with only a 0.9% drop from the teacher model, achieving 91.3% (Fig. 3e). Fig. 3f and Fig. 3g display the confusion matrix and UMAP feature visualization for token decoding [30]. Over 90% of the classification errors involved confusion between class 0 (blank tokens) and neighbouring word tokens. As shown in later analyses of the LLM agent's performance, such boundary errors can be effectively corrected during token-to-word synthesis by the token synthesis agent (TSA). Decoding of emotional states To enrich sentence coherence by providing emotional context, we decode emotional states from carotid pulse signals. Emotional state recognition can typically be achieved through a variety of methods, including analysis of facial images from cameras, audio speech signals, and various physiological indicators such as heart rate and blood pressure [31, 32, 33]. In line with our objective of creating a highly integrated wearable system, we chose carotid pulse signals as a biomarker for emotional decoding. Using 5-second windows, we segmented patients’ pulse signals into samples to construct a dataset, focusing on three common emotion categories for stroke patients: neutral, relieved, and frustrated (data collection protocol detailed in Methods). Fig. 4a shows the discrete Fourier transform (DFT) distributions for each emotion, highlighting distinct frequency characteristics among these emotional states. Accordingly, we incorporated DFT frequency extraction into the decoding pipeline shown in Fig. 4b, where removal of the DC component, Z-score normalization, and DFT are sequentially applied before feeding the values into a classifier for categorization. Fig. 4c illustrates the performance of different classifiers with and without DFT frequency extraction. The results show a significant improvement in decoding accuracy with DFT. The optimal model was the 1D-CNN with DFT, achieving an accuracy of 83.2%, with its confusion matrix displayed in Fig. 4d. The SHAP values reveal that the emotion decoding model primarily focuses on low-frequency signals in the 0-2 Hz range, which is consistent with the pulse signal range demonstrated by the DFT (SFig. 11). In addition to the silent speech and carotid pulse signals analyzed in this study, various physiological activities generate distinct vibrational signals in the neck area, which can introduce artefacts hindering analysis [34, 35]. Fig. 4e shows the frequency and magnitude distributions of several prominent signals in this region. Our observations revealed that silent speech exhibits a relatively strong magnitude, and in applications with the IT, vibration can propagate transversely from the throat center to the carotid artery, introducing crosstalk in the pulse signal. Due to the considerable frequency overlap between silent speech and pulse signals, digital filters are non-ideal for effective artefacts suppression [36]. While adding reference channels could theoretically help, it does not align with the goal of a highly integrated IT [37]. To address this issue, we employed a stress isolation treatment using a polyurethane acrylate (PUA) layer, as shown in Fig. 2a, to prevent strain crosstalk propagation along the IT. The theoretical basis of this isolation strategy has been thoroughly discussed in our previous study [22]. Fig. 4f compares pulse signals with and without strain isolation treatment when silent speech occurs concurrently (the vowel “a” introduced at 2.5s), demonstrating significant crosstalk resilience in the treated IT. LLM agents for sentence synthesis and intelligent expansion To naturally and coherently synthesize sentences that accurately reflect the patient’s intended expression from the decoded token and emotion labels, we introduced two LLM agents based on the GPT-4o-mini API (Fig. 5a): the token synthesis agent (TSA) and the sentence expansion agent (SEA). The TSA merges token labels directly into words silently expressed by the patient and combines them into sentences (left). The SEA, on the other hand, leverages emotion labels and objective information, such as time and weather, to expand these basic sentences into logically coherent, personalized expressions that better capture the patient’s true intent. Through a simple interaction (in this study, two consecutive nods), the IT system enables seamless switching between the direct output and the enriched, expanded sentence. To optimize the performance of the TSA, we refined the prompt design [38]. First, we optimized the prompt length (Fig. 5b), observing a trend where both WER and SER improved with increasing prompt length up to 400 words before eventually deteriorating for higher lengths. We attribute this trend to the fact that longer prompts provide clearer synthesis instructions, but overly lengthy prompts dilute the model's focus ability. Additionally, we compared performance with and without example cases, where the agent was provided with five examples of token label sequences and their corrected word outputs. Including examples significantly improved synthesis accuracy (Fig. 5c). Finally, we evaluated the effect of providing empirical constraints, which specify typical token counts for words of various lengths. Performance improved considerably when constraints were included. Under optimal prompt conditions, TSA achieved its best performance with a WER of 4.2% and an SER of 2.9%. We also assessed and refined the performance of the SEA. Patient satisfaction with the expanded sentences was evaluated through a questionnaire (see STable 4 for criteria details). Following Chain-of-Thought (CoT) optimization [39] and the inclusion of patient-provided expansion examples, the expanded sentences scored significantly higher across multiple criteria (Fig. 5f). Contribution analysis revealed that emotion labels made a substantial impact on emotion accuracy, while objective information notably improved fluency, jointly contributing to the overall satisfaction with the expanded sentences compared to the basic word-only output (Fig. 5e). Under optimal prompt conditions, the SEA-generated expanded sentences resulted in a 55% increase in overall patient satisfaction compared to the TSA’s direct output, raising satisfaction from “somewhat satisfied” to “fully satisfied” levels (SFig. 12 and SFig. 13). In both operating modes, sentences generated by the TSA and SEA agents are sent to an open-source text-to-speech model [44], which synthesizes audio that matches the patient’s natural voice for playback. In real-world applications, the delay between the completion of the user’s silent expression and the sentence playback is approximately 1 second (SNote 2). This low latency effectively supports seamless and natural communication in practical settings. III. Discussion In this work, we introduce the IT, an advanced wearable system designed to empower dysarthric stroke patients to communicate with the fluidity, intuitiveness, and expressiveness of natural speech. Comprehensive analysis and user feedback affirm the IT’s high performance in fluency, accuracy, emotional expressiveness, and personalization. This success is rooted in its innovative design: ultrasensitive textile strain sensors capture rich and high-quality vibrational signals from the laryngeal muscles and carotid artery, while high-resolution tokenized segmentation enables users to communicate freely and continuously without expression delays. Additionally, the integration of LLM agents enables intelligent error correction and contextual adaptation, delivering exceptional decoding accuracy (WER < 5%, SER < 3%) and a 55% increase in user satisfaction. The IT thus sets a new benchmark in wearable silent speech systems, offering a naturalistic, user-centered communication aid. Future efforts in several key areas will guide the continued development of the IT system. First, expanding its adaptability to a wider range of neurological conditions and demographic groups will make the technology more inclusive. Second, enhancing its linguistic diversity and multilingual support will allow for more personalized communication across language barriers. Finally, miniaturizing the system within an edge computing framework will facilitate seamless integration into real-world settings, boosting usability and accessibility. Looking ahead, the advantages of the IT extend beyond enhancing everyday communication; they contribute to the holistic health of neurological patients, encompassing both physical and psychological well-being. The regained fluency in communication allows patients to re-engage in social interactions, reducing isolation and the associated risk of depression. Moreover, effective communication facilitates real-time, personalized adjustments by rehabilitation therapists, supporting patients’ recovery from motor impairments like hemiplegia. Together, these capabilities position the IT as a comprehensive tool for restoring independence and improving quality of life for individuals with neurological conditions. IV. Methods Materials TIMREX KS 25 Graphite (particle size of 25µm) was sourced from IMERYS. Stretchable conductive silver ink was obtained from Dycotec Materials Ltd. Ethyl cellulose was purchased from SIGMA-ALDRICH. Flexible UV Resin Clear was acquired from Photocentric Ltd. The textile substrate, composed of 95% Polyester and 5% spandex, was procured from Jelly Fabrics Ltd. Ink formulation The graphene ink for screen printing was prepared following a reported method. Briefly, 100g of graphite powder and 2g of ethyl cellulose (EC) were mixed in 1L of isopropyl alcohol (IPA) and stirred at 3000 rpm for 30 minutes. The mixture was then added into a high-pressure homogenizer (PSI-40) at 2000 bar pressure for 50 cycles to obtain graphene dispersion. The graphene dispersion is centrifuged at 5000g for 30 min to remove unexfoliated graphite. Fabrication of textile strain sensor The textile substrate was washed with detergent, thoroughly dried, and then treated with UV-ozone for 5 minutes to clean the surface. Screen printing was performed using a 165T polyester silk screen on a semi-automatic printer (Kippax & Sons Ltd.) set with a squeegee angle of 45 degrees, a spacer of 2 mm, a coating speed of 10 mm/s, and a printing speed of 40 mm/s. Graphene ink, silver paste, and PUA were successively printed to form the sensing layer, electrodes, and strain isolation layer, respectively. After printing the PUA, the textile was exposed to UV light for 5 minutes. After each printing pass, the textile was air-dried. Following printing, the sensor was dried at 80 ℃ overnight. A biaxial strain of approximately 10% was then applied to induce the formation of ordered cracks. Characterization Scanning Electron Microscopy (SEM) images were taken with a Magellan 400, after sputtering the textile samples with a 5 nm layer of gold to enhance conductivity. Optical images were captured using an Olympus microscope. Tensile properties of the textile strain sensors were evaluated using a Deben Microtest 200N Tensile Stage and an INSTRON universal testing system. Electrical signals were recorded concurrently with a potentiostat (EmStat4X, PalmSens) and a multiplexer (MUX, PalmSens). Copper tape was crimped onto the contact pads of the samples, supplemented with a small amount of silver paste to improve electrical contact. Wireless PCB for data readout A custom wireless PCB was developed for efficient, continuous data acquisition and transmission within the IT system. Powered by a TP4065 lithium charger and a 3.3V regulator, the PCB ensures stable operation via battery or USB. The STM32G431 microcontroller captures silent speech and carotid pulse signals through two ADC channels, with an OPA2192 operational amplifier for high-precision signal conditioning, amplifying low-level signals and enhancing overall data fidelity. A BLE module (BLE-SER-A-ANT) transmits real-time data via UART, enabling seamless, delay-free communication. Silent speech data acquisition We recruited 10 healthy subjects (mean age: 25.3 ± 4.1 years; 6 males, 4 females) and 5 stroke patients with dysarthria (mean age: 43.9 ± 8.3 years; 4 males, 1 female) for silent speech signal collection, in compliance with Ethics Committee approval from the First Affiliated Hospital of Henan University of Chinese Medicine, approval no. 2023HL-142-01. A corpus was developed consisting of 47 Chinese words commonly used by stroke patients in daily communication, along with 20 sentences constructed from these words (see STable 2 and STable 3). For the healthy subject dataset, we collected 100 repetitions per word and 50 repetitions per sentence. For the patient dataset, we gathered 50 repetitions per word and 50 per sentence. The healthy subject data serves as a critical baseline for initial model training, enabling the model to establish foundational patterns in silent speech signals. This pre-training facilitates improved generalization and performance when later fine-tuning the model on the limited data from dysarthric patients, ultimately enhancing decoding accuracy and robustness in patient-specific applications. The silent speech signals were segmented into tokens at 144 ms intervals. Each token was combined with the preceding 14 tokens to form a sample, allowing the model to incorporate context. The sample’s label corresponds to the word of the current token. The signals were originally recorded at a sampling rate of 10 kHz and subsequently downsampled to 1 kHz before tokenization. Before neural network analysis, each sample was uniformly preprocessed with detrending and z-score normalization. Protocol for emotion data collection Emotional pulse data was collected concurrently with silent speech signals, ensuring synchronized datasets that capture both speech-related and underlying physiological responses. To achieve accurate labeling, each emotion—neutral, relieved, and frustrated—was elicited through a carefully structured protocol involving audio-induced emotional states [ 40 , 41 , 42 ]. The emotions were induced via the international affective digitized sounds (2nd Edition; IADS-2) [ 43 ]. The three emotions were chosen as they are the most frequently encountered emotions in dysarthric patients’ daily communication. Labeling was verified through collaboration between the participants and the therapist to ensure the successful and reliable induction of each target emotion. To balance sufficient information within each window and achieve the necessary resolution for emotion detection, pulse signals were segmented into 5-second samples. A 50% window overlap was applied to increase the training set size, enhancing model learning and generalization. The signals were originally recorded at a sampling rate of 10 kHz and subsequently downsampled to 200 Hz before analysis. Software environment and model training Signal preprocessing was performed on a MacBook Pro equipped with an M1 Max CPU. Network training was conducted using Python 3.8.13, Miniconda 3, and PyTorch 2.0.1 in a performance-optimized environment. Training acceleration was enabled by CUDA on NVIDIA A100 GPU. The detailed training parameters for all models can be found in SFig. 8 and SFig. 9. Declarations Data availability The datasets supporting this study will be available from the GitHub repository before publication. Code availability The code supporting this study will be available from the GitHub repository before publication. Acknowledgments This work was partially supported by the British Council (Grant Contract No. 45371261), the UK Engineering and Physical Science Research Council (EPSRC, grants No. EP/K03099X/1, EP/W024284/1) and Haleon through the CAPE partnership contract (University of Cambridge Ref. No. G110480). References Enderby, P. Disorders of communication. Neurological Rehabilitation 110, 273–281 (2013). Tang, C. et al. A roadmap for the development of human body digital twins. Nature Reviews Electrical Engineering 1,199–207 (2024) Zinn, S., et al. The effect of poststroke cognitive impairment on rehabilitation process and functional outcome. Archives of physical medicine and rehabilitation 85, 1084–1090 (2004). Teshaboeva, F. Literacy education of speech impaired children as a pedagogical psychological problem." Confrencea 5, 299–302 (2023). Ju, X. et al. A systematic review on voiceless patients’ willingness to adopt high-technology augmentative and alternative communication in intensive care units. Intensive and Critical Care Nursing 63, 102948 (2020). Megalingam, R. et al . Sakthiprasad Kuttankulungara Manoharan, Gokul Riju & Sreekanth Makkal Mohandas. NETRAVAAD: Interactive Eye Based Communication System For People With Speech Issues. IEEE Access 12, 69838–69852 (2024). Ezzat, M. et al. Blink-To-Live eye-based communication system for users with speech impairments. Scientific Reports 13, 7961 (2023). Tarek, N. et al. Morse glasses: an IoT communication system based on Morse code for users with speech impairments. Computing 104, 789–808 (2021). Silva, A. B., Littlejohn, K. T., Liu, J. R., Moses, D. A. & Chang, E. F. The speech neuroprosthesis. Nature Reviews Neuroscience 25, 473–492 (2024). Card, N. S. et al. An Accurate and Rapidly Calibrating Speech Neuroprosthesis. New England Journal of Medicine 391, 609–618 (2024). Metzger, S. L. et al. A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620, 1–10 (2023). Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620, 1031–1036 (2023). Kim, T. et al. Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces. Nature Communications 13, 5815 (2022). Tang, C. et al. Ultrasensitive textile strain sensors redefine wearable silent speech interfaces with high machine learning efficiency. npj Flexible Electronics 8, 27 (2024). Yang, Q. et al. Mixed-modality speech recognition and interaction using a wearable artificial throat. Nature Machine Intelligence 5, 169–180 (2023). Xu, S. et al. Force-induced ion generation in zwitterionic hydrogels for a sensitive silent-speech sensor. Nature Communications 14, 219 (2023). Che, Z. et al. Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system. Nature Communications 15, 1873 (2024). Wand, M. et al. Tackling speaking mode varieties in EMG-based speech recognition. IEEE Transactions on Biomedical Engineering 61, 2515–2526 (2014). Liu, H. et al. An epidermal sEMG tattoo-like patch as a new human–machine interface for patients with loss of voice. Microsystems & Nanoengineering 6, 16 (2020). Wang, Y. et al. All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics. npj Flexible Electronics 5, 20 (2021). Tian, H. et al. Bioinspired dual-channel speech recognition using graphene-based electromyographic and mechanical sensors. Cell Reports Physical Science 3, 101075 (2022). Tang, C. et al. A deep learning-enabled smart garment for accurate and versatile sleep conditions monitoring in daily life. arXiv.org https://arxiv.org/abs/2408.00753 (2024). Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404, 132306 (2020). Vaswani, A. et al . Attention is all you need. Advances in Neural Information Processing Systems 6000–6010 (2017). Chen, Z., et al . Long sequence time-series forecasting with deep learning: A survey. Information Fusion 97, 101819 (2023). Bengio, Y., et al . Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 157–166 (1994). Kiranyaz, S., et al. "1D convolutional neural networks and applications: A survey." Mechanical systems and signal processing 151, 107398 (2021). Tang, W., et al. Rethinking 1d-cnn for time series classification: A stronger baseline." arXiv preprint arXiv :2002. 10061 (2020). Hinton, G. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015). McInnes, L., et al. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). Yu, Y., et al. Cloud-edge collaborative depression detection using negative emotion recognition and cross-scale facial feature analysis. IEEE transactions on industrial informatics 19, 3088–3098 (2022). Yang, K., et al. "Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition." IEEE Transactions on Affective Computing 14, 1082–1097 (2021). Saganowski, S., et al. Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review. IEEE Transactions on Affective Computing 14, 1876–1897 (2022). Yi, W., et al. Ultrasensitive Textile Strain Sensing Choker for Diverse Healthcare Applications. 2024 IEEE BioSensors Conference (BioSensors). IEEE, 2024. Yin, J., et al. Motion artefact management for soft bioelectronics. Nature Reviews Bioengineering 2, 541–558 (2024). Selesnick, I., et al. Generalized digital Butterworth filter design. IEEE Transactions on signal processing 46, 1688–1694 (1998). Kuo, S., and Dennis M. Active noise control: a tutorial review. Proceedings of the IEEE 87, 943–973 (1999). Xie, Y., et al. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence 5, 1486–1496 (2023). Wei, J., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022). Irie, G., et al. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Transactions on Multimedia 12, 523–535 (2010). Zhang, S., et al. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE transactions on circuits and systems for video technology 28, 3030–3043 (2017). Qi, Y. et al. , Piezoelectric Touch Sensing and Random-Forest-Based Technique for Emotion Recognition. IEEE Transactions on Computational Social Systems 11, 6296–6307 (2024). Yang, W., et al. Affective auditory stimulus database: An expanded version of the International Affective Digitized Sounds (IADS-E). Behavior Research Methods 50, 1415–1429 (2018). Anastassiou, P., et al . Seed-TTS: A Family of High-Quality Versatile Speech Generation Models. arXiv preprint arXiv:2406.02430 (2024). Hickok, G., and Poeppel, D. The cortical organization of speech processing. Nature Reviews Neuroscience 8, 393–402 (2007). Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryVideo1.mp4 Stroke patient with dysarthria attempts to speak SupplementaryVideo2.mp4 Intelligent Throat: system overview and live demonstration SINov.16.docx Supplementary Information Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5469584","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":399027338,"identity":"3d9c22a8-aa60-470b-aa08-8a6ab1c38128","order_by":0,"name":"Luigi Occhipinti","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAvUlEQVRIiWNgGAWjYBACPgYGxocfKuD8BMJa2BgYmI0lzpCohU2Ct40kLfyLD0hIzrOT121gfviBsS2NCC0SzxIMCrclG247wGYswdiWQ4yWMwYJktsOMG47wGDGwNhWQZyWA7xzDthvO8D+jUgt/D2GDbwNBxK3HeAB2UKUw9iSmSWOJSdvO8xTLJFwjgjv8/MfPv7zQ42d7bbj7Rs/fChLJqyFQSIBymBmICpWQNYcIErZKBgFo2AUjGQAAFgCNTFpaMgHAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0002-9067-2534","institution":"University of Cambridge","correspondingAuthor":true,"prefix":"","firstName":"Luigi","middleName":"","lastName":"Occhipinti","suffix":""},{"id":399027339,"identity":"1d757464-4b9b-46a3-800c-fa83207a920f","order_by":1,"name":"Chenyu Tang","email":"","orcid":"https://orcid.org/0000-0002-6368-5639","institution":"University of Cambridge","correspondingAuthor":false,"prefix":"","firstName":"Chenyu","middleName":"","lastName":"Tang","suffix":""},{"id":399027340,"identity":"67caab28-bf52-457b-b87a-a9a08b5ae9c8","order_by":2,"name":"Shuo Gao","email":"","orcid":"https://orcid.org/0000-0003-3096-4700","institution":"Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Shuo","middleName":"","lastName":"Gao","suffix":""},{"id":399027341,"identity":"64c3364b-4844-44c0-ab96-a506cc49f871","order_by":3,"name":"Cong Li","email":"","orcid":"https://orcid.org/0009-0008-9011-2474","institution":"School of Instrumentation and Optoelectronic Engineering, Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Cong","middleName":"","lastName":"Li","suffix":""},{"id":399027342,"identity":"f6af9206-596a-46eb-a9d5-0fbca1eecea9","order_by":4,"name":"Wentian Yi","email":"","orcid":"https://orcid.org/0000-0002-4044-3063","institution":"University of Cambridge, Department of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Wentian","middleName":"","lastName":"Yi","suffix":""},{"id":399027343,"identity":"7f2e6e32-97fc-4349-8580-10e34a1f6e50","order_by":5,"name":"Yuxuan Jin","email":"","orcid":"","institution":"University of Cambridge, Cavendish Laboratory","correspondingAuthor":false,"prefix":"","firstName":"Yuxuan","middleName":"","lastName":"Jin","suffix":""},{"id":399027344,"identity":"8e032193-3ad7-4823-bf69-4c197b961c5f","order_by":6,"name":"Xiaoxue Zhai","email":"","orcid":"","institution":"Tsinghua University, Department of Rehabilitation Medicine, Beijing Tsinghua Changgung Hospital","correspondingAuthor":false,"prefix":"","firstName":"Xiaoxue","middleName":"","lastName":"Zhai","suffix":""},{"id":399027345,"identity":"7f167104-bac4-4c05-bd05-9922c6913651","order_by":7,"name":"Sixuan Lei","email":"","orcid":"","institution":"Tsinghua University, Shenzhen International Graduate School","correspondingAuthor":false,"prefix":"","firstName":"Sixuan","middleName":"","lastName":"Lei","suffix":""},{"id":399027346,"identity":"19431918-0b79-448b-87a6-588caccb2368","order_by":8,"name":"Hongbei Meng","email":"","orcid":"","institution":"Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Hongbei","middleName":"","lastName":"Meng","suffix":""},{"id":399027347,"identity":"1d6b47e1-f87c-49e9-adf4-3e6d46432dfb","order_by":9,"name":"Zibo Zhang","email":"","orcid":"","institution":"University of Cambridge, Department of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Zibo","middleName":"","lastName":"Zhang","suffix":""},{"id":399027348,"identity":"22d36ed0-e8a5-4037-a257-ff875ed65f83","order_by":10,"name":"Muzi Xu","email":"","orcid":"https://orcid.org/0000-0001-6381-9863","institution":"University of Cambridge, Department of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Muzi","middleName":"","lastName":"Xu","suffix":""},{"id":399027349,"identity":"402039b3-062b-4d7a-bbe7-6f47989c0f54","order_by":11,"name":"Shengbo Wang","email":"","orcid":"https://orcid.org/0000-0003-1212-138X","institution":"School of Instrumentation and Optoelectronic Engineering, Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Shengbo","middleName":"","lastName":"Wang","suffix":""},{"id":399027350,"identity":"ce4b07b0-9464-499f-92d2-5d9ad98fd1db","order_by":12,"name":"Xuhang Chen","email":"","orcid":"https://orcid.org/0009-0003-1757-9303","institution":"University of Cambridge","correspondingAuthor":false,"prefix":"","firstName":"Xuhang","middleName":"","lastName":"Chen","suffix":""},{"id":399027351,"identity":"e0244913-269a-4f24-b3f8-e58986d3fd88","order_by":13,"name":"Chenxi Wang","email":"","orcid":"","institution":"School of Instrumentation and Optoelectronic Engineering, Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Chenxi","middleName":"","lastName":"Wang","suffix":""},{"id":399027352,"identity":"3b1c5e63-c717-4ed2-a49f-31e636ce577f","order_by":14,"name":"Hongyun Yang","email":"","orcid":"","institution":"School of Instrumentation and Optoelectronic Engineering, Beihang University","correspondingAuthor":false,"prefix":"","firstName":"Hongyun","middleName":"","lastName":"Yang","suffix":""},{"id":399027353,"identity":"beb02f36-4caf-41d9-a883-089287e5badf","order_by":15,"name":"Ningli Wang","email":"","orcid":"https://orcid.org/0000-0002-8933-4482","institution":"Beijing Tongren Hospital","correspondingAuthor":false,"prefix":"","firstName":"Ningli","middleName":"","lastName":"Wang","suffix":""},{"id":399027354,"identity":"33bcdd81-ce0f-4c0c-b12f-704b1fb10f9e","order_by":16,"name":"Wenyu Wang","email":"","orcid":"","institution":"Hong Kong University of Science and Technology, Thrust of Smart Manufacturing","correspondingAuthor":false,"prefix":"","firstName":"Wenyu","middleName":"","lastName":"Wang","suffix":""},{"id":399027355,"identity":"d1eb829e-78cf-40d5-b3b0-623d27e75a7a","order_by":17,"name":"Jin Cao","email":"","orcid":"","institution":"School of Life Sciences, Beijing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Jin","middleName":"","lastName":"Cao","suffix":""},{"id":399027356,"identity":"fa8eb832-72b4-4986-9d90-28f3729f6476","order_by":18,"name":"Xiaodong Feng","email":"","orcid":"","institution":"Department of Rehabilitation Center, The First Affiliated Hospital of Henan University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Xiaodong","middleName":"","lastName":"Feng","suffix":""},{"id":399027357,"identity":"6a2020c0-01e7-43a7-970b-23776a2f1dc4","order_by":19,"name":"Peter Smielewski","email":"","orcid":"https://orcid.org/0000-0001-5096-3938","institution":"University of Cambridge","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Smielewski","suffix":""},{"id":399027358,"identity":"01a9db6b-c07f-4b8c-861e-6695c8111b79","order_by":20,"name":"Yu Pan","email":"","orcid":"","institution":"Tsinghua University, Department of Rehabilitation Medicine, Beijing Tsinghua Changgung Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"","lastName":"Pan","suffix":""},{"id":399027359,"identity":"e75c2e10-ec07-45a2-8954-422acf4c7abd","order_by":21,"name":"Wenhui Song","email":"","orcid":"https://orcid.org/0000-0001-8406-472X","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Wenhui","middleName":"","lastName":"Song","suffix":""},{"id":399027360,"identity":"bd2b2ba8-340a-437d-9ef2-eaaeca8842c5","order_by":22,"name":"Martin Birchall","email":"","orcid":"","institution":"Royal National Ear Nose and Throat and Eastman Dental Hospitals, University College London Hospital","correspondingAuthor":false,"prefix":"","firstName":"Martin","middleName":"","lastName":"Birchall","suffix":""}],"badges":[],"createdAt":"2024-11-17 11:25:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5469584/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5469584/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":78883478,"identity":"89f12d7d-008c-4a8a-91c5-afd65de3af56","added_by":"auto","created_at":"2025-03-20 09:09:09","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":431285,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSchematic of the IT developed for stroke patients with dysarthria.\u003c/strong\u003e The system captures extrinsic laryngeal muscle vibrations and carotid pulse signals via textile strain sensors and transmits them to the server through a wireless module. Silent speech signals are processed through a token decoding network, which generates token labels for sentence synthesis. Simultaneously, pulse signals are processed by an emotion decoding network to identify emotional states. The system intelligently integrates both emotional states and contextual objective information (e.g., time, environment) to expand the initial decoded sentences. Through a sentence expansion agent, the decoded output is transformed into personalized, fluent, and emotionally expressive sentences, enabling patients to communicate with a fluency and naturalness comparable to healthy individuals. (Note: Due to grammatical differences between Chinese and English, “We go hospital” is a word-for-word translation of the Chinese expression for “Let's go to the hospital”.)\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/4d9b647fb9c7f07d6fc4d5de.png"},{"id":78883477,"identity":"bee9108b-962f-4105-9477-9069bc8dfb4f","added_by":"auto","created_at":"2025-03-20 09:09:09","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":756912,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHardware and data collection of the IT. a,\u003c/strong\u003e Schematic of a textile-based strain-sensing choker. Two channels are aligned with the carotid artery and center of throat, respectively. Each channel consists of a two-terminal crack-based resistive strain sensor surrounded by a polyurethane acrylate (PUA) stress isolation layer. The top right SEM image shows the spontaneous ordered crack structure of the graphene coating. \u003cstrong\u003eb,\u003c/strong\u003e Relationship between the response to uniaxial stretching (from 0.1% to 5%) and frequency. \u003cstrong\u003ec, \u003c/strong\u003eExploded view of the internal components of the PCB. \u003cstrong\u003ed,\u003c/strong\u003e Diagram of the system communication. \u003cstrong\u003ee,\u003c/strong\u003e Power consumption of each component during system communication. \u003cstrong\u003ef,\u003c/strong\u003e Schematic of the high-resolution tokenization strategy.\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/ee0720a8ce194d2b73e581ab.png"},{"id":78884023,"identity":"adaa18fb-ddbf-464b-8f03-40551cbf7fd5","added_by":"auto","created_at":"2025-03-20 09:17:12","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":687940,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eToken-level decoding framework and performance evaluation. a\u003c/strong\u003e, Explicit context augmentation strategy designed to incorporate contextual information by combining tokens into token samples. \u003cstrong\u003eb\u003c/strong\u003e, Model training pipeline: the teacher model is pre-trained on healthy samples, then fine-tuned on patient samples; knowledge distillation transfers learned features to a student model for efficient prediction. \u003cstrong\u003ec\u003c/strong\u003e, Comparison of decoding accuracy across different numbers of tokens per sample, showing optimal performance when sufficient contextual information is included. \u003cstrong\u003ed\u003c/strong\u003e, Accuracy improvement with word repetition in transfer learning process, demonstrating a jump from zero-shot inference (43.3%) to few-shot learning (92.2%) as repetitions increase. e, Comparison of model performance across architectures with varying accuracy, FLOPs, and parameter counts; ResNet-101 and ResNet-18 were selected as the teacher and student models, respectively. \u003cstrong\u003ef\u003c/strong\u003e, Confusion matrix for the final student model. \u003cstrong\u003eg\u003c/strong\u003e, UMAP visualization of extracted features from the student model, illustrating token clustering patterns that indicate effective decoding and clear separation of different classes.\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/e4e9435d9304cf9265ce0c49.png"},{"id":78883530,"identity":"6b9af08f-1d8b-4595-9399-67dab1ead9be","added_by":"auto","created_at":"2025-03-20 09:09:14","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":457505,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEmotion decoding framework and performance evaluation. a\u003c/strong\u003e, Frequency domain characteristics of carotid pulse signals across three emotional states (Neutral, Relieved, and Frustrated), showing distinct amplitude patterns. \u003cstrong\u003eb\u003c/strong\u003e, Emotion classification workflow: preprocessing pipeline (left) involving DC removal, Z-score normalization, and discrete Fourier transform (DFT), feeding into a classifier based on a 1DCNN architecture (right) for emotion decoding. \u003cstrong\u003ec\u003c/strong\u003e, Comparison of classification accuracies across machine learning algorithms (SVM, LDA, RF, MLP, and 1DCNN) with and without DFT preprocessing, highlighting improved performance with DFT. \u003cstrong\u003ed\u003c/strong\u003e, Confusion matrix for emotion classification. \u003cstrong\u003ee\u003c/strong\u003e, Frequency and magnitude range of different vibrational signal sources (voice, silent speech, breath, carotid pulse) at neck area. \u003cstrong\u003ef\u003c/strong\u003e, Time-frequency spectrogram of pulse signals with and without strain isolation treatment when vowel “a” both introduced at 2.5s, demonstrating successful mitigation of speech crosstalk interference after applying the isolation technique.\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/70ca778ca4f723eea5648785.png"},{"id":78883519,"identity":"0082d9e9-20ab-483e-9f45-6b626c8b9c15","added_by":"auto","created_at":"2025-03-20 09:09:13","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":630627,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLLM agents framework and performance evaluation. a\u003c/strong\u003e, Schematic of the IT’s LLM agents: Token Synthesis Agent (left) directly synthesizes sentences from neural network token labels, while Sentence Expansion Agent (right) enhances outputs with contextual and emotional inputs. \u003cstrong\u003eb\u003c/strong\u003e, Effect of prompt length on word error rate (WER) and sentence error rate (SER) with optimal performance observed at medium lengths. \u003cstrong\u003ec\u003c/strong\u003e, Influence of example-based few-shot learning on WER and SER, showing a significant reduction when examples are provided. \u003cstrong\u003ed\u003c/strong\u003e, Impact of constrained decoding on WER and SER, demonstrating improved accuracy and sentence structure. \u003cstrong\u003ee\u003c/strong\u003e, Contribution of objective information, word, and emotion labels on key user metrics, including fluency, satisfaction, core meaning, and emotional accuracy (evaluated through ablation experiments). \u003cstrong\u003ef\u003c/strong\u003e, Radar plot comparing performance across various configurations (Token-only, Context-aware, Chain-of-Thought (CoT), and CoT with personalized demonstration) on fluency, personalization, core meaning, satisfaction, completeness, and emotion accuracy.\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/c2953f2520691b54574c05f6.png"},{"id":78885360,"identity":"a4d0cf81-5d04-4dc0-9400-899350c4ed96","added_by":"auto","created_at":"2025-03-20 09:33:16","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3726378,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/8e499cc3-74d4-4235-a592-1d527828f267.pdf"},{"id":78884025,"identity":"b59ec77f-0f4b-4487-bbb1-6da5c5d02a7e","added_by":"auto","created_at":"2025-03-20 09:17:13","extension":"mp4","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":16368015,"visible":true,"origin":"","legend":"Stroke patient with dysarthria attempts to speak","description":"","filename":"SupplementaryVideo1.mp4","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/77c2d7d58fe356535a1765c8.mp4"},{"id":78883524,"identity":"0bb5e489-5e0f-48ae-b8fe-cd122b4879f1","added_by":"auto","created_at":"2025-03-20 09:09:13","extension":"mp4","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":30893237,"visible":true,"origin":"","legend":"\u003cp\u003eIntelligent Throat: system overview and live demonstration\u003c/p\u003e","description":"","filename":"SupplementaryVideo2.mp4","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/071a589fc5dc46bec93bba49.mp4"},{"id":78883481,"identity":"cd3a9e6d-3e01-4182-b880-640b5a87e492","added_by":"auto","created_at":"2025-03-20 09:09:11","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":21605592,"visible":true,"origin":"","legend":"\u003cp\u003eSupplementary Information\u003c/p\u003e","description":"","filename":"SINov.16.docx","url":"https://assets-eu.researchsquare.com/files/rs-5469584/v1/a217c9baa365bfeebedff4d5.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Wearable intelligent throat enables natural speech in stroke patients with dysarthria","fulltext":[{"header":"I. Main","content":"\u003cp\u003eNeurological diseases such as stroke, amyotrophic lateral sclerosis (ALS), and Parkinson\u0026rsquo;s disease frequently result in dysarthria\u0026mdash;a severe motor-speech disorder that compromises neuromuscular control over the vocal tract. This impairment drastically restricts effective communication, lowers quality of life, substantially impedes the rehabilitation process, and can even lead to severe psychological issues [1, 2, 3, 4]. Augmentative and alternative communication (AAC) technologies have been developed to address these challenges, including letter-by-letter spelling systems utilizing head or eye tracking [5, 6, 7, 8] and neuroprosthetics powered by brain-computer interface (BCI) devices [9, 10, 11, 12]. While head or eye tracking systems are relatively straightforward to implement, they suffer from slow communication speeds. Neuroprosthetics, while transformative for severe paralysis cases, often rely on invasive, complex recordings and processing of neural signals. For individuals retaining partial control over laryngeal or facial muscles, a strong need remains for solutions that are more intuitive and portable (SNote 1).\u003c/p\u003e\n\u003cp\u003eA promising solution lies in wearable silent speech devices that capture non-acoustic signals, such as subtle skin vibrations [13, 14, 15, 16, 17] or electrophysiological signals from the speech motor cortex [18, 19, 20, 21]. These technologies offer non-invasiveness, comfort, and portability, with potential for seamless daily integration. Yet, despite their promise, current systems remain in their infancy, achieving reliable, discrete word decoding in healthy users but showing limited success in patient trials [13, 14, 15]. More critically, these systems fall short of delivering truly natural communication\u0026mdash;requiring both delay-free expression and consistent contextual coherence, capabilities essential for fully effective and meaningful interactions.\u003c/p\u003e\n\u003cp\u003eTo advance wearable silent speech systems for real-world dysarthria patient use, we developed an AI-driven intelligent throat (IT) system that captures extrinsic laryngeal muscle vibrations and carotid pulse signals, integrating silent speech and emotional states analysis in real-time. The system generates personalized, contextually appropriate sentences that accurately reflect patients\u0026apos; intended meaning (Figure 1). It employs ultrasensitive textile strain sensors, fabricated using advanced printing techniques, to ensure comfortable, durable, and high-quality signal acquisition [14, 22]. By analyzing speech signals at the token level (~100ms), our approach outperforms traditional time-window methods, enabling continuous, fluent word and sentence expression in real time. Knowledge distillation further reduces computational latency by 76%, significantly enhancing communication fluidity. Large language models (LLMs) serve as intelligent agents, automatically correcting token classification errors and generating personalized, context-aware speech by integrating emotional states and environmental cues. Pre-trained on a dataset from 10 healthy individuals, the system achieved a word error rate (WER) of 4.2% and a sentence error rate (SER) of 2.9% when fine-tuned on data from five dysarthric stroke patients. Additionally, the integration of emotional states and contextual cues further personalizes and enriches the decoded sentences, resulting in a 55% increase in user satisfaction and enabling dysarthria patients to communicate with fluency and naturalness comparable to that of healthy individuals. STable 1 provides a comprehensive comparison between the IT system and state-of-the-art wearable silent speech systems.\u003c/p\u003e"},{"header":"II. Results","content":"\u003cp\u003e\u003cstrong\u003eThe intelligent throat system\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe IT system consists primarily of hardware (a smart choker embedding textile strain sensors and a wireless readout printed circuit board (PCB)) and software components (machine learning models and LLM agents). Silent speech signals generated in real time by the user’s silent expressions are decoded by a token decoding network and synthesized into an initial sentence by the token synthesis agent (TSA). Simultaneously, pulse signals are collected from the smart choker device and processed by an emotion decoding network to determine the user’s real-time emotional status. The sentence expansion agent (SEA) intelligently expands the TSA-generated sentence, incorporating personalized emotion labels and objective contextual background data to produce a refined, emotionally expressive, and logically coherent sentence that captures the user’s intended meaning (Fig. 1, SVideo 2). Each component of the IT system is elaborated upon in the following sections.\u003c/p\u003e\n\u003cp\u003eFig. 2a shows the structure of the strain sensing choker screen-printed on an elastic knitted textile. The choker features two channels located at the front and side of the neck, designed to monitor the strain applied to the skin by the muscles near the throat and the carotid artery (SFig. 1). The graphene layer printed on the textile forms ordered cracks along the stress concentration areas of the textile lattice to detect subtle skin vibrations [14]. Silver electrodes are connected to the integrated PCB on the choker. A rigid strain isolation layer with high Young's modulus is printed around each channel to reduce crosstalk between the two channels and the variable strains caused by wearing. Due to the difference in Young's modulus between the elastic textile substrate and the strain isolation layer, less than 1% of external strain is transmitted to the interior when wearing the choker, while the internal sensing areas remain soft and elastic (SFig. 2) [22]. For uniaxial stretching from 1-10 Hz, the printed textile-based graphene strain sensor shows good linear behaviour, producing a response over 10% to subtle strains of 0.1%, and maintains a gauge factor (GF) over 100 during high-frequency stretching (Fig. 2b). Furthermore, our previous studies have confirmed the reliability of the printed textile-based strain sensors with high robustness, durability and washability, as well as high levels of comfort, biocompatibility and breathability [14, 22].\u003c/p\u003e\n\u003cp\u003eTo operate the system and enable wireless communication between the IT choker and server, the PCB was designed for bi-channel measurements (i.e., silent speech and carotid pulse signals), enabling simultaneous acquisition of speech and emotional cues. The PCB integrates a low-power Bluetooth module (Fig. 2c) for continuous data transmission while optimizing energy efficiency for extended use. Key components of the PCB include an analog-to-digital converter (ADC) for high-fidelity signal digitization and a microcontroller unit (MCU) that manages data processing and transmission (Fig. 2e, SFig. 4, and SFig. 5). Power supply, operational amplifiers, and the reference voltage chip are configured to ensure stable signal amplification, catering to the sensitivity requirements of both strain and pulse sensors. For the energy management system, a comprehensive power budget analysis reveals that the designed PCB operates with a total power consumption of 76.5 mW (Fig. 2f). The main power-consuming components are the Bluetooth module (29.7 mW) and amplification circuits (31.9 mW). To extend operational time and support portable use, a 1800 mWh battery was incorporated, providing sufficient capacity for continuous operation thoughout an entire day without recharging.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eToken-level speech decoding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCurrent wearable silent speech systems operate by recognizing discrete words or predefined sentences and lack the ability for continuous, real-time expression analysis typical of the human brain [45]. This limitation arises because these systems rely on fixed time windows (typically 1–3 seconds) for word decoding, requiring users to complete each word within a set interval and pause until the next window to continue [13-21]. Such constraints lead to fragmented expression and unnatural user experience. To address this, we developed a high-resolution tokenization method for signal segmentation (Fig. 2f), dividing speech signals into fine-grained ~100ms segments for continuous word label recognition. This granular segmentation ensures that each token accurately corresponds to a specific part of a single word and is labeled accordingly. This setup enables users to speak fluidly without worrying about timing constraints, as the system continuously classifies and aggregates tokens into coherent words and sentences. Our optimization determined that a token length of 144 ms offers the ideal balance: it minimizes boundary confusion from longer tokens while avoiding the increased computational demands associated with shorter tokens.\u003c/p\u003e\n\u003cp\u003eWhile high-resolution tokenization improves fluidity, shorter tokens inherently contain limited context, making them less effective for accurate word decoding. Temporal machine learning models, like recurrent neural networks (RNN) or transformers, could capture contextual dependencies, but their complexity and computational cost render them suboptimal for wearable silent speech systems [23, 24, 25], which prioritize real-time operation. To balance context awareness and computational efficiency, we implemented an explicit context augmentation strategy (Fig. 3a), where each sample consists of N tokens: N-1 preceding tokens provide context, and the current token determines the sample’s label. For initial tokens, any missing preceding tokens are padded with blank tokens to ensure completeness. We found N=15 tokens to be optimal (Fig. 3c), with accuracy initially increasing as tokens accumulate, then declining due to insufficient context at lower counts and gradient decay or information loss at higher counts [26]. This strategy enables the use of efficient one-dimensional convolutional neural networks (1D-CNNs) instead of computationally intensive temporal models for token decoding [27, 28]. Attention maps reveal that signals from preceding regions indeed contribute to token decoding, validating the effectiveness of the explicit context augmentation strategy (SFig.10).\u003c/p\u003e\n\u003cp\u003eTo further enhance model efficiency and accuracy on patients’ data, we designed the training pipeline shown in Fig. 3b. The model was pre-trained on a larger dataset from healthy individuals and then fine-tuned on the limited patients’ data, leveraging shared signal features to enhance patient-specific decoding. After only 25 repetitions per word in few-shot learning, the model achieved a token classification accuracy of 92.2% (Fig. 3d). In contrast, a model trained from scratch using solely patients’ data could only reach an accuracy of 79.8%. Additionally, we employed response-based knowledge distillation [29] to transfer knowledge from a larger 1D ResNet-101 model to a smaller 1D ResNet-18, reducing computational load by 75.6% while maintaining high accuracy, with only a 0.9% drop from the teacher model, achieving 91.3% (Fig. 3e). Fig. 3f and Fig. 3g display the confusion matrix and UMAP feature visualization for token decoding [30]. Over 90% of the classification errors involved confusion between class 0 (blank tokens) and neighbouring word tokens. As shown in later analyses of the LLM agent's performance, such boundary errors can be effectively corrected during token-to-word synthesis by the token synthesis agent (TSA).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDecoding of emotional states\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo enrich sentence coherence by providing emotional context, we decode emotional states from carotid pulse signals. Emotional state recognition can typically be achieved through a variety of methods, including analysis of facial images from cameras, audio speech signals, and various physiological indicators such as heart rate and blood pressure [31, 32, 33]. In line with our objective of creating a highly integrated wearable system, we chose carotid pulse signals as a biomarker for emotional decoding. Using 5-second windows, we segmented patients’ pulse signals into samples to construct a dataset, focusing on three common emotion categories for stroke patients: neutral, relieved, and frustrated (data collection protocol detailed in Methods). Fig. 4a shows the discrete Fourier transform (DFT) distributions for each emotion, highlighting distinct frequency characteristics among these emotional states. Accordingly, we incorporated DFT frequency extraction into the decoding pipeline shown in Fig. 4b, where removal of the DC component, Z-score normalization, and DFT are sequentially applied before feeding the values into a classifier for categorization. Fig. 4c illustrates the performance of different classifiers with and without DFT frequency extraction. The results show a significant improvement in decoding accuracy with DFT. The optimal model was the 1D-CNN with DFT, achieving an accuracy of 83.2%, with its confusion matrix displayed in Fig. 4d. The SHAP values reveal that the emotion decoding model primarily focuses on low-frequency signals in the 0-2 Hz range, which is consistent with the pulse signal range demonstrated by the DFT (SFig. 11).\u003c/p\u003e\n\u003cp\u003eIn addition to the silent speech and carotid pulse signals analyzed in this study, various physiological activities generate distinct vibrational signals in the neck area, which can introduce artefacts hindering analysis [34, 35]. Fig. 4e shows the frequency and magnitude distributions of several prominent signals in this region. Our observations revealed that silent speech exhibits a relatively strong magnitude, and in applications with the IT, vibration can propagate transversely from the throat center to the carotid artery, introducing crosstalk in the pulse signal. Due to the considerable frequency overlap between silent speech and pulse signals, digital filters are non-ideal for effective artefacts suppression [36]. While adding reference channels could theoretically help, it does not align with the goal of a highly integrated IT [37]. To address this issue, we employed a stress isolation treatment using a\u0026nbsp;polyurethane acrylate\u0026nbsp;(PUA) layer, as shown in Fig. 2a, to prevent strain crosstalk propagation along the IT. The theoretical basis of this isolation strategy has been thoroughly discussed in our previous study [22]. Fig. 4f compares pulse signals with and without strain isolation treatment when silent speech occurs concurrently (the vowel “a” introduced at 2.5s), demonstrating significant crosstalk resilience in the treated IT.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLLM agents for sentence synthesis and intelligent expansion\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo naturally and coherently synthesize sentences that accurately reflect the patient’s intended expression from the decoded token and emotion labels, we introduced two LLM agents based on the GPT-4o-mini API (Fig. 5a): the token synthesis agent (TSA) and the sentence expansion agent (SEA). The TSA merges token labels directly into words silently expressed by the patient and combines them into sentences (left). The SEA, on the other hand, leverages emotion labels and objective information, such as time and weather, to expand these basic sentences into logically coherent, personalized expressions that better capture the patient’s true intent. Through a simple interaction (in this study, two consecutive nods), the IT system enables seamless switching between the direct output and the enriched, expanded sentence.\u003c/p\u003e\n\u003cp\u003eTo optimize the performance of the TSA, we refined the prompt design [38]. First, we optimized the prompt length (Fig. 5b), observing a trend where both WER and SER improved with increasing prompt length up to 400 words before eventually deteriorating for higher lengths. We attribute this trend to the fact that longer prompts provide clearer synthesis instructions, but overly lengthy prompts dilute the model's focus ability. Additionally, we compared performance with and without example cases, where the agent was provided with five examples of token label sequences and their corrected word outputs. Including examples significantly improved synthesis accuracy (Fig. 5c). Finally, we evaluated the effect of providing empirical constraints, which specify typical token counts for words of various lengths. Performance improved considerably when constraints were included. Under optimal prompt conditions, TSA achieved its best performance with a WER of 4.2% and an SER of 2.9%.\u003c/p\u003e\n\u003cp\u003eWe also assessed and refined the performance of the SEA. Patient satisfaction with the expanded sentences was evaluated through a questionnaire (see STable 4 for criteria details). Following Chain-of-Thought (CoT) optimization [39] and the inclusion of patient-provided expansion examples, the expanded sentences scored significantly higher across multiple criteria (Fig. 5f). Contribution analysis revealed that emotion labels made a substantial impact on emotion accuracy, while objective information notably improved fluency, jointly contributing to the overall satisfaction with the expanded sentences compared to the basic word-only output (Fig. 5e). Under optimal prompt conditions, the SEA-generated expanded sentences resulted in a 55% increase in overall patient satisfaction compared to the TSA’s direct output, raising satisfaction from “somewhat satisfied” to “fully satisfied” levels (SFig. 12 and SFig. 13).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eIn both operating modes, sentences generated by the TSA and SEA agents are sent to an open-source text-to-speech model [44], which synthesizes audio that matches the patient’s natural voice for playback. In real-world applications, the delay between the completion of the user’s silent expression and the sentence playback is approximately 1 second (SNote 2). This low latency effectively supports seamless and natural communication in practical settings.\u003c/p\u003e"},{"header":"III. Discussion","content":"\u003cp\u003eIn this work, we introduce the IT, an advanced wearable system designed to empower dysarthric stroke patients to communicate with the fluidity, intuitiveness, and expressiveness of natural speech. Comprehensive analysis and user feedback affirm the IT\u0026rsquo;s high performance in fluency, accuracy, emotional expressiveness, and personalization. This success is rooted in its innovative design: ultrasensitive textile strain sensors capture rich and high-quality vibrational signals from the laryngeal muscles and carotid artery, while high-resolution tokenized segmentation enables users to communicate freely and continuously without expression delays. Additionally, the integration of LLM agents enables intelligent error correction and contextual adaptation, delivering exceptional decoding accuracy (WER\u0026thinsp;\u0026lt;\u0026thinsp;5%, SER\u0026thinsp;\u0026lt;\u0026thinsp;3%) and a 55% increase in user satisfaction. The IT thus sets a new benchmark in wearable silent speech systems, offering a naturalistic, user-centered communication aid.\u003c/p\u003e \u003cp\u003eFuture efforts in several key areas will guide the continued development of the IT system. First, expanding its adaptability to a wider range of neurological conditions and demographic groups will make the technology more inclusive. Second, enhancing its linguistic diversity and multilingual support will allow for more personalized communication across language barriers. Finally, miniaturizing the system within an edge computing framework will facilitate seamless integration into real-world settings, boosting usability and accessibility.\u003c/p\u003e \u003cp\u003eLooking ahead, the advantages of the IT extend beyond enhancing everyday communication; they contribute to the holistic health of neurological patients, encompassing both physical and psychological well-being. The regained fluency in communication allows patients to re-engage in social interactions, reducing isolation and the associated risk of depression. Moreover, effective communication facilitates real-time, personalized adjustments by rehabilitation therapists, supporting patients\u0026rsquo; recovery from motor impairments like hemiplegia. Together, these capabilities position the IT as a comprehensive tool for restoring independence and improving quality of life for individuals with neurological conditions.\u003c/p\u003e"},{"header":"IV. Methods","content":"\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eMaterials\u003c/h2\u003e \u003cp\u003eTIMREX KS 25 Graphite (particle size of 25\u0026micro;m) was sourced from IMERYS. Stretchable conductive silver ink was obtained from Dycotec Materials Ltd. Ethyl cellulose was purchased from SIGMA-ALDRICH. Flexible UV Resin Clear was acquired from Photocentric Ltd. The textile substrate, composed of 95% Polyester and 5% spandex, was procured from Jelly Fabrics Ltd.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eInk formulation\u003c/h2\u003e \u003cp\u003eThe graphene ink for screen printing was prepared following a reported method. Briefly, 100g of graphite powder and 2g of ethyl cellulose (EC) were mixed in 1L of isopropyl alcohol (IPA) and stirred at 3000 rpm for 30 minutes. The mixture was then added into a high-pressure homogenizer (PSI-40) at 2000 bar pressure for 50 cycles to obtain graphene dispersion. The graphene dispersion is centrifuged at 5000g for 30 min to remove unexfoliated graphite.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eFabrication of textile strain sensor\u003c/h3\u003e\n\u003cp\u003eThe textile substrate was washed with detergent, thoroughly dried, and then treated with UV-ozone for 5 minutes to clean the surface. Screen printing was performed using a 165T polyester silk screen on a semi-automatic printer (Kippax \u0026amp; Sons Ltd.) set with a squeegee angle of 45 degrees, a spacer of 2 mm, a coating speed of 10 mm/s, and a printing speed of 40 mm/s. Graphene ink, silver paste, and PUA were successively printed to form the sensing layer, electrodes, and strain isolation layer, respectively. After printing the PUA, the textile was exposed to UV light for 5 minutes. After each printing pass, the textile was air-dried. Following printing, the sensor was dried at 80 ℃ overnight. A biaxial strain of approximately 10% was then applied to induce the formation of ordered cracks.\u003c/p\u003e\n\u003ch3\u003eCharacterization\u003c/h3\u003e\n\u003cp\u003eScanning Electron Microscopy (SEM) images were taken with a Magellan 400, after sputtering the textile samples with a 5 nm layer of gold to enhance conductivity. Optical images were captured using an Olympus microscope. Tensile properties of the textile strain sensors were evaluated using a Deben Microtest 200N Tensile Stage and an INSTRON universal testing system. Electrical signals were recorded concurrently with a potentiostat (EmStat4X, PalmSens) and a multiplexer (MUX, PalmSens). Copper tape was crimped onto the contact pads of the samples, supplemented with a small amount of silver paste to improve electrical contact.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eWireless PCB for data readout\u003c/h2\u003e \u003cp\u003eA custom wireless PCB was developed for efficient, continuous data acquisition and transmission within the IT system. Powered by a TP4065 lithium charger and a 3.3V regulator, the PCB ensures stable operation via battery or USB. The STM32G431 microcontroller captures silent speech and carotid pulse signals through two ADC channels, with an OPA2192 operational amplifier for high-precision signal conditioning, amplifying low-level signals and enhancing overall data fidelity. A BLE module (BLE-SER-A-ANT) transmits real-time data via UART, enabling seamless, delay-free communication.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eSilent speech data acquisition\u003c/h2\u003e \u003cp\u003eWe recruited 10 healthy subjects (mean age: 25.3\u0026thinsp;\u0026plusmn;\u0026thinsp;4.1 years; 6 males, 4 females) and 5 stroke patients with dysarthria (mean age: 43.9\u0026thinsp;\u0026plusmn;\u0026thinsp;8.3 years; 4 males, 1 female) for silent speech signal collection, in compliance with Ethics Committee approval from the First Affiliated Hospital of Henan University of Chinese Medicine, approval no. 2023HL-142-01. A corpus was developed consisting of 47 Chinese words commonly used by stroke patients in daily communication, along with 20 sentences constructed from these words (see STable 2 and STable 3). For the healthy subject dataset, we collected 100 repetitions per word and 50 repetitions per sentence. For the patient dataset, we gathered 50 repetitions per word and 50 per sentence.\u003c/p\u003e \u003cp\u003eThe healthy subject data serves as a critical baseline for initial model training, enabling the model to establish foundational patterns in silent speech signals. This pre-training facilitates improved generalization and performance when later fine-tuning the model on the limited data from dysarthric patients, ultimately enhancing decoding accuracy and robustness in patient-specific applications. The silent speech signals were segmented into tokens at 144 ms intervals. Each token was combined with the preceding 14 tokens to form a sample, allowing the model to incorporate context. The sample\u0026rsquo;s label corresponds to the word of the current token. The signals were originally recorded at a sampling rate of 10 kHz and subsequently downsampled to 1 kHz before tokenization. Before neural network analysis, each sample was uniformly preprocessed with detrending and z-score normalization.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eProtocol for emotion data collection\u003c/h2\u003e \u003cp\u003eEmotional pulse data was collected concurrently with silent speech signals, ensuring synchronized datasets that capture both speech-related and underlying physiological responses. To achieve accurate labeling, each emotion\u0026mdash;neutral, relieved, and frustrated\u0026mdash;was elicited through a carefully structured protocol involving audio-induced emotional states [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e, \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]. The emotions were induced via the international affective digitized sounds (2nd Edition; IADS-2) [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e]. The three emotions were chosen as they are the most frequently encountered emotions in dysarthric patients\u0026rsquo; daily communication. Labeling was verified through collaboration between the participants and the therapist to ensure the successful and reliable induction of each target emotion. To balance sufficient information within each window and achieve the necessary resolution for emotion detection, pulse signals were segmented into 5-second samples. A 50% window overlap was applied to increase the training set size, enhancing model learning and generalization. The signals were originally recorded at a sampling rate of 10 kHz and subsequently downsampled to 200 Hz before analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eSoftware environment and model training\u003c/h2\u003e \u003cp\u003eSignal preprocessing was performed on a MacBook Pro equipped with an M1 Max CPU. Network training was conducted using Python 3.8.13, Miniconda 3, and PyTorch 2.0.1 in a performance-optimized environment. Training acceleration was enabled by CUDA on NVIDIA A100 GPU. The detailed training parameters for all models can be found in SFig. 8 and SFig. 9.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets supporting this study will be available from the GitHub repository before publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability \u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code supporting this study will be available from the GitHub repository before publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was partially supported by the British Council (Grant Contract No. 45371261), the UK Engineering and Physical Science Research Council (EPSRC, grants No. EP/K03099X/1, EP/W024284/1) and Haleon through the CAPE partnership contract (University of Cambridge Ref. No. G110480).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eEnderby, P. Disorders of communication. Neurological Rehabilitation 110, 273\u0026ndash;281 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang, C. \u003cem\u003eet al.\u003c/em\u003e A roadmap for the development of human body digital twins. Nature Reviews Electrical Engineering 1,199\u0026ndash;207 (2024)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZinn, S., \u003cem\u003eet al.\u003c/em\u003e The effect of poststroke cognitive impairment on rehabilitation process and functional outcome. Archives of physical medicine and rehabilitation 85, 1084\u0026ndash;1090 (2004).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTeshaboeva, F. Literacy education of speech impaired children as a pedagogical psychological problem.\" \u003cem\u003eConfrencea\u003c/em\u003e 5, 299\u0026ndash;302 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJu, X. \u003cem\u003eet al.\u003c/em\u003e A systematic review on voiceless patients\u0026rsquo; willingness to adopt high-technology augmentative and alternative communication in intensive care units. Intensive and Critical Care Nursing 63, 102948 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMegalingam, R. \u003cem\u003eet al\u003c/em\u003e. Sakthiprasad Kuttankulungara Manoharan, Gokul Riju \u0026amp; Sreekanth Makkal Mohandas. NETRAVAAD: Interactive Eye Based Communication System For People With Speech Issues. IEEE Access 12, 69838\u0026ndash;69852 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEzzat, M. \u003cem\u003eet al.\u003c/em\u003e Blink-To-Live eye-based communication system for users with speech impairments. Scientific Reports 13, 7961 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTarek, N. \u003cem\u003eet al.\u003c/em\u003e Morse glasses: an IoT communication system based on Morse code for users with speech impairments. Computing 104, 789\u0026ndash;808 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSilva, A. B., Littlejohn, K. T., Liu, J. R., Moses, D. A. \u0026amp; Chang, E. F. The speech neuroprosthesis. Nature Reviews Neuroscience 25, 473\u0026ndash;492 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCard, N. S. et al. An Accurate and Rapidly Calibrating Speech Neuroprosthesis. New England Journal of Medicine 391, 609\u0026ndash;618 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMetzger, S. L. \u003cem\u003eet al.\u003c/em\u003e A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620, 1\u0026ndash;10 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWillett, F. R. \u003cem\u003eet al.\u003c/em\u003e A high-performance speech neuroprosthesis. Nature 620, 1031\u0026ndash;1036 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, T. \u003cem\u003eet al.\u003c/em\u003e Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces. Nature Communications 13, 5815 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang, C. \u003cem\u003eet al.\u003c/em\u003e Ultrasensitive textile strain sensors redefine wearable silent speech interfaces with high machine learning efficiency. npj Flexible Electronics 8, 27 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, Q. \u003cem\u003eet al.\u003c/em\u003e Mixed-modality speech recognition and interaction using a wearable artificial throat. Nature Machine Intelligence 5, 169\u0026ndash;180 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, S. \u003cem\u003eet al.\u003c/em\u003e Force-induced ion generation in zwitterionic hydrogels for a sensitive silent-speech sensor. Nature Communications 14, 219 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChe, Z. \u003cem\u003eet al.\u003c/em\u003e Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system. Nature Communications 15, 1873 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWand, M. \u003cem\u003eet al.\u003c/em\u003e Tackling speaking mode varieties in EMG-based speech recognition. IEEE Transactions on Biomedical Engineering 61, 2515\u0026ndash;2526 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, H. \u003cem\u003eet al.\u003c/em\u003e An epidermal sEMG tattoo-like patch as a new human\u0026ndash;machine interface for patients with loss of voice. Microsystems \u0026amp; Nanoengineering 6, 16 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, Y. \u003cem\u003eet al.\u003c/em\u003e All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics. npj Flexible Electronics 5, 20 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTian, H. \u003cem\u003eet al.\u003c/em\u003e Bioinspired dual-channel speech recognition using graphene-based electromyographic and mechanical sensors. Cell Reports Physical Science 3, 101075 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang, C. \u003cem\u003eet al.\u003c/em\u003e A deep learning-enabled smart garment for accurate and versatile sleep conditions monitoring in daily life. \u003cem\u003earXiv.org\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2408.00753\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2408.00753\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena 404, 132306 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani, A. \u003cem\u003eet al\u003c/em\u003e. Attention is all you need. Advances in Neural Information Processing Systems 6000\u0026ndash;6010 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, Z., \u003cem\u003eet al\u003c/em\u003e. Long sequence time-series forecasting with deep learning: A survey. Information Fusion 97, 101819 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBengio, Y., \u003cem\u003eet al\u003c/em\u003e. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 157\u0026ndash;166 (1994).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKiranyaz, S., \u003cem\u003eet al.\u003c/em\u003e \"1D convolutional neural networks and applications: A survey.\" Mechanical systems and signal processing 151, 107398 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang, W., \u003cem\u003eet al.\u003c/em\u003e Rethinking 1d-cnn for time series classification: A stronger baseline.\" \u003cem\u003earXiv preprint arXiv\u003c/em\u003e:2002.\u003cem\u003e10061\u003c/em\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHinton, G. Distilling the Knowledge in a Neural Network. \u003cem\u003earXiv preprint arXiv:1503.02531\u003c/em\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcInnes, L., \u003cem\u003eet al.\u003c/em\u003e Umap: Uniform manifold approximation and projection for dimension reduction. \u003cem\u003earXiv preprint arXiv:1802.03426\u003c/em\u003e (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu, Y., \u003cem\u003eet al.\u003c/em\u003e Cloud-edge collaborative depression detection using negative emotion recognition and cross-scale facial feature analysis. IEEE transactions on industrial informatics 19, 3088\u0026ndash;3098 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, K., \u003cem\u003eet al.\u003c/em\u003e \"Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition.\" IEEE Transactions on Affective Computing 14, 1082\u0026ndash;1097 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaganowski, S., \u003cem\u003eet al.\u003c/em\u003e Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review. IEEE Transactions on Affective Computing 14, 1876\u0026ndash;1897 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYi, W., et \u003cem\u003eal.\u003c/em\u003e Ultrasensitive Textile Strain Sensing Choker for Diverse Healthcare Applications. 2024 \u003cem\u003eIEEE BioSensors Conference (BioSensors).\u003c/em\u003e IEEE, 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYin, J., \u003cem\u003eet al.\u003c/em\u003e Motion artefact management for soft bioelectronics. Nature Reviews Bioengineering 2, 541\u0026ndash;558 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSelesnick, I., \u003cem\u003eet al.\u003c/em\u003e Generalized digital Butterworth filter design. IEEE Transactions on signal processing 46, 1688\u0026ndash;1694 (1998).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKuo, S., and Dennis M. Active noise control: a tutorial review. \u003cem\u003eProceedings of the IEEE\u003c/em\u003e 87, 943\u0026ndash;973 (1999).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie, Y., \u003cem\u003eet al.\u003c/em\u003e Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence 5, 1486\u0026ndash;1496 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei, J., \u003cem\u003eet al.\u003c/em\u003e Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824\u0026ndash;24837 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIrie, G., \u003cem\u003eet al.\u003c/em\u003e Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Transactions on Multimedia 12, 523\u0026ndash;535 (2010).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, S., \u003cem\u003eet al.\u003c/em\u003e Learning affective features with a hybrid deep model for audio\u0026ndash;visual emotion recognition. IEEE transactions on circuits and systems for video technology 28, 3030\u0026ndash;3043 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQi, Y. \u003cem\u003eet al.\u003c/em\u003e, Piezoelectric Touch Sensing and Random-Forest-Based Technique for Emotion Recognition. IEEE Transactions on Computational Social Systems 11, 6296\u0026ndash;6307 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, W., \u003cem\u003eet al.\u003c/em\u003e Affective auditory stimulus database: An expanded version of the International Affective Digitized Sounds (IADS-E). Behavior Research Methods 50, 1415\u0026ndash;1429 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnastassiou, P., \u003cem\u003eet al\u003c/em\u003e. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models. \u003cem\u003earXiv preprint arXiv:2406.02430\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHickok, G., and Poeppel, D. The cortical organization of speech processing. Nature Reviews Neuroscience 8, 393\u0026ndash;402 (2007).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-5469584/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5469584/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT’s LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems.\u003c/p\u003e","manuscriptTitle":"Wearable intelligent throat enables natural speech in stroke patients with dysarthria","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-20 09:08:45","doi":"10.21203/rs.3.rs-5469584/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"35ebf0d7-f3aa-46e2-b4f8-a3bf154913eb","owner":[],"postedDate":"March 20th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":42515666,"name":"Physical sciences/Mathematics and computing/Computational science"},{"id":42515667,"name":"Physical sciences/Engineering/Biomedical engineering"}],"tags":[],"updatedAt":"2025-08-25T08:30:17+00:00","versionOfRecord":[],"versionCreatedAt":"2025-03-20 09:08:45","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5469584","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5469584","identity":"rs-5469584","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00