Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech

doi:10.21203/rs.3.rs-4977028/v1

Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech

2024 · doi:10.21203/rs.3.rs-4977028/v1

preprint OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 142,289 characters · extracted from preprint-html · click to expand

Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech Bryan Sanders, Monica Lancheros, Marion Bourqui, Marina Laganaro This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4977028/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 15 Feb, 2025 Read the published version in Brain Topography → Version 1 posted 8 You are reading this latest preprint version Abstract Loud speech and whispered speech are two distinct speech modes that are part of daily verbal exchanges, but that involve a different employment of the speech apparatus. However, a clear account of whether and when the motor speech (or phonetic) encoding of these speech modes differs from standard speech has not been provided yet. Here, we addressed this question using Electroencephalography (EEG)/Event related potential (ERP) approaches during a delayed production task to contrast the production of speech sequences (pseudowords) when speaking normally or under a specific speech mode: loud speech in experiment 1 and whispered speech in experiment 2. Behavioral results demonstrated that non-standard speech modes entail a behavioral encoding cost in terms of production latency. Standard speech and speech modes’ ERPs were characterized by the same sequence of microstate maps, suggesting that the same brain processes are involved to produce speech under a specific speech mode. Only loud speech entailed electrophysiological modulations relative to standard speech in terms of waveform amplitudes but also temporal distribution and strength of neural recruitment of the same sequence of microstates in a large time window (from approximatively − 220 ms to -100 ms) preceding the vocal onset. Alternatively, the electrophysiological activity of whispered speech was similar in nature to standard speech. On the whole, speech modes and standard speech seem to be encoded through the same brain processes but the degree of adjustments required seem to vary subsequently across speech modes. Amplitudes Electroencephalography (EEG) Event related potential (ERP) Microstates Motor speech control Figures Figure 1 Figure 2 Figure 3 Introduction Speech production is a complex cognitive-motor ability which allows humans to transform an abstract linguistic code into precise motor commands needed to produce an utterance. However, multiple intrinsic and extrinsic factors can interfere with the transmission of the message between a speaker and a listener. Therefore, speakers will modulate their speech production to overcome these transmission issues by, for instance, whispering, speaking louder or speaking clearer. In the literature, these modulations have been referred to as “speech modes”, “speech styles” or “speaking styles”. They are defined as specific variations of standard speech (SS), which refers to speech produced with normal vocal effort (Kelly & Hansen, 2021 ; Tuomainen et al., 2022). Variations of SS (herein, speech modes) are used in daily conversations but have been surprisingly overlooked in the speech motor control literature. As a consequence, very little is known concerning the brain mechanisms that allow speakers to modulate their utterances. In this regard, one may wonder if speaking in a non-standard speech mode involves a motor speech preparation cost that reflects specific encoding processes. In the present study, we will exploit the high temporal resolution provided by the electroencephalography (EEG) brain imaging technique to investigate the encoding processes underlying the production of speech sequences under different modes. In the following sections, we will first unravel the issues underlying the scientific characterization of the motor speech encoding stage. We will then present the current knowledge about our target speech modes, whispered and loud speech, before trying to hypothesize how they may be encoded on the motor level. Motor speech (phonetic) encoding Speakers can produce intelligible and accurate utterances almost automatically with a low error rate. Despite decades of investigation, a clear account of the interaction between the neural processes underlying speech production and their dynamics is still needed (Bohland et al., 2010 ; Laganaro, 2019 ; Miller & Guenther, 2021 ; Verwoert et al., 2022 ). Phonetic encoding [hereafter motor speech encoding processes] is the label given by some authors in the literature to describe the process of transforming an abstract linguistic sequence into a motor code readable by articulators (W. J. Levelt, 1989 ; W. J. M. Levelt et al., 1999 ; Indefrey, 2011 ; Guenther, 2016 ). This encoding stage has been less studied in comparison to other language encoding processes resulting in a poor understanding of the underlying spatio-temporal dynamics (Indefrey, 2011 ). The four level (FL) model from Van der Merwe (2021) proposed that motor speech encoding could be subdivided into two sequential substages; motor planning (i.e., retrieval of motor plans) and motor programming (i.e., where spatiotemporal and force dimensions are specified). This subdivision has been motivated by clinical symptoms of motor speech disorders such as apraxia of speech and dysarthria. To the best of our knowledge, the FL model is the only speech production model to provide some inputs into the encoding of speech modes. The model states that all the different speech modes can be grouped as a whole entity and would be thus encoded in the same way. Here, speech modes would be encoded through tuning of the unique suprasegmental features during the motor programming stage. Nevertheless, there is no empirical evidence corroborating this proposition. Studying the neural processes underlying speech modes as compared to standard speech can thus provide a relevant way of investigating whether speech modes require adjustments at specific encoding stages or whether they are underlied by different encoding processes, as further presented below. Speech modes Speech modes constitute an omnipresent part of verbal exchanges. The way people speak is continuously influenced by intrinsic factors (speaker related) and extrinsic factors (environment or listener related) (Kelly & Hansen, 2021 ; Smiljanić & Bradlow, 2009 ; Whitfield et al., 2021 ). For instance, one needs to modulate his speech production to tell his friend the food he would like to order in a stadium full of supporters that are loudly singing an anthem. Moreover, meaningful cues (e.g., linguistic, affective or social cues) are conveyed to the interlocutor through the modulation of speech (Perkell, 2012 ; Tourville & Guenther, 2011 ). As an example, a speaker giving a talk during a conference will adopt a clear speech mode to emphasize his take home message. Zhang and Hansen ( 2007 ) proposed five speech modes with unique articulatory and phonatory features: whispered speech, soft speech, neutral (equivalent to standard) speech, loud speech and shouted speech. In this regard, it has been hypothesized that each speech mode involves its own mechanisms resulting in specific articulatory and phonatory patterns (Zhang & Hansen, 2007 ; Scott, 2022). We will briefly describe the two speech modes that will be investigated in the present study, namely loud speech (LS) and whispered speech (WS). Loud Speech (LS) When speakers struggle to convey a message to an interlocutor, for instance in a noisy environment, they usually modify their speech by increasing their vocal effort. In this case, increase in loudness leads to phonatory adjustments and changes in speech kinematics (Dromey & Ramig, 1998 ; Huber & Chandrasekaran, 2006 ; Whitfield et al., 2021 ). These manifestations associated to increased Sound Intensity Level (SIL) pertain to a specific speech mode labeled “LS”. Intuitively, one would conceptualize LS as the best speech mode to make yourself heard by someone else. However, the results obtained by Whitfield et al. ( 2021 ) indicate that LS’s main characteristic is increasing vocal intensity while there is not necessarily an improvement of the articulatory distinctiveness of the message conveyed. In the literature, LS has usually been associated to an increase in the standard SIL of 10 dB ± 4 dB (Huber & Chandrasekaran, 2006 ; Whitfield et al., 2021 ). The encoding processes responsible for the increase in vocal loudness have not been clarified by functional neuroimaging or computational models, but some hypotheses on the proposed mechanisms will be presented below. Whispered Speech (WS) WS is a widespread mode of communication aiming at conveying a message while remaining discreet. This speech mode is convenient in situations requiring silence (e.g., movie, theatre) or to keep private the content of a message (e.g., telling a secret). The ability to whisper is specific to humans (Tsunoda et al., 2011 ) and is characterized by reduced intelligibility and perceptibility for the listener as well as a more effortful production from the speaker’s point of view (Zhang et al., 2018 ). During whispered speech, physiological adjustments are applied to specific muscles of the larynx in order to prevent vocal folds vibration (Konnai et al., 2017 ; Solomon et al., 1989 ; Tsunoda et al., 2011 ). This absence of phonation provides unique features to WS. Actually, among speech modes, phonetic features of WS characterize it as the most distinct speech mode in comparison to SS (Kelly & Hansen, 2021 ; Zhang et al., 2018 ; Zhang & Hansen, 2007 ). Similar to LS, no consensus has been reached in the literature regarding the encoding processes responsible for whispering, leading to several proposed hypotheses. Encoding of speech modes As anticipated previously, two possible hypotheses stem from the literature regarding the encoding processes associated to the production of loud and whispered utterances. On one hand, behavioral results (e.g., Huber & Chandrasekaran, 2006 ) have led Whitfield et al., ( 2021 ) to hypothesize that an upregulation in the neuromotor drive is the mechanism at the origin of LS. However, it is unclear when and how this upregulation occurs in the motor speech encoding stage. On the other hand, two hypotheses were formulated based on functional Magnetic Resonance Imaging (fMRI) investigations in order to characterize the brain processes underlying WS. Correia et al. (2020) demonstrated that the fMRI response was greater for voiced speech than WS in the dorsal laryngeal motor area (dLMA), located in the primary motor cortex (M1). Under this hypothesis, the same brain mechanisms are at play for SS and WS, with larger recruitment for the former. A different hypothesis has been proposed by Tsunoda et al. ( 2011 ) based on a voluntary switching mechanism, in which ordinary speech would be transformed into whispering thanks to functional changes in the frontal lobe. However, their results showed two distinct patterns of brain activation in the frontal lobe involving both increased and decreased brain activation for WS relative to SS, which thus did not clarify how the functional switch would be carried out. In summary, the whispered speech’s literature gathers two distinct approaches concerning the encoding of WS: one described a functional difference in a specific motor region responsible for laryngeal control while another suggested the involvement of a voluntary functional switching mechanism during production of whispered utterances. In light of these theoretical propositions, speech modes could be encoded either (1) through neural adjustment of the same brain processes in the motor programming substage as proposed in the FL model or (2) through the involvement of an additional mechanism overlaying onto regular motor programming encoding processes. In particular, this study will explore these two hypotheses using behavioral and electrophysiological contrasts between speech modes and normally phonated speech. Specifically, LS (Experiment 1) and WS (Experiment 2) were compared to SS during a delayed production task of non-sense speech sequences (pseudowords). This paradigm is ideal to isolate motor speech encoding processes from linguistic encoding processes (Laganaro, 2019 , 2023 ; Piai et al., 2014 ). Electroencephalography (EEG)/Event-related potential (ERP) correlates of speech modes and SS will be analyzed during a time-window of about 350 ms preceding the vocal onset (hereafter referred to as “response-locked”) corresponding to the motor speech encoding stage. This time window is thus aligned to the vocal onset and analyzed in a backward fashion. In present study, we will exploit the high temporal resolution provided by the EEG to match the fast time scale of speech production processes (den Hollander et al., 2019 ; Laganaro & Perret, 2011 ; Piai et al., 2015; Verwoert et al., 2022 ). Especially, we track the temporal dynamics of brain activations in the different experimental conditions via Microstate analysis (Michel et al., 2009 ; Michel & Murray, 2012 ; Murray et al., 2008) which will allow to investigate whether the encoding of speech modes elicit different brain processes relative to SS or if the same brain processes are engaged but with different dynamics. Experiment 1 - loud speech Method Population 30 French native speakers aged from 20 to 31 years old participated to the experiment. They were all right-handed [Average laterality quotient index = 88.33, range = 60–100] according to the Edinburgh Handedness Scales (Oldfield, 1971 ). None of them had any neurological or motor impairment. Furthermore, participants had normal vision or corrected-to-normal vision. They all agreed to participate and signed the consent form accepted by the local ethics committee. They received a small financial compensation for their participation. 6 participants were removed due to either low production accuracy (i.e., below 75%), over-noisy EEG signal or being consider as an outlier in the Ragu Software (Koenig et al., 2011b ). As a result, 24 participants (Mean (M) = 23.25 years old, Standard Deviation (SD) = 3.3 years, 5 males) were retained for the analyses. Material The speech stimuli to be produced consisted of 67 monosyllabic and disyllabic pseudowords (see more details in Appendix A). Pseudowords were selected to avoid any linguistic effect related to words and thus focus on speech production. The pseudowords were composed of phonotactically legal French syllables according to the French database Lexique2 (New et al., 2004 ). All the items had the following syllabic structures: C 1 C 2 V 1 – C 3 V 2 for the disyllabic items (e.g., trafa) and C 1 C 2 V 1 for the monosyllabic items (e.g., pra), with C 1 being one of the three following voiceless plosives: /p/, /t/ or /k/. Procedure The experiment took place in a soundproof room in which participants were seated at about 70 cm from the computer screen. The software E-Prime 3.0 (Psychology Software Tools, Pittsburgh, PA) was used to present the stimuli in several experimental blocks and to record participants’ productions. Participants performed a delayed production task (see Fig. 1), in which they were asked to prepare a speech sequence based on a written pseudoword and to produce it aloud when a cue (here a question mark) appeared on the screen. Each trial displayed in succession a fixation cross (350 ms), a pseudoword written in white at the center of a black screen (1200 ms), ellipsis points indicating a variable waiting delay (either 1300 or 1600 ms) and eventually a yellow question mark appeared on the screen (1700 ms). The question mark was the cue indicating to the participants to produce the pseudoword previously presented as quickly and as accurately as possible. In some cases, yellow ellipsis points appeared on the screen instead of the question mark indicating that no production were expected. These “no-go” trials, although not analyzed, were integrated to keep participants’ attention and to avoid anticipatory responses. On average, no-go items appeared approximatively every nine trials. Before the beginning of the experiment, participants read aloud a list containing all the stimuli to ensure they pronounced them correctly. Halfway through, they were asked to produce the rest of the pseudowords by adopting a loud speech mode. In cases of incorrect pronunciation, they were first corrected and then asked to produce the pseudoword in its correct form. A short training session with five pseudowords produced normally (three go and two no-go trials) preceded the experiment to ensure that the participants were comfortable with the experimental procedure. The experiment was segmented in eight experimental blocks, including four blocks of standard speech (SS) and four blocks of loud Speech (LS), presented in an alternated manner (see Fig. 1 right panel). Each block contained between 48 and 50 stimuli with both pseudowords and no-go items. Across the eight blocks, participants produced the same 180 pseudowords in each condition. Before each SS block, participants were asked to speak as usual. Before each LS block, participants were instructed to speak louder than usual, aiming at being heard from outside the soundproof room. To ensure that participants produced utterances that were loud enough during loud blocks, intensity was checked by the experimenters on a sound level meter which was hidden from the participants. Half of the participants started the experiment with a block in SS (order 1) and the other half with LS (order 2). Short self-paced breaks were given to the participants between blocks. Four lists of the 360 pseudowords were created and were randomly assigned to the participants to avoid order effects. The speech productions were recorded for off-line accuracy (ACC) check and extraction of vocal onsets (or reaction times, RT). Behavioral analyses Intensity was extracted and analyzed with the Praat software (Boersma & Van Heuven, 2001 ). Speech intensity of all SS productions was averaged for each participant to establish an individual cut-off threshold. Therefore, loud utterances that were not higher than 8 dB in comparison to participant’s mean intensity were removed from the analyses. RT and ACC were extracted off-line through listening and visual inspection of the individual audio files with the Checkvocal Software (Protopapas, 2007 ). Uncomplete (e.g., /kRat/ instead of /kRati/), uncertain (e.g.,/kr/…/kRata/) and incorrectly (e.g.,/kRotu/ instead of /kRutu/) produced pseudowords as well as productions that did not correspond to the target speech mode were considered as erroneous productions and were thus removed from the analyses. The vocal onset of each pseudoword was identified by aligning to the plosion bar produced by C 1 . Two judges (i.e., first and third authors) listened to the entire speech dataset resulting in an inter-judge agreement of 98% for ACC and 91% for RT. As cleaning procedure, RT with a SD above 2.5 of the mean latency of production per participant and per condition were removed. As ACC was not part of our hypotheses, this metric will be used as a descriptive statistic. The behavioral results on RT were analyzed using the Mixed Model approach (Bates et al., 2014 ; Carson & Beeson, 2013 ) with the R-Software (R Core Team, 2021 ). We compared multiple nested models that were built up by adding one effect at the time. The best model (see Appendix B) contained RT as dependent variable; speech mode (SS and LS), order of the experimental blocks (loud first or standard first) and length (monosyllabic or disyllabic) as fixed effects and subjects and items as random variables. Interaction effects between speech mode and order of experimental blocks were also tested in the model. EEG Recording and Preprocessing The electrophysiological data was recorded continuously during the experiment with high density EEG using the Active-Two Biosemi EEG system (Biosemi V.O.F. Amsterdam, Netherlands) including 128 electrodes on the scalp with a sampling rate fixed at 512 Hz. All the preprocessing steps, including DC removal, filtering at 0.2 Hz (high pass) and 30 Hz (low pass), and Notch Filtering at 50 Hz to remove line current artifact, were done with the Cartool Software (Brunet et al., 2011 ). Each trial was inspected visually and excluded from the averaging if it was contaminated by any artifact (e.g., blinks, eye movements or noise). After visual inspection, epochs were extracted, matched in number across conditions and averaged per participant. Problematic electrodes were interpolated for each participant using 3-D splines interpolation (Perrin et al., 1987 ), with the same electrodes interpolated across the two uttering conditions. On average, 15.5 electrodes (range: 6–23) were interpolated per participant. Average reference was applied to the EEG data after interpolation. Eventually, we applied a spatial filter as a final step of the preprocessing procedure (see more details in Michel & Brunet, 2019 ). Response locked epochs (i.e., aligned to the vocal onset) were extracted backwards with a time window of 175 TF (i.e., 342 ms). Epochs’ duration was selected based on the two reviews from Laganaro ( 2019 , 2023 ). In the latter, it is suggested that motor speech encoding processes would take up to 300 ms of the planning time rather than the 145 ms proposed in the review of Indefrey ( 2011 ). Waveform analysis Electrodes’ amplitudes were compared between SS and LS with a massed approach on each electrode and time-point. This analysis was computed in the R software with the “threshold-free clusters-enhancement” (TFCE) approach (Smith & Nichols, 2009 ) using the permuco4brain R package (Frossard & Renaud, 2021 ). This test has a high control over family-wise type I error. The analysis is based on 5000 permutation tests for repeated measure ANOVA. Topography Consistency Test (TCT) The TCT (Koenig & Melie-García, 2010 ) aims at disentangling electrical sources from noise in the ERPs data with simple randomization techniques. In other words, this test tries to determine if the same brain topographies are obtained for a specific event with repeated measurements. Here, by using the Global Field Power (GFP) of ERPs averaged at the level of participants, the TCT assesses the topographic consistency of the signal throughout the entire time window. This test has been computed with the Ragu software (Koenig et al., 2011b ) before performing the topographic and microstate (spatio-temporal segmentation) analyses. Topographic ANOVA (TANOVA) analysis By computing an index of dissimilarity, the TANOVA uses a non-parametric randomization test to determine at which time point ERP topographies (i.e. the spatial distribution of the electric signal at scalp at a specific timepoint) significantly differ across conditions (Koenig et al., 2011a ; Murray et al., 2008). The TANOVA analysis is complementary to spatio-temporal segmentation (see next analysis). Indeed, index of dissimilarity and the GFP exploits respectively the topographies and the response strength meaning that they can be measured and analyzed orthogonally (Murray et al., 2008b). A minimal duration threshold for significance can be calculated to control for the possible presence of false positives resulting from the dissimilarity analysis time point by time point (Koenig et al., 2011a ). Microstates (spatio-temporal segmentation) analysis The spatiotemporal segmentation of ERPs or microstates analysis is a two-step procedure aiming at representing conditions with several prototypical topographies or microstates maps corresponding to periods of quasi-stable spatial distribution of the electrophysiological signal on the scalp. This type of analysis relies on the GFP to decompose the signal into clusters of stable periods (60–120 ms) of electrophysiological activity (Koenig et al., 2014 ; Michel & Koenig, 2018 ; Skrandies, 1990 ). First, cluster maps are extracted from the ERP conditions and are referred to as “template maps” (Michel & Koenig, 2018 ). From this point, these templates maps are fitted into participants’ individual signal for each condition in order to extract relevant parameters (Michel & Koenig, 2018 ). These parameters represent several metrics of interest to describe EEG topographies: strength, timing and spatial distribution. In the present study, statistical analyses were carried out on one temporal parameter (duration, DUR) and one global measurement (area under curve, AUC) of occurrence. Results Behavioral results The average intensity of production was 64.60 dB for loud utterances and 51.47 dB for standard utterances. The mean intensity difference was 12.73 dB (SD = 2.95; minimum (Min) = 9.46; maximum (Max) = 22.94) on 148 loud trials on average. Participants produced the pseudowords with a global high accuracy: LS utterances were produced with a 97% accuracy rate (SD = 16) and SS utterances with 96% (SD = 19). The mean production latency for LS and SS was respectively 593.22 ms (SD = 118.41 ms) and 579.74 ms (SD = 124.28 ms). The best nested linear mixed model (see Appendix B for details) demonstrated a significant main effect of the speech mode (t(7365) = 4.85, β = 15.99, standard error (SE) = 3.295, p = < .01), with LS yielding longer RT as compared to SS. Furthermore, a significant interaction effect was observed between the speech mode and the experimental block by which the participants started the experiment (t(7363.254) = -2.764, SE = 4.74, p = .005). Particularly, post-hoc analyses using a Tukey test showed that participants who started with a LS experimental block needed an additional initialization time of 16 ms to produce LS utterances (z= -4.852, SE = 3.3, p = < .001 ). On the contrary, the difference of estimate between LS and SS was not significant for participants who started with a SS experimental block (z= -0.851, SE = 3.4, p = .40 ). ERP results TCT Response-locked ERPs across conditions had an overall topographic consistency through the whole time periods (see Appendix D). Therefore, the whole resulting signal from the response-locked ERPs was kept for the following analyses. Waveform analysis The results of the test distribution of the TFCE procedure are presented on Fig. 2A. Amplitude differences in response-locked ERPs were observed at three time periods. From approximately − 127 ms to the vocal onset (i.e., time 0), a cluster of 14 neighboring electrodes in the central-anterior region was found with lower values for the loud condition. Concerning the time period from − 128 ms to − 200 ms preceding the vocal onset, different amplitudes appeared on three small clusters: seven left anterior electrodes, four right anterior electrodes and five right central-parietal neighboring electrodes that tend to be more negative parietally and more positive anteriorly for the loud condition. Additionally, diverging amplitudes were observed in the − 340 ms to -260 ms time period preceding the vocal onset on some sparse channels. Here, electrodes that significantly differed between the two uttering conditions had lower values for the loud condition. TANOVA Results of the pairwise TANOVA between LS and SS are presented in Fig. 2C. Green time periods correspond to significant time periods (< .05). Among the three significant periods presented, only one of them (i.e., from − 225 ms to − 100 ms) lasted longer than the duration threshold of 113.1 ms. Microstates analysis Three clusters were obtained for the response-locked spatio-temporal segmentation analysis, with 97.81% of explained variance (labelled from “A” to “C” in Fig. 2D). To statistically assess the global strength and duration of each map in the participants ERPs, the template maps were fitted in participants’ individual ERPs (see details on Appendix E). Non-parametric Friedman test (Cleophas et al., 2016 ) demonstrated that the map A differed significantly across conditions on the two parameters of interest (AUC: X 2 (1) = 8.167, p = .004; DUR: X 2 (1) = 9.783, p = .004). In particular, map A lasted longer in the SS condition and had higher AUC value in comparison to the loud condition. On the other hand, the microstate map C significantly differed across conditions with the loud condition entailing a higher AUC value (X 2 (1) = 4.55, p = .033) compared to the standard condition. Discussion In the present experiment, we investigated the behavioral and electrophysiological signature associated to producing an utterance when speaking normally or louder in a delayed production task. In the following, we will first discuss the longer RTs associated to the production of loud utterances before unravelling the electrophysiological correlates of this speech mode in response-locked ERP. The behavioral results replicate previous findings suggesting that producing loud utterances entails longer latencies in comparison to standard speech utterances (e.g., Zhang & Hansen, 2007 ; Bourqui et al., submitted). Here, 16 additional milliseconds were needed for initializing loud utterances. This difference is inferior to the 34 ms differences reported in Bourqui et al. (submitted) but confirms that loud speech entails a behavioral encoding cost. However, this difference in RT depends on the mode by which participants started the experiment. This result will be discussed in the general discussion in the light of the second experiment. On EEG/ERP signals, LS and SS differed in amplitudes, TANOVA and spatiotemporal segmentation in a large time window (i.e., from approximatively − 220 ms to -100 ms before the vocal onset). This time-window falls within the time-window encompassing the last 300 ms preceding the vocal onset that has been associated to the motor speech encoding processes according to previous estimates (Laganaro, 2019 , 2023 ). Especially, producing loud utterances induces modulation of the waveform amplitudes during several time periods throughout the whole response-locked ERP, with larger amplitudes for LS in particular in the last 150 ms (see Fig. 2). Microstates are defined as periods with stable topographic representations suggesting quasi-simultaneity of activity among the brain regions involved in large-scale networks (Michel & Koenig, 2018 ). Therefore, as the spatio-temporal segmentation and the fitting yielded the same sequence of microstates across conditions, we can claim that the production of loud and standard utterances is supported by the same brain networks and thus identical motor speech encoding processes. Although identical brain networks were found in both conditions, differences in temporal dynamics and strength of neural recruitment were revealed for maps A and C. In particular, the map A lasted longer and had a higher AUC value for the standard condition. In turn, map C demonstrated higher AUC value for the loud condition. In other words, these results suggest that to produce a louder utterance, the same neural processes as SS are recruited but with a difference in temporal dynamics and in the strength of recruitment. In particular, these changes in the temporal dynamics occur at two distant time periods corresponding to two different microstates (i.e., close and far from vocal onset) meaning that producing loud utterances involves probably more than just parametrization of muscle commands as proposed by Van der Merwe (2021). Before going further in the interpretation of the results obtained in this experiment, we will first investigate the behavioral and electrophysiological signatures of another speech mode with distinct phonatory and articulatory properties. Experiment 2 - Whispered speech Method Participants A different sample of 24 right-handed [Average laterality quotient index = 90.83, range = 60–100] neurotypical French speakers (M = 24.03 years, SD = 3.3 years, 10 men) fulfilling the same criteria as in Experiment 1 was recruited to perform the task. Material The material was identical to experiment 1. Procedure The experimental procedure was similar to experiment 1, except that this time WS was the speech mode contrasted with SS. During the training session, participants were instructed to speak without vibrating their vocals folds. For those who struggled to whisper, they did several extra items while focusing on not feeling vibration in their vocal apparatus. As in experiment 1, there were 8 blocks of approximatively 50 stimuli each, 4 performed in whispered mode and 4 in standard mode in a counterbalanced order across participants. Analysis During the offline extraction of ACC and RT, WS’ intensity was increased to 70 dB to facilitate the alignment on the vocal onset using the Praat software. The inter-judgement agreement for the ACC and the RT was respectively 94% and 90%. On the behavioral level, the mixed model contained the same variables as the first experiment. On the EEG/ERP level, the length of the response-locked ERPs across conditions was 342 (i.e., 175 TF) as in the first experiment. Results Behavioral results High accuracy was obtained for both standard (M = 93.79%) and whispered utterances (M = 92.58%). Pseudowords in SS were produced on average latencies of 636.70 ms (SD = 167.95) while whispered pseudowords required 652.91 ms (SD = 164.54) to be initialized. The linear mixed model retained (see Appendix C for more details) showed a main effect of the stimuli’s length, with monosyllabic items yielding longer initialization time than disyllabic stimuli (F (7860) = 5.168, β = 6.410, p = .023), and a significant interaction effect between the speech mode and the experimental block by which the participant started the experiment. The post-hoc comparison with the Tukey test demonstrated that participants who started with a WS block had a significant 33.94 ms (z= -9.013, SE = 3.77, p = < .001) longer initialization time in whispered utterances in comparison to standard utterances. On the contrary, no difference was observed across conditions in participant starting with a SS block (z = 0.367, SE = 3.74, p = .71). ERP results TCT WS and SS response-locked ERPs did not contain any topographic inconsistency (see Appendix D). Waveform analysis The TFCE test for ERPs comparisons of amplitudes indicated no differences across conditions (see Fig. 4a) in the response-locked ERPs. TANOVA Response-locked TANOVA yielded two small time periods of significant difference across conditions (i.e., green time periods in Fig. 3c), which did not exceed the duration threshold of 52.65 ms. Microstates analysis The spatio-temporal segmentation of response-locked ERPs were segmented into three microstates maps explaining 97.34% of the global explained variance (GEV). Template maps were fitted to individuals ERPs to perform two statistical analyses. As in Experiment 1, we assessed the AUC and the DUR parameters (see details in Appendix F). The non-parametric Friedman test demonstrated that neither the AUC nor the DUR differed across conditions on any of the microstates maps. Discussion In this second experiment, we contrasted the production of WS and SS utterances with the same pipeline as in the first experiment. In the present case, WS utterances required a longer production latency of 33.94 ms compared to SS. Although only observed on participants starting with a WS block, this result replicates previous findings (see Zhang et al., 2007; Bourqui et al., submitted) and differences across conditions are even larger than in those studies. As in experiment 1, the encoding cost observed in the behavioral results cannot be generalized as it depended on the first experimental block and will be further discussed in the general discussion. On the electrophysiological level, the same microstate sequneces were observed in SS and WS, with only minor differences in the TANOVA analysis. However, these time windows did not exceed the significance duration threshold meaning that they could be the byproduct of false positives. The spatio-temporal segmentation and the fitting in the individuals further confirmed the absence of topographic differences across conditions. On the whole, it seems that electrical brain activity underlying the production of whispered utterances is not really different from producing standard utterances. The same finding has been reported previously on a smaller group of participants (Sikdar et al., 2017 ) suggesting that, on the electrophysiological level, whispering and speaking normally are similar in nature. Although coherent with a previous study, these results raise several questions that will be discussed in depth in the light of the first experiment. General discussion In this study, we investigated the behavioral and electrophysiological signatures of encoding the production of two distinct speech modes (i.e., loud speech (LS) and whispered speech (WS)) relative to standard speech (SS). Since the same procedure and material were used for both speech modes, we can broadly comment on the discrepancies and similarities across experiments. In the two experiments, behavioral results demonstrated longer initialization times for non-standard speech mode, with the result driven by participants that started the experiment with a block in the non-standard condition. This intriguing result may be interpreted as a “novelty bias” as speakers are not accustomed to speaking louder and whispering over such a long period of time. As a result, participants are maybe less familiar with the task and this behavioural encoding cost would thus need further investigation with another experimental plan. Current behavioral results however converge with previous studies using different paradigms and showing a cost of encoding non-standard speech modes (Zhang & Hansen, 2007 ; Bourqui et al., submitted). Some authors have proposed that different speech modes with specific phonatory and articulatory features would involve unique encoding processes in comparison to standard speech (Scott, 2022; Zhang & Hansen, 2007 ). Comparing the electrophysiological results across our two experiments suggests that speech modes cannot be grouped as a whole entity encoded in the motor programming stage (i.e., last encoding process preceding articulation) as suggested in the FL model (Van der Merwe, 2021). Indeed, EEG/ERP results do not converge as LS seems to entail important electrophysiological modulations while WS electrophysiological activity is very close to SS, a null result that has been reported previously (Sikdar et al., 2017 ). Particularly, LS electrophysiological activity differed in several times periods that seems to extend beyond the programming time-window, one close and one quite distant from the vocal onset. Our intrerpretation is that only the significant difference in strength of the last microstate preceding the vocal onset (map C in Fig. 2) could be considered as the “increase in neuromotor drive” proposed in the study of Whitfield (2021). The present results thus validate previous propositions by providing neuroimaging data indicating that speaking loud entails changes in temporal dynamics and an increase in brain activation during motor encoding. Additionally, they also replicate the finding from Sikdar et al., ( 2017 ) showing that WS and SS are similar on the electrophysiological level. In this particular case, the microstates results invalidate the idea that an additional mechanism is responsible for producing whispered utterances as proposed in Tsunoda et al., ( 2011 ). Indeed, the same microstates maps or the same encoding processes were found for both WS and SS. However, the dynamics of brain activation underlying these processes did not differ across conditions. In the light of the electrophysiological data, whispering cannot be distinguished from speaking normally and thus the literature should perhaps adopt a more nuanced approach to understand and characterize this mode. Moreover, if the time-window of ERP modulations for LS seem to encompass a large portion of the time-window associated to motor speech ecoding, likely planning and programming in the FL model, the present results can be also related to the neurocomputational framework of speech production and acquisition from Guenther ( 2016 ) named Direction Into the Velocities of Articulators (DIVA) model. Indeed, although there is no input so far on the dynamics of brain activation in the latter (Tourville & Guenther, 2011 ), the present findings challenge our comprehension of the feedforward control system. If one assumes that the speech sound map (SSM) corresponds to the motor planning in the FL model (i.e., where motor plans are retrieved) while the Articulatory map corresponds to the motor programming stage (i.e., where spatiotemporal and force dimensions are specified), our outcomes suggest that LS could be encoded somehow all along the process of activating the cells in the SSM and transmitting the motor targets to the Initiation Map and the Articulatory Map. For future studies, investigating speech modes thus seems to provide an interesting window to understand the intricate interplay between the functional units in the feedforward control system. Limitations Some methodological concerns could be considered for future studies investigating speech modes and more especially for WS. On one hand, it has been suggested that there was an important intra-speaker and inter-speaker variability in the production of whispered utterances (Konnai et al., 2017 ). On the other hand, despite having the same instructions for every participant, we did not control for the type and/or the way participants were actually whispering. Effectively, Solomon et al. ( 1989 ) proposed that there are two types of whispering: quiet whisper (i.e., low effort manner) and a loud whisper (i.e., high effort manner) which were not controlled in this experiment. In brief, analysis of WS data implies several methodological challenges due to the inherent phonatory and articulatory properties of this mode of production. Conclusion In the present study, we conducted two experiments which demonstrated that producing utterances under specific speech modes entails a behavioral encoding cost (i.e., increased production latency), although this result may need to be confirmed with a different experimental design. The EEG/ERP results demonstrated that speech modes with distinct phonatory and articulatory features cannot be grouped as a global entity and entail adjustments in different time-windows corresponding to different brain networks. Indeed, the electrophysiological signature of the two speech modes of interest were different with loud utterances entailing changes in ERP signal in two mental processes, one close and one farer away from the vocal onset, while the ERP signal associated to whispered utterances did not differ significantly from standard ERP signal. These findings have important consequences as they challenge the current conceptualization of speech modes. Indeed, this study clarifies the statement “each speech mode possesses its own encoding mechanism” (Zhang & Hansen, 2007 ; Scott 2022). Speech modes seem to be produced through the same brain networks as standard production but with a continuum of changes concerning the temporal dynamics and the strength of recruitment. These changes are observed in the whole motor speech (phonetic) encoding stage and they can go from important (in this case for loud utterances) to almost inexistent (in this instance for whispered utterances). Declarations CRediT authorship contribution statement Bryan Sanders (Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Writing – original draft), Monica Lancheros (Investigation, Supervision, Validation, Writing – review & editing), Marion Bourqui (Supervision, Validation, Writing – review & editing), Marina Laganaro (Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review & editing). Funding This work was supported by the Swiss National Science Foundation ( Grant number: CRSII5_202228 ). The funders play no role in study design; in the collection, analysis, and interpretation of data, in the writing of the report, and in the decision to submit the article for publication. Conflict of interest statement : The authors have no financial or proprietary interests in any material discussed in this article. References Bates, D., Mächler, M., Bolker, B. M., & Walker, S. C. (2014). Fitting Linear Mixed-Effects Models using lme4. Journal of Statistical Software , 67 (1). https://doi.org/10.18637/jss.v067.i01 Boersma, P., & Van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International , 5 (9/10), 341-347. Bohland, J. W., Bullock, D., & Guenther, F. H. (2010). Neural representations and mechanisms for the performance of simple speech sequences. Journal of Cognitive Neuroscience , 22 (7), 1504–1529. https://doi.org/10.1162/JOCN.2009.21306 Brunet, D., Murray, M. M., & Michel, C. M. (2011). Spatiotemporal analysis of multichannel EEG: CARTOOL. Computational Intelligence and Neuroscience , 2011 . https://doi.org/10.1155/2011/813870 Carson, R. J., & Beeson, C. M. L. (2013). Crossing Language Barriers: Using Crossed Random Effects Modelling in Psycholinguistics Research. Tutorials in Quantitative Methods for Psychology , 9 (1), 25–41. Cleophas, T. J., Zwinderman, A. H., Cleophas, T. J., & Zwinderman, A. H. (2016). Non-parametric tests for three or more samples (Friedman and Kruskal-Wallis). Clinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing , 193-197 den Hollander, J., Jonkers, R., Mariën, P., & Bastiaanse, R. (2019). Identifying the Speech Production Stages in Early and Late Adulthood by Using Electroencephalography. Frontiers in Human Neuroscience , 13 , 298. https://doi.org/10.3389/FNHUM.2019.00298/BIBTEX Dromey, C., & Ramig, L. O. (1998). Intentional Changes in Sound Pressure Level and Rate. Journal of Speech, Language, and Hearing Research , 41 (5), 1003–1018. https://doi.org/10.1044/JSLHR.4105.1003 Frossard, J., & Renaud, O. (2021). Permutation tests for regression, ANOVA, and comparison of signals: the permuco package. Journal of Statistical Software , 99 , 1-32. Guenther, F. H. (1994). A neural network model of speech acquisition and motor equivalent speech production. Biological Cybernetics 1994 72:1 , 72 (1), 43–53. https://doi.org/10.1007/BF00206237 Guenther, F. H. (2016). Neural Control of Speech. In Neural Control of Speech . The MIT Press. https://doi.org/10.7551/MITPRESS/10471.001.0001 Guenther, F. H., & Vladusich, T. (2012). A Neural Theory of Speech Acquisition and Production. Journal of Neurolinguistics , 25 (5), 408–422. https://doi.org/10.1016/J.JNEUROLING.2009.08.006 Huber, J. E., & Chandrasekaran, B. (2006). Effects of Increasing Sound Pressure Level on Lip and Jaw Movement Parameters and Consistency in Young Adults. Journal of Speech, Language, and Hearing Research , 49 (6), 1368–1379. https://doi.org/10.1044/1092-4388(2006/098) Indefrey, P. (2011). The spatial and temporal signatures of word production components: A critical update. Frontiers in Psychology , 2 (OCT), 255. https://doi.org/10.3389/FPSYG.2011.00255/BIBTEX Indefrey, P., & Levelt, W. J. M. (2004). The spatial and temporal signatures of word production components. Cognition , 92 (1–2), 101–144. https://doi.org/10.1016/J.COGNITION.2002.06.001 Kelly, F., & Hansen, J. H. L. (2021). Analysis and calibration of lombard effect and whisper for speaker recognition. IEEE/ACM Transactions on Audio Speech and Language Processing , 29 , 927–942. https://doi.org/10.1109/TASLP.2021.3053388 Koenig, T., Kottlow, M., Stein, M., & Melie-García, L. (2011a). Ragu. Computational Intelligence and Neuroscience , 2011 , 14. https://doi.org/10.1155/2011/938925 Koenig, T., Kottlow, M., Stein, M., & Melie-García, L. (2011b). Ragu: a free tool for the analysis of EEG and MEG event-related scalp field data using global randomization statistics. Computational Intelligence and Neuroscience , 2011 . https://doi.org/10.1155/2011/938925 Koenig, T., & Melie-García, L. (2010). A method to determine the presence of averaged event-related fields using randomization tests. Brain Topography , 23 (3), 233–242. https://doi.org/10.1007/S10548-010-0142-1/FIGURES/5 Koenig, T., Stein, M., Grieder, M., & Kottlow, M. (2014). A tutorial on data-driven methods for statistically assessing ERP topographies. Brain Topography , 27 (1), 72–83. https://doi.org/10.1007/S10548-013-0310-1/FIGURES/7 Konnai, R., Scherer, R. C., Peplinski, A., & Ryan, K. (2017). Whisper and Phonation: Aerodynamic Comparisons Across Adduction and Loudness. Journal of Voice , 31 (6), 773.e11-773.e20. https://doi.org/10.1016/J.JVOICE.2017.02.016 Laganaro, M. (2019). Language, Cognition and Neuroscience Phonetic encoding in utterance production: a review of open issues from 1989 to 2018 . https://doi.org/10.1080/23273798.2019.1599128 Laganaro, M. (2023). Time-course of phonetic (motor speech) encoding in utterance production. Cognitive Neuropsychology , 40 (5-6), 287-297. Laganaro, M., & Perret, C. (2011). Comparing electrophysiological correlates of word production in immediate and delayed naming through the analysis of word age of acquisition effects. Brain Topography , 24 (1), 19–29. https://doi.org/10.1007/S10548-010-0162-X/FIGURES/3 Lenth, R. (2020). emmeans: Estimated Marginal Means, aka Least-Squares Means. (R package version 1.4.5). The Comprehensive R Archive Network (CRAN). https://cran.r- project.org/package=emmeans Levelt, W. J. (1989). Speaking: From intention to articulation (Vol. 1): MIT press. (MIT press, Ed.; Vol. 1). https://www.mpi.nl/publications/item67053/speaking-intention-articulation Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences , 22 (1), 1–38. https://doi.org/10.1017/S0140525X99001776 Michel, C. M., & Brunet, D. (2019). EEG source imaging: A practical review of the analysis steps. Frontiers in Neurology , 10 (APR), 446653. https://doi.org/10.3389/FNEUR.2019.00325/BIBTEX Michel, C. M., & Koenig, T. (2018). EEG microstates as a tool for studying the temporal dynamics of whole-brain neuronal networks: A review. NeuroImage , 180 , 577–593. https://doi.org/10.1016/J.NEUROIMAGE.2017.11.062 Michel, C. M., Koenig, T., Brandeis, D., Gianotti, L. R. R., & Michel, C. M. (2009). Electrical Neuroimaging Edited by. Cambridge University Press . Michel, C. M., & Murray, M. M. (2012). Towards the utilization of EEG as a brain imaging tool. NeuroImage , 61 (2), 371–385. https://doi.org/10.1016/J.NEUROIMAGE.2011.12.039 Miller, H. E., & Guenther, F. H. (2021). Modelling speech motor programming and apraxia of speech in the DIVA/GODIVA neurocomputational framework. Aphasiology , 35 (4), 424–441. https://doi.org/10.1080/02687038.2020.1765307 Murray, M. M., Brunet, D., & Michel, C. M. (2008a). Topographic ERP analyses: a step-by-step tutorial review. Brain Topography , 20 (4), 249–264. https://doi.org/10.1007/S10548-008-0054-5 New, B., Pallier, C., Brysbaert, M., & Ferrand, L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers , 36 (3), 516-524. Oldfield, R. C. (1971). The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia , 9 (1), 97-113. Perrin, F., Pernier, J., Bertnard, O., Giard, M. H., & Echallier, J. F. (1987). Mapping of scalp potentials by surface spline interpolation. Electroencephalography and clinical neurophysiology , 66 (1), 75-81. Perkell, J. S. (2012). Movement goals and feedback and feedforward control mechanisms in speech production. Journal of Neurolinguistics , 25 (5), 382–407. https://doi.org/10.1016/J.JNEUROLING.2010.02.011 Piai, V., Riès, S. K., & Knight, R. T. (2014). The electrophysiology of language production: What could be improved. Frontiers in Psychology , 5 (OCT), 1560. https://doi.org/10.3389/FPSYG.2014.01560/BIBTEX Protopapas, A. (2007). CheckVocal: A program to facilitate checking the accuracy and response time of vocal responses from DMDX. Behavior Research Methods , 39 (4), 859–862. https://doi.org/10.3758/BF03192979/METRICS Ramoo, D. (2021). 9.2 The Standard Model of Speech Production . BCcampus. R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/ Sikdar, D., Roy, R., & Mahadevappa, M. (2017, May). Multifractal analysis of electroencephalogram for human speech modalities. In 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER) (pp. 637-640). IEEE. Skrandies, W. (1990). Global field power and topographic similarity. Brain Topography , 3 (1), 137–141. https://doi.org/10.1007/BF01128870 Smiljanić, R., & Bradlow, A. R. (2009). Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes. Language and Linguistics Compass , 3 (1), 236–264. https://doi.org/10.1111/J.1749-818X.2008.00112.X Smith, S. M., & Nichols, T. E. (2009). Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage , 44 (1), 83-98. Solomon, N. P., McCall, G. N., Trosset, M. W., & Gray, W. C. (1989). Laryngeal Configuration and Constriction during Two Types of Whispering. Journal of Speech and Hearing Research , 32 (1), 161–174. https://doi.org/10.1044/JSHR.3201.161 Strijkers, K., & Costa, A. (2016). The cortical dynamics of speaking: present shortcomings and future avenues. Language, Cognition and Neuroscience , 31 (4), 484–503. https://doi.org/10.1080/23273798.2015.1120878 Tourville, J. A., & Guenther, F. H. (2011). The DIVA model: A neural theory of speech acquisition and production. Language and Cognitive Processes , 26 (7), 952–981. https://doi.org/10.1080/01690960903498424 Tsunoda, K., Sekimoto, S., & Baer, T. (2011). An fMRI study of whispering: the role of human evolution in psychological dysphonia. Medical Hypotheses , 77 (1), 112–115. https://doi.org/10.1016/J.MEHY.2011.03.040 Van der Merwe, A. (2009). A theoretical framework for the characterization of pathological speech sensorimotor control. In M. R. McNeil (Ed.), Clinical management of sensorimotor speech disorders (2nd ed., pp. 3–18). Van Der Merwe, A. (2020). New perspectives on speech motor planning and programming in the context of the four- level model and its implications for understanding the pathophysiology underlying apraxia of speech and other motor speech disorders. Aphasiology , 35 (4), 397–423. https://doi.org/10.1080/02687038.2020.1765306 Verwoert, M., Ottenhoff, M. C., Goulis, S., Colon, A. J., Wagner, L., Tousseyn, S., van Dijk, J. P., Kubben, P. L., & Herff, C. (2022). Dataset of Speech Production in intracranial Electroencephalography. Scientific Data 2022 9:1 , 9 (1), 1–9. https://doi.org/10.1038/s41597-022-01542-9 Weerathunge, H. R., Alzamendi, G. A., Cler, G. J., Guenther, F. H., Stepp, C. E., & Zañartu, M. (2022). LaDIVA: A neurocomputational model providing laryngeal motor control for speech acquisition and production. PLoS Computational Biology , 18 (6). https://doi.org/10.1371/JOURNAL.PCBI.1010159 Whitfield, J. A., Holdosh, S. R., Kriegel, Z., Sullivan, L. E., & Fullenkamp, A. M. (2021). Tracking the Costs of Clear and Loud Speech: Interactions Between Speech Motor Control and Concurrent Visuomotor Tracking. Journal of Speech, Language, and Hearing Research , 64 (6s), 2182–2195. https://doi.org/10.1044/2020_JSLHR-20-00264 Zhang, C., & Hansen, J. H. L. (2007). Analysis and classification of speech mode: Whispered through shouted. International Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007 , 4 , 2396–2399. https://doi.org/10.21437/INTERSPEECH.2007-621 Zhang, C., Hansen, J. H. L., & Patil, H. A. (2018). Advancements in whispered speech detection for interactive/speech systems. In Signal and Acoustic Modelling for Speech and Communication Disorders (Vol. 5, pp. 9-32). De Gruyter. Additional Declarations No competing interests reported. Supplementary Files Appendix.docx Cite Share Download PDF Status: Published Journal Publication published 15 Feb, 2025 Read the published version in Brain Topography → Version 1 posted Editorial decision: Revision requested 25 Nov, 2024 Reviews received at journal 22 Nov, 2024 Reviewers agreed at journal 29 Oct, 2024 Reviewers agreed at journal 24 Sep, 2024 Reviewers invited by journal 10 Sep, 2024 Editor assigned by journal 28 Aug, 2024 Submission checks completed at journal 28 Aug, 2024 First submitted to journal 26 Aug, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4977028","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":358354116,"identity":"dd9d4f0b-0c43-4080-bc90-650e5ce80145","order_by":0,"name":"Bryan Sanders","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+klEQVRIiWNgGAWjYBACAwQzAYgrGBjYQGzGBqK1nCFZC2MblI1Pi7n0GcPPBQx18ubsycc+/px3OI+Pf/EBhp87cGux7Msxlp7BcNhwZ8+z5Nm82w4Xs0k8S2DsPYPHYWd4DKR5GA4wbriRY8zMuO1wYpvEGQNmuAuxazH+zcNQZ7/hRv5nxp9zQFrOfyCkxQxoC3Mi0BZmBt4GoBb+Hga8Wix72MqseQwOJ28488yYmedYOtAWNoODvXi0mPMwb77NU1Fnu+F48mPGHzXWifP7Dz988BOPFgYGDgOU2GFgkEhgOIBPAwMD+wM0AX4CGkbBKBgFo2DEAQBZc1Cnfi8IFQAAAABJRU5ErkJggg==","orcid":"","institution":"University of Geneva","correspondingAuthor":true,"prefix":"","firstName":"Bryan","middleName":"","lastName":"Sanders","suffix":""},{"id":358354117,"identity":"ebff2eed-f882-4ca2-b66f-f1a186e9369e","order_by":1,"name":"Monica Lancheros","email":"","orcid":"","institution":"University of Geneva","correspondingAuthor":false,"prefix":"","firstName":"Monica","middleName":"","lastName":"Lancheros","suffix":""},{"id":358354118,"identity":"af8f55fc-e508-455c-9d58-41df70920177","order_by":2,"name":"Marion Bourqui","email":"","orcid":"","institution":"University of Geneva","correspondingAuthor":false,"prefix":"","firstName":"Marion","middleName":"","lastName":"Bourqui","suffix":""},{"id":358354119,"identity":"9f8563e1-be61-489f-94a3-7e2a0a723108","order_by":3,"name":"Marina Laganaro","email":"","orcid":"","institution":"University of Geneva","correspondingAuthor":false,"prefix":"","firstName":"Marina","middleName":"","lastName":"Laganaro","suffix":""}],"badges":[],"createdAt":"2024-08-26 09:51:09","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4977028/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4977028/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s10548-025-01108-z","type":"published","date":"2025-02-15T15:57:06+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":65419088,"identity":"dc03db00-32c6-43ab-9e4e-e3282b80db7d","added_by":"auto","created_at":"2024-09-27 07:50:17","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":112680,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIllustration of the delayed production task on the left panel and of the experimental design on the right panel.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4977028/v1/539ab09a4eb99bad3e26b44f.png"},{"id":65418790,"identity":"04c6f7c8-89dd-4a14-9076-508316bb4be5","added_by":"auto","created_at":"2024-09-27 07:42:17","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":352718,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003e(A) Results from the waveform analyses across conditions on all time points (x axis) and electrodes (y axeis) on response-locked ERPs, with red and yellow points indicating significant differences (p\u0026lt; .01 and p\u0026lt; .05, respectively). (B) Illustration of amplitude variations for the Fz, Cz and Pz electrodes. (C) Results of the TANOVA or topographic dissimilarity analysis with significant time periods represented in green (the y-axis represents 1-p values). (D) Results of the spatial-temporal segmentation across conditions represented on the mean GFP in microvolts (μV) per condition. The areas delimited correspond to each microstate with its associated topography (i.e., spatial distribution of the brain activity) on the left panel.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4977028/v1/2f43e0d2a736e7f5b3fc7c66.png"},{"id":65419087,"identity":"33ee52cc-5a5e-4f9b-b7cd-f5bb225479c5","added_by":"auto","created_at":"2024-09-27 07:50:17","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":223240,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThis figure illustrates the contrast between standard and whispered response-locked ERPs. (a) Results of the TFCE method revealing no significant differences in amplitude (p ≥ .05 for white) across conditions. (b) Fz, Cz and Pz exemplars of waveform modulations. (c) Results of the TANOVA analysis demonstrating short-timed differences in green between WS and SS. (d) Results of the spatio-temporal segmentation represented on the mean GFP in microvolts (μV) with the associated topographies on the left panel. Microstates or stable periods of electrophysiological activity are delimited by colors.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4977028/v1/dd84a34e948b537f20e0bd7d.png"},{"id":76487451,"identity":"8e66b773-d6e3-41ba-9e1e-11e13d0cdcff","added_by":"auto","created_at":"2025-02-17 16:06:59","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2121548,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4977028/v1/35f7c68d-8c20-4537-b2a5-c4da985abe4b.pdf"},{"id":65418792,"identity":"b9970d5f-9538-4d8b-99ff-2e3c63c7253a","added_by":"auto","created_at":"2024-09-27 07:42:17","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":532007,"visible":true,"origin":"","legend":"","description":"","filename":"Appendix.docx","url":"https://assets-eu.researchsquare.com/files/rs-4977028/v1/35559a544fb39a675d7ac1b3.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech","fulltext":[{"header":"Introduction","content":"\u003cp\u003eSpeech production is a complex cognitive-motor ability which allows humans to transform an abstract linguistic code into precise motor commands needed to produce an utterance. However, multiple intrinsic and extrinsic factors can interfere with the transmission of the message between a speaker and a listener. Therefore, speakers will modulate their speech production to overcome these transmission issues by, for instance, whispering, speaking louder or speaking clearer. In the literature, these modulations have been referred to as “speech modes”, “speech styles” or “speaking styles”. They are defined as specific variations of standard speech (SS), which refers to speech produced with normal vocal effort (Kelly \u0026amp; Hansen, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Tuomainen et al., 2022). Variations of SS (herein, speech modes) are used in daily conversations but have been surprisingly overlooked in the speech motor control literature. As a consequence, very little is known concerning the brain mechanisms that allow speakers to modulate their utterances. In this regard, one may wonder if speaking in a non-standard speech mode involves a motor speech preparation cost that reflects specific encoding processes. In the present study, we will exploit the high temporal resolution provided by the electroencephalography (EEG) brain imaging technique to investigate the encoding processes underlying the production of speech sequences under different modes. In the following sections, we will first unravel the issues underlying the scientific characterization of the motor speech encoding stage. We will then present the current knowledge about our target speech modes, whispered and loud speech, before trying to hypothesize how they may be encoded on the motor level.\u003c/p\u003e\n\u003ch3\u003eMotor speech (phonetic) encoding\u003c/h3\u003e\n\u003cp\u003eSpeakers can produce intelligible and accurate utterances almost automatically with a low error rate. Despite decades of investigation, a clear account of the interaction between the neural processes underlying speech production and their dynamics is still needed (Bohland et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2010\u003c/span\u003e; Laganaro, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Miller \u0026amp; Guenther, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Verwoert et al., \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Phonetic encoding [hereafter motor speech encoding processes] is the label given by some authors in the literature to describe the process of transforming an abstract linguistic sequence into a motor code readable by articulators (W. J. Levelt, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e1989\u003c/span\u003e; W. J. M. Levelt et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e1999\u003c/span\u003e; Indefrey, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Guenther, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). This encoding stage has been less studied in comparison to other language encoding processes resulting in a poor understanding of the underlying spatio-temporal dynamics (Indefrey, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). The four level (FL) model from Van der Merwe (2021) proposed that motor speech encoding could be subdivided into two sequential substages; motor planning (i.e., retrieval of motor plans) and motor programming (i.e., where spatiotemporal and force dimensions are specified). This subdivision has been motivated by clinical symptoms of motor speech disorders such as apraxia of speech and dysarthria. To the best of our knowledge, the FL model is the only speech production model to provide some inputs into the encoding of speech modes. The model states that all the different speech modes can be grouped as a whole entity and would be thus encoded in the same way. Here, speech modes would be encoded through tuning of the unique suprasegmental features during the motor programming stage. Nevertheless, there is no empirical evidence corroborating this proposition. Studying the neural processes underlying speech modes as compared to standard speech can thus provide a relevant way of investigating whether speech modes require adjustments at specific encoding stages or whether they are underlied by different encoding processes, as further presented below.\u003c/p\u003e\n\u003ch3\u003eSpeech modes\u003c/h3\u003e\n\u003cp\u003e Speech modes constitute an omnipresent part of verbal exchanges. The way people speak is continuously influenced by intrinsic factors (speaker related) and extrinsic factors (environment or listener related) (Kelly \u0026amp; Hansen, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Smiljanić \u0026amp; Bradlow, \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Whitfield et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). For instance, one needs to modulate his speech production to tell his friend the food he would like to order in a stadium full of supporters that are loudly singing an anthem. Moreover, meaningful cues (e.g., linguistic, affective or social cues) are conveyed to the interlocutor through the modulation of speech (Perkell, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Tourville \u0026amp; Guenther, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). As an example, a speaker giving a talk during a conference will adopt a clear speech mode to emphasize his take home message. Zhang and Hansen (\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e) proposed five speech modes with unique articulatory and phonatory features: whispered speech, soft speech, neutral (equivalent to standard) speech, loud speech and shouted speech. In this regard, it has been hypothesized that each speech mode involves its own mechanisms resulting in specific articulatory and phonatory patterns (Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Scott, 2022). We will briefly describe the two speech modes that will be investigated in the present study, namely loud speech (LS) and whispered speech (WS).\u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eLoud Speech (LS)\u003c/h2\u003e \u003cp\u003eWhen speakers struggle to convey a message to an interlocutor, for instance in a noisy environment, they usually modify their speech by increasing their vocal effort. In this case, increase in loudness leads to phonatory adjustments and changes in speech kinematics (Dromey \u0026amp; Ramig, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e1998\u003c/span\u003e; Huber \u0026amp; Chandrasekaran, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2006\u003c/span\u003e; Whitfield et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). These manifestations associated to increased Sound Intensity Level (SIL) pertain to a specific speech mode labeled “LS”. Intuitively, one would conceptualize LS as the best speech mode to make yourself heard by someone else. However, the results obtained by Whitfield et al. (\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) indicate that LS’s main characteristic is increasing vocal intensity while there is not necessarily an improvement of the articulatory distinctiveness of the message conveyed. In the literature, LS has usually been associated to an increase in the standard SIL of 10 dB ± 4 dB (Huber \u0026amp; Chandrasekaran, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2006\u003c/span\u003e; Whitfield et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). The encoding processes responsible for the increase in vocal loudness have not been clarified by functional neuroimaging or computational models, but some hypotheses on the proposed mechanisms will be presented below.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eWhispered Speech (WS)\u003c/h3\u003e\n\u003cp\u003eWS is a widespread mode of communication aiming at conveying a message while remaining discreet. This speech mode is convenient in situations requiring silence (e.g., movie, theatre) or to keep private the content of a message (e.g., telling a secret). The ability to whisper is specific to humans (Tsunoda et al., \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2011\u003c/span\u003e) and is characterized by reduced intelligibility and perceptibility for the listener as well as a more effortful production from the speaker’s point of view (Zhang et al., \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). During whispered speech, physiological adjustments are applied to specific muscles of the larynx in order to prevent vocal folds vibration (Konnai et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Solomon et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e1989\u003c/span\u003e; Tsunoda et al., \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). This absence of phonation provides unique features to WS. Actually, among speech modes, phonetic features of WS characterize it as the most distinct speech mode in comparison to SS (Kelly \u0026amp; Hansen, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Zhang et al., \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e). Similar to LS, no consensus has been reached in the literature regarding the encoding processes responsible for whispering, leading to several proposed hypotheses.\u003c/p\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eEncoding of speech modes\u003c/h2\u003e \u003cp\u003eAs anticipated previously, two possible hypotheses stem from the literature regarding the encoding processes associated to the production of loud and whispered utterances. On one hand, behavioral results (e.g., Huber \u0026amp; Chandrasekaran, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2006\u003c/span\u003e) have led Whitfield et al., (\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) to hypothesize that an upregulation in the neuromotor drive is the mechanism at the origin of LS. However, it is unclear when and how this upregulation occurs in the motor speech encoding stage. On the other hand, two hypotheses were formulated based on functional Magnetic Resonance Imaging (fMRI) investigations in order to characterize the brain processes underlying WS. Correia et al. (2020) demonstrated that the fMRI response was greater for voiced speech than WS in the dorsal laryngeal motor area (dLMA), located in the primary motor cortex (M1). Under this hypothesis, the same brain mechanisms are at play for SS and WS, with larger recruitment for the former. A different hypothesis has been proposed by Tsunoda et al. (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2011\u003c/span\u003e) based on a voluntary switching mechanism, in which ordinary speech would be transformed into whispering thanks to functional changes in the frontal lobe. However, their results showed two distinct patterns of brain activation in the frontal lobe involving both increased and decreased brain activation for WS relative to SS, which thus did not clarify how the functional switch would be carried out. In summary, the whispered speech’s literature gathers two distinct approaches concerning the encoding of WS: one described a functional difference in a specific motor region responsible for laryngeal control while another suggested the involvement of a voluntary functional switching mechanism during production of whispered utterances. In light of these theoretical propositions, speech modes could be encoded either (1) through neural adjustment of the same brain processes in the motor programming substage as proposed in the FL model or (2) through the involvement of an additional mechanism overlaying onto regular motor programming encoding processes. In particular, this study will explore these two hypotheses using behavioral and electrophysiological contrasts between speech modes and normally phonated speech. Specifically, LS (Experiment 1) and WS (Experiment 2) were compared to SS during a delayed production task of non-sense speech sequences (pseudowords). This paradigm is ideal to isolate motor speech encoding processes from linguistic encoding processes (Laganaro, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2019\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Piai et al., \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2014\u003c/span\u003e). Electroencephalography (EEG)/Event-related potential (ERP) correlates of speech modes and SS will be analyzed during a time-window of about 350 ms preceding the vocal onset (hereafter referred to as “response-locked”) corresponding to the motor speech encoding stage. This time window is thus aligned to the vocal onset and analyzed in a backward fashion. In present study, we will exploit the high temporal resolution provided by the EEG to match the fast time scale of speech production processes (den Hollander et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Laganaro \u0026amp; Perret, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Piai et al., 2015; Verwoert et al., \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Especially, we track the temporal dynamics of brain activations in the different experimental conditions via Microstate analysis (Michel et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Michel \u0026amp; Murray, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Murray et al., 2008) which will allow to investigate whether the encoding of speech modes elicit different brain processes relative to SS or if the same brain processes are engaged but with different dynamics.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003cdiv id=\"Sec8\" class=\"Section3\"\u003e \u003cdiv id=\"Sec9\" class=\"Section4\"\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e "},{"header":"Experiment 1 - loud speech","content":"\u003ch2\u003eMethod\u003c/h2\u003e\u003ch2\u003ePopulation\u003c/h2\u003e\u003cp\u003e30 French native speakers aged from 20 to 31 years old participated to the experiment. They were all right-handed [Average laterality quotient index = 88.33, range = 60–100] according to the Edinburgh Handedness Scales (Oldfield, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e1971\u003c/span\u003e). None of them had any neurological or motor impairment. Furthermore, participants had normal vision or corrected-to-normal vision. They all agreed to participate and signed the consent form accepted by the local ethics committee. They received a small financial compensation for their participation. 6 participants were removed due to either low production accuracy (i.e., below 75%), over-noisy EEG signal or being consider as an outlier in the Ragu Software (Koenig et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2011b\u003c/span\u003e). As a result, 24 participants (Mean (M) = 23.25 years old, Standard Deviation (SD) = 3.3 years, 5 males) were retained for the analyses.\u003c/p\u003e\u003ch2\u003eMaterial\u003c/h2\u003e\u003cp\u003eThe speech stimuli to be produced consisted of 67 monosyllabic and disyllabic pseudowords (see more details in \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e A). Pseudowords were selected to avoid any linguistic effect related to words and thus focus on speech production. The pseudowords were composed of phonotactically legal French syllables according to the French database Lexique2 (New et al., \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2004\u003c/span\u003e). All the items had the following syllabic structures: C\u003csub\u003e1\u003c/sub\u003eC\u003csub\u003e2\u003c/sub\u003eV\u003csub\u003e1\u003c/sub\u003e – C\u003csub\u003e3\u003c/sub\u003eV\u003csub\u003e2\u003c/sub\u003e for the disyllabic items (e.g., trafa) and C\u003csub\u003e1\u003c/sub\u003eC\u003csub\u003e2\u003c/sub\u003eV\u003csub\u003e1\u003c/sub\u003e for the monosyllabic items (e.g., pra), with C\u003csub\u003e1\u003c/sub\u003e being one of the three following voiceless plosives: /p/, /t/ or /k/.\u003c/p\u003e\u003ch2\u003eProcedure\u003c/h2\u003e\u003cp\u003eThe experiment took place in a soundproof room in which participants were seated at about 70 cm from the computer screen. The software E-Prime 3.0 (Psychology Software Tools, Pittsburgh, PA) was used to present the stimuli in several experimental blocks and to record participants’ productions.\u003c/p\u003e\u003cp\u003eParticipants performed a delayed production task (see Fig.\u0026nbsp; 1), in which they were asked to prepare a speech sequence based on a written pseudoword and to produce it aloud when a cue (here a question mark) appeared on the screen. Each trial displayed in succession a fixation cross (350 ms), a pseudoword written in white at the center of a black screen (1200 ms), ellipsis points indicating a variable waiting delay (either 1300 or 1600 ms) and eventually a yellow question mark appeared on the screen (1700 ms). The question mark was the cue indicating to the participants to produce the pseudoword previously presented as quickly and as accurately as possible. In some cases, yellow ellipsis points appeared on the screen instead of the question mark indicating that no production were expected. These “no-go” trials, although not analyzed, were integrated to keep participants’ attention and to avoid anticipatory responses. On average, no-go items appeared approximatively every nine trials.\u003c/p\u003e\u003cp\u003e Before the beginning of the experiment, participants read aloud a list containing all the stimuli to ensure they pronounced them correctly. Halfway through, they were asked to produce the rest of the pseudowords by adopting a loud speech mode. In cases of incorrect pronunciation, they were first corrected and then asked to produce the pseudoword in its correct form. A short training session with five pseudowords produced normally (three go and two no-go trials) preceded the experiment to ensure that the participants were comfortable with the experimental procedure. The experiment was segmented in eight experimental blocks, including four blocks of standard speech (SS) and four blocks of loud Speech (LS), presented in an alternated manner (see Fig.\u0026nbsp;1 right panel). Each block contained between 48 and 50 stimuli with both pseudowords and no-go items. Across the eight blocks, participants produced the same 180 pseudowords in each condition. Before each SS block, participants were asked to speak as usual. Before each LS block, participants were instructed to speak louder than usual, aiming at being heard from outside the soundproof room. To ensure that participants produced utterances that were loud enough during loud blocks, intensity was checked by the experimenters on a sound level meter which was hidden from the participants. Half of the participants started the experiment with a block in SS (order 1) and the other half with LS (order 2). Short self-paced breaks were given to the participants between blocks. Four lists of the 360 pseudowords were created and were randomly assigned to the participants to avoid order effects. The speech productions were recorded for off-line accuracy (ACC) check and extraction of vocal onsets (or reaction times, RT).\u003c/p\u003e\u003ch2\u003eBehavioral analyses\u003c/h2\u003e\u003cp\u003eIntensity was extracted and analyzed with the Praat software (Boersma \u0026amp; Van Heuven, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2001\u003c/span\u003e). Speech intensity of all SS productions was averaged for each participant to establish an individual cut-off threshold. Therefore, loud utterances that were not higher than 8 dB in comparison to participant’s mean intensity were removed from the analyses. RT and ACC were extracted off-line through listening and visual inspection of the individual audio files with the Checkvocal Software (Protopapas, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2007\u003c/span\u003e). Uncomplete (e.g., /kRat/ instead of /kRati/), uncertain (e.g.,/kr/…/kRata/) and incorrectly (e.g.,/kRotu/ instead of /kRutu/) produced pseudowords as well as productions that did not correspond to the target speech mode were considered as erroneous productions and were thus removed from the analyses. The vocal onset of each pseudoword was identified by aligning to the plosion bar produced by C\u003csub\u003e1\u003c/sub\u003e. Two judges (i.e., first and third authors) listened to the entire speech dataset resulting in an inter-judge agreement of 98% for ACC and 91% for RT. As cleaning procedure, RT with a SD above 2.5 of the mean latency of production per participant and per condition were removed. As ACC was not part of our hypotheses, this metric will be used as a descriptive statistic. The behavioral results on RT were analyzed using the Mixed Model approach (Bates et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Carson \u0026amp; Beeson, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e) with the R-Software (R Core Team, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). We compared multiple nested models that were built up by adding one effect at the time. The best model (see \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e B) contained RT as dependent variable; speech mode (SS and LS), order of the experimental blocks (loud first or standard first) and length (monosyllabic or disyllabic) as fixed effects and subjects and items as random variables. Interaction effects between speech mode and order of experimental blocks were also tested in the model.\u003c/p\u003e\u003ch2\u003eEEG Recording and Preprocessing\u003c/h2\u003e\u003cp\u003eThe electrophysiological data was recorded continuously during the experiment with high density EEG using the Active-Two Biosemi EEG system (Biosemi V.O.F. Amsterdam, Netherlands) including 128 electrodes on the scalp with a sampling rate fixed at 512 Hz. All the preprocessing steps, including DC removal, filtering at 0.2 Hz (high pass) and 30 Hz (low pass), and Notch Filtering at 50 Hz to remove line current artifact, were done with the Cartool Software (Brunet et al., \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). Each trial was inspected visually and excluded from the averaging if it was contaminated by any artifact (e.g., blinks, eye movements or noise). After visual inspection, epochs were extracted, matched in number across conditions and averaged per participant. Problematic electrodes were interpolated for each participant using 3-D splines interpolation (Perrin et al., \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e1987\u003c/span\u003e), with the same electrodes interpolated across the two uttering conditions. On average, 15.5 electrodes (range: 6–23) were interpolated per participant. Average reference was applied to the EEG data after interpolation. Eventually, we applied a spatial filter as a final step of the preprocessing procedure (see more details in Michel \u0026amp; Brunet, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Response locked epochs (i.e., aligned to the vocal onset) were extracted backwards with a time window of 175 TF (i.e., 342 ms). Epochs’ duration was selected based on the two reviews from Laganaro (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2019\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). In the latter, it is suggested that motor speech encoding processes would take up to 300 ms of the planning time rather than the 145 ms proposed in the review of Indefrey (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2011\u003c/span\u003e).\u003c/p\u003e\u003ch2\u003eWaveform analysis\u003c/h2\u003e\u003cp\u003eElectrodes’ amplitudes were compared between SS and LS with a massed approach on each electrode and time-point. This analysis was computed in the R software with the “threshold-free clusters-enhancement” (TFCE) approach (Smith \u0026amp; Nichols, \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2009\u003c/span\u003e) using the permuco4brain R package (Frossard \u0026amp; Renaud, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). This test has a high control over family-wise type I error. The analysis is based on 5000 permutation tests for repeated measure ANOVA.\u003c/p\u003e\u003ch2\u003eTopography Consistency Test (TCT)\u003c/h2\u003e\u003cp\u003eThe TCT (Koenig \u0026amp; Melie-García, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2010\u003c/span\u003e) aims at disentangling electrical sources from noise in the ERPs data with simple randomization techniques. In other words, this test tries to determine if the same brain topographies are obtained for a specific event with repeated measurements. Here, by using the Global Field Power (GFP) of ERPs averaged at the level of participants, the TCT assesses the topographic consistency of the signal throughout the entire time window. This test has been computed with the Ragu software (Koenig et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2011b\u003c/span\u003e) before performing the topographic and microstate (spatio-temporal segmentation) analyses.\u003c/p\u003e\u003ch2\u003eTopographic ANOVA (TANOVA) analysis\u003c/h2\u003e\u003cp\u003eBy computing an index of dissimilarity, the TANOVA uses a non-parametric randomization test to determine at which time point ERP topographies (i.e. the spatial distribution of the electric signal at scalp at a specific timepoint) significantly differ across conditions (Koenig et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2011a\u003c/span\u003e; Murray et al., 2008). The TANOVA analysis is complementary to spatio-temporal segmentation (see next analysis). Indeed, index of dissimilarity and the GFP exploits respectively the topographies and the response strength meaning that they can be measured and analyzed orthogonally (Murray et al., 2008b). A minimal duration threshold for significance can be calculated to control for the possible presence of false positives resulting from the dissimilarity analysis time point by time point (Koenig et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2011a\u003c/span\u003e).\u003c/p\u003e\u003ch2\u003eMicrostates (spatio-temporal segmentation) analysis\u003c/h2\u003e\u003cp\u003eThe spatiotemporal segmentation of ERPs or microstates analysis is a two-step procedure aiming at representing conditions with several prototypical topographies or microstates maps corresponding to periods of quasi-stable spatial distribution of the electrophysiological signal on the scalp. This type of analysis relies on the GFP to decompose the signal into clusters of stable periods (60–120 ms) of electrophysiological activity (Koenig et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Michel \u0026amp; Koenig, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Skrandies, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e1990\u003c/span\u003e). First, cluster maps are extracted from the ERP conditions and are referred to as “template maps” (Michel \u0026amp; Koenig, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). From this point, these templates maps are fitted into participants’ individual signal for each condition in order to extract relevant parameters (Michel \u0026amp; Koenig, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). These parameters represent several metrics of interest to describe EEG topographies: strength, timing and spatial distribution. In the present study, statistical analyses were carried out on one temporal parameter (duration, DUR) and one global measurement (area under curve, AUC) of occurrence.\u003c/p\u003e\n\u003ch3\u003eResults\u003c/h3\u003e\n\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eBehavioral results\u003c/h2\u003e \u003cp\u003eThe average intensity of production was 64.60 dB for loud utterances and 51.47 dB for standard utterances. The mean intensity difference was 12.73 dB (SD\u0026thinsp;=\u0026thinsp;2.95; minimum (Min)\u0026thinsp;=\u0026thinsp;9.46; maximum (Max)\u0026thinsp;=\u0026thinsp;22.94) on 148 loud trials on average. Participants produced the pseudowords with a global high accuracy: LS utterances were produced with a 97% accuracy rate (SD\u0026thinsp;=\u0026thinsp;16) and SS utterances with 96% (SD\u0026thinsp;=\u0026thinsp;19). The mean production latency for LS and SS was respectively 593.22 ms (SD\u0026thinsp;=\u0026thinsp;118.41 ms) and 579.74 ms (SD\u0026thinsp;=\u0026thinsp;124.28 ms). The best nested linear mixed model (see \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e B for details) demonstrated a significant main effect of the speech mode (t(7365)\u0026thinsp;=\u0026thinsp;4.85, β\u0026thinsp;=\u0026thinsp;15.99, standard error (SE)\u0026thinsp;=\u0026thinsp;3.295, p\u0026thinsp;=\u0026thinsp;\u0026lt;\u0026thinsp;.01), with LS yielding longer RT as compared to SS. Furthermore, a significant interaction effect was observed between the speech mode and the experimental block by which the participants started the experiment (t(7363.254) = -2.764, SE\u0026thinsp;=\u0026thinsp;4.74, p\u0026thinsp;=\u0026thinsp;.005). Particularly, post-hoc analyses using a Tukey test showed that participants who started with a LS experimental block needed an additional initialization time of 16 ms to produce LS utterances (z= -4.852, SE\u0026thinsp;=\u0026thinsp;3.3, p\u0026thinsp;=\u0026thinsp;\u0026lt;\u0026thinsp;\u003cem\u003e.001\u003c/em\u003e). On the contrary, the difference of estimate between LS and SS was not significant for participants who started with a SS experimental block (z= -0.851, SE\u0026thinsp;=\u0026thinsp;3.4, p\u0026thinsp;=\u0026thinsp;\u003cem\u003e.40\u003c/em\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eERP results\u003c/h2\u003e \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e \u003ch2\u003eTCT\u003c/h2\u003e \u003cp\u003eResponse-locked ERPs across conditions had an overall topographic consistency through the whole time periods (see \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e D). Therefore, the whole resulting signal from the response-locked ERPs was kept for the following analyses.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eWaveform analysis\u003c/h2\u003e \u003cp\u003eThe results of the test distribution of the TFCE procedure are presented on Fig.\u0026nbsp;2A. Amplitude differences in response-locked ERPs were observed at three time periods. From approximately \u0026minus;\u0026thinsp;127 ms to the vocal onset (i.e., time 0), a cluster of 14 neighboring electrodes in the central-anterior region was found with lower values for the loud condition. Concerning the time period from \u0026minus;\u0026thinsp;128 ms to \u0026minus;\u0026thinsp;200 ms preceding the vocal onset, different amplitudes appeared on three small clusters: seven left anterior electrodes, four right anterior electrodes and five right central-parietal neighboring electrodes that tend to be more negative parietally and more positive anteriorly for the loud condition. Additionally, diverging amplitudes were observed in the \u0026minus;\u0026thinsp;340 ms to -260 ms time period preceding the vocal onset on some sparse channels. Here, electrodes that significantly differed between the two uttering conditions had lower values for the loud condition.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eTANOVA\u003c/h2\u003e \u003cp\u003eResults of the pairwise TANOVA between LS and SS are presented in Fig.\u0026nbsp;2C. Green time periods correspond to significant time periods (\u0026lt;\u0026thinsp;.05). Among the three significant periods presented, only one of them (i.e., from \u0026minus;\u0026thinsp;225 ms to \u0026minus;\u0026thinsp;100 ms) lasted longer than the duration threshold of 113.1 ms.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eMicrostates analysis\u003c/h2\u003e \u003cp\u003eThree clusters were obtained for the response-locked spatio-temporal segmentation analysis, with 97.81% of explained variance (labelled from \u0026ldquo;A\u0026rdquo; to \u0026ldquo;C\u0026rdquo; in Fig.\u0026nbsp;2D). To statistically assess the global strength and duration of each map in the participants ERPs, the template maps were fitted in participants\u0026rsquo; individual ERPs (see details on \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e E).\u003c/p\u003e \u003cp\u003eNon-parametric Friedman test (Cleophas et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) demonstrated that the map A differed significantly across conditions on the two parameters of interest (AUC: X\u003csup\u003e2\u003c/sup\u003e(1)\u0026thinsp;=\u0026thinsp;8.167, p\u0026thinsp;=\u0026thinsp;.004; DUR: X\u003csup\u003e2\u003c/sup\u003e(1)\u0026thinsp;=\u0026thinsp;9.783, p\u0026thinsp;=\u0026thinsp;.004). In particular, map A lasted longer in the SS condition and had higher AUC value in comparison to the loud condition. On the other hand, the microstate map C significantly differed across conditions with the loud condition entailing a higher AUC value (X\u003csup\u003e2\u003c/sup\u003e(1)\u0026thinsp;=\u0026thinsp;4.55, p\u0026thinsp;=\u0026thinsp;.033) compared to the standard condition.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eDiscussion\u003c/h3\u003e\n\u003cp\u003eIn the present experiment, we investigated the behavioral and electrophysiological signature associated to producing an utterance when speaking normally or louder in a delayed production task. In the following, we will first discuss the longer RTs associated to the production of loud utterances before unravelling the electrophysiological correlates of this speech mode in response-locked ERP. The behavioral results replicate previous findings suggesting that producing loud utterances entails longer latencies in comparison to standard speech utterances (e.g., Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Bourqui et al., submitted). Here, 16 additional milliseconds were needed for initializing loud utterances. This difference is inferior to the 34 ms differences reported in Bourqui et al. (submitted) but confirms that loud speech entails a behavioral encoding cost. However, this difference in RT depends on the mode by which participants started the experiment. This result will be discussed in the general discussion in the light of the second experiment.\u003c/p\u003e \u003cp\u003eOn EEG/ERP signals, LS and SS differed in amplitudes, TANOVA and spatiotemporal segmentation in a large time window (i.e., from approximatively − 220 ms to -100 ms before the vocal onset). This time-window falls within the time-window encompassing the last 300 ms preceding the vocal onset that has been associated to the motor speech encoding processes according to previous estimates (Laganaro, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2019\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Especially, producing loud utterances induces modulation of the waveform amplitudes during several time periods throughout the whole response-locked ERP, with larger amplitudes for LS in particular in the last 150 ms (see Fig.\u0026nbsp;2). Microstates are defined as periods with stable topographic representations suggesting quasi-simultaneity of activity among the brain regions involved in large-scale networks (Michel \u0026amp; Koenig, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Therefore, as the spatio-temporal segmentation and the fitting yielded the same sequence of microstates across conditions, we can claim that the production of loud and standard utterances is supported by the same brain networks and thus identical motor speech encoding processes. Although identical brain networks were found in both conditions, differences in temporal dynamics and strength of neural recruitment were revealed for maps A and C. In particular, the map A lasted longer and had a higher AUC value for the standard condition. In turn, map C demonstrated higher AUC value for the loud condition. In other words, these results suggest that to produce a louder utterance, the same neural processes as SS are recruited but with a difference in temporal dynamics and in the strength of recruitment. In particular, these changes in the temporal dynamics occur at two distant time periods corresponding to two different microstates (i.e., close and far from vocal onset) meaning that producing loud utterances involves probably more than just parametrization of muscle commands as proposed by Van der Merwe (2021). Before going further in the interpretation of the results obtained in this experiment, we will first investigate the behavioral and electrophysiological signatures of another speech mode with distinct phonatory and articulatory properties.\u003c/p\u003e "},{"header":"Experiment 2 - Whispered speech","content":"\u003ch2\u003eMethod\u003c/h2\u003e\u003ch2\u003eParticipants\u003c/h2\u003e\u003cp\u003eA different sample of 24 right-handed [Average laterality quotient index = 90.83, range = 60–100] neurotypical French speakers (M = 24.03 years, SD = 3.3 years, 10 men) fulfilling the same criteria as in Experiment 1 was recruited to perform the task.\u003c/p\u003e\u003ch2\u003eMaterial\u003c/h2\u003e\u003cp\u003eThe material was identical to experiment 1.\u003c/p\u003e\n\u003ch3\u003eProcedure\u003c/h3\u003e\n\u003cp\u003eThe experimental procedure was similar to experiment 1, except that this time WS was the speech mode contrasted with SS. During the training session, participants were instructed to speak without vibrating their vocals folds. For those who struggled to whisper, they did several extra items while focusing on not feeling vibration in their vocal apparatus. As in experiment 1, there were 8 blocks of approximatively 50 stimuli each, 4 performed in whispered mode and 4 in standard mode in a counterbalanced order across participants.\u003c/p\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003eAnalysis\u003c/h2\u003e \u003cp\u003eDuring the offline extraction of ACC and RT, WS’ intensity was increased to 70 dB to facilitate the alignment on the vocal onset using the Praat software. The inter-judgement agreement for the ACC and the RT was respectively 94% and 90%. On the behavioral level, the mixed model contained the same variables as the first experiment. On the EEG/ERP level, the length of the response-locked ERPs across conditions was 342 (i.e., 175 TF) as in the first experiment.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eResults\u003c/h3\u003e\n\u003cdiv id=\"Sec33\" class=\"Section2\"\u003e \u003ch2\u003eBehavioral results\u003c/h2\u003e \u003cp\u003eHigh accuracy was obtained for both standard (M = 93.79%) and whispered utterances (M = 92.58%). Pseudowords in SS were produced on average latencies of 636.70 ms (SD = 167.95) while whispered pseudowords required 652.91 ms (SD = 164.54) to be initialized. The linear mixed model retained (see \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e C for more details) showed a main effect of the stimuli’s length, with monosyllabic items yielding longer initialization time than disyllabic stimuli (F (7860) = 5.168, β = 6.410, \u003cem\u003ep =\u003c/em\u003e .023), and a significant interaction effect between the speech mode and the experimental block by which the participant started the experiment. The post-hoc comparison with the Tukey test demonstrated that participants who started with a WS block had a significant 33.94 ms (z= -9.013, SE = 3.77, p = \u0026lt; .001) longer initialization time in whispered utterances in comparison to standard utterances. On the contrary, no difference was observed across conditions in participant starting with a SS block (z = 0.367, SE = 3.74, p = .71).\u003c/p\u003e \u003cdiv id=\"Sec34\" class=\"Section3\"\u003e \u003ch2\u003eERP results\u003c/h2\u003e \u003cdiv id=\"Sec35\" class=\"Section4\"\u003e \u003ch2\u003eTCT\u003c/h2\u003e \u003cp\u003eWS and SS response-locked ERPs did not contain any topographic inconsistency (see \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e D).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eWaveform analysis\u003c/h3\u003e\n\u003cp\u003eThe TFCE test for ERPs comparisons of amplitudes indicated no differences across conditions (see Fig.\u0026nbsp;4a) in the response-locked ERPs.\u003c/p\u003e \u003cdiv id=\"Sec37\" class=\"Section2\"\u003e \u003ch2\u003eTANOVA\u003c/h2\u003e \u003cp\u003eResponse-locked TANOVA yielded two small time periods of significant difference across conditions (i.e., green time periods in Fig.\u0026nbsp;3c), which did not exceed the duration threshold of 52.65 ms.\u003c/p\u003e \u003cdiv id=\"Sec38\" class=\"Section3\"\u003e \u003ch2\u003eMicrostates analysis\u003c/h2\u003e \u003cp\u003eThe spatio-temporal segmentation of response-locked ERPs were segmented into three microstates maps explaining 97.34% of the global explained variance (GEV). Template maps were fitted to individuals ERPs to perform two statistical analyses. As in Experiment 1, we assessed the AUC and the DUR parameters (see details in \u003cspan refid=\"Sec44\" class=\"InternalRef\"\u003eAppendix\u003c/span\u003e F). The non-parametric Friedman test demonstrated that neither the AUC nor the DUR differed across conditions on any of the microstates maps.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eDiscussion\u003c/h3\u003e\n\u003cp\u003eIn this second experiment, we contrasted the production of WS and SS utterances with the same pipeline as in the first experiment. In the present case, WS utterances required a longer production latency of 33.94 ms compared to SS. Although only observed on participants starting with a WS block, this result replicates previous findings (see Zhang et al., 2007; Bourqui et al., submitted) and differences across conditions are even larger than in those studies. As in experiment 1, the encoding cost observed in the behavioral results cannot be generalized as it depended on the first experimental block and will be further discussed in the general discussion. On the electrophysiological level, the same microstate sequneces were observed in SS and WS, with only minor differences in the TANOVA analysis. However, these time windows did not exceed the significance duration threshold meaning that they could be the byproduct of false positives. The spatio-temporal segmentation and the fitting in the individuals further confirmed the absence of topographic differences across conditions. On the whole, it seems that electrical brain activity underlying the production of whispered utterances is not really different from producing standard utterances. The same finding has been reported previously on a smaller group of participants (Sikdar et al., \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) suggesting that, on the electrophysiological level, whispering and speaking normally are similar in nature. Although coherent with a previous study, these results raise several questions that will be discussed in depth in the light of the first experiment.\u003c/p\u003e "},{"header":"General discussion","content":"\u003cp\u003e In this study, we investigated the behavioral and electrophysiological signatures of encoding the production of two distinct speech modes (i.e., loud speech (LS) and whispered speech (WS)) relative to standard speech (SS). Since the same procedure and material were used for both speech modes, we can broadly comment on the discrepancies and similarities across experiments. In the two experiments, behavioral results demonstrated longer initialization times for non-standard speech mode, with the result driven by participants that started the experiment with a block in the non-standard condition. This intriguing result may be interpreted as a “novelty bias” as speakers are not accustomed to speaking louder and whispering over such a long period of time. As a result, participants are maybe less familiar with the task and this behavioural encoding cost would thus need further investigation with another experimental plan. Current behavioral results however converge with previous studies using different paradigms and showing a cost of encoding non-standard speech modes (Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Bourqui et al., submitted). Some authors have proposed that different speech modes with specific phonatory and articulatory features would involve unique encoding processes in comparison to standard speech (Scott, 2022; Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e). Comparing the electrophysiological results across our two experiments suggests that speech modes cannot be grouped as a whole entity encoded in the motor programming stage (i.e., last encoding process preceding articulation) as suggested in the FL model (Van der Merwe, 2021). Indeed, EEG/ERP results do not converge as LS seems to entail important electrophysiological modulations while WS electrophysiological activity is very close to SS, a null result that has been reported previously (Sikdar et al., \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). Particularly, LS electrophysiological activity differed in several times periods that seems to extend beyond the programming time-window, one close and one quite distant from the vocal onset. Our intrerpretation is that only the significant difference in strength of the last microstate preceding the vocal onset (map C in Fig.\u0026nbsp;2) could be considered as the “increase in neuromotor drive” proposed in the study of Whitfield (2021). The present results thus validate previous propositions by providing neuroimaging data indicating that speaking loud entails changes in temporal dynamics and an increase in brain activation during motor encoding. Additionally, they also replicate the finding from Sikdar et al., (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) showing that WS and SS are similar on the electrophysiological level. In this particular case, the microstates results invalidate the idea that an additional mechanism is responsible for producing whispered utterances as proposed in Tsunoda et al., (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). Indeed, the same microstates maps or the same encoding processes were found for both WS and SS. However, the dynamics of brain activation underlying these processes did not differ across conditions. In the light of the electrophysiological data, whispering cannot be distinguished from speaking normally and thus the literature should perhaps adopt a more nuanced approach to understand and characterize this mode.\u003c/p\u003e\u003cp\u003eMoreover, if the time-window of ERP modulations for LS seem to encompass a large portion of the time-window associated to motor speech ecoding, likely planning and programming in the FL model, the present results can be also related to the neurocomputational framework of speech production and acquisition from Guenther (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) named Direction Into the Velocities of Articulators (DIVA) model. Indeed, although there is no input so far on the dynamics of brain activation in the latter (Tourville \u0026amp; Guenther, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2011\u003c/span\u003e), the present findings challenge our comprehension of the feedforward control system. If one assumes that the speech sound map (SSM) corresponds to the motor planning in the FL model (i.e., where motor plans are retrieved) while the Articulatory map corresponds to the motor programming stage (i.e., where spatiotemporal and force dimensions are specified), our outcomes suggest that LS could be encoded somehow all along the process of activating the cells in the SSM and transmitting the motor targets to the Initiation Map and the Articulatory Map. For future studies, investigating speech modes thus seems to provide an interesting window to understand the intricate interplay between the functional units in the feedforward control system.\u003c/p\u003e"},{"header":"Limitations","content":"\u003cp\u003eSome methodological concerns could be considered for future studies investigating speech modes and more especially for WS. On one hand, it has been suggested that there was an important intra-speaker and inter-speaker variability in the production of whispered utterances (Konnai et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). On the other hand, despite having the same instructions for every participant, we did not control for the type and/or the way participants were actually whispering. Effectively, Solomon et al. (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e1989\u003c/span\u003e) proposed that there are two types of whispering: quiet whisper (i.e., low effort manner) and a loud whisper (i.e., high effort manner) which were not controlled in this experiment. In brief, analysis of WS data implies several methodological challenges due to the inherent phonatory and articulatory properties of this mode of production.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn the present study, we conducted two experiments which demonstrated that producing utterances under specific speech modes entails a behavioral encoding cost (i.e., increased production latency), although this result may need to be confirmed with a different experimental design. The EEG/ERP results demonstrated that speech modes with distinct phonatory and articulatory features cannot be grouped as a global entity and entail adjustments in different time-windows corresponding to different brain networks. Indeed, the electrophysiological signature of the two speech modes of interest were different with loud utterances entailing changes in ERP signal in two mental processes, one close and one farer away from the vocal onset, while the ERP signal associated to whispered utterances did not differ significantly from standard ERP signal. These findings have important consequences as they challenge the current conceptualization of speech modes. Indeed, this study clarifies the statement \u0026ldquo;each speech mode possesses its own encoding mechanism\u0026rdquo; (Zhang \u0026amp; Hansen, \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Scott 2022). Speech modes seem to be produced through the same brain networks as standard production but with a continuum of changes concerning the temporal dynamics and the strength of recruitment. These changes are observed in the whole motor speech (phonetic) encoding stage and they can go from important (in this case for loud utterances) to almost inexistent (in this instance for whispered utterances).\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCRediT authorship contribution statement\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eBryan Sanders\u003c/strong\u003e (Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Writing \u0026ndash; original draft), \u003cstrong\u003eMonica Lancheros\u003c/strong\u003e (Investigation, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing), \u003cstrong\u003eMarion Bourqui\u003c/strong\u003e (Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing), \u003cstrong\u003eMarina Laganaro\u003c/strong\u003e (Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing \u0026ndash; review \u0026amp; editing).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by the \u003cem\u003eSwiss National Science Foundation (\u003c/em\u003eGrant number: \u003cem\u003eCRSII5_202228\u003c/em\u003e). The funders play no role in study design; in the collection, analysis, and interpretation of data, in the writing of the report, and in the decision to submit the article for publication.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cem\u003eConflict of interest statement\u003c/em\u003e: The authors have no financial or proprietary interests in any material discussed in this article.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBates, D., M\u0026auml;chler, M., Bolker, B. M., \u0026amp; Walker, S. C. (2014). Fitting Linear Mixed-Effects Models using lme4. \u003cem\u003eJournal of Statistical Software\u003c/em\u003e, \u003cem\u003e67\u003c/em\u003e(1). https://doi.org/10.18637/jss.v067.i01\u003c/li\u003e\n\u003cli\u003eBoersma, P., \u0026amp; Van Heuven, V. (2001). Speak and unSpeak with PRAAT. \u003cem\u003eGlot International\u003c/em\u003e, \u003cem\u003e5\u003c/em\u003e(9/10), 341-347.\u003c/li\u003e\n\u003cli\u003eBohland, J. W., Bullock, D., \u0026amp; Guenther, F. H. (2010). Neural representations and mechanisms for the performance of simple speech sequences. \u003cem\u003eJournal of Cognitive Neuroscience\u003c/em\u003e, \u003cem\u003e22\u003c/em\u003e(7), 1504\u0026ndash;1529. https://doi.org/10.1162/JOCN.2009.21306\u003c/li\u003e\n\u003cli\u003eBrunet, D., Murray, M. M., \u0026amp; Michel, C. M. (2011). Spatiotemporal analysis of multichannel EEG: CARTOOL. \u003cem\u003eComputational Intelligence and Neuroscience\u003c/em\u003e, \u003cem\u003e2011\u003c/em\u003e. https://doi.org/10.1155/2011/813870\u003c/li\u003e\n\u003cli\u003eCarson, R. J., \u0026amp; Beeson, C. M. L. (2013). Crossing Language Barriers: Using Crossed Random Effects Modelling in Psycholinguistics Research. \u003cem\u003eTutorials in Quantitative Methods for Psychology\u003c/em\u003e, \u003cem\u003e9\u003c/em\u003e(1), 25\u0026ndash;41.\u003c/li\u003e\n\u003cli\u003eCleophas, T. J., Zwinderman, A. H., Cleophas, T. J., \u0026amp; Zwinderman, A. H. (2016). Non-parametric tests for three or more samples (Friedman and Kruskal-Wallis). \u003cem\u003eClinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing\u003c/em\u003e, 193-197\u003c/li\u003e\n\u003cli\u003eden Hollander, J., Jonkers, R., Mari\u0026euml;n, P., \u0026amp; Bastiaanse, R. (2019). Identifying the Speech Production Stages in Early and Late Adulthood by Using Electroencephalography. \u003cem\u003eFrontiers in Human Neuroscience\u003c/em\u003e, \u003cem\u003e13\u003c/em\u003e, 298. https://doi.org/10.3389/FNHUM.2019.00298/BIBTEX\u003c/li\u003e\n\u003cli\u003eDromey, C., \u0026amp; Ramig, L. O. (1998). Intentional Changes in Sound Pressure Level and Rate. \u003cem\u003eJournal of Speech, Language, and Hearing Research\u003c/em\u003e, \u003cem\u003e41\u003c/em\u003e(5), 1003\u0026ndash;1018. https://doi.org/10.1044/JSLHR.4105.1003\u003c/li\u003e\n\u003cli\u003eFrossard, J., \u0026amp; Renaud, O. (2021). Permutation tests for regression, ANOVA, and comparison of signals: the permuco package. \u003cem\u003eJournal of Statistical Software\u003c/em\u003e, \u003cem\u003e99\u003c/em\u003e, 1-32.\u003c/li\u003e\n\u003cli\u003eGuenther, F. H. (1994). A neural network model of speech acquisition and motor equivalent speech production. \u003cem\u003eBiological Cybernetics 1994 72:1\u003c/em\u003e, \u003cem\u003e72\u003c/em\u003e(1), 43\u0026ndash;53. https://doi.org/10.1007/BF00206237\u003c/li\u003e\n\u003cli\u003eGuenther, F. H. (2016). Neural Control of Speech. In \u003cem\u003eNeural Control of Speech\u003c/em\u003e. The MIT Press. https://doi.org/10.7551/MITPRESS/10471.001.0001\u003c/li\u003e\n\u003cli\u003eGuenther, F. H., \u0026amp; Vladusich, T. (2012). A Neural Theory of Speech Acquisition and Production. \u003cem\u003eJournal of Neurolinguistics\u003c/em\u003e, \u003cem\u003e25\u003c/em\u003e(5), 408\u0026ndash;422. https://doi.org/10.1016/J.JNEUROLING.2009.08.006\u003c/li\u003e\n\u003cli\u003eHuber, J. E., \u0026amp; Chandrasekaran, B. (2006). Effects of Increasing Sound Pressure Level on Lip and Jaw Movement Parameters and Consistency in Young Adults. \u003cem\u003eJournal of Speech, Language, and Hearing Research\u003c/em\u003e, \u003cem\u003e49\u003c/em\u003e(6), 1368\u0026ndash;1379. https://doi.org/10.1044/1092-4388(2006/098)\u003c/li\u003e\n\u003cli\u003eIndefrey, P. (2011). The spatial and temporal signatures of word production components: A critical update. \u003cem\u003eFrontiers in Psychology\u003c/em\u003e, \u003cem\u003e2\u003c/em\u003e(OCT), 255. https://doi.org/10.3389/FPSYG.2011.00255/BIBTEX\u003c/li\u003e\n\u003cli\u003eIndefrey, P., \u0026amp; Levelt, W. J. M. (2004). The spatial and temporal signatures of word production components. \u003cem\u003eCognition\u003c/em\u003e, \u003cem\u003e92\u003c/em\u003e(1\u0026ndash;2), 101\u0026ndash;144. https://doi.org/10.1016/J.COGNITION.2002.06.001\u003c/li\u003e\n\u003cli\u003eKelly, F., \u0026amp; Hansen, J. H. L. (2021). Analysis and calibration of lombard effect and whisper for speaker recognition. \u003cem\u003eIEEE/ACM Transactions on Audio Speech and Language Processing\u003c/em\u003e, \u003cem\u003e29\u003c/em\u003e, 927\u0026ndash;942. https://doi.org/10.1109/TASLP.2021.3053388\u003c/li\u003e\n\u003cli\u003eKoenig, T., Kottlow, M., Stein, M., \u0026amp; Melie-Garc\u0026iacute;a, L. (2011a). Ragu. \u003cem\u003eComputational Intelligence and Neuroscience\u003c/em\u003e, \u003cem\u003e2011\u003c/em\u003e, 14. https://doi.org/10.1155/2011/938925\u003c/li\u003e\n\u003cli\u003eKoenig, T., Kottlow, M., Stein, M., \u0026amp; Melie-Garc\u0026iacute;a, L. (2011b). Ragu: a free tool for the analysis of EEG and MEG event-related scalp field data using global randomization statistics. \u003cem\u003eComputational Intelligence and Neuroscience\u003c/em\u003e, \u003cem\u003e2011\u003c/em\u003e. https://doi.org/10.1155/2011/938925\u003c/li\u003e\n\u003cli\u003eKoenig, T., \u0026amp; Melie-Garc\u0026iacute;a, L. (2010). A method to determine the presence of averaged event-related fields using randomization tests. \u003cem\u003eBrain Topography\u003c/em\u003e, \u003cem\u003e23\u003c/em\u003e(3), 233\u0026ndash;242. https://doi.org/10.1007/S10548-010-0142-1/FIGURES/5\u003c/li\u003e\n\u003cli\u003eKoenig, T., Stein, M., Grieder, M., \u0026amp; Kottlow, M. (2014). A tutorial on data-driven methods for statistically assessing ERP topographies. \u003cem\u003eBrain Topography\u003c/em\u003e, \u003cem\u003e27\u003c/em\u003e(1), 72\u0026ndash;83. https://doi.org/10.1007/S10548-013-0310-1/FIGURES/7\u003c/li\u003e\n\u003cli\u003eKonnai, R., Scherer, R. C., Peplinski, A., \u0026amp; Ryan, K. (2017). Whisper and Phonation: Aerodynamic Comparisons Across Adduction and Loudness. \u003cem\u003eJournal of Voice\u003c/em\u003e, \u003cem\u003e31\u003c/em\u003e(6), 773.e11-773.e20. https://doi.org/10.1016/J.JVOICE.2017.02.016\u003c/li\u003e\n\u003cli\u003eLaganaro, M. (2019). \u003cem\u003eLanguage, Cognition and Neuroscience Phonetic encoding in utterance production: a review of open issues from 1989 to 2018\u003c/em\u003e. https://doi.org/10.1080/23273798.2019.1599128\u003c/li\u003e\n\u003cli\u003eLaganaro, M. (2023). Time-course of phonetic (motor speech) encoding in utterance production. \u003cem\u003eCognitive Neuropsychology\u003c/em\u003e, \u003cem\u003e40\u003c/em\u003e(5-6), 287-297.\u003c/li\u003e\n\u003cli\u003eLaganaro, M., \u0026amp; Perret, C. (2011). Comparing electrophysiological correlates of word production in immediate and delayed naming through the analysis of word age of acquisition effects. \u003cem\u003eBrain Topography\u003c/em\u003e, \u003cem\u003e24\u003c/em\u003e(1), 19\u0026ndash;29. https://doi.org/10.1007/S10548-010-0162-X/FIGURES/3\u003c/li\u003e\n\u003cli\u003eLenth, R. (2020). emmeans: Estimated Marginal Means, aka Least-Squares Means. (R package version 1.4.5). The Comprehensive R Archive Network (CRAN). https://cran.r- project.org/package=emmeans\u003c/li\u003e\n\u003cli\u003eLevelt, W. J. (1989). \u003cem\u003eSpeaking: From intention to articulation (Vol. 1): MIT press.\u003c/em\u003e (MIT press, Ed.; Vol. 1). https://www.mpi.nl/publications/item67053/speaking-intention-articulation\u003c/li\u003e\n\u003cli\u003eLevelt, W. J. M., Roelofs, A., \u0026amp; Meyer, A. S. (1999). A theory of lexical access in speech production. \u003cem\u003eBehavioral and Brain Sciences\u003c/em\u003e, \u003cem\u003e22\u003c/em\u003e(1), 1\u0026ndash;38. https://doi.org/10.1017/S0140525X99001776\u003c/li\u003e\n\u003cli\u003eMichel, C. M., \u0026amp; Brunet, D. (2019). EEG source imaging: A practical review of the analysis steps. \u003cem\u003eFrontiers in Neurology\u003c/em\u003e, \u003cem\u003e10\u003c/em\u003e(APR), 446653. https://doi.org/10.3389/FNEUR.2019.00325/BIBTEX\u003c/li\u003e\n\u003cli\u003eMichel, C. M., \u0026amp; Koenig, T. (2018). EEG microstates as a tool for studying the temporal dynamics of whole-brain neuronal networks: A review. \u003cem\u003eNeuroImage\u003c/em\u003e, \u003cem\u003e180\u003c/em\u003e, 577\u0026ndash;593. https://doi.org/10.1016/J.NEUROIMAGE.2017.11.062\u003c/li\u003e\n\u003cli\u003eMichel, C. M., Koenig, T., Brandeis, D., Gianotti, L. R. R., \u0026amp; Michel, C. M. (2009). Electrical Neuroimaging Edited by. \u003cem\u003eCambridge University Press\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eMichel, C. M., \u0026amp; Murray, M. M. (2012). Towards the utilization of EEG as a brain imaging tool. \u003cem\u003eNeuroImage\u003c/em\u003e, \u003cem\u003e61\u003c/em\u003e(2), 371\u0026ndash;385. https://doi.org/10.1016/J.NEUROIMAGE.2011.12.039\u003c/li\u003e\n\u003cli\u003eMiller, H. E., \u0026amp; Guenther, F. H. (2021). Modelling speech motor programming and apraxia of speech in the DIVA/GODIVA neurocomputational framework. \u003cem\u003eAphasiology\u003c/em\u003e, \u003cem\u003e35\u003c/em\u003e(4), 424\u0026ndash;441. https://doi.org/10.1080/02687038.2020.1765307\u003c/li\u003e\n\u003cli\u003eMurray, M. M., Brunet, D., \u0026amp; Michel, C. M. (2008a). Topographic ERP analyses: a step-by-step tutorial review. \u003cem\u003eBrain Topography\u003c/em\u003e, \u003cem\u003e20\u003c/em\u003e(4), 249\u0026ndash;264. https://doi.org/10.1007/S10548-008-0054-5\u003c/li\u003e\n\u003cli\u003eNew, B., Pallier, C., Brysbaert, M., \u0026amp; Ferrand, L. (2004). Lexique 2: A new French lexical database. \u003cem\u003eBehavior Research Methods, Instruments, \u0026amp; Computers\u003c/em\u003e, \u003cem\u003e36\u003c/em\u003e(3), 516-524.\u003c/li\u003e\n\u003cli\u003eOldfield, R. C. (1971). The assessment and analysis of handedness: the Edinburgh inventory. \u003cem\u003eNeuropsychologia\u003c/em\u003e, \u003cem\u003e9\u003c/em\u003e(1), 97-113.\u003c/li\u003e\n\u003cli\u003ePerrin, F., Pernier, J., Bertnard, O., Giard, M. H., \u0026amp; Echallier, J. F. (1987). Mapping of scalp potentials by surface spline interpolation. \u003cem\u003eElectroencephalography and clinical neurophysiology\u003c/em\u003e, \u003cem\u003e66\u003c/em\u003e(1), 75-81.\u003c/li\u003e\n\u003cli\u003ePerkell, J. S. (2012). Movement goals and feedback and feedforward control mechanisms in speech production. \u003cem\u003eJournal of Neurolinguistics\u003c/em\u003e, \u003cem\u003e25\u003c/em\u003e(5), 382\u0026ndash;407. https://doi.org/10.1016/J.JNEUROLING.2010.02.011\u003c/li\u003e\n\u003cli\u003ePiai, V., Ri\u0026egrave;s, S. K., \u0026amp; Knight, R. T. (2014). The electrophysiology of language production: What could be improved. \u003cem\u003eFrontiers in Psychology\u003c/em\u003e, \u003cem\u003e5\u003c/em\u003e(OCT), 1560. https://doi.org/10.3389/FPSYG.2014.01560/BIBTEX\u003c/li\u003e\n\u003cli\u003eProtopapas, A. (2007). CheckVocal: A program to facilitate checking the accuracy and response time of vocal responses from DMDX. \u003cem\u003eBehavior Research Methods\u003c/em\u003e, \u003cem\u003e39\u003c/em\u003e(4), 859\u0026ndash;862. https://doi.org/10.3758/BF03192979/METRICS\u003c/li\u003e\n\u003cli\u003eRamoo, D. (2021). \u003cem\u003e9.2 The Standard Model of Speech Production\u003c/em\u003e. BCcampus.\u003c/li\u003e\n\u003cli\u003eR Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/\u003c/li\u003e\n\u003cli\u003eSikdar, D., Roy, R., \u0026amp; Mahadevappa, M. (2017, May). Multifractal analysis of electroencephalogram for human speech modalities. In \u003cem\u003e2017 8th International IEEE/EMBS Conference on Neural Engineering (NER)\u003c/em\u003e (pp. 637-640). IEEE.\u003c/li\u003e\n\u003cli\u003eSkrandies, W. (1990). Global field power and topographic similarity. \u003cem\u003eBrain Topography\u003c/em\u003e, \u003cem\u003e3\u003c/em\u003e(1), 137\u0026ndash;141. https://doi.org/10.1007/BF01128870\u003c/li\u003e\n\u003cli\u003eSmiljanić, R., \u0026amp; Bradlow, A. R. (2009). Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes. \u003cem\u003eLanguage and Linguistics Compass\u003c/em\u003e, \u003cem\u003e3\u003c/em\u003e(1), 236\u0026ndash;264. https://doi.org/10.1111/J.1749-818X.2008.00112.X\u003c/li\u003e\n\u003cli\u003eSmith, S. M., \u0026amp; Nichols, T. E. (2009). Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. \u003cem\u003eNeuroimage\u003c/em\u003e, \u003cem\u003e44\u003c/em\u003e(1), 83-98.\u003c/li\u003e\n\u003cli\u003eSolomon, N. P., McCall, G. N., Trosset, M. W., \u0026amp; Gray, W. C. (1989). Laryngeal Configuration and Constriction during Two Types of Whispering. \u003cem\u003eJournal of Speech and Hearing Research\u003c/em\u003e, \u003cem\u003e32\u003c/em\u003e(1), 161\u0026ndash;174. https://doi.org/10.1044/JSHR.3201.161\u003c/li\u003e\n\u003cli\u003eStrijkers, K., \u0026amp; Costa, A. (2016). The cortical dynamics of speaking: present shortcomings and future avenues. \u003cem\u003eLanguage, Cognition and Neuroscience\u003c/em\u003e, \u003cem\u003e31\u003c/em\u003e(4), 484\u0026ndash;503. https://doi.org/10.1080/23273798.2015.1120878\u003c/li\u003e\n\u003cli\u003eTourville, J. A., \u0026amp; Guenther, F. H. (2011). The DIVA model: A neural theory of speech acquisition and production. \u003cem\u003eLanguage and Cognitive Processes\u003c/em\u003e, \u003cem\u003e26\u003c/em\u003e(7), 952\u0026ndash;981. https://doi.org/10.1080/01690960903498424\u003c/li\u003e\n\u003cli\u003eTsunoda, K., Sekimoto, S., \u0026amp; Baer, T. (2011). An fMRI study of whispering: the role of human evolution in psychological dysphonia. \u003cem\u003eMedical Hypotheses\u003c/em\u003e, \u003cem\u003e77\u003c/em\u003e(1), 112\u0026ndash;115. https://doi.org/10.1016/J.MEHY.2011.03.040\u003c/li\u003e\n\u003cli\u003eVan der Merwe, A. (2009). A theoretical framework for the characterization of pathological speech sensorimotor control. In M. R. McNeil (Ed.), \u003cem\u003eClinical management of sensorimotor speech disorders\u003c/em\u003e (2nd ed., pp. 3\u0026ndash;18).\u003c/li\u003e\n\u003cli\u003eVan Der Merwe, A. (2020). New perspectives on speech motor planning and programming in the context of the four- level model and its implications for understanding the pathophysiology underlying apraxia of speech and other motor speech disorders. \u003cem\u003eAphasiology\u003c/em\u003e, \u003cem\u003e35\u003c/em\u003e(4), 397\u0026ndash;423. https://doi.org/10.1080/02687038.2020.1765306\u003c/li\u003e\n\u003cli\u003eVerwoert, M., Ottenhoff, M. C., Goulis, S., Colon, A. J., Wagner, L., Tousseyn, S., van Dijk, J. P., Kubben, P. L., \u0026amp; Herff, C. (2022). Dataset of Speech Production in intracranial Electroencephalography. \u003cem\u003eScientific Data 2022 9:1\u003c/em\u003e, \u003cem\u003e9\u003c/em\u003e(1), 1\u0026ndash;9. https://doi.org/10.1038/s41597-022-01542-9\u003c/li\u003e\n\u003cli\u003eWeerathunge, H. R., Alzamendi, G. A., Cler, G. J., Guenther, F. H., Stepp, C. E., \u0026amp; Za\u0026ntilde;artu, M. (2022). LaDIVA: A neurocomputational model providing laryngeal motor control for speech acquisition and production. \u003cem\u003ePLoS Computational Biology\u003c/em\u003e, \u003cem\u003e18\u003c/em\u003e(6). https://doi.org/10.1371/JOURNAL.PCBI.1010159\u003c/li\u003e\n\u003cli\u003eWhitfield, J. A., Holdosh, S. R., Kriegel, Z., Sullivan, L. E., \u0026amp; Fullenkamp, A. M. (2021). Tracking the Costs of Clear and Loud Speech: Interactions Between Speech Motor Control and Concurrent Visuomotor Tracking. \u003cem\u003eJournal of Speech, Language, and Hearing Research\u003c/em\u003e, \u003cem\u003e64\u003c/em\u003e(6s), 2182\u0026ndash;2195. https://doi.org/10.1044/2020_JSLHR-20-00264\u003c/li\u003e\n\u003cli\u003eZhang, C., \u0026amp; Hansen, J. H. L. (2007). Analysis and classification of speech mode: Whispered through shouted. \u003cem\u003eInternational Speech Communication Association - 8th Annual Conference of the International Speech Communication Association, Interspeech 2007\u003c/em\u003e, \u003cem\u003e4\u003c/em\u003e, 2396\u0026ndash;2399. https://doi.org/10.21437/INTERSPEECH.2007-621\u003c/li\u003e\n\u003cli\u003eZhang, C., Hansen, J. H. L., \u0026amp; Patil, H. A. (2018). Advancements in whispered speech detection for interactive/speech systems. In \u003cem\u003eSignal and Acoustic Modelling for Speech and Communication Disorders\u003c/em\u003e (Vol. 5, pp. 9-32). De Gruyter.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"brain-topography","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"btop","sideBox":"Learn more about [Brain Topography](http://link.springer.com/journal/10548)","snPcode":"10548","submissionUrl":"https://submission.nature.com/new-submission/10548/3","title":"Brain Topography","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Amplitudes, Electroencephalography (EEG), Event related potential (ERP) Microstates, Motor speech control","lastPublishedDoi":"10.21203/rs.3.rs-4977028/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4977028/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e Loud speech and whispered speech are two distinct speech modes that are part of daily verbal exchanges, but that involve a different employment of the speech apparatus. However, a clear account of whether and when the motor speech (or phonetic) encoding of these speech modes differs from standard speech has not been provided yet. Here, we addressed this question using Electroencephalography (EEG)/Event related potential (ERP) approaches during a delayed production task to contrast the production of speech sequences (pseudowords) when speaking normally or under a specific speech mode: loud speech in experiment 1 and whispered speech in experiment 2. Behavioral results demonstrated that non-standard speech modes entail a behavioral encoding cost in terms of production latency. Standard speech and speech modes\u0026rsquo; ERPs were characterized by the same sequence of microstate maps, suggesting that the same brain processes are involved to produce speech under a specific speech mode. Only loud speech entailed electrophysiological modulations relative to standard speech in terms of waveform amplitudes but also temporal distribution and strength of neural recruitment of the same sequence of microstates in a large time window (from approximatively \u0026minus;\u0026thinsp;220 ms to -100 ms) preceding the vocal onset. Alternatively, the electrophysiological activity of whispered speech was similar in nature to standard speech. On the whole, speech modes and standard speech seem to be encoded through the same brain processes but the degree of adjustments required seem to vary subsequently across speech modes.\u003c/p\u003e","manuscriptTitle":"Brain dynamics of speech modes encoding: Loud and Whispered speech versus Standard speech","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-09-27 07:42:13","doi":"10.21203/rs.3.rs-4977028/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-11-25T20:44:11+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-22T09:39:32+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"155865871191788928123592429581755263979","date":"2024-10-29T16:38:37+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"141840979118632114251248627365928736164","date":"2024-09-24T08:58:27+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-09-10T20:59:57+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-08-28T21:57:49+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-08-28T12:10:38+00:00","index":"","fulltext":""},{"type":"submitted","content":"Brain Topography","date":"2024-08-26T09:48:07+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"brain-topography","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"btop","sideBox":"Learn more about [Brain Topography](http://link.springer.com/journal/10548)","snPcode":"10548","submissionUrl":"https://submission.nature.com/new-submission/10548/3","title":"Brain Topography","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"7619c23f-a1fe-4b44-a6b3-562fe86c3731","owner":[],"postedDate":"September 27th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-02-17T15:59:43+00:00","versionOfRecord":{"articleIdentity":"rs-4977028","link":"https://doi.org/10.1007/s10548-025-01108-z","journal":{"identity":"brain-topography","isVorOnly":false,"title":"Brain Topography"},"publishedOn":"2025-02-15 15:57:06","publishedOnDateReadable":"February 15th, 2025"},"versionCreatedAt":"2024-09-27 07:42:13","video":"","vorDoi":"10.1007/s10548-025-01108-z","vorDoiUrl":"https://doi.org/10.1007/s10548-025-01108-z","workflowStages":[]},"version":"v1","identity":"rs-4977028","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4977028","identity":"rs-4977028","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-4.0