Multimodal Classification of Cognitive Workload Using Eye-Tracking, ECG, and Head Motion Data in Simulated Military Missions

preprint OA: closed
Full text JSON View at publisher
Full text 165,098 characters · extracted from preprint-html · click to expand
Multimodal Classification of Cognitive Workload Using Eye-Tracking, ECG, and Head Motion Data in Simulated Military Missions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Multimodal Classification of Cognitive Workload Using Eye-Tracking, ECG, and Head Motion Data in Simulated Military Missions Murat Kucukosmanoglu, Justin Brooks, Catherine Neubauer, Andrea Krausman This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7285350/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Accurately assessing cognitive workload is critical in military operations, where decisions must be made under pressure in complex and dynamic environments. This study presents multimodal machine learning approaches for classifying workload into three levels: low, moderate, and high. Synchronized electrocardiogram (ECG), eye-tracking, and head movement signals from inertial measurement units were collected across 26 simulated missions involving autonomous technologies. High workload segments were annotated by experts based on task demands and performance. Physiological and behavioral features; including heart rate, heart rate variability, pupil diameter, fixation count, and blink rate, were extracted and normalized per participant to account for individual variability. Classification models were evaluated using subject-independent five-fold cross-validation to ensure generalization. Among the tested models, XGBoost achieved the highest performance, with an accuracy of 0.86 and a macro averaged F1 score of 0.78, outperforming Random Forest (accuracy: 0.82, F1: 0.73) and Decision Tree (accuracy: 0.74, F1: 0.65). Feature importance analysis revealed pupil size and fixation dispersion as key predictors of cognitive workload. These findings demonstrate the feasibility of real-time, noninvasive cognitive workload monitoring using multimodal physiological signals and support the development of adaptive human-machine systems that dynamically respond to operator cognitive states in high-demand environments. Biological sciences/Computational biology and bioinformatics Physical sciences/Engineering Health sciences/Health care Physical sciences/Mathematics and computing Biological sciences/Neuroscience Cognitive Workload Multimodal Classification Eye-Tracking ECG XGBoost Human-Machine Interaction Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Introduction Modern military operations require rapid and adaptive decision-making in conditions of uncertainty, time pressure, and high cognitive demand 1 . As combat platforms such as next-generation ground vehicles continue to evolve toward greater autonomy, human operators are expected to manage complex systems while maintaining situational awareness across multiple mission components 2 . These systems act not just as tools, but as intelligent teammates supporting shared decision-making and dynamic task coordination. However, reductions in crew size and increases in system complexity place greater mental demands on individual operators 3 , making real-time monitoring of cognitive workload essential. Cognitive workload refers to the portion of an individual’s mental resources engaged during task execution 4 , 5 . When workload surpasses cognitive capacity, performance can degrade quickly, especially in high-stakes operational contexts 6 , 7 . Excessive workload has been associated with delayed responses, impaired decision making, and a higher likelihood of mission-critical errors 8 . Although subjective assessments such as self‑report surveys are commonly used to evaluate perceived workload, they are inherently retrospective and do not offer the temporal resolution required for real‑time applications 9 . Conversely, physiological signals offer an objective and continuous alternative for assessing cognitive workload. Among the most promising indicators are electrocardiogram (ECG) and eye-tracking signals 10 . ECG data provides insight into the autonomic nervous system through features such as heart rate and heart rate variability (HRV), both of which respond to changes in cognitive effort and emotional stress 11 , 12 . Eye-tracking contributes behavioral markers such as pupil size, fixation duration, blink frequency, and saccade dynamics, all of which have been shown to vary with cognitive demand 13 – 15 . These signals are noninvasive, can be recorded using wearable sensors, are fieldable, and thus, are suitable for use in complex task environments. While ECG and eye-tracking are each valuable on their own, they also have limitations in isolation, especially in operational settings where noise or motion can degrade signal quality. Combining multiple physiological modalities offers a more robust solution 16 , 17 . Previous research in fields such as aviation, driving, and human-computer interaction has shown that multimodal approaches improve the accuracy and stability of workload classification systems 18 , 19 . By capturing complementary aspects of the human response to cognitive demand, multimodal sensing enables a more comprehensive understanding of operator state. This is particularly important for adaptive systems that must respond to fluctuations in workload during task execution. Unlike prior studies that focused on controlled laboratory tasks or individual assessments, this study used unstructured military simulations that incorporate multi-human, multi-agent teams, and expert-annotated workload segments to better reflect real-world demands and mission-specific contexts. The findings support the feasibility of real-time, noninvasive monitoring of cognitive workload and provide a foundation for integrating physiological sensing into adaptive human–machine systems. In this study, we present machine learning frameworks to classify cognitive workload into low, moderate, and high levels using synchronized multimodal data. The dataset includes ECG, eye-tracking, and head movement signals derived from an inertial measurement unit (IMU), collected across 26 unstructured mission simulations involving participant interaction with autonomous technologies. We hypothesize that integrating cardiovascular and oculometric features will enhance workload classification performance compared to unimodal approaches. Results Temporal Dynamics of Physiological Signals Figure 1 presents an example of physiological responses to varying cognitive workload levels during a representative mission, based on data from a single participant (labeled ‘cs01’). Red dashed lines mark high workload events, manually annotated by experts according to operational task demands. This time aligned visualization illustrates how physiological signals evolve across mission phases and workload transitions. Notably, both pupil diameter and heart rate increased during high workload periods. The magnetometer Z axis signal, representing vertical head movement, also exhibited some variability during these intervals compared to lower workload phases. Condition-Based Distributions of Physiological Features To assess how ECG-derived features vary across workload levels, we computed the condition-wise means and 95% confidence intervals of the mean for four core metrics: mean heart rate (Mean_HR), maximum heart rate (Max_HR), minimum heart rate (Min_HR), and HRV. Figure 2 presents the standardized mean values of selected features across the three workload classes: low, moderate, and high, with 95% confidence intervals indicating the range within which the true mean is likely to fall for each condition. Heart rate metrics (i.e., Mean_HR, Max_HR, and Min_HR) exhibited a clear monotonic increase from low to high workload conditions. In contrast, HRV showed a decreasing trend with rising workload, though the change was much less pronounced. To evaluate the sensitivity of eye features to changes in task demand, we examined standardized means of key eye-tracking metrics across low, moderate, and high workload conditions. Figure 3 presents group-level feature distributions with 95% confidence intervals. Several oculometric features varied across workload conditions. Mean pupil diameter increased with workload. Fixation count was higher under moderate and high workload, while fixation duration decreased. Saccade count showed a modest increase with workload, and blink count decreased as workload increased. To complement these physiological and oculometric insights, we further examined how head movement features derived from the Tobii Pro Glasses 3’s inertial measurement unit (IMU) vary with cognitive workload. The IMU captures translational and rotational head motion through accelerometer, gyroscope, and magnetometer signals. We computed the standardized mean and 95% confidence interval for each IMU feature under workload conditions. Several IMU features varied across workload conditions (Fig. 4 ). Accelerometer signals along the Y and Z axes decreased with increasing workload. In contrast, magnetometer Y- and Z-axis values increased from low to high workload. Gyroscope features remained relatively stable across all conditions. Figure 5 displays violin plots for six representative features across low, moderate, and high workload conditions: Mean Heart Rate, Mean Pupil Size, Fixation Count, Fixation X-Coordinate Standard Deviation, Saccade Count, and Magnetometer Z-Axis. These plots show the distributions and central tendencies of each feature within each workload class. Heart rate and pupil size increased across workload levels. Fixation count also increased, while Fixation X-Coordinate Standard Deviation (Fix_X_Std) was higher under moderate workload and showed minimal change between moderate and high workload. Saccade count exhibited a modest increase with workload. Magnetometer Z-axis values increased from low to high workload. Classification Model Performance We evaluated the performance of three classification models: Decision Tree, Random Forest, and XGBoost in predicting cognitive workload levels defined as low workload (Class 0), moderate workload (Class 1), and high workload (Class 2). Models were trained using standardized physiological features and assessed via repeated cross-validation, stratified across individual crew stations to ensure subject-independent evaluation. Overall Results Among the models evaluated, XGBoost demonstrated the highest overall performance, achieving a mean cross-validation accuracy of 0.86 and a macro averaged F1 score of 0.78 across folds. Random Forest followed with an accuracy of 0.82 and an F1 score of 0.73. Decision Tree yielded the lowest performance, with an accuracy of 0.74 and an F1 score of 0.65. These metrics represent the mean values averaged across five subject-independent folds. Table 1 shows summary statistics for each model. Table 1 Overall Cross-Validation Performance Metrics Metric Decision Tree Random Forest XGBoost Accuracy 0.74 0.82 0.86 Precision 0.61 0.70 0.77 Recall 0.75 0.77 0.79 F1 Score 0.65 0.73 0.78 To gain deeper insight into model behavior, we evaluated F1 scores by workload class. All models performed well in identifying low workload conditions, and both XGBoost and Random Forest showed strong performance for moderate workload. High workload classification was more difficult. For Class 2, XGBoost achieved a recall of 0.73 and precision of 0.63, yielding an F1 score of 0.68. This value represents the average Class 2 F1 score across the five subject-independent cross-validation folds. This reduced performance likely reflects increased variability in physiological responses during high-demand periods and the effects of class imbalance. In the final dataset, high workload accounted for only 16% of samples, while moderate workload dominated at 78%, contributing to more false positives for Class 2. Table 2 Model Performance by Workload Class Workload Class Description Decision Tree (F1) Random Forest (F1) XGBoost (F1) Class 0 Low Workload 0.59 0.69 0.75 Class 1 Moderate Workload 0.82 0.88 0.91 Class 2 High Workload 0.53 0.61 0.68 These results provide evidence supporting the superior performance of XGBoost for multimodal cognitive workload classification. Accordingly, XGBoost was selected for all subsequent analyses and interpretation. Modality-Specific Model Comparison We trained separate XGBoost classifiers using only ECG features, only eye-tracking features, and only IMU features. All models were evaluated using a consistent five-fold cross-validation approach. Their performance was then compared to a multimodal model that integrated all three modalities. Among the unimodal models, the eye-tracking–only model achieved the strongest performance, with an average accuracy of 0.82 and F1 score of 0.75. The IMU-only model followed, with an accuracy of 0.72 and F1 score of 0.56. The ECG-only model showed lower performance, achieving an accuracy of 0.67 and F1 score of 0.43. By comparison, the multimodal model outperformed all unimodal models, yielding the highest accuracy (0.86), F1 score (0.78), recall (0.79), and precision (0.77). These results highlight the complementary value of ECG-, eye-, and IMU-derived signals in cognitive workload classification. While each modality contributes independently, combining them provides greater accuracy and robustness in predicting workload states. This improvement supports our hypothesis that multimodal models offer superior predictive performance compared to unimodal approaches. Table 3 Overall Cross-Validation Performance by Modality (XGBoost Model) Metric ECG Only Eye-Tracking Only IMU Only Multimodal (ECG + Eye-Tracking + IMU) Accuracy 0.67 0.82 0.72 0.86 Precision 0.42 0.74 0.56 0.77 Recall 0.45 0.78 0.57 0.79 F1 Score 0.43 0.75 0.56 0.78 Feature Importance To better understand the physiological basis of model predictions, we analyzed feature importance from the multimodal XGBoost classifier (Fig. 6 ). Features were ranked based on their average gain, reflecting their contribution to improved classification performance. Eye-tracking features ranked highest in the multimodal XGBoost classifier. Pupil Size Mean was the top-ranked feature, followed by Fixation X-Coordinate Standard Deviation (Fix X Std). Other eye-related features, including mean and standard deviation of fixation duration, Saccade Count, and Number of Blinks, also appeared among the most important features. IMU-derived features such as Accelerometer Y, Accelerometer Z, Magnetometer Y, and Magnetometer Z were also ranked among the top contributors. ECG-related features, including Max HR, Mean HR, Min HR, and Signal-to-Noise Ratio (SNR), were present but ranked lower in average gain compared to eye-tracking and IMU features. Correlation Between Estimated Workload and Reported Mental Demand We analyzed correlations between the average predicted workload produced by the XGBoost classifier and participants’ self-reported mental demand ratings. These analyses were conducted at the phase, mission, and crew station (subject) levels to capture fluctuations in workload across multiple temporal and operational scales. Each mission was composed of four distinct phases, enabling finer-grained comparisons of workload dynamics over time. This multilevel approach aligns with how cognitive demand is experienced and assessed in team-based settings, allowing us to examine both shared and individual perceptions of task difficulty. Correlations between model-predicted workload and self-reported mental demand varied by aggregation level. At the individual crew station level, we observed a statistically significant but modest correlation (r = 0.14, p < 0.001). Using median mental demand ratings aggregated across crew stations to obtain team-level estimates, a stronger positive correlation emerged at the mission phase level (r = 0.21, p < 0.05). The strongest association was found at the mission level (r = 0.59, p < 0.05), where both predicted and self-reported workload values were averaged across crew stations to reduce subject-specific variability. For comparison, we also calculated correlations using mean mental demand ratings. While the overall pattern remained consistent, the associations were weaker: r = 0.14 (p < 0.001) at the crew station level, r = 0.16 (p = 0.12) at the phase level, and r = 0.27 (p = 0.189) at the mission level. These findings highlight the advantage of using the median to mitigate the influence of extreme values and uniform response tendencies in subjective workload ratings. Temporal Alignment of Predictions with True Workload Labels Figure 8 presents a time-series comparison between the model’s predicted workload classes and the ground truth labels for a representative subject during a mission. The model successfully tracks the overall progression of cognitive workload, transitioning from low workload to moderate and occasionally into high workload. Predictions exhibit strong temporal alignment with the true labels during extended segments, particularly for low and moderate phases. However, two types of misclassifications are evident: (1) one brief over-prediction to high workload while the ground truth remains at moderate, likely reflecting the model’s sensitivity to transient physiological fluctuations (e.g., spikes in heart rate, pupil size, or fixation dispersion); and (2) missed detections of actual high workload periods, as indicated by the lower recall for the high class. These results highlight the challenge of distinguishing brief, noisy workload surges from sustained high workload episodes. Discussion This study demonstrates the feasibility of classifying cognitive workload using a multimodal physiological framework that integrates ECG, eye-tracking, and IMU data collected during high-fidelity, complex simulated military missions. The findings highlight the benefit of combining complementary biosignals to improve model accuracy under varying cognitive demands. Among the models tested, XGBoost delivered the strongest performance, achieving an overall accuracy of 0.86 and a macro averaged F1 score of 0.78, outperforming both Random Forest (F1 = 0.73) and Decision Tree (F1 = 0.65). While all models effectively detected low and moderate workload conditions, classification performance was lower for high workload segments. This reduction likely reflects increased physiological variability under stress, as well as limitations in the annotation process. Manual labeling is subject to temporal uncertainty and may fail to capture brief or subtle instances of high cognitive demand, leading to noisy or incomplete ground truth. To address this, we applied a ± 30-second buffer around annotated high workload events, resulting in 60-second modeling windows. This window length was selected empirically: shorter windows led to reduced precision, while longer ones increased precision at the cost of temporal specificity. Despite high workload periods comprising only 16% of the dataset, XGBoost achieved a recall of 0.73 and a precision of 0.63 for this class, resulting in an F1 score of 0.68. These results suggest the model effectively detects most high workload events, though some moderate segments are misclassified. Given the class imbalance, the reduced precision for Class 2 is expected. Importantly, this performance substantially exceeds chance-level expectations (F1 ≈ 0.16), suggesting that the model learned meaningful physiological patterns associated with elevated cognitive demand. These findings underscore the effectiveness of tree-based ensemble methods for modeling complex, nonlinear relationships in multimodal physiological data. Physiological and Behavioral Indicators of Workload Interpretation of the temporal dynamics revealed distinct physiological signatures associated with elevated workload. During annotated high workload periods, pupil diameter increased, consistent with heightened arousal and attentional engagement. 20 Concurrent increases in heart rate likely reflect sympathetic nervous system activation in response to cognitive stress 21 – 23 . Additionally, fluctuations in magnetometer Z-axis signals during these periods may reflect postural adjustments or head orientation shifts, which are common during intense visual search or physical engagement. Together, these trends confirm that synchronized eye-tracking, cardiovascular, and movement-related signals capture workload-relevant changes in operator state. Condition-based analyses further revealed how feature distributions change across workload levels. Oculometric features; including pupil size, fixation count, and saccade count; showed consistent increases with task demand, while fixation duration and blink count decreased. These patterns may be consistent with more rapid visual scanning and sustained attention under time pressure. Notably, fixation durations tended to shorten as task complexity increased, as observed in both flight simulator exercises 24 and a video game-based cognitive load task 25 . However, there is no clear consensus on the behavior of fixation- and saccade-related metrics under cognitive load, as findings vary considerably depending on task type and experimental design 14 . Gaze dispersion (as indexed by Fix X Std) increased from low to moderate workload, then stabilized, possibly reflecting a shift to more exploratory behavior during early task engagement, followed by narrowed focus during high-load phases. Similarly, IMU-derived features exhibited systematic changes across workload conditions. Accelerometer signals along the Y and Z axes decreased with increasing workload, possibly reflecting reduced head motion or more constrained postural adjustments during cognitively demanding tasks. 26 In contrast, magnetometer values along the same axes increased, suggesting changes in orientation or environmental magnetic field exposure. Gyroscope signals remained relatively stable, indicating that rotational head movements were less sensitive to workload variations, potentially due to the task structure or physical constraints of the simulation environment. Feature Importance and Modality Contributions Feature-level analyses provided interpretable insights into how specific signals reflect changes in workload. As previously described, elevated heart rate and reduced heart rate variability corresponded with increased workload, consistent with heightened sympathetic nervous system activity 21 – 23 . Eye-tracking metrics, including increased pupil dilation and higher fixation counts, also demonstrated clear sensitivity to workload fluctuations, aligning with prior studies linking these features to attentional engagement and cognitive effort 27 . For example, more frequent and shorter fixations suggested rapid scanning behavior under time pressure. Blink rate also emerged as a top-ranked feature in the model’s important plots. Exploratory visual analyses (Fig. 3 ) revealed a consistent decrease in blink frequency during high workload phases, reinforcing previous findings that associate reduced blink rate with heightened visual processing demands 27 . Gaze variability, as indexed by the standard deviation of fixation X-coordinates, increased from low to moderate workload levels and then stabilized at higher workload. This pattern differs from previous studies that reported overall decreases in gaze variability with rising workload 27 , highlighting the potential for nonlinear or context-dependent effects in complex operational tasks. Together, these results underscore the value of combining cardiovascular and oculometric indicators for robust workload monitoring. In addition to cardiovascular and eye-tracking signals, IMU-derived features contributed meaningful information to workload classification. Specifically, magnetometer and accelerometer readings along the Z-axis were consistently ranked among the most important predictors. These features may reflect postural adjustments, movement variability, or head orientation changes during cognitively demanding phases. A central advantage of the proposed framework is its integration of multimodal data sources. ECG, eye-tracking, and IMU signals each capture distinct but complementary aspects of cognitive and physiological functioning. By fusing these modalities, the model achieved a richer and more robust physiological representation of workload. This multimodal approach significantly enhanced the model’s ability to discriminate nuanced workload transitions across different mission phases, supporting previous findings that multimodal fusion improves classification. 17 , 28 , 29 Notably, our multimodal classifier achieved a macro averaged F1 score of 0.78, outperforming the ECG-only model (F1 = 0.43), the eye-tracking–only model (F1 = 0.75), and the IMU-only model (F1 = 0.56), though the gain over eye-tracking alone was modest. These findings confirm the complementary value of IMU-derived head movement signals alongside traditional biosignals. Subjective Validation and Temporal Alignment Correlations between model-predicted workload and self-reported mental demand ratings varied by level of aggregation. The strongest alignment was observed at the mission level (r = 0.59), followed by the phase level (r = 0.21), and weakest at the individual crew station level (r = 0.14). This gradient likely reflects the effects of temporal and spatial averaging: broader aggregations smooth out short-term fluctuations and inter-individual variability, resulting in stronger signal consistency. At the individual level, weaker correlations may be attributed to variability in subjective reporting, inconsistent interpretations of mental demand, or physiological differences in responsiveness. Several participants provided uniform ratings across all phases, limiting sensitivity to within-mission changes. Moreover, because many annotations were made at the section or platoon level, they may reflect collective task load rather than individual experience. For example, a label such as “SECTION 01: HIGH WORKLOAD; Under attack, some confusion” may not represent the cognitive load of each crew member equally. In our labeling procedure, when such section-level annotations were present, we assigned the high workload label to all crew stations within that section during the corresponding window. This decision was made to preserve operational context but may have introduced label noise for individuals whose workload levels diverged from the group average. These findings highlight the inherent challenges in validating physiological workload predictions against subjective ratings in team-based environments and suggest the value of incorporating finer-grained labeling approaches, such as continuous self-reporting or real-time performance-based metrics. Implications and Future Directions This study demonstrates the feasibility and utility of using multimodal, noninvasive physiological sensing to monitor cognitive workload in complex operational settings. The integration of eye-tracking, IMU and ECG data supports accurate, real-time classification of workload, with direct applications in military, industrial, and other high-stakes environments. The use of within-subject normalization and subject-independent validation enhances generalizability and allows for both individualized and group-level insights. Several limitations warrant consideration. Although the sample size was sufficient for model development, it may not capture the full spectrum of physiological variability present in broader populations, including older adults or individuals with different training and experience levels. High workload annotations were based on expert judgment, which, despite following structured criteria, remain subjective and may not fully capture nuanced or short-duration workload events. This limitation was particularly evident in fast-paced simulation environments, where rapid task transitions made it challenging for annotators to detect every instance of high workload. Incorporating multiple annotators may help increase labeling reliability in such dynamic settings. Additionally, integrating continuous self-reports or objective performance indicators could further improve label fidelity in future work. Finally, the current framework models workload as a discrete, three-level construct. However, cognitive load often varies continuously and dynamically. Future studies should explore regression-based or temporal modeling approaches, such as recurrent neural networks, temporal attention mechanisms, or state-space models, to better capture moment-to-moment workload transitions. Additionally, efforts to deploy this framework in real-world operational contexts will be critical to advancing adaptive human-machine teaming systems that respond intelligently to operator state. Conclusion This study introduces a validated machine learning framework for classifying cognitive workload using synchronized eye-tracking, ECG, and Head IMU data collected during high-fidelity simulated military missions. XGBoost achieved the highest performance, with an accuracy of 0.86 and a macro averaged F1 score of 0.78. These findings demonstrate that multimodal physiological signals can reliably distinguish between low, moderate, and high workload states, achieving strong classification performance with interpretable feature contributions. Eye-tracking features emerged as dominant predictors, with complementary insights from head movement and cardiovascular data, underscoring the value of multimodal integration. These results lay a strong foundation for real-time cognitive state monitoring and the development of intelligent, adaptive interfaces in mission-critical environments. Future research should expand this work by incorporating more diverse populations and exploring deployment in real-world operational contexts. Additionally, integrating continuous labeling, temporal modeling approaches, and objective performance measures will be critical for capturing the dynamic and individualized nature of cognitive workload. These advances will support the next generation of adaptive human-machine teaming systems capable of responding fluidly to operator states in complex, high-stakes domains. Methods This study was approved by the Army Research Laboratory’s Institutional Review Board. All study procedures involving human participants complied with the ethical standards set by the IRB and the principles of the Belmont Report. All methods were performed in accordance with the relevant guidelines and regulations. Written informed consent was obtained from all participants prior to participation. Participants were given the opportunity to ask questions, and all inquiries were addressed. This study was conducted in support of the next generation combat vehicle modernization priority. Experiments were performed in the Information for Mixed Squads (INFORMS) Laboratory at Aberdeen Proving Ground, a fully instrumented simulation facility designed for large-scale, platoon-level research. Thirty participants, organized into two platoons, completed 26 simulated missions against a live opposing force (OPFOR) during two separate 10-day experimental sessions. Participants operated both manned and unmanned ground vehicles in dynamic, team-based scenarios involving reconnaissance, target engagement, and coordinated decision-making. Some missions incorporated the Dynamic Task Allocation System (DTAS), an intelligent interface that adaptively supports real-time task management based on operator workload and mission demands. Additionally, the simulation environment was equipped with autonomous software agents capable of performing various crew functions. These included maneuvering vehicles to designated locations, controlling slewable weapon systems, and enhancing situational awareness via computer vision–based Aided Target Recognition (AiTR), delivered through the Automatic Detection System (ADS). These systems enabled adaptive human-agent teaming and added operational complexity relevant to future battlefield environments. Physiological data; including ECG, eye-tracking and head movement signals, were continuously recorded from participants’ crew stations during mission execution to support cognitive workload modeling. Each mission lasted approximately 90 minutes and was divided into four operational phases, delineated by Phase Lines. For modeling purposes, mission segments were classified into three workload conditions: Low workload (Baseline): a 5-minute pre-mission period during which participants viewed a relaxing video while seated, allowing for the collection of physiological baseline data. During this period, participants watched a relaxing video of a lava lamp ( https://youtu.be/h_lQ2tMgLVM ) for 5 minutes while remaining seated with both feet on the floor. Moderate workload: the main mission execution phase excluding baseline and high workload periods. High workload: identified by expert raters based on indicators such as task saturation, complex decision-making, and operational urgency. High workload segments were independently annotated by at least two trained raters using a structured set of pre-defined observational criteria, including markers of workload, team communication, cohesion, and mission-specific events. Final annotations were aligned with 60-second windows surrounding performance-critical events to support robust modeling of workload fluctuations. Participants and Platoon Structure Two platoons, each comprising 15 individuals (N = 30 total), participated in separate 10-day simulation sessions. Each platoon included: Fourteen crew members organized into two 7-person sections. Each section included: One Section Commander overseeing the team’s tactical actions. Three two-person dyads, each responsible for operating a Manned Control Vehicle (MCV) or a pair of Robotic Combat Vehicles (RCVs). One Higher Control (HICON) operator per platoon, who served as a mission facilitator, coordinating scenario updates, communication flows, and operational stimuli in collaboration with the research team. Participants were recruited from active-duty U.S. Army Soldiers. Soldiers ranged in age from 19 to 33 years (mean = 24.3, SD = 4.5). Participants had served in the U.S. military for an average of 5.1 years (SD = 4.0). All sessions were conducted over two weeks (Monday - Friday), limited to 8 hours per day (including a lunch break), and included comprehensive training and orientation prior to mission engagement. The simulated missions were conducted in a fully instrumented, fixed-base team simulator designed to replicate the spatial layout and operational demands of real-world vehicle crew stations. As shown in Fig. 9 , participants interacted with touchscreen monitors, steering controls, and weapon interfaces, while communicating through push-to-talk headsets within and across sections. Throughout each mission, Soldiers wore physiological monitoring devices, including Zephyr™ BioHarness 3.0 chest straps (Zephyr Technology, Annapolis, MD, USA) for ECG recording and Tobii Pro Glasses 3 (Tobii AB, Danderyd, Sweden) for eye tracking and head movement, to continuously capture cardiovascular and oculometric signals during task execution. Subjective State Measures All 14 crew stations from each group completed the NASA Task Load Index (NASA-TLX) 30 following each mission phase. Among the six subscales, mental demand, physical demand, temporal demand, performance, effort, and frustration, we focused specifically on the mental demand item, which asked: “How much mental and perceptual activity was required, such as thinking, looking, or searching? Was the task easy or demanding?” Responses were rated on a scale from 0 (low) to 100 (high) and captured participants’ perceived workload immediately after each mission phase. To account for the variability and occasional uniformity in subjective ratings across individuals, we used the median of mental demand scores and the mean of predicted workload values. At the crew station level, we computed values separately for each participant without further aggregation. At the phase and mission levels, we aggregated data by taking the median of mental demand scores across all crew stations for a given phase or mission, while averaging the corresponding predicted workload values. This approach helped reduce the influence of outliers in subjective responses while preserving the continuous output of the model. Physiological Feature Extraction To assess moment-to-moment fluctuations in cognitive workload, we extracted features from three modalities. These features were computed over overlapping temporal windows and served as inputs to machine learning models. ECG Feature Extraction ECG signals were collected using the Zephyr™ BioHarness system, a wearable chest strap device that provides real-time heart rate and respiratory measurements. Data were sampled at 250 Hz and processed through a custom Python-based pipeline designed to extract features reflecting autonomic nervous system dynamics and physiological arousal. The raw signals were first bandpass filtered using a 4th-order Butterworth filter (cutoff: 0.25–30 Hz) to suppress baseline drift and high-frequency noise. The filter was implemented in Python 3.11 using the butter and filtfilt functions from the scipy.signal module, applying zero-phase forward and reverse filtering to eliminate phase distortion 31 . Following filtering, the signals were further cleaned using the neurokit2 library (v0.2.2), which applies the Engzeemod2012 algorithm for robust R-peak detection 32 . Features extracted over 30-second windows (with 1-second shifts) included: Mean Heart Rate: the average number of beats per minute, serving as a general indicator of physiological arousal. Maximum and Minimum Heart Rate: the peak and trough heart rate values observed during the analysis window, providing a range of autonomic reactivity. Heart Rate Variability: calculated using the root mean square of successive differences in inter-beat intervals (RMSSD), reflecting parasympathetic modulation and cognitive effort. Signal-to-Noise Ratio (SNR): estimated as the logarithmic ratio between the signal power of ECG segments surrounding R-peaks and the residual noise power from their cleaned counterparts, offering a quantitative assessment of ECG signal quality. These features were aligned with mission timelines and workload annotations to support fine-grained modeling of workload-related physiological responses. Eye-Tracking Feature Extraction Eye-tracking data were recorded using Tobii Pro Glasses at a sampling rate of 100 Hz, capturing timestamped horizontal and vertical gaze coordinates along with pupil diameter measurements from both eyes. To enable high-resolution temporal analysis aligned with mission timelines, a custom Python-based pipeline was developed to extract behavioral and oculometric features from sliding windows of 10 seconds, with a 1-second step size. For each window, a comprehensive set of features was computed across four categories: pupil dynamics, blink activity, saccades, and fixations. Blink events were defined as gaps in valid eye-tracking data lasting at least 300 milliseconds, following conventions used in the PyGazeAnalyser toolkit and related literature 33 . Pupil metrics included mean and standard deviation of diameter, as well as left–right inter-eye differences. Saccades were identified using inter-sample thresholds on velocity (> 0.2 units/s) and acceleration (> 30 units/s²) computed from gaze positions, which are normalized screen coordinates ranging from 0 to 1 (with [0,0] indicating the top-left and [1,1] the bottom-right of the display). Valid saccades had durations between 20–300 ms. For each window, we computed saccade count, amplitude, and velocity statistics. Fixation detection was based on spatial clustering of gaze points. Fixations were defined as sequences of points that remained within 0.05 unit² for a minimum of 100 milliseconds, consistent with established parameters in eye movement research 34 . The fixations are rarely less than 100 ms and often in the range of 200–400 ms. 35 Extracted fixation features included count, mean and variability of fixation duration, and spatial dispersion along the x and y axes. A summary of the extracted eye-tracking features and their descriptions is provided in Table 4 . Table 4 Eye-Tracking Features Extracted Per Time Window Feature Category Feature Name Description Pupil Dynamics Pupil Size Mean / Std Average and variability of pupil diameter across both eyes Left–Right Diameter Diff (Mean / Std) Difference between left and right pupil diameters (mean and variability) Blink Activity Blink Count Number of blinks, inferred from data gaps ≥ 300 ms Saccades Saccade Count Number of detected saccades per window Saccade Distance (Mean / Std) Amplitude of eye movements between fixations Saccade Velocity (Mean / Std) Speed of saccadic eye movements Fixations Fixation Count Number of fixations per window Fixation Duration (Mean / Std) Duration of gaze stabilizations on a single point Fixation Coordinates (X/Y Mean / Std) Spatial distribution of fixations (center and spread in x and y directions) Head IMU Feature Extraction In addition to ECG and eye-tracking data, we extracted head movement signals from the IMU embedded in the Tobii Pro Glasses 3. A Python-based pipeline was developed to compute mean IMU-derived metrics from sliding windows of 10 seconds, with a 1-second step size. The glasses contain three types of sensors: accelerometer, gyroscope, and magnetometer. These sensors capture translational and rotational head movements, which may serve as indirect indicators of workload-induced behavior. Accelerometer : Measures linear acceleration along the X, Y, and Z axes in m/s², with gravity influencing the Y-axis during rest ( ~ − 9.8 m/s²). Sampled at 100 Hz. Gyroscope : Captures angular velocity in degrees/sec. Yaw (Y-axis), pitch (X-axis), and roll (Z-axis) rotations correspond to typical head movements like shaking, nodding, and tilting. Sampled at 100 Hz. Magnetometer : Measures local magnetic field strength in microteslas (µT) on each axis and is useful for estimating head orientation. Sampled at 10 Hz. To enable multimodal fusion, ECG, eye-tracking, and head IMU features, represented as time series data with one second resolution, were averaged over 60 second windows aligned with workload annotations. Features were first extracted in shorter windows: 10 seconds for eye-tracking and IMU signals, and 30 seconds for ECG, using one second step sizes to retain high temporal granularity. These short window features were then aggregated to match the 60 second workload labeling resolution. This approach produced synchronized and smoothed feature vectors while preserving the temporal structure. A full overview of the multimodal preprocessing pipeline is presented in Fig. 10 . Labeling and Normalization Strategy For modeling purposes, mission data were categorized into three workload levels: low, moderate, and high, based on operational phase and observed task complexity. Low workload corresponded to a pre-mission resting baseline period. Moderate workload encompassed routine mission execution, excluding baseline and high workload segments. High workload segments were identified by two trained researchers physically present during all mission simulations. Annotator 1 labeled Section 1 (7 people); Annotator 2 labeled Section 2 (7 people). Using unstructured observational criteria, including task saturation, rapid decision-making, and intense coordination, raters monitored team interactions and participant behavior in real time. Annotations were applied in 60-second windows. To account for physiological response lag, a ± 30-second buffer was added around each event. Overlapping events occurring within 5 seconds were merged into a single segment; others were treated as distinct. Labels were assigned only to crew stations actively engaged in the corresponding workload episode, ensuring precise alignment with physiological and behavioral data. The 60-second window duration was informed by typical engagement dynamics observed in tactical military simulations, where interactions between OPFOR and Blue forces often unfold within 30 to 60 seconds. To address inter-individual variability in physiological baselines, all extracted features were z-score normalized within each crew station, preserving relative workload dynamics while minimizing baseline differences. The final dataset reflected an imbalanced distribution across workload levels, with low workload comprising 6% of the samples, moderate workload accounting for 78%, and high workload representing 16%. This distribution underscores the predominance of moderate workload periods during mission execution and the relative sparsity of high workload events. Such class imbalance is a common challenge in modeling operational cognitive states, where periods of intense cognitive demand are naturally less frequent but critically important for accurate detection and intervention. Machine Learning Framework and Feature Selection We implemented and evaluated three supervised machine learning classifiers: Decision Tree, Random Forest, and XGBoost. These models were trained and validated using a grouped cross-validation framework designed to preserve independence across crew stations. Data Preparation and Preprocessing The dataset comprised multimodal features extracted from synchronized ECG, eye tracking, and head movement signals. These features were merged across resampled segments representing each workload class. We employed an early fusion strategy by concatenating ECG, eye tracking, and IMU features prior to model training, allowing the classifier to jointly learn multimodal representations of cognitive workload. Several preprocessing steps were applied: Standardization: Features were z-scored within each subject and clipped between the 1st and 99th percentiles to reduce the influence of outliers. Labeling: Segments were labeled as 0 (Low), 1 (Moderate), or 2 (High Workload), and concatenated into a unified dataset. Feature Filtering: Non-informative features (e.g., metadata, survey responses, identifiers, signal artifacts) were excluded. Missing Data Handling: Rows with more than 20% missing columns were removed, accounting for 5.1% of the dataset. The remaining missing values in numerical features were imputed using k-Nearest Neighbors (K = 2), based on similarity in feature space. Cross-Validation and Grouping Strategy To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) 36 was applied exclusively to the training data within each cross-validation fold, ensuring that no synthetic data were included in the test set. This strategy avoided data leakage and preserved the natural class distribution during evaluation. A 5-fold GroupKFold strategy was used to prevent subject overlap between training and testing folds. Hyperparameter Tuning and Model Optimization Each model underwent hyperparameter optimization using RandomizedSearchCV with 50 search iterations and parallelized evaluation. Rather than optimizing for general classification performance alone, we prioritized the detection of high workload segments (Class 2), which are rare but operationally critical. Accordingly, we used a custom scoring function (f1_class2_scorer) that computes the F1 score specifically for Class 2, placing emphasis on the model’s ability to correctly identify elevated cognitive load. For XGBoost, we additionally specified eval_metric='mlogloss' to guide internal gradient boosting updates and improve the model’s probabilistic calibration. This internal loss function influenced tree construction and early stopping, but did not override our primary selection criterion during tuning, which remained based on the Class 2 F1 score. This choice of objective ensured that hyperparameter selection favored configurations that improve recall and precision for high workload episodes, rather than inflating performance on the more common moderate workload class. The hyperparameter combination that yielded the highest mean F1 score for Class 2 across cross-validation folds was selected as the best configuration for each model. The optimal hyperparameters are summarized in Table 5 . Table 5 Best-performing Hyperparameters for Each Classifier. Classifier Best Hyperparameters Decision Tree {'min_samples_split': 10, 'min_samples_leaf': 1, 'max_leaf_nodes': 30, 'max_features': None, 'max_depth': 9, 'ccp_alpha': 0.0} Random Forest {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_leaf_nodes': None, 'max_features': None, 'max_depth': 11, 'ccp_alpha': 0.0, 'bootstrap': True} XGBoost {'tree_method': 'hist', 'subsample': 0.6, 'reg_lambda': 1.5, 'reg_alpha': 0.5, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 7, 'max_delta_step': 0, 'learning_rate': 0.1, 'gamma': 1.0, 'colsample_bytree': 1.0, 'colsample_bynode': 0.6, 'colsample_bylevel': 1.0} Evaluation Metrics and Post-Hoc Analysis Model performance was assessed using multiple metrics to capture classification quality across imbalanced and multi-class data. These included accuracy, macro averaged precision, macro averaged recall, and macro-F1 score, which provide a balanced view by averaging performance equally across all three workload classes. In addition, per-class F1 scores were computed to evaluate how well the model distinguished between low, moderate, and high workload conditions. Each metric offers a distinct perspective on model behavior: Precision is the ratio of true positive predictions to all positive predictions made by the model (Eq. 1 ). It reflects the model’s ability to minimize false positives, making it especially important when incorrect high workload predictions could trigger unnecessary system interventions. $$\:Precision\:=\frac{True\:Positives}{True\:Positives\:+\:False\:Positives}\:$$ 1 Recall measures the proportion of actual positive cases that were correctly identified (Eq. 2 ). It captures the model’s capacity to detect all relevant instances, ensuring that cognitively demanding periods are not missed. $$\:Recall\:=\frac{True\:Positives}{True\:Positives\:+\:False\:Negatives}\:$$ 2 F1 Score is the harmonic mean of precision and recall (Eq. 3 ). It offers a single, balanced metric that accounts for both detection sensitivity and prediction reliability, particularly useful in the presence of class imbalance. $$\:F1\:=2\frac{\left(Precision\:x\:Recall\right)}{\left(Precision\:x\:Recall\right)}\:$$ 3 Following classification, predicted workload labels were temporally aligned with mission timelines. Average predicted workload scores were then computed across the four mission phases, enabling trend analysis over time and comparison with self-reported mental demand ratings. Feature Importance Analysis To identify the most influential physiological and oculometric features in workload classification, we analyzed feature importance using the XGBoost model. XGBoost computes importance scores based on average gain, reflecting how much each feature contributes to improving the model’s objective function across all decision splits 37 . This approach captures both the frequency and effectiveness of a feature’s use in reducing classification error. Features were ranked by their gain scores, and the top contributors were selected to balance interpretability and model performance while minimizing redundancy. Declarations Acknowledgements This research was supported by the U.S. Army Combat Capabilities Development Command Army Research Laboratory (DEVCOM ARL) (CN, AK) and Grant No: W911NF2120108 (MK, JB). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of DEVCOM ARL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Funding This study was supported by the U.S. Army Combat Capabilities Development Command Army Research Laboratory Grant No: W911NF2120108. Author contributions AK and CN conceptualized the research; MK performed the research; AK and JB supervised the research; MK prepared the figures and wrote the manuscript; All authors reviewed and edited the manuscript. Data availability Due to the sensitivity of the data and participant privacy, the dataset is not publicly available. Reasonable requests for access may be considered by the ARL author (CN). Competing interests JB is a shareholder of Dprime LLC. All other authors have no competing interests. References Wen, S. et al. AdaptiveCoPilot: Design and Testing of a NeuroAdaptive LLM Cockpit Guidance System in both Novice and Expert Pilots. arXiv.org https://arxiv.org/abs/2501.04156v1 (2025). Lematta, G. J. et al. Team Interaction Strategies for Human–Autonomy Teaming in Next Generation Combat Vehicles. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 64 , 77–81 (2020). Huang, Q., Xu, X., Wei, Y., Zhang, J. & Jin, X. The impacts of level of automation and cognitive secondary task on the cognitive load of armored vehicle crews. Cogn. Technol. Work (2025) doi:10.1007/s10111-025-00806-9. Longo, L., Wickens, C. D., Hancock, G. & Hancock, P. A. Human Mental Workload: A Survey and a Novel Inclusive Definition. Front. Psychol. 13 , 883321 (2022). Alexander, A. & Nygren, T. EXAMINING THE RELATIONSHIP BETWEEN MENTAL WORKLOAD AND SITUATION AWARENESS IN A SIMULATED AIR COMBAT TASK. https://apps.dtic.mil/sti/citations/ADA387928. Ranchet, M., Morgan, J. C., Akinwuntan, A. E. & Devos, H. Cognitive workload across the spectrum of cognitive impairments: A systematic review of physiological measures. Neurosci. Biobehav. Rev. 80 , 516–537 (2017). Sosnowski, M. J. & Brosnan, S. F. Under pressure: the interaction between high-stakes contexts and individual differences in decision-making in humans and non-human species. Anim. Cogn. 26 , 1103–1117 (2023). Brady, C., Sawant, S., Madathil, K. C. & McNeese, N. A Systematic Review on the Effect of Cognitive Fatigue in Teams. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 68 , 1287–1291 (2024). Almukhtar, A. et al. Objective Assessment of Cognitive Workload in Surgery. Ann. Surg. 281 , 942–951 (2025). Ma, X., Monfared, R., Grant, R. & Goh, Y. M. Determining Cognitive Workload Using Physiological Measurements: Pupillometry and Heart-Rate Variability. Sensors 24 , 2010 (2024). Solhjoo, S. et al. Heart Rate and Heart Rate Variability Correlate with Clinical Reasoning Performance and Self-Reported Measures of Cognitive Load. Sci. Rep. 9 , 14668 (2019). Luque-Casado, A., Perales, J. C., Cárdenas, D. & Sanabria, D. Heart rate variability and cognitive processing: The autonomic response to task demands. Biol. Psychol. 113 , 83–90 (2016). Mark, J. A., Curtin, A., Kraft, A. E., Ziegler, M. D. & Ayaz, H. Mental workload assessment by monitoring brain, heart, and eye with six biomedical modalities during six cognitive tasks. Front. Neuroergonomics 5 , (2024). Ekin, M., Krejtz, K., Duarte, C., Duchowski, A. T. & Krejtz, I. Prediction of intrinsic and extraneous cognitive load with oculometric and biometric indicators. Sci. Rep. 15 , 5213 (2025). Skaramagkas, V. et al. Review of Eye Tracking Metrics Involved in Emotional and Cognitive Processes. IEEE Rev. Biomed. Eng. 16 , 260–277 (2023). Multimodal Assessment of Mental Workload During Automated Vehicle Remote Assistance: Modeling of Eye-Tracking-Related, …. http://ouci.dntb.gov.ua/en/works/4yNn00zx/. Charles, R. L. & Nixon, J. Measuring mental workload using physiological measures: A systematic review. Appl. Ergon. 74 , 221–232 (2019). Hirachan, N., Mathews, A., Romero, J. & Rojas, R. F. Measuring Cognitive Workload Using Multimodal Sensors. in 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) 4921–4924 (2022). doi:10.1109/EMBC48229.2022.9871308. Tao, X. et al. A multimodal physiological dataset for driving behaviour analysis. Sci. Data 11 , 378 (2024). Li, Q., Luximon, Y., Zhang, J. & Song, Y. Measuring and classifying students’ cognitive load in pen‐based mobile learning using handwriting, touch gestural and eye‐tracking data. Br. J. Educ. Technol. 55 , 625–653 (2024). Delliaux, S., Delaforge, A., Deharo, J.-C. & Chaumet, G. Mental Workload Alters Heart Rate Variability, Lowering Non-linear Dynamics. Front. Physiol. 10 , (2019). Vuksanović, V. & Gal, V. Heart rate variability in mental stress aloud. Med. Eng. Phys. 29 , 344–349 (2007). Hjortskov, N. et al. The effect of mental stress on heart rate variability and blood pressure during computer work. Eur. J. Appl. Physiol. 92 , 84–89 (2004). De Rivecourt, M., Kuperus, M. N., Post, W. J. & Mulder, L. J. M. Cardiovascular and eye activity measures as indices for momentary changes in mental effort during simulated flight. Ergonomics 51 , 1295–1319 (2008). Mallick, R., Slayback, D., Touryan, J., Ries, A. J. & Lance, B. J. The use of eye metrics to index cognitive workload in video games. in 2016 IEEE Second Workshop on Eye Tracking and Visualization (ETVIS) 60–64 (2016). doi:10.1109/ETVIS.2016.7851168. Lubetzky, A. V., Coker, E., Arie, L., Aharoni, M. M. H. & Krasovsky, T. Postural Control under Cognitive Load: Evidence of Increased Automaticity Revealed by Center-of-Pressure and Head Kinematics. J. Mot. Behav. 54 , 466–479 (2022). Marquart, G., Cabrall, C. & de Winter, J. Review of Eye-related Measures of Drivers’ Mental Workload. Procedia Manuf. 3 , 2854–2861 (2015). Liu, Y. et al. Cognitive Load Prediction From Multimodal Physiological Signals Using Multiview Learning. IEEE J. Biomed. Health Inform. 29 , 3282–3292 (2025). Lobo, J. L. et al. Cognitive workload classification using eye-tracking and EEG data. in Proceedings of the International Conference on Human-Computer Interaction in Aerospace 1–8 (ACM, Paris France, 2016). doi:10.1145/2950112.2964585. Hart, S. G. & Staveland, L. E. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. in Advances in Psychology (eds. Hancock, P. A. & Meshkati, N.) vol. 52 139–183 (North-Holland, 1988). Gommers, R. et al. scipy/scipy: SciPy 1.9.0. Zenodo (2022) doi:10.5281/zenodo.6940349. Makowski, D. et al. NeuroKit2: A Python toolbox for neurophysiological signal processing. Behav. Res. Methods 53 , 1689–1696 (2021). Volkmann, F. C., Riggs, L. A. & Moore, R. K. Eyeblinks and Visual Suppression. Science 207 , 900–902 (1980). Dalmaijer, E. S., Mathôt, S. & Van der Stigchel, S. PyGaze: An open-source, cross-platform toolbox for minimal-effort programming of eyetracking experiments. Behav. Res. Methods 46 , 913–921 (2014). Salvucci, D. D. & Goldberg, J. H. Identifying fixations and saccades in eye-tracking protocols. in Proceedings of the symposium on Eye tracking research & applications - ETRA ’00 71–78 (ACM Press, Palm Beach Gardens, Florida, United States, 2000). doi:10.1145/355017.355028. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 16 , 321–357 (2002). Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, San Francisco California USA, 2016). doi:10.1145/2939672.2939785. Additional Declarations Competing interest reported. JB is a shareholder of Dprime LLC. All other authors have no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7285350","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":503283062,"identity":"91d5c79e-f22f-4929-a47d-d524e89e71c4","order_by":0,"name":"Murat Kucukosmanoglu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABKUlEQVRIiWNgGAWjYPCCBCBmg3GYGxg+sMH4Btg1HEDVwtjAOINkLcw8bNhVgoDujNyHnz8wpCX2z0hLfPgzxy6Pn39hm7RN2Z18PrHDBxh+FGBoMbuRbixxgCEnccaNtMPGvNuSiyVnPGyTzjn3zLJNOi2BsQfTYWY30hiAWioSN0ikt0kzbmNO3HDjYJt0btthAzbpHAMGHqxamH/AtEj+3FYP0WIJ1pL/gfEPVi1sYIdtkEg7JsG77XDihvONQOsgtjAwY7PlzDM2izMGacYzzjxLBvrleOLMGYzNlj3nngG1pBkclsGi5Xga842KimTZ/vY0w4c/t1Un9vMfPnjjR9kdA/nZyQ8fvvmDI6hRjJJIAJEHGBAkQcB/gATFo2AUjIJRMBIAANRKcZwNaCxmAAAAAElFTkSuQmCC","orcid":"","institution":"D-Prime LLC","correspondingAuthor":true,"prefix":"","firstName":"Murat","middleName":"","lastName":"Kucukosmanoglu","suffix":""},{"id":503283066,"identity":"35ff2d19-ab43-4359-8d52-56c768b11497","order_by":1,"name":"Justin Brooks","email":"","orcid":"","institution":"D-Prime LLC","correspondingAuthor":false,"prefix":"","firstName":"Justin","middleName":"","lastName":"Brooks","suffix":""},{"id":503283070,"identity":"8e32c7d0-9c16-404d-b848-964564d96b09","order_by":2,"name":"Catherine Neubauer","email":"","orcid":"","institution":"U.S. Army Combat Capabilities Development Command Army Research Laboratory","correspondingAuthor":false,"prefix":"","firstName":"Catherine","middleName":"","lastName":"Neubauer","suffix":""},{"id":503283072,"identity":"7dd03003-5e14-411b-ae33-5d0dc7edd429","order_by":3,"name":"Andrea Krausman","email":"","orcid":"","institution":"U.S. Army Combat Capabilities Development Command Army Research Laboratory","correspondingAuthor":false,"prefix":"","firstName":"Andrea","middleName":"","lastName":"Krausman","suffix":""}],"badges":[],"createdAt":"2025-08-03 20:38:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7285350/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7285350/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":89987406,"identity":"07843330-a880-48f2-b1d5-b63df91d3980","added_by":"auto","created_at":"2025-08-27 06:59:12","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":922638,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTemporal Dynamics of Physiological and Movement Signals During a Mission. \u003c/strong\u003eTime series plots of mean pupil diameter (top), heart rate (middle), and magnetometer Z axis readings (bottom) from participant cs01 across the mission timeline. Red dashed lines denote manually annotated high workload events.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/bd6699e8b2ce5ed373bb80e3.png"},{"id":89987416,"identity":"572edbde-63bd-47bf-b8a2-b321641b403b","added_by":"auto","created_at":"2025-08-27 06:59:13","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":369316,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCondition-Based Differences in ECG Features.\u003c/strong\u003e Group means (±95% CI) of z-scored ECG features are plotted for low workload (blue), moderate workload (orange), and high workload (green) conditions.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/8a84a0b6bdcf2bcf908a95e3.png"},{"id":89987409,"identity":"1901cb84-bd45-4f5c-a100-b64b7620c5c4","added_by":"auto","created_at":"2025-08-27 06:59:13","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":409264,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCondition-Based Differences in Eye-Tracking Features.\u003c/strong\u003e Mean z-scored values of selected oculometric features are shown for low workload (blue), moderate workload (orange), and high workload (green) conditions, with 95% confidence intervals.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/a57b8a2047fe98057592734f.png"},{"id":89987417,"identity":"920215ff-4a5f-4ea4-addb-e7997a54e95a","added_by":"auto","created_at":"2025-08-27 06:59:13","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":403779,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCondition-Based Differences in Head IMU Features.\u003c/strong\u003e Mean z-scored values (±95% CI) of accelerometer, gyroscope, and magnetometer features are shown for low workload (blue), moderate workload (orange), and high workload (green) conditions.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/8a9b481fbc870d7d2d98a80a.png"},{"id":89987423,"identity":"9c7a4cee-0548-4f02-a9dc-5a095418240f","added_by":"auto","created_at":"2025-08-27 06:59:13","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1186475,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCondition-Based Distributions of Multimodal Physiological Features. \u003c/strong\u003eViolin plots show the z-scored distributions of six representative features across low (blue), moderate workload (red), and higher workload (green) conditions: (a) Mean Heart Rate, (b) Mean Pupil Diameter, (c) Fixation Count, (d) Fixation X-Coordinate Standard Deviation, (e) Saccade Count, and (f) Magnetometer Z-Axis (IMU). White circles indicate median values, with black outlines for contrast.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/244a173e885fa46d2b871b7e.png"},{"id":89987438,"identity":"f08ff227-a6c5-41fa-9eed-765016b74bbc","added_by":"auto","created_at":"2025-08-27 06:59:14","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":651999,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTop Feature Importances from the XGBoost Classifier.\u003c/strong\u003e Bar plot showing the top 25 features ranked by their average gain across cross-validation folds in the XGBoost model. Feature bars are color-coded by modality: blue for eye-tracking features, red for ECG features, and green for IMU features.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/f63df4dc50d8f39895ce4d3d.png"},{"id":89987435,"identity":"b0ed5cea-6638-442d-8058-29b52019b6e4","added_by":"auto","created_at":"2025-08-27 06:59:14","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":365129,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCorrelation Between Model-Predicted Workload and Subjective Mental Demand Ratings.\u003c/strong\u003e (a) Crew Station-level: r = 0.14, \u003cem\u003ep\u003c/em\u003e \u0026lt; 0.001. (b) Phase-level: r = 0.21,\u003cem\u003e p\u003c/em\u003e \u0026lt; 0.05. (c) Mission-level: r = 0.59, \u003cem\u003ep\u003c/em\u003e \u0026lt; 0.05.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/4a5138df78485689787d08a2.png"},{"id":89987410,"identity":"cce0ff15-c0be-4d75-bf34-6356618a9b2d","added_by":"auto","created_at":"2025-08-27 06:59:13","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":484995,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTemporal Alignment of Model Predictions with Ground Truth Workload Labels. \u003c/strong\u003eThe plot displays a time-series comparison of predicted versus true workload class labels (low, moderate, high) for a representative subject-mission pair. The solid blue line with circles indicates ground truth labels, while the dashed orange line with crosses shows model predictions.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/c8e6c354e05ced956cae5239.png"},{"id":89987430,"identity":"a2381a28-e076-4dd3-ad61-d84406fe2760","added_by":"auto","created_at":"2025-08-27 06:59:14","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":427610,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSimulator environment used during mission execution.\u003c/strong\u003e The image shows the fixed-base mockup for Section A, where participants were seated at instrumented stations to operate combat vehicles, interact with displays and controls, and communicate with teammates and autonomous agents.\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/03497a09518e62bccf5a59ac.png"},{"id":89987436,"identity":"7f40205d-8d2f-487c-9782-bd68ab858ce6","added_by":"auto","created_at":"2025-08-27 06:59:14","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":1232729,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMultimodal data preprocessing pipeline for cognitive workload modeling.\u003c/strong\u003e Raw data included ECG, eye-tracking and IMU signals, collected continuously during simulated missions.\u003c/p\u003e","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/f91811ae4699dc0d1c59a412.png"},{"id":100036456,"identity":"84236f65-c8eb-42f4-bf30-4c23fd9a50c3","added_by":"auto","created_at":"2026-01-12 10:25:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7948162,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7285350/v1/06aa884e-c2b1-43b7-9c2a-056a1f072112.pdf"}],"financialInterests":"Competing interest reported. JB is a shareholder of Dprime LLC. All other authors have no competing interests.","formattedTitle":"Multimodal Classification of Cognitive Workload Using Eye-Tracking, ECG, and Head Motion Data in Simulated Military Missions","fulltext":[{"header":"Introduction","content":"\u003cp\u003eModern military operations require rapid and adaptive decision-making in conditions of uncertainty, time pressure, and high cognitive demand \u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. As combat platforms such as next-generation ground vehicles continue to evolve toward greater autonomy, human operators are expected to manage complex systems while maintaining situational awareness across multiple mission components \u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. These systems act not just as tools, but as intelligent teammates supporting shared decision-making and dynamic task coordination. However, reductions in crew size and increases in system complexity place greater mental demands on individual operators \u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e, making real-time monitoring of cognitive workload essential.\u003c/p\u003e\u003cp\u003eCognitive workload refers to the portion of an individual\u0026rsquo;s mental resources engaged during task execution \u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. When workload surpasses cognitive capacity, performance can degrade quickly, especially in high-stakes operational contexts \u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Excessive workload has been associated with delayed responses, impaired decision making, and a higher likelihood of mission-critical errors \u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Although subjective assessments such as self‑report surveys are commonly used to evaluate perceived workload, they are inherently retrospective and do not offer the temporal resolution required for real‑time applications \u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e .\u003c/p\u003e\u003cp\u003eConversely, physiological signals offer an objective and continuous alternative for assessing cognitive workload. Among the most promising indicators are electrocardiogram (ECG) and eye-tracking signals \u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. ECG data provides insight into the autonomic nervous system through features such as heart rate and heart rate variability (HRV), both of which respond to changes in cognitive effort and emotional stress \u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Eye-tracking contributes behavioral markers such as pupil size, fixation duration, blink frequency, and saccade dynamics, all of which have been shown to vary with cognitive demand \u003csup\u003e\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. These signals are noninvasive, can be recorded using wearable sensors, are fieldable, and thus, are suitable for use in complex task environments.\u003c/p\u003e\u003cp\u003eWhile ECG and eye-tracking are each valuable on their own, they also have limitations in isolation, especially in operational settings where noise or motion can degrade signal quality. Combining multiple physiological modalities offers a more robust solution \u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. Previous research in fields such as aviation, driving, and human-computer interaction has shown that multimodal approaches improve the accuracy and stability of workload classification systems \u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. By capturing complementary aspects of the human response to cognitive demand, multimodal sensing enables a more comprehensive understanding of operator state. This is particularly important for adaptive systems that must respond to fluctuations in workload during task execution.\u003c/p\u003e\u003cp\u003eUnlike prior studies that focused on controlled laboratory tasks or individual assessments, this study used unstructured military simulations that incorporate multi-human, multi-agent teams, and expert-annotated workload segments to better reflect real-world demands and mission-specific contexts. The findings support the feasibility of real-time, noninvasive monitoring of cognitive workload and provide a foundation for integrating physiological sensing into adaptive human\u0026ndash;machine systems.\u003c/p\u003e\u003cp\u003eIn this study, we present machine learning frameworks to classify cognitive workload into low, moderate, and high levels using synchronized multimodal data. The dataset includes ECG, eye-tracking, and head movement signals derived from an inertial measurement unit (IMU), collected across 26 unstructured mission simulations involving participant interaction with autonomous technologies. We hypothesize that integrating cardiovascular and oculometric features will enhance workload classification performance compared to unimodal approaches.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cb\u003eTemporal Dynamics of Physiological Signals\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents an example of physiological responses to varying cognitive workload levels during a representative mission, based on data from a single participant (labeled \u0026lsquo;cs01\u0026rsquo;). Red dashed lines mark high workload events, manually annotated by experts according to operational task demands. This time aligned visualization illustrates how physiological signals evolve across mission phases and workload transitions. Notably, both pupil diameter and heart rate increased during high workload periods. The magnetometer Z axis signal, representing vertical head movement, also exhibited some variability during these intervals compared to lower workload phases.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eCondition-Based Distributions of Physiological Features\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo assess how ECG-derived features vary across workload levels, we computed the condition-wise means and 95% confidence intervals of the mean for four core metrics: mean heart rate (Mean_HR), maximum heart rate (Max_HR), minimum heart rate (Min_HR), and HRV. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents the standardized mean values of selected features across the three workload classes: low, moderate, and high, with 95% confidence intervals indicating the range within which the true mean is likely to fall for each condition.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eHeart rate metrics (i.e., Mean_HR, Max_HR, and Min_HR) exhibited a clear monotonic increase from low to high workload conditions. In contrast, HRV showed a decreasing trend with rising workload, though the change was much less pronounced.\u003c/p\u003e\u003cp\u003eTo evaluate the sensitivity of eye features to changes in task demand, we examined standardized means of key eye-tracking metrics across low, moderate, and high workload conditions. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e presents group-level feature distributions with 95% confidence intervals.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eSeveral oculometric features varied across workload conditions. Mean pupil diameter increased with workload. Fixation count was higher under moderate and high workload, while fixation duration decreased. Saccade count showed a modest increase with workload, and blink count decreased as workload increased.\u003c/p\u003e\u003cp\u003eTo complement these physiological and oculometric insights, we further examined how head movement features derived from the Tobii Pro Glasses 3\u0026rsquo;s inertial measurement unit (IMU) vary with cognitive workload. The IMU captures translational and rotational head motion through accelerometer, gyroscope, and magnetometer signals. We computed the standardized mean and 95% confidence interval for each IMU feature under workload conditions.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eSeveral IMU features varied across workload conditions (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Accelerometer signals along the Y and Z axes decreased with increasing workload. In contrast, magnetometer Y- and Z-axis values increased from low to high workload. Gyroscope features remained relatively stable across all conditions.\u003c/p\u003e\u003cp\u003eFigure 5 displays violin plots for six representative features across low, moderate, and high workload conditions: Mean Heart Rate, Mean Pupil Size, Fixation Count, Fixation X-Coordinate Standard Deviation, Saccade Count, and Magnetometer Z-Axis. These plots show the distributions and central tendencies of each feature within each workload class.\u003c/p\u003e\u003cp\u003eHeart rate and pupil size increased across workload levels. Fixation count also increased, while Fixation X-Coordinate Standard Deviation (Fix_X_Std) was higher under moderate workload and showed minimal change between moderate and high workload. Saccade count exhibited a modest increase with workload. Magnetometer Z-axis values increased from low to high workload.\u003c/p\u003e\u003cp\u003e\u003cb\u003eClassification Model Performance\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe evaluated the performance of three classification models: Decision Tree, Random Forest, and XGBoost in predicting cognitive workload levels defined as low workload (Class 0), moderate workload (Class 1), and high workload (Class 2). Models were trained using standardized physiological features and assessed via repeated cross-validation, stratified across individual crew stations to ensure subject-independent evaluation.\u003c/p\u003e\u003cp\u003e\u003cb\u003eOverall Results\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAmong the models evaluated, XGBoost demonstrated the highest overall performance, achieving a mean cross-validation accuracy of 0.86 and a macro averaged F1 score of 0.78 across folds. Random Forest followed with an accuracy of 0.82 and an F1 score of 0.73. Decision Tree yielded the lowest performance, with an accuracy of 0.74 and an F1 score of 0.65. These metrics represent the mean values averaged across five subject-independent folds. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows summary statistics for each model.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOverall Cross-Validation Performance Metrics\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetric\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eAccuracy\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.74\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.86\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ePrecision\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.61\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.70\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eRecall\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eF1 Score\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTo gain deeper insight into model behavior, we evaluated F1 scores by workload class. All models performed well in identifying low workload conditions, and both XGBoost and Random Forest showed strong performance for moderate workload. High workload classification was more difficult. For Class 2, XGBoost achieved a recall of 0.73 and precision of 0.63, yielding an F1 score of 0.68. This value represents the average Class 2 F1 score across the five subject-independent cross-validation folds. This reduced performance likely reflects increased variability in physiological responses during high-demand periods and the effects of class imbalance. In the final dataset, high workload accounted for only 16% of samples, while moderate workload dominated at 78%, contributing to more false positives for Class 2.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eModel Performance by Workload Class\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWorkload Class\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDescription\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDecision Tree (F1)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eRandom Forest (F1)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eXGBoost (F1)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eClass 0\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLow Workload\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.69\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eClass 1\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eModerate Workload\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.91\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eClass 2\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eHigh Workload\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.53\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.61\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.68\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThese results provide evidence supporting the superior performance of XGBoost for multimodal cognitive workload classification. Accordingly, XGBoost was selected for all subsequent analyses and interpretation.\u003c/p\u003e\u003cp\u003e\u003cb\u003eModality-Specific Model Comparison\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe trained separate XGBoost classifiers using only ECG features, only eye-tracking features, and only IMU features. All models were evaluated using a consistent five-fold cross-validation approach. Their performance was then compared to a multimodal model that integrated all three modalities.\u003c/p\u003e\u003cp\u003eAmong the unimodal models, the eye-tracking\u0026ndash;only model achieved the strongest performance, with an average accuracy of 0.82 and F1 score of 0.75. The IMU-only model followed, with an accuracy of 0.72 and F1 score of 0.56. The ECG-only model showed lower performance, achieving an accuracy of 0.67 and F1 score of 0.43. By comparison, the multimodal model outperformed all unimodal models, yielding the highest accuracy (0.86), F1 score (0.78), recall (0.79), and precision (0.77).\u003c/p\u003e\u003cp\u003eThese results highlight the complementary value of ECG-, eye-, and IMU-derived signals in cognitive workload classification. While each modality contributes independently, combining them provides greater accuracy and robustness in predicting workload states. This improvement supports our hypothesis that multimodal models offer superior predictive performance compared to unimodal approaches.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOverall Cross-Validation Performance by Modality (XGBoost Model)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetric\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eECG Only\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eEye-Tracking Only\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eIMU Only\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eMultimodal (ECG\u0026thinsp;+\u0026thinsp;Eye-Tracking\u0026thinsp;+\u0026thinsp;IMU)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eAccuracy\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.67\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.72\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.86\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003ePrecision\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.42\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.74\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.56\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eRecall\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.57\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eF1 Score\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.56\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eFeature Importance\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo better understand the physiological basis of model predictions, we analyzed feature importance from the multimodal XGBoost classifier (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Features were ranked based on their average gain, reflecting their contribution to improved classification performance.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eEye-tracking features ranked highest in the multimodal XGBoost classifier. Pupil Size Mean was the top-ranked feature, followed by Fixation X-Coordinate Standard Deviation (Fix X Std). Other eye-related features, including mean and standard deviation of fixation duration, Saccade Count, and Number of Blinks, also appeared among the most important features.\u003c/p\u003e\u003cp\u003eIMU-derived features such as Accelerometer Y, Accelerometer Z, Magnetometer Y, and Magnetometer Z were also ranked among the top contributors.\u003c/p\u003e\u003cp\u003eECG-related features, including Max HR, Mean HR, Min HR, and Signal-to-Noise Ratio (SNR), were present but ranked lower in average gain compared to eye-tracking and IMU features.\u003c/p\u003e\u003cp\u003e\u003cb\u003eCorrelation Between Estimated Workload and Reported Mental Demand\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe analyzed correlations between the average predicted workload produced by the XGBoost classifier and participants\u0026rsquo; self-reported mental demand ratings. These analyses were conducted at the phase, mission, and crew station (subject) levels to capture fluctuations in workload across multiple temporal and operational scales. Each mission was composed of four distinct phases, enabling finer-grained comparisons of workload dynamics over time. This multilevel approach aligns with how cognitive demand is experienced and assessed in team-based settings, allowing us to examine both shared and individual perceptions of task difficulty.\u003c/p\u003e\u003cp\u003eCorrelations between model-predicted workload and self-reported mental demand varied by aggregation level. At the individual crew station level, we observed a statistically significant but modest correlation (r\u0026thinsp;=\u0026thinsp;0.14, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Using median mental demand ratings aggregated across crew stations to obtain team-level estimates, a stronger positive correlation emerged at the mission phase level (r\u0026thinsp;=\u0026thinsp;0.21, p\u0026thinsp;\u0026lt;\u0026thinsp;0.05). The strongest association was found at the mission level (r\u0026thinsp;=\u0026thinsp;0.59, p\u0026thinsp;\u0026lt;\u0026thinsp;0.05), where both predicted and self-reported workload values were averaged across crew stations to reduce subject-specific variability. For comparison, we also calculated correlations using mean mental demand ratings. While the overall pattern remained consistent, the associations were weaker: r\u0026thinsp;=\u0026thinsp;0.14 (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) at the crew station level, r\u0026thinsp;=\u0026thinsp;0.16 (p\u0026thinsp;=\u0026thinsp;0.12) at the phase level, and r\u0026thinsp;=\u0026thinsp;0.27 (p\u0026thinsp;=\u0026thinsp;0.189) at the mission level. These findings highlight the advantage of using the median to mitigate the influence of extreme values and uniform response tendencies in subjective workload ratings.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eTemporal Alignment of Predictions with True Workload Labels\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e presents a time-series comparison between the model\u0026rsquo;s predicted workload classes and the ground truth labels for a representative subject during a mission. The model successfully tracks the overall progression of cognitive workload, transitioning from low workload to moderate and occasionally into high workload. Predictions exhibit strong temporal alignment with the true labels during extended segments, particularly for low and moderate phases.\u003c/p\u003e\u003cp\u003eHowever, two types of misclassifications are evident: (1) one brief over-prediction to high workload while the ground truth remains at moderate, likely reflecting the model\u0026rsquo;s sensitivity to transient physiological fluctuations (e.g., spikes in heart rate, pupil size, or fixation dispersion); and (2) missed detections of actual high workload periods, as indicated by the lower recall for the high class. These results highlight the challenge of distinguishing brief, noisy workload surges from sustained high workload episodes.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study demonstrates the feasibility of classifying cognitive workload using a multimodal physiological framework that integrates ECG, eye-tracking, and IMU data collected during high-fidelity, complex simulated military missions. The findings highlight the benefit of combining complementary biosignals to improve model accuracy under varying cognitive demands. Among the models tested, XGBoost delivered the strongest performance, achieving an overall accuracy of 0.86 and a macro averaged F1 score of 0.78, outperforming both Random Forest (F1\u0026thinsp;=\u0026thinsp;0.73) and Decision Tree (F1\u0026thinsp;=\u0026thinsp;0.65).\u003c/p\u003e\u003cp\u003eWhile all models effectively detected low and moderate workload conditions, classification performance was lower for high workload segments. This reduction likely reflects increased physiological variability under stress, as well as limitations in the annotation process. Manual labeling is subject to temporal uncertainty and may fail to capture brief or subtle instances of high cognitive demand, leading to noisy or incomplete ground truth. To address this, we applied a\u0026thinsp;\u0026plusmn;\u0026thinsp;30-second buffer around annotated high workload events, resulting in 60-second modeling windows. This window length was selected empirically: shorter windows led to reduced precision, while longer ones increased precision at the cost of temporal specificity.\u003c/p\u003e\u003cp\u003eDespite high workload periods comprising only 16% of the dataset, XGBoost achieved a recall of 0.73 and a precision of 0.63 for this class, resulting in an F1 score of 0.68. These results suggest the model effectively detects most high workload events, though some moderate segments are misclassified. Given the class imbalance, the reduced precision for Class 2 is expected. Importantly, this performance substantially exceeds chance-level expectations (F1\u0026thinsp;\u0026asymp;\u0026thinsp;0.16), suggesting that the model learned meaningful physiological patterns associated with elevated cognitive demand. These findings underscore the effectiveness of tree-based ensemble methods for modeling complex, nonlinear relationships in multimodal physiological data.\u003c/p\u003e\u003cp\u003e\u003cb\u003ePhysiological and Behavioral Indicators of Workload\u003c/b\u003e\u003c/p\u003e\u003cp\u003eInterpretation of the temporal dynamics revealed distinct physiological signatures associated with elevated workload. During annotated high workload periods, pupil diameter increased, consistent with heightened arousal and attentional engagement.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e Concurrent increases in heart rate likely reflect sympathetic nervous system activation in response to cognitive stress\u003csup\u003e\u003cspan additionalcitationids=\"CR22\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Additionally, fluctuations in magnetometer Z-axis signals during these periods may reflect postural adjustments or head orientation shifts, which are common during intense visual search or physical engagement. Together, these trends confirm that synchronized eye-tracking, cardiovascular, and movement-related signals capture workload-relevant changes in operator state.\u003c/p\u003e\u003cp\u003eCondition-based analyses further revealed how feature distributions change across workload levels. Oculometric features; including pupil size, fixation count, and saccade count; showed consistent increases with task demand, while fixation duration and blink count decreased. These patterns may be consistent with more rapid visual scanning and sustained attention under time pressure. Notably, fixation durations tended to shorten as task complexity increased, as observed in both flight simulator exercises \u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e and a video game-based cognitive load task \u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. However, there is no clear consensus on the behavior of fixation- and saccade-related metrics under cognitive load, as findings vary considerably depending on task type and experimental design\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. Gaze dispersion (as indexed by Fix X Std) increased from low to moderate workload, then stabilized, possibly reflecting a shift to more exploratory behavior during early task engagement, followed by narrowed focus during high-load phases.\u003c/p\u003e\u003cp\u003eSimilarly, IMU-derived features exhibited systematic changes across workload conditions. Accelerometer signals along the Y and Z axes decreased with increasing workload, possibly reflecting reduced head motion or more constrained postural adjustments during cognitively demanding tasks.\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e In contrast, magnetometer values along the same axes increased, suggesting changes in orientation or environmental magnetic field exposure. Gyroscope signals remained relatively stable, indicating that rotational head movements were less sensitive to workload variations, potentially due to the task structure or physical constraints of the simulation environment.\u003c/p\u003e\u003cp\u003e\u003cb\u003eFeature Importance and Modality Contributions\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFeature-level analyses provided interpretable insights into how specific signals reflect changes in workload. As previously described, elevated heart rate and reduced heart rate variability corresponded with increased workload, consistent with heightened sympathetic nervous system activity \u003csup\u003e\u003cspan additionalcitationids=\"CR22\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Eye-tracking metrics, including increased pupil dilation and higher fixation counts, also demonstrated clear sensitivity to workload fluctuations, aligning with prior studies linking these features to attentional engagement and cognitive effort \u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. For example, more frequent and shorter fixations suggested rapid scanning behavior under time pressure. Blink rate also emerged as a top-ranked feature in the model\u0026rsquo;s important plots. Exploratory visual analyses (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e) revealed a consistent decrease in blink frequency during high workload phases, reinforcing previous findings that associate reduced blink rate with heightened visual processing demands\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. Gaze variability, as indexed by the standard deviation of fixation X-coordinates, increased from low to moderate workload levels and then stabilized at higher workload. This pattern differs from previous studies that reported overall decreases in gaze variability with rising workload\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e, highlighting the potential for nonlinear or context-dependent effects in complex operational tasks. Together, these results underscore the value of combining cardiovascular and oculometric indicators for robust workload monitoring.\u003c/p\u003e\u003cp\u003eIn addition to cardiovascular and eye-tracking signals, IMU-derived features contributed meaningful information to workload classification. Specifically, magnetometer and accelerometer readings along the Z-axis were consistently ranked among the most important predictors. These features may reflect postural adjustments, movement variability, or head orientation changes during cognitively demanding phases.\u003c/p\u003e\u003cp\u003eA central advantage of the proposed framework is its integration of multimodal data sources. ECG, eye-tracking, and IMU signals each capture distinct but complementary aspects of cognitive and physiological functioning. By fusing these modalities, the model achieved a richer and more robust physiological representation of workload. This multimodal approach significantly enhanced the model\u0026rsquo;s ability to discriminate nuanced workload transitions across different mission phases, supporting previous findings that multimodal fusion improves classification.\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e,\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e Notably, our multimodal classifier achieved a macro averaged F1 score of 0.78, outperforming the ECG-only model (F1\u0026thinsp;=\u0026thinsp;0.43), the eye-tracking\u0026ndash;only model (F1\u0026thinsp;=\u0026thinsp;0.75), and the IMU-only model (F1\u0026thinsp;=\u0026thinsp;0.56), though the gain over eye-tracking alone was modest. These findings confirm the complementary value of IMU-derived head movement signals alongside traditional biosignals.\u003c/p\u003e\u003cp\u003e\u003cb\u003eSubjective Validation and Temporal Alignment\u003c/b\u003e\u003c/p\u003e\u003cp\u003eCorrelations between model-predicted workload and self-reported mental demand ratings varied by level of aggregation. The strongest alignment was observed at the mission level (r\u0026thinsp;=\u0026thinsp;0.59), followed by the phase level (r\u0026thinsp;=\u0026thinsp;0.21), and weakest at the individual crew station level (r\u0026thinsp;=\u0026thinsp;0.14). This gradient likely reflects the effects of temporal and spatial averaging: broader aggregations smooth out short-term fluctuations and inter-individual variability, resulting in stronger signal consistency.\u003c/p\u003e\u003cp\u003eAt the individual level, weaker correlations may be attributed to variability in subjective reporting, inconsistent interpretations of mental demand, or physiological differences in responsiveness. Several participants provided uniform ratings across all phases, limiting sensitivity to within-mission changes. Moreover, because many annotations were made at the section or platoon level, they may reflect collective task load rather than individual experience. For example, a label such as \u003cb\u003e\u0026ldquo;SECTION 01: HIGH WORKLOAD; Under attack, some confusion\u0026rdquo;\u003c/b\u003e may not represent the cognitive load of each crew member equally. In our labeling procedure, when such section-level annotations were present, we assigned the high workload label to all crew stations within that section during the corresponding window. This decision was made to preserve operational context but may have introduced label noise for individuals whose workload levels diverged from the group average. These findings highlight the inherent challenges in validating physiological workload predictions against subjective ratings in team-based environments and suggest the value of incorporating finer-grained labeling approaches, such as continuous self-reporting or real-time performance-based metrics.\u003c/p\u003e\u003cp\u003e\u003cb\u003eImplications and Future Directions\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThis study demonstrates the feasibility and utility of using multimodal, noninvasive physiological sensing to monitor cognitive workload in complex operational settings. The integration of eye-tracking, IMU and ECG data supports accurate, real-time classification of workload, with direct applications in military, industrial, and other high-stakes environments. The use of within-subject normalization and subject-independent validation enhances generalizability and allows for both individualized and group-level insights.\u003c/p\u003e\u003cp\u003eSeveral limitations warrant consideration. Although the sample size was sufficient for model development, it may not capture the full spectrum of physiological variability present in broader populations, including older adults or individuals with different training and experience levels. High workload annotations were based on expert judgment, which, despite following structured criteria, remain subjective and may not fully capture nuanced or short-duration workload events. This limitation was particularly evident in fast-paced simulation environments, where rapid task transitions made it challenging for annotators to detect every instance of high workload. Incorporating multiple annotators may help increase labeling reliability in such dynamic settings. Additionally, integrating continuous self-reports or objective performance indicators could further improve label fidelity in future work.\u003c/p\u003e\u003cp\u003eFinally, the current framework models workload as a discrete, three-level construct. However, cognitive load often varies continuously and dynamically. Future studies should explore regression-based or temporal modeling approaches, such as recurrent neural networks, temporal attention mechanisms, or state-space models, to better capture moment-to-moment workload transitions. Additionally, efforts to deploy this framework in real-world operational contexts will be critical to advancing adaptive human-machine teaming systems that respond intelligently to operator state.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study introduces a validated machine learning framework for classifying cognitive workload using synchronized eye-tracking, ECG, and Head IMU data collected during high-fidelity simulated military missions. XGBoost achieved the highest performance, with an accuracy of 0.86 and a macro averaged F1 score of 0.78. These findings demonstrate that multimodal physiological signals can reliably distinguish between low, moderate, and high workload states, achieving strong classification performance with interpretable feature contributions. Eye-tracking features emerged as dominant predictors, with complementary insights from head movement and cardiovascular data, underscoring the value of multimodal integration.\u003c/p\u003e\u003cp\u003eThese results lay a strong foundation for real-time cognitive state monitoring and the development of intelligent, adaptive interfaces in mission-critical environments. Future research should expand this work by incorporating more diverse populations and exploring deployment in real-world operational contexts. Additionally, integrating continuous labeling, temporal modeling approaches, and objective performance measures will be critical for capturing the dynamic and individualized nature of cognitive workload. These advances will support the next generation of adaptive human-machine teaming systems capable of responding fluidly to operator states in complex, high-stakes domains.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eThis study was approved by the Army Research Laboratory’s Institutional Review Board. All study procedures involving human participants complied with the ethical standards set by the IRB and the principles of the Belmont Report. All methods were performed in accordance with the relevant guidelines and regulations. Written informed consent was obtained from all participants prior to participation. Participants were given the opportunity to ask questions, and all inquiries were addressed.\u003c/p\u003e\u003cp\u003eThis study was conducted in support of the next generation combat vehicle modernization priority. Experiments were performed in the Information for Mixed Squads (INFORMS) Laboratory at Aberdeen Proving Ground, a fully instrumented simulation facility designed for large-scale, platoon-level research.\u003c/p\u003e\u003cp\u003eThirty participants, organized into two platoons, completed 26 simulated missions against a live opposing force (OPFOR) during two separate 10-day experimental sessions. Participants operated both manned and unmanned ground vehicles in dynamic, team-based scenarios involving reconnaissance, target engagement, and coordinated decision-making. Some missions incorporated the Dynamic Task Allocation System (DTAS), an intelligent interface that adaptively supports real-time task management based on operator workload and mission demands.\u003c/p\u003e\u003cp\u003eAdditionally, the simulation environment was equipped with autonomous software agents capable of performing various crew functions. These included maneuvering vehicles to designated locations, controlling slewable weapon systems, and enhancing situational awareness via computer vision–based Aided Target Recognition (AiTR), delivered through the Automatic Detection System (ADS). These systems enabled adaptive human-agent teaming and added operational complexity relevant to future battlefield environments.\u003c/p\u003e\u003cp\u003ePhysiological data; including ECG, eye-tracking and head movement signals, were continuously recorded from participants’ crew stations during mission execution to support cognitive workload modeling. Each mission lasted approximately 90 minutes and was divided into four operational phases, delineated by Phase Lines. For modeling purposes, mission segments were classified into three workload conditions:\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eLow workload (Baseline): a 5-minute pre-mission period during which participants viewed a relaxing video while seated, allowing for the collection of physiological baseline data. During this period, participants watched a relaxing video of a lava lamp (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://youtu.be/h_lQ2tMgLVM\u003c/span\u003e\u003cspan address=\"https://youtu.be/h_lQ2tMgLVM\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) for 5 minutes while remaining seated with both feet on the floor.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eModerate workload: the main mission execution phase excluding baseline and high workload periods.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eHigh workload: identified by expert raters based on indicators such as task saturation, complex decision-making, and operational urgency.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eHigh workload segments were independently annotated by at least two trained raters using a structured set of pre-defined observational criteria, including markers of workload, team communication, cohesion, and mission-specific events. Final annotations were aligned with 60-second windows surrounding performance-critical events to support robust modeling of workload fluctuations.\u003c/p\u003e\u003cp\u003e\u003cb\u003eParticipants and Platoon Structure\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTwo platoons, each comprising 15 individuals (N = 30 total), participated in separate 10-day simulation sessions. Each platoon included:\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eFourteen crew members organized into two 7-person sections. Each section included:\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eOne Section Commander overseeing the team’s tactical actions.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThree two-person dyads, each responsible for operating a Manned Control Vehicle (MCV) or a pair of Robotic Combat Vehicles (RCVs).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003e\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eOne Higher Control (HICON) operator per platoon, who served as a mission facilitator, coordinating scenario updates, communication flows, and operational stimuli in collaboration with the research team.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eParticipants were recruited from active-duty U.S. Army Soldiers. Soldiers ranged in age from 19 to 33 years (mean = 24.3, SD = 4.5). Participants had served in the U.S. military for an average of 5.1 years (SD = 4.0). All sessions were conducted over two weeks (Monday - Friday), limited to 8 hours per day (including a lunch break), and included comprehensive training and orientation prior to mission engagement.\u003c/p\u003e\u003cp\u003eThe simulated missions were conducted in a fully instrumented, fixed-base team simulator designed to replicate the spatial layout and operational demands of real-world vehicle crew stations. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e, participants interacted with touchscreen monitors, steering controls, and weapon interfaces, while communicating through push-to-talk headsets within and across sections. Throughout each mission, Soldiers wore physiological monitoring devices, including Zephyr™ BioHarness 3.0 chest straps (Zephyr Technology, Annapolis, MD, USA) for ECG recording and Tobii Pro Glasses 3 (Tobii AB, Danderyd, Sweden) for eye tracking and head movement, to continuously capture cardiovascular and oculometric signals during task execution.\u003c/p\u003e\u003cp\u003e\u003cb\u003eSubjective State Measures\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAll 14 crew stations from each group completed the NASA Task Load Index (NASA-TLX) \u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e following each mission phase. Among the six subscales, mental demand, physical demand, temporal demand, performance, effort, and frustration, we focused specifically on the mental demand item, which asked: “How much mental and perceptual activity was required, such as thinking, looking, or searching? Was the task easy or demanding?” Responses were rated on a scale from 0 (low) to 100 (high) and captured participants’ perceived workload immediately after each mission phase.\u003c/p\u003e\u003cp\u003eTo account for the variability and occasional uniformity in subjective ratings across individuals, we used the median of mental demand scores and the mean of predicted workload values. At the crew station level, we computed values separately for each participant without further aggregation. At the phase and mission levels, we aggregated data by taking the median of mental demand scores across all crew stations for a given phase or mission, while averaging the corresponding predicted workload values. This approach helped reduce the influence of outliers in subjective responses while preserving the continuous output of the model.\u003c/p\u003e\u003cp\u003e\u003cb\u003ePhysiological Feature Extraction\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo assess moment-to-moment fluctuations in cognitive workload, we extracted features from three modalities. These features were computed over overlapping temporal windows and served as inputs to machine learning models.\u003c/p\u003e\u003cp\u003e\u003cb\u003eECG Feature Extraction\u003c/b\u003e\u003c/p\u003e\u003cp\u003eECG signals were collected using the Zephyr™ BioHarness system, a wearable chest strap device that provides real-time heart rate and respiratory measurements. Data were sampled at 250 Hz and processed through a custom Python-based pipeline designed to extract features reflecting autonomic nervous system dynamics and physiological arousal. The raw signals were first bandpass filtered using a 4th-order Butterworth filter (cutoff: 0.25–30 Hz) to suppress baseline drift and high-frequency noise. The filter was implemented in Python 3.11 using the butter and filtfilt functions from the scipy.signal module, applying zero-phase forward and reverse filtering to eliminate phase distortion \u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Following filtering, the signals were further cleaned using the neurokit2 library (v0.2.2), which applies the Engzeemod2012 algorithm for robust R-peak detection \u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eFeatures extracted over 30-second windows (with 1-second shifts) included:\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eMean Heart Rate: the average number of beats per minute, serving as a general indicator of physiological arousal.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMaximum and Minimum Heart Rate: the peak and trough heart rate values observed during the analysis window, providing a range of autonomic reactivity.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eHeart Rate Variability: calculated using the root mean square of successive differences in inter-beat intervals (RMSSD), reflecting parasympathetic modulation and cognitive effort.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSignal-to-Noise Ratio (SNR): estimated as the logarithmic ratio between the signal power of ECG segments surrounding R-peaks and the residual noise power from their cleaned counterparts, offering a quantitative assessment of ECG signal quality.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eThese features were aligned with mission timelines and workload annotations to support fine-grained modeling of workload-related physiological responses.\u003c/p\u003e\u003cp\u003e\u003cb\u003eEye-Tracking Feature Extraction\u003c/b\u003e\u003c/p\u003e\u003cp\u003eEye-tracking data were recorded using Tobii Pro Glasses at a sampling rate of 100 Hz, capturing timestamped horizontal and vertical gaze coordinates along with pupil diameter measurements from both eyes. To enable high-resolution temporal analysis aligned with mission timelines, a custom Python-based pipeline was developed to extract behavioral and oculometric features from sliding windows of 10 seconds, with a 1-second step size.\u003c/p\u003e\u003cp\u003eFor each window, a comprehensive set of features was computed across four categories: pupil dynamics, blink activity, saccades, and fixations. Blink events were defined as gaps in valid eye-tracking data lasting at least 300 milliseconds, following conventions used in the PyGazeAnalyser toolkit and related literature\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. Pupil metrics included mean and standard deviation of diameter, as well as left–right inter-eye differences. Saccades were identified using inter-sample thresholds on velocity (\u0026gt; 0.2 units/s) and acceleration (\u0026gt; 30 units/s²) computed from gaze positions, which are normalized screen coordinates ranging from 0 to 1 (with [0,0] indicating the top-left and [1,1] the bottom-right of the display). Valid saccades had durations between 20–300 ms. For each window, we computed saccade count, amplitude, and velocity statistics.\u003c/p\u003e\u003cp\u003eFixation detection was based on spatial clustering of gaze points. Fixations were defined as sequences of points that remained within 0.05 unit² for a minimum of 100 milliseconds, consistent with established parameters in eye movement research\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. The fixations are rarely less than 100 ms and often in the range of 200–400 ms.\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e Extracted fixation features included count, mean and variability of fixation duration, and spatial dispersion along the x and y axes. A summary of the extracted eye-tracking features and their descriptions is provided in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eEye-Tracking Features Extracted Per Time Window\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFeature Category\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFeature Name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDescription\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e\u003cb\u003ePupil Dynamics\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePupil Size Mean / Std\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAverage and variability of pupil diameter across both eyes\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eLeft–Right Diameter Diff (Mean / Std)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDifference between left and right pupil diameters (mean and variability)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eBlink Activity\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBlink Count\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNumber of blinks, inferred from data gaps ≥ 300 ms\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u003cb\u003eSaccades\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSaccade Count\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNumber of detected saccades per window\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSaccade Distance (Mean / Std)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAmplitude of eye movements between fixations\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSaccade Velocity (Mean / Std)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSpeed of saccadic eye movements\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u003cb\u003eFixations\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFixation Count\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNumber of fixations per window\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFixation Duration (Mean / Std)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDuration of gaze stabilizations on a single point\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFixation Coordinates (X/Y Mean / Std)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSpatial distribution of fixations (center and spread in x and y directions)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003cb\u003eHead IMU Feature Extraction\u003c/b\u003e\u003c/p\u003e\u003cp\u003eIn addition to ECG and eye-tracking data, we extracted head movement signals from the IMU embedded in the Tobii Pro Glasses 3. A Python-based pipeline was developed to compute mean IMU-derived metrics from sliding windows of 10 seconds, with a 1-second step size. The glasses contain three types of sensors: accelerometer, gyroscope, and magnetometer. These sensors capture translational and rotational head movements, which may serve as indirect indicators of workload-induced behavior.\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eAccelerometer\u003c/b\u003e: Measures linear acceleration along the X, Y, and Z axes in m/s², with gravity influencing the Y-axis during rest ( ~ − 9.8 m/s²). Sampled at 100 Hz.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eGyroscope\u003c/b\u003e: Captures angular velocity in degrees/sec. Yaw (Y-axis), pitch (X-axis), and roll (Z-axis) rotations correspond to typical head movements like shaking, nodding, and tilting. Sampled at 100 Hz.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eMagnetometer\u003c/b\u003e: Measures local magnetic field strength in microteslas (µT) on each axis and is useful for estimating head orientation. Sampled at 10 Hz.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eTo enable multimodal fusion, ECG, eye-tracking, and head IMU features, represented as time series data with one second resolution, were averaged over 60 second windows aligned with workload annotations. Features were first extracted in shorter windows: 10 seconds for eye-tracking and IMU signals, and 30 seconds for ECG, using one second step sizes to retain high temporal granularity. These short window features were then aggregated to match the 60 second workload labeling resolution. This approach produced synchronized and smoothed feature vectors while preserving the temporal structure. A full overview of the multimodal preprocessing pipeline is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eLabeling and Normalization Strategy\u003c/b\u003e\u003c/p\u003e\u003cp\u003eFor modeling purposes, mission data were categorized into three workload levels: low, moderate, and high, based on operational phase and observed task complexity.\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eLow workload corresponded to a pre-mission resting baseline period.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eModerate workload encompassed routine mission execution, excluding baseline and high workload segments.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eHigh workload segments were identified by two trained researchers physically present during all mission simulations. Annotator 1 labeled Section 1 (7 people); Annotator 2 labeled Section 2 (7 people). Using unstructured observational criteria, including task saturation, rapid decision-making, and intense coordination, raters monitored team interactions and participant behavior in real time.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003eAnnotations were applied in 60-second windows. To account for physiological response lag, a ± 30-second buffer was added around each event. Overlapping events occurring within 5 seconds were merged into a single segment; others were treated as distinct. Labels were assigned only to crew stations actively engaged in the corresponding workload episode, ensuring precise alignment with physiological and behavioral data. The 60-second window duration was informed by typical engagement dynamics observed in tactical military simulations, where interactions between OPFOR and Blue forces often unfold within 30 to 60 seconds.\u003c/p\u003e\u003cp\u003eTo address inter-individual variability in physiological baselines, all extracted features were z-score normalized within each crew station, preserving relative workload dynamics while minimizing baseline differences.\u003c/p\u003e\u003cp\u003eThe final dataset reflected an imbalanced distribution across workload levels, with low workload comprising 6% of the samples, moderate workload accounting for 78%, and high workload representing 16%. This distribution underscores the predominance of moderate workload periods during mission execution and the relative sparsity of high workload events. Such class imbalance is a common challenge in modeling operational cognitive states, where periods of intense cognitive demand are naturally less frequent but critically important for accurate detection and intervention.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMachine Learning Framework and Feature Selection\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe implemented and evaluated three supervised machine learning classifiers: Decision Tree, Random Forest, and XGBoost. These models were trained and validated using a grouped cross-validation framework designed to preserve independence across crew stations.\u003c/p\u003e\u003cp\u003e\u003cb\u003eData Preparation and Preprocessing\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe dataset comprised multimodal features extracted from synchronized ECG, eye tracking, and head movement signals. These features were merged across resampled segments representing each workload class. We employed an early fusion strategy by concatenating ECG, eye tracking, and IMU features prior to model training, allowing the classifier to jointly learn multimodal representations of cognitive workload. Several preprocessing steps were applied:\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eStandardization: Features were z-scored within each subject and clipped between the 1st and 99th percentiles to reduce the influence of outliers.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLabeling: Segments were labeled as 0 (Low), 1 (Moderate), or 2 (High Workload), and concatenated into a unified dataset.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFeature Filtering: Non-informative features (e.g., metadata, survey responses, identifiers, signal artifacts) were excluded.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMissing Data Handling: Rows with more than 20% missing columns were removed, accounting for 5.1% of the dataset. The remaining missing values in numerical features were imputed using k-Nearest Neighbors (K = 2), based on similarity in feature space.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cp\u003e\u003cb\u003eCross-Validation and Grouping Strategy\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE)\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e was applied exclusively to the training data within each cross-validation fold, ensuring that no synthetic data were included in the test set. This strategy avoided data leakage and preserved the natural class distribution during evaluation. A 5-fold GroupKFold strategy was used to prevent subject overlap between training and testing folds.\u003c/p\u003e\u003cp\u003e\u003cb\u003eHyperparameter Tuning and Model Optimization\u003c/b\u003e\u003c/p\u003e\u003cp\u003eEach model underwent hyperparameter optimization using RandomizedSearchCV with 50 search iterations and parallelized evaluation. Rather than optimizing for general classification performance alone, we prioritized the detection of high workload segments (Class 2), which are rare but operationally critical. Accordingly, we used a custom scoring function (f1_class2_scorer) that computes the F1 score specifically for Class 2, placing emphasis on the model’s ability to correctly identify elevated cognitive load.\u003c/p\u003e\u003cp\u003eFor XGBoost, we additionally specified eval_metric='mlogloss' to guide internal gradient boosting updates and improve the model’s probabilistic calibration. This internal loss function influenced tree construction and early stopping, but did not override our primary selection criterion during tuning, which remained based on the Class 2 F1 score.\u003c/p\u003e\u003cp\u003eThis choice of objective ensured that hyperparameter selection favored configurations that improve recall and precision for high workload episodes, rather than inflating performance on the more common moderate workload class. The hyperparameter combination that yielded the highest mean F1 score for Class 2 across cross-validation folds was selected as the best configuration for each model. The optimal hyperparameters are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e.\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eBest-performing Hyperparameters for Each Classifier.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eClassifier\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBest Hyperparameters\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e{'min_samples_split': 10, 'min_samples_leaf': 1, 'max_leaf_nodes': 30, 'max_features': None, 'max_depth': 9, 'ccp_alpha': 0.0}\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e{'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_leaf_nodes': None, 'max_features': None, 'max_depth': 11, 'ccp_alpha': 0.0, 'bootstrap': True}\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e{'tree_method': 'hist', 'subsample': 0.6, 'reg_lambda': 1.5, 'reg_alpha': 0.5, 'n_estimators': 300, 'min_child_weight': 3, 'max_depth': 7, 'max_delta_step': 0, 'learning_rate': 0.1, 'gamma': 1.0, 'colsample_bytree': 1.0, 'colsample_bynode': 0.6, 'colsample_bylevel': 1.0}\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003cb\u003eEvaluation Metrics and Post-Hoc Analysis\u003c/b\u003e\u003c/p\u003e\u003cp\u003eModel performance was assessed using multiple metrics to capture classification quality across imbalanced and multi-class data. These included accuracy, macro averaged precision, macro averaged recall, and macro-F1 score, which provide a balanced view by averaging performance equally across all three workload classes. In addition, per-class F1 scores were computed to evaluate how well the model distinguished between low, moderate, and high workload conditions.\u003c/p\u003e\u003cp\u003eEach metric offers a distinct perspective on model behavior:\u003c/p\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003ePrecision is the ratio of true positive predictions to all positive predictions made by the model (Eq.\u0026nbsp;\u003cspan refid=\"Equ1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). It reflects the model’s ability to minimize false positives, making it especially important when incorrect high workload predictions could trigger unnecessary system interventions.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:Precision\\:=\\frac{True\\:Positives}{True\\:Positives\\:+\\:False\\:Positives}\\:$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eRecall measures the proportion of actual positive cases that were correctly identified (Eq.\u0026nbsp;\u003cspan refid=\"Equ2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). It captures the model’s capacity to detect all relevant instances, ensuring that cognitively demanding periods are not missed.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:Recall\\:=\\frac{True\\:Positives}{True\\:Positives\\:+\\:False\\:Negatives}\\:$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eF1 Score is the harmonic mean of precision and recall (Eq.\u0026nbsp;\u003cspan refid=\"Equ3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). It offers a single, balanced metric that accounts for both detection sensitivity and prediction reliability, particularly useful in the presence of class imbalance.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$\\:F1\\:=2\\frac{\\left(Precision\\:x\\:Recall\\right)}{\\left(Precision\\:x\\:Recall\\right)}\\:$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003cp\u003eFollowing classification, predicted workload labels were temporally aligned with mission timelines. Average predicted workload scores were then computed across the four mission phases, enabling trend analysis over time and comparison with self-reported mental demand ratings.\u003c/p\u003e\u003cp\u003e\u003cb\u003eFeature Importance Analysis\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo identify the most influential physiological and oculometric features in workload classification, we analyzed feature importance using the XGBoost model. XGBoost computes importance scores based on average gain, reflecting how much each feature contributes to improving the model’s objective function across all decision splits \u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e. This approach captures both the frequency and effectiveness of a feature’s use in reducing classification error. Features were ranked by their gain scores, and the top contributors were selected to balance interpretability and model performance while minimizing redundancy.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eThis research was supported by the U.S. Army Combat Capabilities Development Command\u003c/p\u003e\n\u003cp\u003eArmy Research Laboratory (DEVCOM ARL) (CN, AK) and Grant No: W911NF2120108 (MK, JB). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of DEVCOM ARL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eThis study was supported by the U.S. Army Combat Capabilities Development Command Army Research Laboratory Grant No: W911NF2120108.\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eAK and CN conceptualized the research; MK performed the research; AK and JB supervised the research; MK prepared the figures and wrote the manuscript; All authors reviewed and edited the manuscript.\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eDue to the sensitivity of the data and participant privacy, the dataset is not publicly available.\u003cbr\u003e\u0026nbsp;Reasonable requests for access may be considered by the ARL author (CN).\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003eCompeting interests\u0026nbsp;\u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eJB is a shareholder of Dprime LLC. All other authors have no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eWen, S. \u003cem\u003eet al.\u003c/em\u003e AdaptiveCoPilot: Design and Testing of a NeuroAdaptive LLM Cockpit Guidance System in both Novice and Expert Pilots. \u003cem\u003earXiv.org \u003c/em\u003ehttps://arxiv.org/abs/2501.04156v1 (2025).\u003c/li\u003e\n\u003cli\u003eLematta, G. J. \u003cem\u003eet al.\u003c/em\u003e Team Interaction Strategies for Human\u0026ndash;Autonomy Teaming in Next Generation Combat Vehicles. \u003cem\u003eProc. Hum. Factors Ergon. Soc. Annu. Meet. \u003c/em\u003e\u003cstrong\u003e64\u003c/strong\u003e, 77\u0026ndash;81 (2020).\u003c/li\u003e\n\u003cli\u003eHuang, Q., Xu, X., Wei, Y., Zhang, J. \u0026amp; Jin, X. The impacts of level of automation and cognitive secondary task on the cognitive load of armored vehicle crews. \u003cem\u003eCogn. Technol. Work\u003c/em\u003e (2025) doi:10.1007/s10111-025-00806-9.\u003c/li\u003e\n\u003cli\u003eLongo, L., Wickens, C. D., Hancock, G. \u0026amp; Hancock, P. A. Human Mental Workload: A Survey and a Novel Inclusive Definition. \u003cem\u003eFront. Psychol. \u003c/em\u003e\u003cstrong\u003e13\u003c/strong\u003e, 883321 (2022).\u003c/li\u003e\n\u003cli\u003eAlexander, A. \u0026amp; Nygren, T. EXAMINING THE RELATIONSHIP BETWEEN MENTAL WORKLOAD AND SITUATION AWARENESS IN A SIMULATED AIR COMBAT TASK. https://apps.dtic.mil/sti/citations/ADA387928.\u003c/li\u003e\n\u003cli\u003eRanchet, M., Morgan, J. C., Akinwuntan, A. E. \u0026amp; Devos, H. Cognitive workload across the spectrum of cognitive impairments: A systematic review of physiological measures. \u003cem\u003eNeurosci. Biobehav. Rev. \u003c/em\u003e\u003cstrong\u003e80\u003c/strong\u003e, 516\u0026ndash;537 (2017).\u003c/li\u003e\n\u003cli\u003eSosnowski, M. J. \u0026amp; Brosnan, S. F. Under pressure: the interaction between high-stakes contexts and individual differences in decision-making in humans and non-human species. \u003cem\u003eAnim. Cogn. \u003c/em\u003e\u003cstrong\u003e26\u003c/strong\u003e, 1103\u0026ndash;1117 (2023).\u003c/li\u003e\n\u003cli\u003eBrady, C., Sawant, S., Madathil, K. C. \u0026amp; McNeese, N. A Systematic Review on the Effect of Cognitive Fatigue in Teams. \u003cem\u003eProc. Hum. Factors Ergon. Soc. Annu. Meet. \u003c/em\u003e\u003cstrong\u003e68\u003c/strong\u003e, 1287\u0026ndash;1291 (2024).\u003c/li\u003e\n\u003cli\u003eAlmukhtar, A. \u003cem\u003eet al.\u003c/em\u003e Objective Assessment of Cognitive Workload in Surgery. \u003cem\u003eAnn. Surg. \u003c/em\u003e\u003cstrong\u003e281\u003c/strong\u003e, 942\u0026ndash;951 (2025).\u003c/li\u003e\n\u003cli\u003eMa, X., Monfared, R., Grant, R. \u0026amp; Goh, Y. M. Determining Cognitive Workload Using Physiological Measurements: Pupillometry and Heart-Rate Variability. \u003cem\u003eSensors \u003c/em\u003e\u003cstrong\u003e24\u003c/strong\u003e, 2010 (2024).\u003c/li\u003e\n\u003cli\u003eSolhjoo, S. \u003cem\u003eet al.\u003c/em\u003e Heart Rate and Heart Rate Variability Correlate with Clinical Reasoning Performance and Self-Reported Measures of Cognitive Load. \u003cem\u003eSci. Rep. \u003c/em\u003e\u003cstrong\u003e9\u003c/strong\u003e, 14668 (2019).\u003c/li\u003e\n\u003cli\u003eLuque-Casado, A., Perales, J. C., C\u0026aacute;rdenas, D. \u0026amp; Sanabria, D. Heart rate variability and cognitive processing: The autonomic response to task demands. \u003cem\u003eBiol. Psychol. \u003c/em\u003e\u003cstrong\u003e113\u003c/strong\u003e, 83\u0026ndash;90 (2016).\u003c/li\u003e\n\u003cli\u003eMark, J. A., Curtin, A., Kraft, A. E., Ziegler, M. D. \u0026amp; Ayaz, H. Mental workload assessment by monitoring brain, heart, and eye with six biomedical modalities during six cognitive tasks. \u003cem\u003eFront. Neuroergonomics \u003c/em\u003e\u003cstrong\u003e5\u003c/strong\u003e, (2024).\u003c/li\u003e\n\u003cli\u003eEkin, M., Krejtz, K., Duarte, C., Duchowski, A. T. \u0026amp; Krejtz, I. Prediction of intrinsic and extraneous cognitive load with oculometric and biometric indicators. \u003cem\u003eSci. Rep. \u003c/em\u003e\u003cstrong\u003e15\u003c/strong\u003e, 5213 (2025).\u003c/li\u003e\n\u003cli\u003eSkaramagkas, V. \u003cem\u003eet al.\u003c/em\u003e Review of Eye Tracking Metrics Involved in Emotional and Cognitive Processes. \u003cem\u003eIEEE Rev. Biomed. Eng. \u003c/em\u003e\u003cstrong\u003e16\u003c/strong\u003e, 260\u0026ndash;277 (2023).\u003c/li\u003e\n\u003cli\u003eMultimodal Assessment of Mental Workload During Automated Vehicle Remote Assistance: Modeling of Eye-Tracking-Related, \u0026hellip;. http://ouci.dntb.gov.ua/en/works/4yNn00zx/.\u003c/li\u003e\n\u003cli\u003eCharles, R. L. \u0026amp; Nixon, J. Measuring mental workload using physiological measures: A systematic review. \u003cem\u003eAppl. Ergon. \u003c/em\u003e\u003cstrong\u003e74\u003c/strong\u003e, 221\u0026ndash;232 (2019).\u003c/li\u003e\n\u003cli\u003eHirachan, N., Mathews, A., Romero, J. \u0026amp; Rojas, R. F. Measuring Cognitive Workload Using Multimodal Sensors. in \u003cem\u003e2022 44th Annual International Conference of the IEEE Engineering in Medicine \u0026amp; Biology Society (EMBC)\u003c/em\u003e 4921\u0026ndash;4924 (2022). doi:10.1109/EMBC48229.2022.9871308.\u003c/li\u003e\n\u003cli\u003eTao, X. \u003cem\u003eet al.\u003c/em\u003e A multimodal physiological dataset for driving behaviour analysis. \u003cem\u003eSci. Data \u003c/em\u003e\u003cstrong\u003e11\u003c/strong\u003e, 378 (2024).\u003c/li\u003e\n\u003cli\u003eLi, Q., Luximon, Y., Zhang, J. \u0026amp; Song, Y. Measuring and classifying students\u0026rsquo; cognitive load in pen‐based mobile learning using handwriting, touch gestural and eye‐tracking data. \u003cem\u003eBr. J. Educ. Technol. \u003c/em\u003e\u003cstrong\u003e55\u003c/strong\u003e, 625\u0026ndash;653 (2024).\u003c/li\u003e\n\u003cli\u003eDelliaux, S., Delaforge, A., Deharo, J.-C. \u0026amp; Chaumet, G. Mental Workload Alters Heart Rate Variability, Lowering Non-linear Dynamics. \u003cem\u003eFront. Physiol. \u003c/em\u003e\u003cstrong\u003e10\u003c/strong\u003e, (2019).\u003c/li\u003e\n\u003cli\u003eVuksanović, V. \u0026amp; Gal, V. Heart rate variability in mental stress aloud. \u003cem\u003eMed. Eng. Phys. \u003c/em\u003e\u003cstrong\u003e29\u003c/strong\u003e, 344\u0026ndash;349 (2007).\u003c/li\u003e\n\u003cli\u003eHjortskov, N. \u003cem\u003eet al.\u003c/em\u003e The effect of mental stress on heart rate variability and blood pressure during computer work. \u003cem\u003eEur. J. Appl. Physiol. \u003c/em\u003e\u003cstrong\u003e92\u003c/strong\u003e, 84\u0026ndash;89 (2004).\u003c/li\u003e\n\u003cli\u003eDe Rivecourt, M., Kuperus, M. N., Post, W. J. \u0026amp; Mulder, L. J. M. Cardiovascular and eye activity measures as indices for momentary changes in mental effort during simulated flight. \u003cem\u003eErgonomics \u003c/em\u003e\u003cstrong\u003e51\u003c/strong\u003e, 1295\u0026ndash;1319 (2008).\u003c/li\u003e\n\u003cli\u003eMallick, R., Slayback, D., Touryan, J., Ries, A. J. \u0026amp; Lance, B. J. The use of eye metrics to index cognitive workload in video games. in \u003cem\u003e2016 IEEE Second Workshop on Eye Tracking and Visualization (ETVIS)\u003c/em\u003e 60\u0026ndash;64 (2016). doi:10.1109/ETVIS.2016.7851168.\u003c/li\u003e\n\u003cli\u003eLubetzky, A. V., Coker, E., Arie, L., Aharoni, M. M. H. \u0026amp; Krasovsky, T. Postural Control under Cognitive Load: Evidence of Increased Automaticity Revealed by Center-of-Pressure and Head Kinematics. \u003cem\u003eJ. Mot. Behav. \u003c/em\u003e\u003cstrong\u003e54\u003c/strong\u003e, 466\u0026ndash;479 (2022).\u003c/li\u003e\n\u003cli\u003eMarquart, G., Cabrall, C. \u0026amp; de Winter, J. Review of Eye-related Measures of Drivers\u0026rsquo; Mental Workload. \u003cem\u003eProcedia Manuf. \u003c/em\u003e\u003cstrong\u003e3\u003c/strong\u003e, 2854\u0026ndash;2861 (2015).\u003c/li\u003e\n\u003cli\u003eLiu, Y. \u003cem\u003eet al.\u003c/em\u003e Cognitive Load Prediction From Multimodal Physiological Signals Using Multiview Learning. \u003cem\u003eIEEE J. Biomed. Health Inform. \u003c/em\u003e\u003cstrong\u003e29\u003c/strong\u003e, 3282\u0026ndash;3292 (2025).\u003c/li\u003e\n\u003cli\u003eLobo, J. L. \u003cem\u003eet al.\u003c/em\u003e Cognitive workload classification using eye-tracking and EEG data. in \u003cem\u003eProceedings of the International Conference on Human-Computer Interaction in Aerospace\u003c/em\u003e 1\u0026ndash;8 (ACM, Paris France, 2016). doi:10.1145/2950112.2964585.\u003c/li\u003e\n\u003cli\u003eHart, S. G. \u0026amp; Staveland, L. E. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. in \u003cem\u003eAdvances in Psychology\u003c/em\u003e (eds. Hancock, P. A. \u0026amp; Meshkati, N.) vol. 52 139\u0026ndash;183 (North-Holland, 1988).\u003c/li\u003e\n\u003cli\u003eGommers, R. \u003cem\u003eet al.\u003c/em\u003e scipy/scipy: SciPy 1.9.0. \u003cem\u003eZenodo\u003c/em\u003e (2022) doi:10.5281/zenodo.6940349.\u003c/li\u003e\n\u003cli\u003eMakowski, D. \u003cem\u003eet al.\u003c/em\u003e NeuroKit2: A Python toolbox for neurophysiological signal processing. \u003cem\u003eBehav. Res. Methods \u003c/em\u003e\u003cstrong\u003e53\u003c/strong\u003e, 1689\u0026ndash;1696 (2021).\u003c/li\u003e\n\u003cli\u003eVolkmann, F. C., Riggs, L. A. \u0026amp; Moore, R. K. Eyeblinks and Visual Suppression. \u003cem\u003eScience \u003c/em\u003e\u003cstrong\u003e207\u003c/strong\u003e, 900\u0026ndash;902 (1980).\u003c/li\u003e\n\u003cli\u003eDalmaijer, E. S., Math\u0026ocirc;t, S. \u0026amp; Van der Stigchel, S. PyGaze: An open-source, cross-platform toolbox for minimal-effort programming of eyetracking experiments. \u003cem\u003eBehav. Res. Methods \u003c/em\u003e\u003cstrong\u003e46\u003c/strong\u003e, 913\u0026ndash;921 (2014).\u003c/li\u003e\n\u003cli\u003eSalvucci, D. D. \u0026amp; Goldberg, J. H. Identifying fixations and saccades in eye-tracking protocols. in \u003cem\u003eProceedings of the symposium on Eye tracking research \u0026amp; applications - ETRA \u0026rsquo;00\u003c/em\u003e 71\u0026ndash;78 (ACM Press, Palm Beach Gardens, Florida, United States, 2000). doi:10.1145/355017.355028.\u003c/li\u003e\n\u003cli\u003eChawla, N. V., Bowyer, K. W., Hall, L. O. \u0026amp; Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. \u003cem\u003eJ. Artif. Intell. Res. \u003c/em\u003e\u003cstrong\u003e16\u003c/strong\u003e, 321\u0026ndash;357 (2002).\u003c/li\u003e\n\u003cli\u003eChen, T. \u0026amp; Guestrin, C. XGBoost: A Scalable Tree Boosting System. in \u003cem\u003eProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\u003c/em\u003e 785\u0026ndash;794 (ACM, San Francisco California USA, 2016). doi:10.1145/2939672.2939785.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Cognitive Workload, Multimodal Classification, Eye-Tracking, ECG, XGBoost, Human-Machine Interaction","lastPublishedDoi":"10.21203/rs.3.rs-7285350/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7285350/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAccurately assessing cognitive workload is critical in military operations, where decisions must be made under pressure in complex and dynamic environments. This study presents multimodal machine learning approaches for classifying workload into three levels: low, moderate, and high. Synchronized electrocardiogram (ECG), eye-tracking, and head movement signals from inertial measurement units were collected across 26 simulated missions involving autonomous technologies. High workload segments were annotated by experts based on task demands and performance. Physiological and behavioral features; including heart rate, heart rate variability, pupil diameter, fixation count, and blink rate, were extracted and normalized per participant to account for individual variability. Classification models were evaluated using subject-independent five-fold cross-validation to ensure generalization. Among the tested models, XGBoost achieved the highest performance, with an accuracy of 0.86 and a macro averaged F1 score of 0.78, outperforming Random Forest (accuracy: 0.82, F1: 0.73) and Decision Tree (accuracy: 0.74, F1: 0.65). Feature importance analysis revealed pupil size and fixation dispersion as key predictors of cognitive workload. These findings demonstrate the feasibility of real-time, noninvasive cognitive workload monitoring using multimodal physiological signals and support the development of adaptive human-machine systems that dynamically respond to operator cognitive states in high-demand environments.\u003c/p\u003e","manuscriptTitle":"Multimodal Classification of Cognitive Workload Using Eye-Tracking, ECG, and Head Motion Data in Simulated Military Missions","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-27 06:59:07","doi":"10.21203/rs.3.rs-7285350/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f51d2b9d-0da0-492b-ac1f-90211af42b76","owner":[],"postedDate":"August 27th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":53467334,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":53467335,"name":"Physical sciences/Engineering"},{"id":53467336,"name":"Health sciences/Health care"},{"id":53467337,"name":"Physical sciences/Mathematics and computing"},{"id":53467338,"name":"Biological sciences/Neuroscience"}],"tags":[],"updatedAt":"2026-01-12T10:23:37+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-27 06:59:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7285350","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7285350","identity":"rs-7285350","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00