Scaled Custom Attention for Enhanced Temporal Dependency Modeling in EEG Classification

doi:10.22541/au.174885856.64797320/v1

Scaled Custom Attention for Enhanced Temporal Dependency Modeling in EEG Classification

2025 · doi:10.22541/au.174885856.64797320/v1

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 34,983 characters · extracted from oa-doi-fallback · 3 sections · click to expand

Abstract

Accurate Electroencephalography (EEG) signals classification is essential for diagnosing brain disorders such as Epilepsy. Whereas Deep Learning models such as Convolution Neural Networks (CNNs) and Long Short-Term Memory (LSTM) improved EEG classification performance over traditional methods, existing attention mechanisms such as Additive, Luong and Multihead struggle to capture EEG’s complex temporal dependencies. This study proposes Scaled Custom Attention (SCA); a mechanism for temporal dependency modeling during EEG classification. Unlike traditional Query-Key-Value (QKV) approaches which rely on semantic weighting schemes, SCA employs direct feature weighting strategy that adapts to the unique temporal dependencies of EEG signals, and introduces a scaling strategy that enhances stability. To validate our approach, experiments were conducted using TUH EEG Epilepsy Corpus (TUEP) where SCA achieved an improved classification performance (Accuracy: 98.07%, F1-Score: 98.06%), marginally higher than Additive (97.60%, 97.61%), Multihead (97.66%, 97.66%), and Luong (97.68%, 97.66%) attention mechanisms when integrated to the LConvNet EEG classification model. Additionally, SCA achieves a balanced performance profile, with competitive inference time of 2.83 vs. 1.32–3.89 for baselines, parameter efficiency (58.5 params/sample vs 58.5–63.7), and a comparable generalization, with an average training-validation difference (Avg) of 0.0191, making it a promising enhancement for EEG-based deep learning models. 1 Introduction EEG classification is essential for brain-computer interfaces (BCIs), disorder detection, and brain state monitoring. Traditional machine learning approaches such as Support Vector Machines (SVMs), k-Nearest Neighbors (k-NN) and Random Forests rely on hand-crafted features such as power spectral density, wavelet coefficients and statistical measures, which require extensive preprocessing and are prone to noise and intersubject variability (Craik, He, & Contreras-Vidal, 2019); (Lotte, et al., 2018); (Roy, et al., 2019). The advancement of DL models, particularly convolutional neural networks (CNNs) and long-short-term memory (LSTM), demonstrated a significant improvement in EEG classification by the ability to extract spatial and temporal representations directly from EEG signals (Zhang, Zhang, & Wang, 2023). This eliminated the reliance on labor-intensive feature extraction processes, offering an efficient approach to EEG signal classification (Cai, Zhang, Zhu, & Li, 2025); (Omar, Kimwele, Olowolayemo, & Kaburu, 2024); (Bagchi & Bathula, 2021); (Wang, Huang, Xiao, Cai, & Tai, 2024). Comprehensive studies detail these advances while highlighting the importance of robust feature extraction methods to address EEG variability (Abibullaev, Keutayeva, & Zollanvari, 2023); (Deng, Li, Li, Guo, & Xu, 2024); (Khushiyant, Mathur, Kumar, & Shokeen, 2024). Despite progress with DL techniques in EEG analysis, several structural challenges persist, including the absence of EEG-tailored DL technique, EEG subject variability, high dimensionality, and low signal-to-noise ratio inheritance in EEG signals (Lotey, Keserwani, Dogra, & Roy, 2023); (Govil, Yao, & Borao, 2024). Consequently, attention mechanisms have been increasingly integrated into DL models to address challenges by enabling models to focus on the most relevant spatial and temporal features, including timesteps, frequency bands, spatial regions, and spectral characteristics, to improve interpretability and classification performance (Xin, Hu, Liu, Zhao, & Zhang, 2022); (Miao, Zhang, Zhao, & Ming, 2023); (Kuang & Michoski, 2022); (Eldele, et al., 2021). Traditional attention mechanisms such as Additive and Multihead Attentions, originally developed for Natural Language Processing (NLP), have been adapted to computer vision and EEG analysis to improve feature selection and classification performance by selectively focusing on relevant tempo-spatial features (Vaswani, 2017); (Tang, Ma, Xiao, Wu, & Zeng, 2025); (Luong, 2015), (Bahdanau, 2014). In particular, studies have shown that attention-based architectures outperform traditional deep learning models in tasks such as seizure detection, motor imagery classification, and emotion recognition, (Deng, Li, Li, Guo, & Xu, 2024); (Zarean, Tajally, Tavakkoli-Moghaddam, & Kia, 2025); (Zhong, Wu, Yin, & Liu, 2024); (Cisotto, et al., 2020). Recent advancements in frequency-specific attention mechanisms enhanced spectral-based EEG analysis by assigning specific attention weights to different EEG bands, improving feature selection (Yang, et al., 2024). Furthermore, hybrid approaches combining attention mechanisms with graph neural networks (GNNs) or Transformers have demonstrated significant improvements in EEG classification by modeling long-range dependencies more effectively (Abibullaev, Keutayeva, & Zollanvari, 2023); (Shi, et al., 2023); (Gao, Jia, Zhou, & Du, 2023); (Das & Menon, 2024). Despite advancements in attention mechanisms for EEG analysis, several limitations persist; Majority of attention mechanisms were originally designed for NLP tasks, relying on fixed tokens within a sequence for prediction, whereas EEG signals lack predefined sequence structure which makes it challenging to effectively capture features (Kuang & Michoski, 2022); (Zhong, Wu, Yin, & Liu, 2024). Additionally, attention mechanisms such as Additive, Luong, and Multihead compute semantic context through computation of special vectors commonly known as Query (Q), Key (K) and Value (V), since EEG sequence lack semantic context, such operation leads to inefficiencies and unnecessary computational overhead. Moreover, existing attentions often fail to leverage on distinct but crucial EEG rhythms for classification due to lack of frequency-specific operations. Furthermore, some prioritize temporal relationships while neglecting spatial and spectral information, limiting their effectiveness in EEG analysis (Zhang, Zhang, & Wang, 2023). This study addresses the above-mentioned challenges by introducing Scaled Custom Attention (SCA), a novel attention mechanism tailored for temporal modeling during EEG classification. Unlike traditional QKV-based attention methods, SCA applies direct feature weighting on EEG timesteps with optimized scaling strategy for enhanced performance, efficiency, and stability. By dynamically prioritizing relevant temporal and spectral EEG features, SCA improves classification accuracy. Comprehensive evaluations conducted on the TUH EEG Epilepsy Corpus demonstrate superior performance over existing attention mechanisms in terms of accuracy, parameter efficiency, and scalability. Though customized for EEG analysis, SCA is inspired by existing attention mechanisms (Vaswani, 2017); (Luong, 2015); (Bahdanau, 2014). 2 Methodology 2.1 Dataset and Preprocessing This study employs the Temple University Hospital EEG Epilepsy Corpus (TUEP v2.0.0), a publicly available dataset collected in accordance with the Declaration of Helsinki and the HIPAA Privacy Rule (Obeid & Picone, 2016). Each recording, stored in EDF format, includes at least 25 scalp EEG channels with variable durations and sampling rates. A balanced subset of 49 sessions per class was selected based on clinical annotations. Table 1: EEG Preprocessing Steps Algorithm | INPUT: Raw EEG Input Data OUTPUT: Transformed EEG Data (Epochs) 1: Load EEG data from EDF file at filepath with all samples preloaded 2: Create EEG channel information: • 25 EEG channels • Sampling frequency = 128 Hz 3: Set EEG reference to average of all channels 4: Resample EEG data to 128 Hz 5: Apply bandpass filter between 1 Hz and 45 Hz 6: Crop EEG data between 1s and 200s 7: Segment EEG into 2-second epochs with 1-second overlap 8: Convert epochs into a numerical array 9: Apply PCA for dimensionality reduction with 25 components 10: return EEG Transformed data (Epochs), (shape: [num epochs, 25, 256]) | Preprocessing was performed in a Paperspace Gradient cloud environment using MNE-Python and TensorFlow (Keras) as illustrated with the algorithm in Table 1. Raw signals were average-referenced, bandpass filtered (1–45 Hz), and resampled to 128 Hz. Recordings were then cropped to 1–200 seconds and segmented into overlapping 2-second epochs (1s overlap), yielding 9,702 epochs per class. Dimensionality was reduced using PCA, resulting in standardized inputs of shape (25 channels × 256 time points) per epoch for classification. 2.2 EEG Classification Model: LConvNet LConvNet (Omar, Kimwele, Olowolayemo, & Kaburu, 2024) was selected for its ability to extract spatial and temporal EEG features using CNN and LSTM, enhanced by attention mechanisms. The model takes preprocessed EEG epochs input\(\text{X\ }\in\ \ R^{\text{T\ }\times\text{\ d}}\text{\ \ \ },\)where T refers to the number of EEG timesteps and d, the channels. Spatial features are extracted using three convolutional blocks\(\left(3\ \times\ 3,\ 5\ \times\ 5,\ 7\ \times\ 7\right)\)with ReLU activation, max-pooling, and dropout. A Time-Distributed layer (TDL) processes timesteps independently before LSTM captures temporal dependencies. A GlobalAveragePooling1D (skip connection) layer aggregates temporal information, which is concatenated with the LSTM output followed by binary classification with Adam optimizer and a learning rate of \(1.0\ \times\ 10^{-5}\). Details of LConvNet operation are shown in Table 2. 2.3 Proposed Scaled Custom Attention (SCA) SCA enhances LConvNet by dynamically weighting the time steps of EEG while stabilizing the gradients through custom scaling. Various scaling factors (\(\sqrt{d},\ d,\ d^{2},\ d^{3},\ d^{4}\), where d is the number of EEG channels) were tested, with \(d^{2}\) yielding best performance. Weights are computed using an exponential function\(\left(\exp\right)\) scaled tanh-activated scores, then normalizes across time steps (Equation 1). Here, \(x_{i}\ \)represents the input at the \(i^{\text{th}}\) timestep, with W and b as weight and bias terms. SoftMax normalization prioritizes relevant EEG features while suppressing less important ones. \begin{equation} o=\ \sum_{i=1}^{T}\begin{matrix}\left(\ \frac{\frac{\exp{\left(\tanh(x_{i}W+b\right)\ }}{d^{2}})}{\sum_{j=1}^{T}{\frac{\exp{\left(\tanh(x_{j}W+b\right)\ }}{d^{2}})}}\text{\ .\ \ }x_{i}\right)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 1.\ \\ \\ \end{matrix}\nonumber \\ \end{equation} Since EEG lacks semantic context, SCA assigns weights to important temporal features directly from time-distributed layer (direct feature weighting). 2.4 Baseline Attention Mechanisms 2.4.1 Additive Attention In the context of this study, Additive Attention was integrated into the LConvNet model to enhance temporal feature selection by dynamically weighting different time steps of EEG data. Following CNN-based spatial feature extraction, TimeDistributed Dense layers projected EEG features into a lower-dimensional space analogous to encoder hidden states in NLP RNN-based techniques which are fed to the Additive attention for computation of alignment scores as shown in Equation 2. Where: \(e_{i}\)is the alignment score for the \(i_{\text{th}}\) time step; \(v^{T}\)represents a learnable vector that projects output to scalar score;\(Q\) represents EEG current time step similar to current sequence hidden state of the decoder in NLP; \(K_{r}\) is reference EEG timestep features (analogous to encoder hidden states in NLP); \(W_{Q}\) and\(W_{K}\ \)are learnable weight matrices for \(Q\) and \(K\)respectively; \(b\) is the bias term, \(\alpha_{i}\ \)is normalized attention weights. | \[e_{i}=\ v^{T}\tanh\left(W_{Q}Q_{i}+W_{K}K_{r}+b\right)\] \[\alpha_{i}=\ \frac{\exp\left(e_{i}\right)}{\sum_{j}{\exp\left(e_{j}\right)}}\] \[c_{t}=\ \sum_{i}\alpha_{i}K\] | \[Equation\ 2\] | The context vector \(c_{t}\ \)is passed to the LConvNet’s LSTM and subsequent layers for capturing temporal dependencies, then subsequent classification and output. 2.4.2 Luong Attention The Luong attention mechanism as demonstrated in Equation 3, scores attention between EEG timesteps based on relevance, where: \(e_{i}\)represents alignment score; \(Q\) represents current time focus similar to current hidden state of the decoder in NLP; K represents all timestep features for comparison (encoder hidden states); \(W_{a}\) is a learned weight matrix that models relationships between Q and K;\(\alpha_{i}\ \)represents normalized attention scores, the context vector \(c_{t}\) preserves information by concatenating \(\alpha_{i}\)to \(K\) while refining the selection of attention-based feature. | \[e_{i}=\ Q^{T}W_{a}K\] \[\alpha_{i}=\ \frac{\exp\left(e_{i}\right)}{\sum_{j}{\exp\left(e_{j}\right)}}\] \[c_{t}=\ \sum_{i}\alpha_{i}K\] | \[\text{Equation\ }3\] | 2.4.3 Multihead Attention (MHA) Unlike Additive Attention, which focuses on a single attention context, MHA enables LConvNet performance through computation of multiple parallel attention scores, for purposes of capturing diverse temporal patterns. For the sake of this study, we used 4 attention heads with\(\dim_{k}\) = 64, there was no improvement of performance from scaling number of heads. As explained in section 2.1, EEG tranformed features from TimeDistributed Dense layers (TDL) are passed to MHA for computation of attention scores through QKV-based approach as shown in Equation 4, where \(Q,K,V\ \)are query, key, and value matrices from EEG feature representations, \(h\) is transformed EEG feature maps,\(W_{O}\) projects concatenated outputs back to feature space, and\(\dim_{k}\) is the dimension used for scaling dot-product attention. | \[Q={\ hW}_{Q}\,\ \ \ \ \ \ \ \ \ K={\ hW}_{K}\,\ \ V=\text{\ hW}_{V}\ \] \[\text{Attention}\left(Q,K,V\right)=SoftMax\left(\frac{\text{QK}^{T}}{\sqrt{\dim_{k}}}\right)\text{\ V}\] \[\text{MultiHead}\left(Q,K,V\right)=\ Concat\left(\text{head\!}_{1},\text{head\!}_{2},\ldots,\ \text{head\!}_{h}\right)\ W_{O}\] | \(Equation\ 4\) | In summary, the operations of the attention mechanisms are as follows: • QKV-based models (Additive, Luong, Multihead) compute attention scores, derive a context vector, and modify the feature representations before passing them to the LSTM. • SCA applies direct feature weighting at the raw EEG time-step level, bypassing separate score computation and dynamically highlighting critical temporal dependencies. • Finally, the context vectors are passed through LSTM layers, capturing temporal dependencies before classification through a Dense output layer. 2.5 Integration of attention mechanisms into LConvNet Model SCA and baseline attention mechanisms described in the previous sections were each integrated into LConvNet model at the temporal feature extraction stage (between Time-Distributed and LSTM Layers as shown in Table. 2). The CNN component first extracts spatial features, after which TimeDistributed layers process them independently across time steps before passing to LSTM. Table 2: Algorithm - LConvNet Integration with Attention Mechanisms (Attention Block) | x ← Input layer Reshape input to (channels, timesteps, 1) for Conv2D compatibility 3 x ← Conv2D(32, 3×3) → LayerNormalization → MaxPooling2D 4 x ← Conv2D(64, 5×5) → LayerNormalization → MaxPooling2D 5 x ← Conv2D(128, 7×7) → LayerNormalization → MaxPooling2D 6 x ← Dropout 7 x ← TimeDistributed(Flatten) → TimeDistributed(Dense(64, ReLU)) 8 a ← AttentionBlock (x), where AttentionBlock is one of: ScaledCustomAttention(x) AdditiveAttention(x) LuongAttention(x) MultiHeadAttention(x) a ← x + a (Residual connection) a ← LayerNormalization(a) lstm out ← LSTM(64, return sequences=False)(a) pooled out ← GlobalAveragePooling1D(x) concat ← Concatenate([lstm out, pooled out]) output ← Dense(1, activation=’sigmoid’)(concat) Compile: Adam optimizer (learning rate 1e−5), binary crossentropy loss Return model | 3 Results and Discussion 3.1 Classification Performance and Generalization The classification performance of different attention mechanisms was evaluated using Accuracy, Precision, Recall, and F1-score, computed as in Equations. 5 - 8. where \(TP,\ FP,\ TN,\ and\ FN\) denote True Positives, False Positives, True Negatives, and False Negatives, respectively. \begin{equation} Accuracy=\ \frac{TP+TN}{TP+FP+FN+TN}\ \ \ \ \ \ \ \ \ \ \ Equation\ 5\ \ \ \ \nonumber \\ \end{equation}\begin{equation} Precision=\frac{\text{TP}}{TP+FP}\text{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\text{\ \ \ \ \ \ }Equation\ 6\nonumber \\ \end{equation}\begin{equation} Recall=\ \frac{\text{TP}}{TP+FN}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 7\nonumber \\ \end{equation}\begin{equation} F1-Score=\ \ 2*\ \frac{Precision*Recall}{Precision+Recall}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 8\nonumber \\ \end{equation} SCA achieves the highest accuracy (98.07%), precision (98.59%) and F1-score (98.06%). However, Multihead Attention achieves the highest recall (98.04%). All models perform within 1% of each other in overall metrics (accuracy: 97.60–98.07%; F1-Score: 97.61–98.06%). Table 3 shows results across attention mechanisms. Table 3 Summary of Classification Performance (%) | Accuracy | 98.07 | 97.68 | 97.60 | 97.66 | | Precision | 98.59 | 98.38 | 97.43 | 97.29 | | Recall | 97.53 | 96.96 | 97.78 | 98.04 | | F1-Score | 98.06 | 97.66 | 97.61 | 97.66 | To evaluate models’ learning behaviors, training and validation accuracy and loss curves were examined (Fig. 1). The curves indicate stable convergence, with minor validation fluctuations suggesting slight sensitivity. Fig. 1 Training -Validation Accuracy and Loss Curves Generalization was evaluated using the Training-Validation Gap (\(\text{Av}_{g}\)) defined in Equation 9. where \(T_{i}\ \)and\(V_{i}\) represent train and validation performance at epoch \(i\). A smaller \(\text{Av}_{g}\) indicates good generalization. \begin{equation} \text{Av}_{g}=\ \frac{1}{N}\ \sum_{i=1}^{N}\left(T_{i}-\ V_{i}\right)\text{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ }9\nonumber \\ \end{equation} Generalization analysis reveals: Additive Attention has the smallest\(\text{Av}_{g}\) (0.0172), while SCA shows comparable generalization (\(\text{Av}_{g}\) = 0.0191). All models fall within a narrow Avg range (Δ < 0.002). Fig. 2 demonstrate the results. Fig. 2 Epoch-Wise Training-Validation Gap Confusion matrix analysis (Fig. 3) shows that SCA has the highest True Negatives (TN = 1913) and lowest False Positives (FP = 27), indicating strong specificity. Additive and MHA exhibit slightly higher FP (50 and 53, respectively) than SCA and Luong (27 and 31 respectively). However, Additive achieves fewer False Negatives (FN = 43) than Luong (FN = 59), suggesting a trade-off between sensitivity and specificity across attention mechanisms. Fig. 3 Comparison of classification performance: Confusion Matrices Model scalability was assessed using trainable parameters, training time, and inference time (Table 4). In this context, parameter efficiency evaluates the number of computational resources required per training sample as defined in Equation 10. Where \(P_{\text{sample}}\)represents the parameters per sample, \(P_{\text{train}}\) denotes the number of trainable parameters, \(b\) is the training batch size, and\(e\) refers to the number of training epochs. \(\left(P_{\text{sample}}=\ \frac{P_{\text{train}}}{b\ *e}\right)\) Equation 10 SCA and Additive Attention ties with the most parameter-efficient value (58.5), while MHA requires most trainable parameters (815K). Additive Attention exhibits the fastest inference time (1.32), while Luong Attention has the slowest (3.89). Table 4 Computational Performance of Attention Mechanisms (K = ’000) | \(P_{\text{train}}\) | 749K | 770K | 749K | 815K | | Training Time (s) | 42.2 | 43.6 | 41.6 | 42.3 | | Inference Time (s) | 2.83 | 3.89 | 1.32 | 1.99 | | \(P_{\text{sample}}\) | 58.5 | 60.1 | 58.5 | 63.7 | 3.2 Analyzing the Temporal Dynamics of SCA in EEG Classification Attention weight distributions were visualized across EEG timesteps using heatmap representations (Fig. 4), from randomly selected epileptic and non-epileptic EEG samples. Fig. 4 Attention weight heatmaps for four representative EEG samples. Statistical analysis of all 3,881 samples was performed using mean (\(\mu\)), standard deviation (\(\sigma\)), and 95% confidence interval (CI) defined in Equations 11 - 13. The attention weight array has a shape of (1, 64), where 64 denotes the number of attention channels. \begin{equation} \text{\ \ \ \ \ \ μ}_{j}\ =\ \frac{1}{N}\ \sum_{j=1}^{N}A_{j,i}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 11\nonumber \\ \end{equation}\begin{equation} \text{\ \ \ \ \ \ \ \ \ }\sigma_{j}=\ \sqrt{\frac{1}{N}\ \sum_{j=1}^{N}\left(A_{j,i}\ -\mu_{j}\right)^{2}}\ \ \ \ \ \ \ \ \ \ \ Equation\ 12\nonumber \\ \end{equation}\begin{equation} \text{CI}_{l},\ \ \text{CI}_{u},=\left[\ \mu_{j}-Z*\ \frac{\sigma_{j}\ }{\sqrt{N}}\,\ \ \ \mu_{j}+Z*\ \frac{\sigma_{j}\ }{\sqrt{N}}\text{\ \ }\right]\ \ \ \ \ \ \ Equation\ 13\nonumber \\ \end{equation} Where \(A_{j,i}\text{\ \ }\)represents attention weight at sample \(j\)channel i and \(Z\) is the critical value for the 95% confidence interval. Statistical analysis reveals a bimodal distribution: Most channels (34/64, 53.1%) show consistently low weights (\(\mu\) = 0.21 ± 0.14), while 3 channels (4.7%) demonstrate high activation (\(\mu\) = 2.29 ± 0.33). The remaining 26 channels (40.6%) exhibit intermediate weights (\(\mu\) = 1.14 ±0.11 to 1.71 ± 0.14) with narrow 95% CIs (± 0.02 -0.04). Table 5 details the complete statistics. Table 5 Statistical Summary of Attention Weights per Channel | 0.0–0.5 | 34 | 0.21 | 0.14 | 0.20 | 0.22 | | 0.5–1.0 | 9 | 0.72 | 0.16 | 0.71 | 0.74 | | 1.0–1.5 | 12 | 1.14 | 0.11 | 1.12 | 1.16 | | 1.5–2.0 | 5 | 1.71 | 0.14 | 1.68 | 1.73 | | 2.0+ | 3 | 2.29 | 0.33 | 2.26 | 2.31 | We further characterized SCA attention patterns using entropy, sparsity, and the coefficient of variation (CV). Entropy quantifies randomness in attention weight distribution (Equation 14). Where: \(E\left(X\right)\) represents entropy of attention distribution; n is total number of EEG channels;i is timesteps index; \(p_{i}\ \)is normalized attention weight for the \(i_{\text{th}}\) element. Higher entropy indicates broader attention distribution, while lower entropy suggests focused attention on fewer channels. \begin{equation} E\left(X\right)=\ -\sum_{i=1}^{n}p_{i}\log p_{i}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 14\nonumber \\ \end{equation} Fig. 5 Entropy Distribution of Attention Weights. Fig. 5 shows 57/64 channels (89.1%) exhibit high entropy (6.00-7.99); 5 channels (7.8%) show moderate entropy (5.00-5.99). Only 1 channel had a low entropy (< 1). Table 6 displays the summary. Table 6 SCA entropy distribution across channels | 0.0–0.99 | 1 | | 5.00–5.99 | 5 | | 6.0–6.99 | 12 | | 7.0–7.99 | 45 | | Total | 64 | Sparsity (Equation 15) evaluates attention’s focus, while mean measures the level of allocations. The sparsity-mean correlation (Fig. 6) indicates generally distributed attention weights. Where \(a_{i}\)represents raw attention weights. \(Sparsity=\frac{\sum_{i=1}^{n}\left|a_{i}\right|}{\max\left(a_{i}\right)*n}\) Equation 15 Fig. 6 Sparsity vs Mean Attention. Coefficient of Variation (CV) measures the variability in attention weights as given by Equation 16. we set threshold of 75th percentile (1.1969) indicated by the dotted redline in Fig. 7. \begin{equation} CV=\ \frac{\sigma}{\mu}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Equation\ 16\nonumber \\ \end{equation} Most attention channels (48/64, 75%) show low variability (CVs < 1.197); 14 channels (21.9%) have moderate variability (1.197 <CV 11). This distribution, as depicted in Fig. 7, suggests a fairly uniform attention distribution. For additional insight, we measure Correlation (r) between metrics used. Fig. 7 Distribution of Coefficient of Variation. Correlation analysis (Fig. 8) indicates: Strong entropy-sparsity relationship (\(r\) = 0.62); relatively stronger correlations of statistical metrics (\(\sigma,\ \mu,\ CI\)) with other metrics except CV; Weak CV correlations (\(\left|r\right|\)< 0.1) with other metrics. Fig. 8 Correlation Between Metrics. 3.3 Ablation Study Ablation study was conducted to assess the impact of removing key SCA components (scaling, bias, attention, skip connection) (Table 7). The following observations were made: Table 7. Performance impact of removing SCA components | Accuracy (%) | 97.34 | 94.22 | 97.19 | 97.70 | 98.07 | | F1-Score (%) | 96.00 | 95.00 | 98.00 | 98.00 | 98.00 | | \(\text{Av}_{g}\) | 0.0259 | -0.0054 | 0.0306 | 0.012 | 0.0191 | | Train Time | 43.53 | 31.42 | 41.37 | 43.06 | 42.17 | | Inference Time | 4.60 | 2.214 | 6.18 | 4.61 | 2.83 | • Attention mechanism: Removal caused the largest accuracy reduction (-3.85% vs full SCA), suggesting its importance for feature extraction. The negative \(\text{Av}_{g}\) (-0.0054) may indicate underfitting. • Skip connections: Absence increased inference time by 118% (6.18s vs 2.83s) while marginally reducing accuracy (-0.88%), highlighting their role in computational efficiency. • Scaling: Showed minimal impact on accuracy context (+0.37%). • Bias: Removal resulted in decreased accuracy (−0.73%) and F1-Score (−2.00%), with a 62.5% longer inference time (4.60s vs SCA’s 2.83s). The increased \(\text{Av}_{g}\) (0.0259 vs. 0.0191) suggests bias terms adds minimal generalization benefits. • Full SCA: Achieved the best balance - highest accuracy (98.07%) with competitive inference time (2.83s, second only to No-Attention). 4 Challenges and Limitations The SCA mechanism shows promise for epilepsy detection but may not generalize to other EEG applications without adaptation. Focused solely on attention module design rather than full-system integration, SCA’s inference time (2.83s) may hinder real-time clinical use. Preprocessing was customized for TUEP dataset, and it overlooked real-life challenges pertaining to EEG machine variations, electrode placements and class imbalance. SCA computational demands, though improved over benchmark mechanisms, may still hinder implementation in resource constrained edge devices such as wearable EEG monitors. Additionally, the clinical interpretability of attention patterns and comparisons with newer architectures such as the Transformer Networks remains unverified. 5 Conclusion This study introduced the SCA mechanism for EEG classification, designed to improve stability through scaling and enhance temporal dependency modeling via direct feature-weighting. Evaluations on the TUH EEG Epilepsy Corpus (TUEP v2.0.0) shows that SCA achieves competitive performance compared to Luong, Additive, and Multihead benchmark attention mechanisms, with: Accuracy: 98.07% (vs. 97.60–97.68% for benchmark attention mechanisms), and balanced metrics (Precision = 98.59%, recall = 97.53%, and F1-score = 98.06%). Analysis of learning curves and training-validation gaps indicates comparable generalization across mechanisms (e.g., SCA’s gap: 0.0191 vs. Additive’s 0.0172). Further, SCA demonstrates favorable stability in weight allocation as supported by entropy, coefficient of variation and sparsity metrics. While SCA demonstrates promising results in parameter efficiency (58.5 parameters/sample) and scalability, its advantages over other additive and Luong attention mechanisms are marginal (≤ 2 parameters / Sample difference) and moderate for MHA (63.7 parameters/sample). Future work could explore: Hybrid architectures for further optimization of temporal modeling, dynamic weight allocation for diverse EEG applications and independent validation on diverse datasets to confirm robustness.

References

Abdelhady, G. (2025). Enhancing Neuroprosthetic Control Using CNN-LSTM Models: A Simulation Study with EEG-Based Motor Imagery. Artificial Intelligence Information Security, 3, 17–35.Abibullaev, B., Keutayeva, A., & Zollanvari, A. (2023). Deep learning in EEG-based BCIs: a comprehensive review of transformer models, advantages, challenges, and applications. IEEE Access .Bagchi, S., & Bathula, D. R. (2021). EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification. arXiv preprint arXiv:2107.03983 . Retrieved from https://arxiv.org/abs/2107.03983Bahdanau, D. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Cai, S., Zhang, R., Zhu, H., & Li, H. (2025). Modeling the Temporal Dynamics of EEG Signals in Selective Listening. IEEE Transactions on Consumer Electronics .Cisotto, G., Zanga, A., Chlebus, J., Zoppis, I., Manzoni, S., & Markowska-Kaczmar, U. (2020). Comparison of attention-based deep learning models for eeg classification. arXiv preprint arXiv:2012.01074 .Craik, A., He, Y., & Contreras-Vidal, J. L. (2019). Deep learning for electroencephalogram (EEG) classification tasks: a review. Journal of neural engineering, 16, 031001.Das, A., & Menon, V. (2024). Frequency-specific directed connectivity between the hippocampus and parietal cortex during verbal and spatial episodic memory: an intracranial EEG replication. Cerebral Cortex, 34, bhae287.Deng, H., Li, M., Li, J., Guo, M., & Xu, G. (2024). A robust multi-branch multi-attention-mechanism EEGNet for motor imagery BCI decoding. Journal of Neuroscience Methods, 405, 110108.Eldele, E., Chen, Z., Liu, C., Wu, M., Kwoh, C.-K., Li, X., & Guan, C. (2021). An attention-based deep learning approach for sleep stage classification with single-channel EEG. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29, 809–818.Gao, K., Jia, W., Zhou, Y., & Du, R. (2023). Multi-Head Self-Attention Enhanced Convolutional Neural Network for Driver Fatigue Detection using EEG Signals. 2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), (pp. 817–823).Govil, A., Yao, E., & Borao, C. R. (2024). Multi-Dimensional Framework for EEG Signal Processing and Denoising Through Tensor-based Architecture. arXiv preprint arXiv:2401.05589 . Retrieved from https://arxiv.org/abs/2401.05589Khushiyant, Mathur, V., Kumar, S., & Shokeen, V. (2024). REEGNet: A resource efficient EEGNet for EEG trail classification in healthcare. Intelligent Decision Technologies, 18, 1463–1476.Kuang, D., & Michoski, C. (2022). KAM – a Kernel Attention Module for Emotion Classification with EEG Data. arXiv preprint arXiv:2208.08161 . Retrieved from https://arxiv.org/abs/2208.08161Lotey, T., Keserwani, P., Dogra, D. P., & Roy, P. P. (2023). Feature Reweighting for EEG-based Motor Imagery Classification. arXiv preprint arXiv:2308.02515 . Retrieved from https://arxiv.org/abs/2308.02515Lotte, F., Bougrain, L., Cichocki, A., Clerc, M., Congedo, M., Rakotomamonjy, A., & Yger, F. (2018). A review of classification algorithms for EEG-based brain–computer interfaces: a 10 year update. Journal of neural engineering, 15, 031005.Luong, M.-T. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 .Miao, Z., Zhang, X., Zhao, M., & Ming, D. (2023). LMDA-Net: A lightweight multi-dimensional attention network for general EEG-based brain-computer interface paradigms and interpretability. arXiv preprint arXiv:2303.16407 . Retrieved from https://arxiv.org/abs/2303.16407Obeid, I., & Picone, J. (2016). The Temple University Hospital EEG Data Corpus. Frontiers in Neuroscience, 10, 196. doi:10.3389/fnins.2016.00196Omar, S. M., Kimwele, M., Olowolayemo, A., & Kaburu, D. M. (2024). Enhancing EEG signals classification using LSTM-CNN architecture. Engineering Reports, 6, e12827.Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T. H., & Faubert, J. (2019). Deep learning-based electroencephalography analysis: a systematic review. Journal of neural engineering, 16, 051001.Shi, X., Li, B., Wang, W., Qin, Y., Wang, H., & Wang, X. (2023). Classification algorithm for electroencephalogram-based motor imagery using hybrid neural network with spatio-temporal convolution and multi-head attention mechanism. Neuroscience, 527, 64–73.Tang, Y., Ma, Y., Xiao, C., Wu, M., & Zeng, G. (2025). Classification of EEG event-related potentials based on channel attention mechanism. The Journal of Supercomputing, 81, 126.Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems .Wang, T., Huang, X., Xiao, Z., Cai, W., & Tai, Y. (2024). EEG emotion recognition based on differential entropy feature matrix through 2D-CNN-LSTM network. EURASIP Journal on Advances in Signal Processing, 2024, 49.Xin, Q., Hu, S., Liu, S., Zhao, L., & Zhang, Y.-D. (2022). An attention-based wavelet convolution neural network for epilepsy EEG classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 30, 957–966.Yang, Y., Wang, Z., Tao, W., Liu, X., Jia, Z., Wang, B., & Wan, F. (2024). Spectral-Spatial Attention Alignment for Multi-Source Domain Adaptation in EEG-Based Emotion Recognition. IEEE Transactions on Affective Computing .Zarean, J., Tajally, A., Tavakkoli-Moghaddam, R., & Kia, R. (2025). Robust electroencephalogram-based biometric identification against GAN-generated artificial signals using a novel end-to-end attention-based CNN-LSTM neural network. Cluster Computing, 28, 168.Zhang, Y., Zhang, Y., & Wang, S. (2023). An attention-based hybrid deep learning model for EEG emotion recognition. Signal, Image and Video Processing, 17, 2305–2313.Zhong, X., Wu, F., Yin, Z., & Liu, G. (2024). An Attention-Enhanced Retentive Broad Learning System for Subject-Generic Emotion Recognition from EEG Signals. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 2310–2314). Information & Authors Information Version history Peer review timeline Published IET Signal Processing Version of Record1 Mar 2026Published Copyright This work is licensed under a Non Exclusive No Reuse License.

Keywords

Authors Metrics & Citations Metrics Article Usage 276views 183downloads Citations Download citation Swaleh Omar, Michael Kimwele, Akeem Olowolayemo, et al. Scaled Custom Attention for Enhanced Temporal Dependency Modeling in EEG Classification. Authorea. 02 June 2025. DOI: https://doi.org/10.22541/au.174885856.64797320/v1 DOI: https://doi.org/10.22541/au.174885856.64797320/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-06-15T06:18:04.506796+00:00