Hierarchical attention enhanced deep learning achieves high precision motor imagery classification in brain computer interfaces

Machine Learning


Data acquisition and experimental paradigm

Participants and ethical considerations

The experimental protocol was designed in accordance with the Declaration of Helsinki and approved by the institutional review board43,44. A total of \(N = 15\) healthy participants (8 males, 7 females; age: \(\mu = 24.3\) years, \(\sigma = 3.2\) years) were recruited for this study. All participants satisfied the following inclusion criteria:

$$\begin{aligned} \mathscr {I} = \{p \in \mathscr {P} : \text {MMSE}(p) \ge 27 \wedge \text {EHI}(p) > 0.7 \wedge \text {BDI}(p) < 10\} \end{aligned}$$

(17)

where MMSE denotes the Mini-Mental State Examination score, EHI represents the Edinburgh Handedness Inventory laterality quotient45, and BDI indicates the Beck Depression Inventory score. Written informed consent was obtained from all participants prior to the experiment.

EEG recording setup and signal acquisition

Electroencephalographic signals were recorded using a 64-channel active electrode system (actiCAP, Brain Products GmbH) with electrodes positioned according to the extended international 10-20 system46,. The electrode impedances were maintained below 5 k\(\Omega\) throughout the recording session. The continuous EEG signal \(x_c
(18)

where \(I_d
(19)

The Nyquist frequency \(f_N = f_s/2 = 500\) Hz ensures adequate representation of all relevant neurophysiological frequencies. The anti-aliasing filter was implemented as a zero-phase Butterworth filter of order 8:

$$\begin{aligned} H(s) = \frac{1}{\sqrt{1 + \left( \frac{s}{\omega _c}\right) ^{2n}}} \end{aligned}$$

(20)

where \(\omega _c = 2\pi \cdot 450\) rad/s is the cutoff frequency and \(n = 8\) is the filter order.

Motor imagery paradigm design

The experimental paradigm consisted of four distinct motor imagery classes: left hand (LH), right hand (RH), both feet (BF), and tongue (T) movements. The trial structure followed a carefully designed temporal sequence to optimize neural response capture:

$$\begin{aligned} \mathscr {T} = \{t_{\text {fix}}, t_{\text {cue}}, t_{\text {MI}}, t_{\text {rest}}\} \end{aligned}$$

(21)

where: – \(t_{\text {fix}} \in [0, 2]\) s: fixation cross presentation – \(t_{\text {cue}} \in [2, 3]\) s: visual cue presentation – \(t_{\text {MI}} \in [3, 7]\) s: motor imagery execution – \(t_{\text {rest}} \in [7, 9]\) s: inter-trial interval

The probability of each class appearing was balanced according to:

$$\begin{aligned} P(\text {class} = k) = \frac{1}{K} = 0.25, \quad k \in \{\text {LH}, \text {RH}, \text {BF}, \text {T}\} \end{aligned}$$

(22)

Signal preprocessing pipeline

Artifact detection and removal

The preprocessing pipeline begins with automated artifact detection using a multivariate approach48. For each channel c and time window w, we compute the standardized amplitude:

$$\begin{aligned} z_{c,w} = \frac{|x_{c,w} – \mu _{c,w}|}{\sigma _{c,w}} \end{aligned}$$

(23)

where \(\mu _{c,w}\) and \(\sigma _{c,w}\) are the local mean and standard deviation. Channels are marked as artifactual if:

$$\begin{aligned} \mathscr {A}_c = \{c : \max _w(z_{c,w})> \tau _z \vee \text {Kurt}(x_c)> \tau _k \vee \text {PSD}_c(f> 40) > \tau _p\} \end{aligned}$$

(24)

where \(\tau _z = 5\), \(\tau _k = 10\), and \(\tau _p\) are empirically determined thresholds, and Kurt denotes kurtosis.

Algorithm 2
figure b

Attention network training

Independent component analysis for artifact removal

We employ extended Infomax ICA to decompose the multichannel EEG into statistically independent components:

$$\begin{aligned} {X} = {A}{S} \end{aligned}$$

(25)

where \({X} \in 97.2477{R}^{C \times T}\) is the observed EEG data, \({A} \in 97.2477{R}^{C \times C}\) is the mixing matrix, and \({S} \in 97.2477{R}^{C \times T}\) contains the independent components. The unmixing matrix \({W} = {A}^{-1}\) is obtained by maximizing the entropy:

$$\begin{aligned} H({y}) = -E[\log p({y})] = -E\left[ \sum _{i=1}^C \log p_i(y_i)\right] + \log |\det ({W})| \end{aligned}$$

(26)

where \({y} = {W}{x}\) and \(p_i\) is the probability density function of the i-th component.

The weight update rule follows the natural gradient approach:

$$\begin{aligned} \Delta {W} = \eta \left[ {I} – E[{\phi }({y}){y}^T]\right] {W} \end{aligned}$$

(27)

where \(\eta\) is the learning rate and \({\phi }({y})\) is the score function vector. Artifactual components are identified using multiple criteria:

$$\begin{aligned} \text {IC}_{\text {artifact}} = \left\{ i : \text {MI}(s_i, \text {EOG})> \theta _{\text {MI}} \vee \frac{P_{s_i}(f < 4)}{P_{s_i}(f \in [8,13])} > \theta _{\text {ratio}}\right\} \end{aligned}$$

(28)

where MI denotes mutual information with EOG channels, and \(P_{s_i}(f)\) is the power spectral density of component \(s_i\).

Spectral filtering and band-power extraction

For motor imagery analysis, we focus on specific frequency bands known to exhibit task-related modulations. The filter bank approach employs multiple bandpass filters49:

$$\begin{aligned} H_k(z) = \frac{b_0 + b_1z^{-1} + … + b_Mz^{-M}}{1 + a_1z^{-1} + … + a_Nz^{-N}} \end{aligned}$$

(29)

where the coefficients are designed for the following frequency bands: – \(\theta\): [4, 8] Hz – \(\alpha\): [8, 13] Hz – \(\beta _{\text {low}}\): [13, 20] Hz – \(\beta _{\text {high}}\): [20, 30] Hz – \(\gamma _{\text {low}}\): [30, 50] Hz

The instantaneous band power is computed using the Hilbert transform:

$$\begin{aligned} P_k
(30)

where \(\mathscr {H}\) denotes the Hilbert transform operator.

Spatial filtering using optimized common spatial patterns

The traditional CSP algorithm is enhanced through regularization and multi-class extensions. For the two-class case, the optimization problem becomes:

$$\begin{aligned} {w}^* = \arg \max _{{w}} \frac{{w}^T{\varvec{\Sigma }}_1{w}}{{w}^T{\varvec{\Sigma }}_2{w}} \quad \text {subject to} \quad {w}^T{w} = 1 \end{aligned}$$

(31)

This is solved via the generalized eigenvalue problem:

$$\begin{aligned} \varvec{\Sigma }_1{w} = \lambda \varvec{\Sigma }_2{w} \end{aligned}$$

(32)

For improved robustness, we employ Tikhonov regularization:

$$\begin{aligned} \tilde{\varvec{\Sigma }}_k = (1-\gamma )\varvec{\Sigma }_k + \gamma \text {tr}(\varvec{\Sigma }_k){I} \end{aligned}$$

(33)

where \(\gamma \in [0,1]\) is the regularization parameter determined through cross-validation.

The multi-class extension uses the one-versus-rest (OVR) approach:

$$W_{{OVR}} = \bigcup\limits_{{k = 1}}^{K} {W_{k} } ,\quad {\text{where}}\quad W_{k} = CSP\left( {\Sigma _{k} ,\sum\limits_{{j \ne k}} {\Sigma _{j} } } \right)$$

(34)

Feature extraction and dimensionality reduction

After spatial filtering, we extract logarithmic band power features:

$$\begin{aligned} f_i = \log \left( \frac{1}{T_w}\int _{t_0}^{t_0+T_w} ({w}_i^T{x}
(35)

where \(T_w\) is the time window length and \({w}_i\) is the i-th spatial filter. Principal Component Analysis (PCA) is applied for dimensionality reduction:

$$\begin{aligned} {Y} = {U}_d^T({F} – \varvec{\mu }_F) \end{aligned}$$

(36)

where \({U}_d\) contains the top d eigenvectors of the feature covariance matrix:

$$\begin{aligned} {C}_F = \frac{1}{N-1}\sum _{i=1}^N ({f}_i – \varvec{\mu }_F)({f}_i – \varvec{\mu }_F)^T \end{aligned}$$

(37)

The number of components d is selected to retain 95

$$\begin{aligned} d = \arg \min _{k} \left\{ k : \frac{\sum _{i=1}^k \lambda _i}{\sum _{i=1}^D \lambda _i} \ge 0.95\right\} \end{aligned}$$

(38)

Deep learning architectures

Convolutional neural network for spatial feature learning

The convolutional architecture is specifically designed for EEG spatial pattern extraction. The first layer performs spatial convolution across electrode locations:

$$\begin{aligned} h_{i,j}^{(1)} = \sigma \left( \sum _{m=1}^{C} \sum _{n=1}^{T_k} w_{m,n}^{(i)} x_{m,j+n-1} + b_i^{(1)}\right) \end{aligned}$$

(39)

where C is the number of channels, \(T_k\) is the temporal kernel size, and \(\sigma\) is the activation function.

The spatial attention mechanism is incorporated as:

$$\begin{aligned} \alpha _c = \frac{\exp (e_c)}{\sum _{k=1}^C \exp (e_k)}, \quad e_c = {v}_a^T \tanh ({W}_a {h}_c + {b}_a) \end{aligned}$$

(40)

The attended feature representation becomes:

$$\begin{aligned} \tilde{{h}} = \sum _{c=1}^C \alpha _c {h}_c \end{aligned}$$

(41)

Long short-term memory networks for temporal dynamics

The LSTM architecture captures the temporal evolution of motor imagery patterns. The complete LSTM dynamics are governed by:

$$\begin{aligned} {f}_t&= \sigma ({W}_f {x}_t + {U}_f {h}_{t-1} + {V}_f \odot {c}_{t-1} + {b}_f) \end{aligned}$$

(42)

$$\begin{aligned} {i}_t&= \sigma ({W}_i {x}_t + {U}_i {h}_{t-1} + {V}_i \odot {c}_{t-1} + {b}_i) \end{aligned}$$

(43)

$$\begin{aligned} \tilde{{c}}_t&= \tanh ({W}_c {x}_t + {U}_c {h}_{t-1} + {b}_c) \end{aligned}$$

(44)

$$\begin{aligned} {c}_t&= {f}_t \odot {c}_{t-1} + {i}_t \odot \tilde{{c}}_t \end{aligned}$$

(45)

$$\begin{aligned} {o}_t&= \sigma ({W}_o {x}_t + {U}_o {h}_{t-1} + {V}_o \odot {c}_t + {b}_o) \end{aligned}$$

(46)

$$\begin{aligned} {h}_t&= {o}_t \odot \tanh ({c}_t) \end{aligned}$$

(47)

where \(\odot\) denotes element-wise multiplication, and \({V}_f, {V}_i, {V}_o\) are the peephole connection weights.

Fig. 2
figure 2

Schematic diagram of CNN-LSTM model structure.

The gradient flow through LSTM is carefully managed to prevent vanishing gradients:

$$\begin{aligned} \frac{\partial {c}_t}{\partial {c}_{t-k}} = \prod _{j=1}^k {f}_{t-j+1} \end{aligned}$$

(48)

This product of forget gates allows long-term dependencies to be preserved when \({f}_t \approx 1\).

Bidirectional processing for complete temporal context

The bidirectional LSTM processes the sequence in both forward and backward directions:

$$\begin{aligned} \overrightarrow{{h}}_t= & \text {LSTM}_{\rightarrow }({x}_t, \overrightarrow{{h}}_{t-1}, \overrightarrow{{c}}_{t-1}) \end{aligned}$$

(49)

$$\begin{aligned} \overleftarrow{{h}}_t= & \text {LSTM}_{\leftarrow }({x}_t, \overleftarrow{{h}}_{t+1}, \overleftarrow{{c}}_{t+1}) \end{aligned}$$

(50)

The combined representation is:

$$\begin{aligned} {h}_t = [\overrightarrow{{h}}_t; \overleftarrow{{h}}_t] \end{aligned}$$

(51)

This bidirectional processing captures both causal and anti-causal dependencies in the EEG signal.

Hierarchical CNN-LSTM architecture

The hybrid CNN-LSTM model leverages the complementary strengths of convolutional and recurrent architectures in Fig. 2. The hierarchical processing pipeline can be formalized as:

$$\begin{aligned} \mathscr {F}_{\text {hybrid}} = \mathscr {L}_{\text {LSTM}} \circ \mathscr {R} \circ \mathscr {P} \circ \mathscr {C}_{\text {CNN}} \circ \mathscr {S} \end{aligned}$$

(52)

where \(\mathscr {S}\) represents spatial filtering, \(\mathscr {C}_{\text {CNN}}\) denotes convolutional feature extraction, \(\mathscr {P}\) is pooling, \(\mathscr {R}\) is reshaping, and \(\mathscr {L}_{\text {LSTM}}\) represents temporal modeling.

The convolutional block consists of multiple layers with progressively increasing receptive fields:

$$\begin{aligned} {H}^{(l+1)} = \phi \left( \sum _{k=1}^{K_l} {W}_k^{(l)} * {H}_k^{(l)} + {b}^{(l)}\right) \end{aligned}$$

(53)

where \(*\) denotes the convolution operation, \(K_l\) is the number of feature maps in layer l, and \(\phi\) is the activation function.

We employ depthwise separable convolutions to reduce computational complexity:

$$\begin{aligned} \text {DepthwiseSep}({X}) = \text {Pointwise}(\text {Depthwise}({X})) \end{aligned}$$

(54)

The depthwise convolution applies a single filter per input channel:

$$\begin{aligned} {Y}_{c,i,j} = \sum _{m,n} {K}_{c,m,n} \cdot {X}_{c,i+m,j+n} \end{aligned}$$

(55)

Followed by pointwise convolution to combine channels:

$$\begin{aligned} {Z}_{k,i,j} = \sum _{c=1}^{C_{\text {in}}} {W}_{k,c} \cdot {Y}_{c,i,j} \end{aligned}$$

(56)

Multi-head attention mechanism

The attention mechanism enables the model to focus on task-relevant spatiotemporal patterns. We implement multi-head attention with H parallel attention heads:

$$\begin{aligned} \text {MultiHead}({Q}, {K}, {V}) = \text {Concat}(\text {head}_1, …, \text {head}_H){W}^O \end{aligned}$$

(57)

where each head is computed as:

$$\begin{aligned} \text {head}_i = \text {Attention}({Q}{W}_i^Q, {K}{W}_i^K, {V}{W}_i^V) \end{aligned}$$

(58)

The scaled dot-product attention is:

$$\begin{aligned} \text {Attention}({Q}, {K}, {V}) = \text {softmax}\left( \frac{{Q}{K}^T}{\sqrt{d_k}}\right) {V} \end{aligned}$$

(59)

where \(d_k\) is the dimension of the key vectors, and the scaling factor \(1/\sqrt{d_k}\) prevents gradient vanishing in the softmax function. For temporal attention over LSTM hidden states, we compute:

$$\begin{aligned} e_{t,t’} = {v}^T \tanh ({W}_h {h}_t + {W}_s {s}_{t’} + {b}_{\text {att}}) \end{aligned}$$

(60)

The attention weights are normalized:

$$\begin{aligned} \alpha _{t,t’} = \frac{\exp (e_{t,t’})}{\sum _{k=1}^T \exp (e_{t,k})} \end{aligned}$$

(61)

The context vector aggregates the attended features:

$$\begin{aligned} {c}_t = \sum _{t’=1}^T \alpha _{t,t’} {h}_{t’} \end{aligned}$$

(62)

Self-attention for global dependencies

To capture long-range dependencies in EEG signals, we incorporate self-attention layers:

$$\begin{aligned} {SA}({X}) = \text {softmax}\left( \frac{{X}{X}^T}{\sqrt{d}}\right) {X} \end{aligned}$$

(63)

The position-aware self-attention includes positional encodings:

$$\begin{aligned} {PE}_{(pos,2i)}= & \sin \left( \frac{pos}{10000^{2i/d}}\right) \end{aligned}$$

(64)

$$\begin{aligned} {PE}_{(pos,2i+1)}= & \cos \left( \frac{pos}{10000^{2i/d}}\right) \end{aligned}$$

(65)

The input with positional encoding becomes:

$$\begin{aligned} \tilde{{X}} = {X} + {PE} \end{aligned}$$

(66)

Algorithm 3
figure c

Real-time motor imagery classification

Layer normalization placement: We employ a pre-normalization architecture following the recent advances in transformer design. The complete attention block implements the following sequence:

$$\begin{aligned} \begin{aligned} \tilde{H}^{(l)}&= \text {LayerNorm}(H^{(l-1)}) \\ \text {Attn}^{(l)}&= \text {MultiHeadAttention}(\tilde{H}^{(l)}, \tilde{H}^{(l)}, \tilde{H}^{(l)}) \\ H^{(l)}&= H^{(l-1)} + \text {Dropout}(\text {Attn}^{(l)}, p_{attn}) \\ \tilde{H}^{(l+1)}&= \text {LayerNorm}(H^{(l)}) \\ \text {FFN}^{(l)}&= \text {FeedForward}(\tilde{H}^{(l+1)}) \\ H^{(l+1)}&= H^{(l)} + \text {Dropout}(\text {FFN}^{(l)}, p_{ffn}) \end{aligned} \end{aligned}$$

(67)

The LayerNorm operation is applied with learnable parameters \(\gamma\) and \(\beta\):

$$\begin{aligned} \text {LayerNorm}(x) = \gamma \odot \frac{x – \mu }{\sqrt{\sigma ^2 + \epsilon }} + \beta \end{aligned}$$

(68)

where \(\mu\) and \(\sigma ^2\) are computed across the feature dimension, and \(\epsilon = 1e^{-6}\) for numerical stability.

Attention dropout configuration: We implement dropout at three specific locations within the attention mechanism:

  1. 1.

    Attention weight dropout (\(p_{attn} = 0.3\)): Applied to the attention probability matrix after softmax normalization:

    $$\begin{aligned} \text {Attention}(Q,K,V) = \text {Dropout}\left( \text {softmax}\left( \frac{QK^T}{\sqrt{d_k}}\right) , 0.3\right) V \end{aligned}$$

    (69)

  2. 2.

    Output projection dropout (\(p_{proj} = 0.2\)): Applied to the concatenated multi-head output before the final linear projection:

    $$\begin{aligned} \text {MultiHead}(Q,K,V) = \text {Dropout}(\text {Concat}(\text {head}_1, …, \text {head}_H), 0.2) W^O \end{aligned}$$

    (70)

  3. 3.

    Residual connection dropout (\(p_{res} = 0.1\)): Applied to the attention output before the residual addition:

    $$\begin{aligned} H^{(l)} = H^{(l-1)} + \text {Dropout}(\text {Attn}^{(l)}, 0.1) \end{aligned}$$

    (71)

L2 regularization on attention weights: We apply L2 regularization specifically to the attention projection matrices with coefficient \(\lambda _{attn} = 0.001\):

$$\begin{aligned} \mathscr {L}_{attn\_reg} = \lambda _{attn} \sum _{h=1}^{H} \left( \Vert W_Q^{(h)}\Vert _F^2 + \Vert W_K^{(h)}\Vert _F^2 + \Vert W_V^{(h)}\Vert _F^2\right) + \lambda _{attn} \Vert W^O\Vert _F^2 \end{aligned}$$

(72)

The total loss function becomes:

$$\begin{aligned} \mathscr {L}_{total} = \mathscr {L}_{CE} + \lambda _1 \mathscr {L}_{center} + \lambda _2 \mathscr {L}_{reg} + \mathscr {L}_{attn\_reg} \end{aligned}$$

(73)

Attention head initialization: Each attention head is initialized using Xavier uniform initialization:

$$\begin{aligned} W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} \sim \mathscr {U}\left( -\sqrt{\frac{6}{d_{model} + d_k}}, \sqrt{\frac{6}{d_{model} + d_k}}\right) \end{aligned}$$

(74)

where \(d_{model} = 256\) and \(d_k = 32\).

Gradient scaling for attention: To stabilize training, we apply gradient scaling to attention parameters:

$$\begin{aligned} \frac{\partial \mathscr {L}}{\partial W_{attn}} \leftarrow \alpha _{scale} \cdot \frac{\partial \mathscr {L}}{\partial W_{attn}} \end{aligned}$$

(75)

where \(\alpha _{scale} = 0.5\) prevents attention gradients from dominating the optimization.

Training methodology and optimization

Loss function design

For multi-class motor imagery classification, we employ a weighted cross-entropy loss to handle class imbalance:

$$\begin{aligned} \mathscr {L}_{\text {CE}} = -\sum _{i=1}^N \sum _{k=1}^K w_k y_{i,k} \log (\hat{y}_{i,k}) \end{aligned}$$

(76)

where \(w_k = \frac{N}{K \cdot N_k}\) is the class weight, \(N_k\) is the number of samples in class k.

To improve feature discrimination, we add a center loss term:

$$\begin{aligned} \mathscr {L}_{\text {center}} = \frac{1}{2}\sum _{i=1}^N \Vert {f}_i – {c}_{y_i}\Vert _2^2 \end{aligned}$$

(77)

where \({f}_i\) is the feature representation and \({c}_{y_i}\) is the center of class \(y_i\).

The total loss combines multiple objectives:

$$\begin{aligned} \mathscr {L}_{\text {total}} = \mathscr {L}_{\text {CE}} + \lambda _1 \mathscr {L}_{\text {center}} + \lambda _2 \mathscr {L}_{\text {reg}} \end{aligned}$$

(78)

where \(\mathscr {L}_{\text {reg}}\) is the regularization term:

$$\begin{aligned} \mathscr {L}_{\text {reg}} = \sum _{l=1}^L \left( \Vert {W}^{(l)}\Vert _F^2 + \beta \Vert {W}^{(l)}\Vert _1\right) , \end{aligned}$$

(79)

combining L2 (Frobenius norm) and L1 regularization for weight sparsity.

Fig. 3
figure 3

Schematic diagram of RF model structure.

Optimization algorithm

We employ the Adam optimizer with modifications for improved convergence:

$$\begin{aligned} {m}_t&= \beta _1 {m}_{t-1} + (1-\beta _1) {g}_t \end{aligned}$$

(80)

$$\begin{aligned} {v}_t&= \beta _2 {v}_{t-1} + (1-\beta _2) {g}_t^2 \end{aligned}$$

(81)

$$\begin{aligned} \hat{{m}}_t&= \frac{{m}_t}{1-\beta _1^t} \end{aligned}$$

(82)

$$\begin{aligned} \hat{{v}}_t&= \frac{{v}_t}{1-\beta _2^t} \end{aligned}$$

(83)

$$\begin{aligned} \varvec{\theta }_t&= \varvec{\theta }_{t-1} – \alpha \frac{\hat{{m}}_t}{\sqrt{\hat{{v}}_t} + \epsilon } \end{aligned}$$

(84)

where \({g}_t = \nabla _{\varvec{\theta }} \mathscr {L}(\varvec{\theta }_{t-1})\) is the gradient at time t.

The learning rate follows a cosine annealing schedule with warm restarts:

$$\begin{aligned} \alpha _t = \alpha _{\min } + \frac{1}{2}(\alpha _{\max } – \alpha _{\min })\left( 1 + \cos \left( \frac{T_{\text {cur}}}{T_{\max }}\pi \right) \right) \end{aligned}$$

(85)

where \(T_{\text {cur}}\) is the number of epochs since the last restart.

Gradient clipping and normalization

To prevent gradient explosion in deep networks, we apply gradient clipping:

$$\begin{aligned} \tilde{{g}} = {\left\{ \begin{array}{ll} {g} & \text {if } \Vert {g}\Vert _2 \le \tau \\ \tau \frac{{g}}{\Vert {g}\Vert _2} & \text {otherwise} \end{array}\right. } \end{aligned}$$

(86)

Layer normalization is applied to stabilize training:

$$\begin{aligned} \text {LN}({x}) = \gamma \odot \frac{{x} – \mu }{\sqrt{\sigma ^2 + \epsilon }} + \beta \end{aligned}$$

(87)

where \(\mu\) and \(\sigma ^2\) are the mean and variance computed across the feature dimension.

Data augmentation strategies

To improve model generalization, we implement several EEG-specific data augmentation techniques:

  1. 1.

    **Temporal Shifting**: Random temporal shifts within physiologically plausible ranges in Fig. 3:

    $$\begin{aligned} \tilde{{x}}
    (88)

  2. 2.

    **Amplitude Scaling**: Channel-wise amplitude modulation:

    $$\begin{aligned} \tilde{x}_c
    (89)

  3. 3.

    **Gaussian Noise Addition**:

    $$\begin{aligned} \tilde{{x}}
    (90)

    where \(\sigma _n = 0.01 \cdot \text {std}({x})\).

  4. 4.

    **Mixup Regularization**: Linear interpolation between training samples:

    $$\begin{aligned} \tilde{{x}} = \lambda {x}_i + (1-\lambda ) {x}_j, \quad \tilde{y} = \lambda y_i + (1-\lambda ) y_j \end{aligned}$$

    (91)

    where \(\lambda \sim \text {Beta}(\alpha , \alpha )\) with \(\alpha = 0.2\).

Evaluation metrics and statistical analysis

Performance metrics

Beyond accuracy, we employ multiple metrics to comprehensively evaluate model performance:

  1. 1.

    **Cohen’s Kappa**: Accounts for chance agreement:

    $$\begin{aligned} \kappa = \frac{p_o – p_e}{1 – p_e} \end{aligned}$$

    (92)

    where \(p_o\) is the observed agreement and \(p_e\) is the expected agreement by chance.

  2. 2.

    **Matthews Correlation Coefficient**: Balanced measure for multi-class:

    $$\begin{aligned} \text {MCC} = \frac{\sum _{k,l,m} C_{kk}C_{lm} – C_{kl}C_{mk}}{\sqrt{\sum _k(\sum _l C_{kl})(\sum _{l’} C_{l’k})}\sqrt{\sum _k(\sum _l C_{lk})(\sum _{l’} C_{kl’})}} \end{aligned}$$

    (93)

    where C is the confusion matrix.

  3. 3.

    **Area Under the ROC Curve (AUC)**: For multi-class, we use one-vs-rest:

    $$\begin{aligned} \text {AUC}_{\text {macro}} = \frac{1}{K}\sum _{k=1}^K \text {AUC}_k \end{aligned}$$

    (94)

Statistical significance testing

To assess the statistical significance of performance differences, we employ:

  1. 1.

    **Wilcoxon Signed-Rank Test**: For paired comparisons across subjects:

    $$\begin{aligned} W = \sum _{i=1}^n \text {sgn}(x_{2,i} – x_{1,i}) \cdot R_i \end{aligned}$$

    (95)

    where \(R_i\) is the rank of \(|x_{2,i} – x_{1,i}|\).

  2. 2.

    **Friedman Test**: For comparing multiple models:

    $$\begin{aligned} \chi _F^2 = \frac{12n}{k(k+1)}\sum _{j=1}^k R_j^2 – 3n(k+1) \end{aligned}$$

    (96)

    where n is the number of subjects, k is the number of models, and \(R_j\) is the sum of ranks for model j.

  3. 3.

    **Post-hoc Nemenyi Test**: For pairwise comparisons after Friedman test:

    $$\begin{aligned} CD = q_{\alpha } \sqrt{\frac{k(k+1)}{6n}} \end{aligned}$$

    (97)

    where \(q_{\alpha }\) is the critical value from the Studentized range distribution.

Cross-validation strategy

We implement a nested cross-validation scheme to ensure unbiased performance estimation:

$$\begin{aligned} \hat{\mathscr {E}} = \frac{1}{K_{\text {outer}}} \sum _{i=1}^{K_{\text {outer}}} \mathscr {E}_i(\varvec{\theta }_i^*) \end{aligned}$$

(98)

where \(\varvec{\theta }_i^*\) is obtained from inner cross-validation:

$$\begin{aligned} \varvec{\theta }_i^* = \arg \min _{\varvec{\theta }} \frac{1}{K_{\text {inner}}} \sum _{j=1}^{K_{\text {inner}}} \mathscr {L}(\varvec{\theta }; \mathscr {D}_{i,j}^{\text {train}}, \mathscr {D}_{i,j}^{\text {val}}) \end{aligned}$$

(99)

Nested cross-validation protocol

To ensure unbiased performance estimation and robust hyperparameter optimization, we implemented a rigorous nested cross-validation strategy with explicit leave-two-subjects-out (L2SO) outer loop validation. The complete protocol consists of the following components:

Outer loop (model evaluation): We employed a leave-two-subjects-out cross-validation scheme, systematically holding out 2 participants for testing while using the remaining 13 participants for training and validation. This resulted in \(\left( {\begin{array}{c}15\\ 2\end{array}}\right) = 105\) outer folds, ensuring that every possible pair of subjects served as the test set exactly once. The outer loop provides an unbiased estimate of model generalization to unseen subjects, which is critical for BCI applications where cross-subject variability is substantial.

Inner loop (hyperparameter optimization): For each outer fold, we performed hyperparameter optimization using the 13 training subjects through stratified 5-fold cross-validation. The inner training set was split using an 80%-20% ratio, where 80% (approximately 10.4 subjects worth of data) was used for model training and 20% (approximately 2.6 subjects worth of data) was reserved for validation during hyperparameter search.

Parameter grid specification: The hyperparameter search space encompassed the following parameters:

  • Learning rates: \(\alpha \in \{0.0001, 0.0005, 0.001, 0.005, 0.01\}\)

  • Dropout rates: \(p_1 \in \{0.15, 0.2, 0.25, 0.3\}\), \(p_2 \in \{0.4, 0.5, 0.6\}\)

  • LSTM units: \(N_{lstm} \in \{64, 96, 128, 192, 256\}\)

  • Attention heads: \(H \in \{4, 6, 8, 12, 16\}\)

  • Regularization weights: \(\lambda _1 \in \{0.001, 0.003, 0.005\}\), \(\lambda _2 \in \{0.0001, 0.0005, 0.001\}\)

This resulted in \(5 \times 4 \times 3 \times 5 \times 5 \times 3 \times 3 = 13,500\) hyperparameter combinations evaluated through Bayesian optimization using Gaussian Process surrogate models.

Early stopping criteria: To prevent overfitting during hyperparameter search, we implemented early stopping with the following criteria: (1) patience of 20 epochs with no improvement in validation loss, (2) minimum delta of 0.001 for considering an improvement, and (3) restoration of best weights upon early termination. The maximum training duration was capped at 200 epochs per configuration.

Performance aggregation: The final performance estimate was computed as:

$$\begin{aligned} \hat{E} = \frac{1}{105} \sum _{i=1}^{105} E_i(\theta _i^*), \end{aligned}$$

(100)

where \(E_i(\theta _i^*)\) represents the test performance of the optimal hyperparameters \(\theta _i^*\) found for the i-th outer fold. The reported standard deviations reflect the variability across the 105 outer folds, providing a robust estimate of model stability across different subject combinations.

Statistical validation: To ensure the reliability of our nested CV results, we computed 95% confidence intervals using the t-distribution with 104 degrees of freedom (105 folds – 1). Additionally, we performed the Shapiro-Wilk test to verify the normality of performance distributions across folds, confirming that parametric statistical tests were appropriate for our analysis.

Advanced signal processing techniques

Riemannian geometry-based features

EEG covariance matrices lie on the manifold of symmetric positive definite (SPD) matrices. We compute the Riemannian mean of trial covariance matrices:

$$\begin{aligned} \bar{{C}} = \arg \min _{{C} \in \mathscr {S}_{++}^n} \sum _{i=1}^N d_R^2({C}, {C}_i) \end{aligned}$$

(101)

where \(d_R\) is the Riemannian distance:

$$\begin{aligned} d_R({C}_1, {C}_2) = \left\| \log ({C}_1^{-1/2}{C}_2{C}_1^{-1/2})\right\| _F = \sqrt{\sum _{i=1}^n \log ^2(\lambda _i)} \end{aligned}$$

(102)

The tangent space mapping at the Riemannian mean provides vectorized features:

$$\begin{aligned} {s}_i = \text {vec}_u(\log _{\bar{{C}}}({C}_i)) \end{aligned}$$

(103)

where \(\text {vec}_u\) is the upper triangular vectorization and \(\log _{\bar{{C}}}\) is the Riemannian logarithm.

Wavelet packet decomposition

For multi-resolution analysis, we employ wavelet packet decomposition:

$$\begin{aligned} W_{j,n}^{(p)}[k] = \sum _m h_p[m-2k]W_{j-1,n}^{(\lfloor p/2 \rfloor )}[m] \end{aligned}$$

(104)

where \(p \in \{0,1\}\) indicates low-pass (\(h_0\)) or high-pass (\(h_1\)) filtering, and j is the decomposition level.

The best basis selection uses the Shannon entropy criterion:

$$\begin{aligned} E(W) = -\sum _{k} |W[k]|^2 \log (|W[k]|^2) \end{aligned}$$

(105)

The optimal wavelet packet tree \(\mathscr {T}^*\) minimizes:

$$\begin{aligned} \mathscr {T}^* = \arg \min _{\mathscr {T}} \sum _{(j,n) \in \mathscr {T}} E(W_{j,n}) \end{aligned}$$

(106)

Phase-locking value and connectivity analysis

To assess functional connectivity during motor imagery, we compute the phase-locking value (PLV):

$$\begin{aligned} \text {PLV}_{i,j}(f) = \left| \frac{1}{N}\sum _{n=1}^N e^{i(\phi _i^n(f) – \phi _j^n(f))}\right| \end{aligned}$$

(107)

where \(\phi _i^n(f)\) is the instantaneous phase of channel i at frequency f for trial n.

The weighted phase lag index (wPLI) provides a more robust connectivity measure:

$$\begin{aligned} \text {wPLI}_{i,j} = \frac{|E\{|\Im ({S}_{ij})|\text {sgn}(\Im ({S}_{ij}))\}|}{E\{|\Im ({S}_{ij})|\}} \end{aligned}$$

(108)

where \({S}_{ij}\) is the cross-spectral density between channels i and j.

Deep learning implementation details

Network architecture specifications

The complete CNN-LSTM-Attention architecture consists of the following layers: Spatial convolution block

$$\begin{aligned} \begin{aligned} {H}_1&= \text {Conv2D}({X}, F_1, (1, K_t), \text {padding}=\text {‘same’}) \\ {H}_2&= \text {BatchNorm}({H}_1) \\ {H}_3&= \text {DepthwiseConv2D}({H}_2, (C, 1), \text {depth}=D) \\ {H}_4&= \text {BatchNorm}(\text {ELU}({H}_3)) \\ {H}_5&= \text {AveragePool2D}({H}_4, (1, P_t)) \end{aligned} \end{aligned}$$

(109)

where \(F_1 = 8\) initial filters, \(K_t = 64\) temporal kernel size, \(D = 2\) depth multiplier, and \(P_t = 8\) pooling size.

Separable convolution block

$$\begin{aligned} \begin{aligned} {H}_6&= \text {SeparableConv2D}(\text {Dropout}({H}_5, p_1), F_2, (1, K_t’)) \\ {H}_7&= \text {BatchNorm}({H}_6) \\ {H}_8&= \text {ELU}({H}_7) \\ {H}_9&= \text {AveragePool2D}({H}_8, (1, P_t’)) \end{aligned} \end{aligned}$$

(110)

where \(F_2 = 16\), \(K_t’ = 16\), \(p_1 = 0.25\), and \(P_t’ = 8\).

LSTM block

$$\begin{aligned} {H}_{10}= & \text {Reshape}({H}_9, (-1, T_{\text {seq}}, F_{\text {lstm}})) \end{aligned}$$

(111)

$$\begin{aligned} {H}_{11}, ({h}_T, {c}_T)= & \text {LSTM}({H}_{10}, \text {units}=N_{\text {lstm}}, \text {return\_sequences}=\text {True}) \end{aligned}$$

(112)

where \(N_{\text {lstm}} = 128\) LSTM units.

Attention block

$$\begin{aligned} \begin{aligned} {E}&= \tanh ({W}_a {H}_{11} + {b}_a) \\ \varvec{\alpha }&= \text {softmax}({w}_e^T {E}) \\ {c}&= \sum _{t=1}^{T_{\text {seq}}} \alpha _t {h}_t \end{aligned} \end{aligned}$$

(113)

*Classification head

$$\begin{aligned} \begin{aligned} {H}_{12}&= \text {Dense}({c}, N_{\text {dense}}, \text {activation}=\text {‘relu’}) \\ {H}_{13}&= \text {Dropout}({H}_{12}, p_2) \\ \hat{{y}}&= \text {Dense}({H}_{13}, K, \text {activation}=\text {‘softmax’}) \end{aligned} \end{aligned}$$

(114)

where \(N_{\text {dense}} = 64\) and \(p_2 = 0.5\).

Initialization strategies

Proper weight initialization is crucial for deep network training. For convolutional layers, we use He initialization:

$$\begin{aligned} w_{ij} \sim \mathscr {N}\left( 0, \sqrt{\frac{2}{n_{\text {in}}}}\right) \end{aligned}$$

(115)

For LSTM weights, we employ orthogonal initialization:

$$\begin{aligned} {W} = {Q} \text { from QR decomposition of } {A} \sim \mathscr {N}(0,1) \end{aligned}$$

(116)

Forget gate biases are initialized to 1.0 to encourage information flow:

$$\begin{aligned} b_f^{(0)} = 1.0 \end{aligned}$$

(117)

Ensemble methods and model averaging

To improve robustness, we employ ensemble techniques:

Snapshot ensembling: Save models at different training epochs with cosine annealing:

$$\begin{aligned} \hat{y}_{\text {ensemble}} = \frac{1}{M}\sum _{m=1}^M \hat{y}^{(m)} \end{aligned}$$

(118)

Weighted averaging: Based on validation performance:

$$\begin{aligned} \hat{y}_{\text {weighted}} = \sum _{m=1}^M w_m \hat{y}^{(m)}, \quad w_m = \frac{e^{\beta \cdot \text {acc}_m}}{\sum _{j=1}^M e^{\beta \cdot \text {acc}_j}} \end{aligned}$$

(119)

where \(\beta\) is a temperature parameter.

Spatiotemporal attention fusion mechanism

Our spatiotemporal attention employs a parallel dual-stream architecture that simultaneously computes spatial and temporal attention components, followed by adaptive fusion. Given bidirectional LSTM outputs \(H \in 97.2477{R}^{T_{seq} \times d_{hidden}}\), we reshape to \(H_{spatial} \in 97.2477{R}^{T_{seq} \times C \times d_{electrode}}\) to preserve spatial electrode structure, where \(d_{electrode} = d_{hidden}/C\).

The spatial attention operates across electrode channels at each temporal position:

$$\begin{aligned} e_{spatial}^{(t,c)}&= v_s^T \tanh (W_s H_{spatial}^{(t,c)} + b_s) \end{aligned}$$

(120)

$$\begin{aligned} \alpha _{spatial}^{(t,c)}&= \frac{\exp (e_{spatial}^{(t,c)})}{\sum _{c’=1}^{C} \exp (e_{spatial}^{(t,c’)})} \end{aligned}$$

(121)

$$\begin{aligned} c_{spatial}^{
(122)

Parallel temporal attention operates across time steps:

$$\begin{aligned} e_{temporal}^{
(123)

$$\begin{aligned} \alpha _{temporal}^{
(124)

$$\begin{aligned} c_{temporal}&= \sum _{t=1}^{T_{seq}} \alpha _{temporal}^{
(125)

The adaptive fusion mechanism dynamically weights spatial and temporal information:

$$\begin{aligned} g_{spatial}&= \sigma (W_{gate,s} [c_{temporal}; \bar{H}] + b_{gate,s}) \end{aligned}$$

(126)

$$\begin{aligned} g_{temporal}&= 1 – g_{spatial} \end{aligned}$$

(127)

$$\begin{aligned} c_{fused}&= g_{spatial} \odot c_{spatial,avg} + g_{temporal} \odot c_{temporal} \end{aligned}$$

(128)

where \(c_{spatial,avg} = \frac{1}{T_{seq}} \sum _{t=1}^{T_{seq}} c_{spatial}^{
(129)