A scheme combining feature fusion and hybrid deep learning models for epileptic seizure detection and prediction

Dataset

In this study, the CHB-MIT dataset is utilized for detecting and predicting seizures. Developed by Massachusetts Institute of Technology (MIT) and Children’s Hospital in Boston (CHB), this publicly accessible dataset comprises scalp EEG recordings from various patients and is extensively employed in epilepsy studies. It adopts the International 10–20 System’s bipolar montage method, capturing EEG signals from 22 electrodes at a 256 Hz sampling rate with 16-bit precision. Table 1 details the CHB-MIT dataset, which typically includes 23 EEG signal channels, with a few cases having 18 channels. Data on CHB01 and CHB21 were collected from the same patient, 1.5 years apart. There are approximately 9 to 42 consecutive EEG files for each case, with most files containing one hour of EEG recordings. The EEG files in this dataset include 198 episodes, and the beginning and end of each episode are labeled⁴⁹.

Table 1 Summary of CHB-MIT epilepsy EEG dataset.

Experimental methods

The schematic representation of our proposed methodology for the detection and prediction of epileptic seizures is depicted in Fig. 1. After filtering, the EEG signal is decomposed through DWT to generate six subbands (D1–D5, A5). Features are extracted from these subbands, including standard deviation (STD), power spectral density (PSD), band energy, and fuzzy entropy (FuzzyEn). Finally, the CNN-GRU-AM model further extracts features and performs classification. Each component of the methodology is elucidated in detail in the subsequent sections.

Pre-processing

The majority of studies focusing on epilepsy diagnosis operate under the assumption that epileptic EEG signals exhibit four distinct continuous states of brain activity. These states include the pre-ictal phase (preceding the seizure), the ictal phase (during the seizure occurrence), the post-ictal phase (following the seizure), and the inter-ictal phase (representing the non-seizure intervals).

In the seizure detection task, the primary goal is to identify the timing of seizure events, distinguishing between ictal and inter-ictal states. Conversely, in the seizure prediction task, the goal is to issue a warning before the onset of a seizure. Here, the focus lies in distinguishing between inter-ictal and pre-ictal states. The CHB-MIT dataset has a sampling frequency of 256 Hz. A random sample of 76,800 data points was taken during the ictal, pre-ictal, and inter-ictal phases, corresponding to a time length of 300 s. Notably, a significant proportion of EEG recordings in the CHB-MIT dataset exhibit contamination from 60 Hz power line noise. This interference could be efficiently mitigated by eliminating components within the 57–63 Hz and 117–123 Hz frequency ranges. Following this noise reduction step, the filtered EEG signals undergo DWT. The calculation formula for DWT is shown in Eq. (1).

$$DWT(j,k) = \int\limits_{ – \infty }^{\infty } {f
(1)

where $\psi^ *$ is the complex conjugate of the mother wavelet function $\psi$ with fluctuating characteristics. $2^{j}$ and $k2^{j}$ denote the scale factor and translation factor. $j$ represents the number of layers of decomposition and $k$ is an integer, respectively.

As a time–frequency analysis method, the DWT dissects the original signal into sub-signals of varying frequencies through a sequence of filters and down-sampling operations during a multi-level decomposition. These sub-signals encompass both approximation coefficients, indicative of the low-frequency part, and detail coefficients, capturing the high-frequency component. For the j-th level of decomposition, the sampling frequency is ${f}_{s}/{2}^{j}$, where ${f}_{s}$ is the original signal’s sampling frequency. The frequency band of the detail coefficients can be represented as $\left[{f}_{s}/{2}^{j+1},{f}_{s}/{2}^{j}\right]$. In this study, we employed the Daubechies-4 wavelet function as the basis function for wavelet decomposition, configuring it as a five-level decomposition. For an EEG signal sampled at 256 Hz, the detail coefficients across the five scales respectively represent signal components within the 64–128 Hz, 32–64 Hz, 16–32 Hz, 8–16 Hz, and 4–8 Hz frequency bands⁵⁰.

Feature extraction

Feature extraction holds paramount importance in the analysis of epileptic EEG signals. Given the intricacy of the brain’s electrical activity, EEG signals frequently encompass multiple frequency components and time-domain features. Consequently, the analysis and processing of EEG signals necessitate the comprehensive utilization of various signal processing techniques and methods. This section elucidates two categories of features: time–frequency domain features and nonlinear features, both extracted within each subband of the EEG signal. Time–frequency domain analysis is capable of capturing the changes in signals in both time and frequency simultaneously, which is particularly important for processing non-stationary signals such as EEG signals. Epileptic seizures are often manifested as nonlinear and complex dynamic changes, and these changes can be better reflected through nonlinear characteristics.

The standard deviation (STD) effectively describes the amplitude variation of the EEG signal and it is a simple and easy to calculate statistic that can be quickly extracted from the EEG signal. The calculation of STD is shown in Eq. (2).

$$STD = \sqrt {\frac{{\sum\nolimits_{i = 1}^{N} {(x_{i} – \mu )^{2} } }}{N}} ,$$

(2)

where $x_{i}$ represents the $i^{\text{th}}$ EEG data sample in a signal segment. $\mu$ represents the mean of the segment. $N$ represents the length of segment.

The power spectral density (PSD) serves as a visualization tool for depicting the energy distribution of EEG signals across various frequencies, thereby revealing alterations in frequency components during seizures. Concurrently, band energy offers insight into the energy variations within specific frequency bands, extracting the energy distribution across distinct frequency ranges and consequently deriving frequency features. The amalgamation of PSD and frequency band energy features is a common practice, providing a more holistic comprehension of EEG signal characteristics in the frequency domain. The comprehensive analysis yields vital information crucial for the diagnosis and prediction of epilepsy.

EEG signals exhibit a high degree of complexity and nonlinearity, and the incorporation of nonlinear features proves beneficial in capturing intricate nonlinear relationships within the data. These features aid in uncovering the deep nonlinear structures inherent in EEG signals. Notably, among the nonlinear features, entropy serves as a metric for quantifying uncertainty and the information content within the data. Fuzzy entropy, a variant of traditional entropy, stands out for its enhanced capacity to handle data uncertainty. It is particularly advantageous in the context of complex epilepsy datasets, where fuzzy entropy excels in precisely capturing relationships between data features and the associated information content. The capability contributes to a more nuanced understanding of the intricate characteristics embedded within the data⁵¹.

Classification

By leveraging diverse neural network structures, the features inherent in epileptic signals can be thoroughly explored and harnessed to enhance the accuracy and robustness of classification. In this study, we devised a CNN-GRU-AM model tailored for epilepsy detection and prediction, amalgamating the distinctive dominance of CNN, GRU, and AM. Specifically, the CNN component adeptly extracts spatial features from the input data through two convolutional layers (each followed by ReLU activation) and pooling layer. The first convolutional layer uses 32 filters, while the second convolutional layer uses 64 filters. The output of the pooling layer is weighted through an attention mechanism comprising two fully connected layers and a sigmoid activation layer. The attention-weighted result is then fed into a GRU layer with 10 hidden units, designed to capture the inherent temporal features in the input data and output the result of the last time step of the sequence. Subsequently, the output from the GRU layer is passed to a fully connected layer to generate a probability distribution of the categories, which is ultimately used for classification prediction via a softmax layer. The proposed model effectively utilizes time–frequency domain and nonlinear features comprehensively, enabling the capture of intricate patterns within the signals. The structured integration of CNN, GRU, and AM facilitates the accurate recognition of epileptic seizures. The architecture of the CNN-GRU-AM model is visually represented in Fig. 2, and the details of the network architecture and the hyper-parameter configuration are given in Tables 2 and 3, respectively.

Table 2 Model structure of CNN-GRU-AM.

Table 3 Hyper-parameter configuration.

Convolutional neural network (CNN)

The CNN is a deep learning model that gained widespread application in the early stages of image processing and computer vision. Renowned for its efficacy in extracting and classifying image features, CNN has also demonstrated utility in classifying the states of EEG signals. Unlike traditional machine learning algorithms, CNN eliminates the need for manually designing features. The fundamental components of a CNN include a convolutional layer, a pooling layer, and a fully connected layer. The convolutional layer primarily serves to extract features from input data, generating a feature map. The feature map is subsequently downsampled by the pooling layer to reduce feature dimensionality and computational complexity. Stacking multiple convolutional and pooling layers in a specific order facilitates the extraction of increasingly sophisticated features. The fully connected layer then integrates the learned features from the pooling layer into the sample labeling space through weighted fusion.

The architecture of the CNN network is visually represented in Fig. 3. In the initial convolutional layer, a 3 × 1 convolutional kernel is employed to generate 32 convolutional feature maps. The subsequent pooling layer utilizes a pooling window of size 3 × 1 with a step size of 1. In the subsequent convolutional layer, a similar 3 × 1 convolutional kernel is utilized to produce 64 convolutional feature maps. The pooling configuration remains consistent with a window size of 3 × 1 and a step size of 1. Each convolutional layer in our model is followed by a batch normalization layer and a Rectified Linear Unit (ReLU) activation function. The dual technique enhances the model’s generalization capacity, mitigates overfitting, and accelerates the training speed.

Gated recurrent unit (GRU) network

The GRU structure is an improved variant of the recurrent neural network (RNN), specifically designed to address the challenges of gradient vanishing and gradient explosion inherent in traditional RNNs when expanding the number of network layers and iterations. Serving as a modification of the Long Short-Term Memory (LSTM) architecture, the GRU streamlines the structure by reducing the number of gates. GRU employs update gate and reset gate to decide whether to retain or discard hidden state information from the previous time step. The utilization of a Sigmoid function, outputting values between 0 and 1, facilitates the determination of the extent of information retention. The selective updating and forgetting of information enable GRU to efficiently capture long-term dependencies within the data. The update gate controls previously sent messages. In Eq. (3), the update gate can selectively retain previous messages ${h}_{t-1}$, and ${W}^{\left(z\right)}$ and ${U}^{\left(z\right)}$ are the weight matrices of the update gate.

$$z_{t} = \sigma (W^{(z)} x_{t} + U^{(z)} h_{t – 1} ).$$

(3)

The reset gate selectively forgets the previous information ${h}_{t-1}$. Equation (4) is calculated in the same way as Eq. (3), with ${W}^{\left(r\right)}$ and ${U}^{\left(r\right)}$ being the reset gate’s weight matrix.

$$r_{t} = \sigma (W^{(r)} x_{t} + U^{(r)} h_{t – 1} ).$$

(4)

The reset gate output ${r}_{t}$ is subject to $U{h}_{t-1}$ matrix multiplication as shown in Eqs. (5) and (6).

$$h_{t}{\prime} = \tanh (Wx_{t} + r_{t} \cdot Uh_{t – 1} ),$$

(5)

$$h_{t} = z_{t} \cdot h_{t – 1} + (1 – z_{t} ) \cdot h_{t}{\prime} .$$

(6)

The internal structure of the GRU is shown in Fig. 4, where ${x}_{t}$ and ${h}_{t}$ are the input vector and the hidden state at time t, respectively, and ${h}_{t}{\prime}$ is a candidate for the hidden state. The update gate ${z}_{t}$ determines how to update the hidden state using the current EEG information, and the reset gate ${r}_{t}$ determines how much historical information needs to be forgotten. $\sigma \left(\cdot \right)$ and $\text{tanh}(\cdot )$ are the sigmoid function and hyperbolic tangent function, respectively.

Attention mechanism (AM)

The AM emulates the human attention allocation process and serves as a valuable tool in assisting deep learning models when handling extensive datasets. Designed to enhance efficiency in focusing on crucial information, the AM allows models to prioritize significant details during data processing. This prioritization is achieved through assigning varying weights to different segments of the input data. By incorporating the AM, the model gains flexibility in processing input data of diverse lengths and structures. Moreover, it enhances the model’s ability to discern correlations within the input data. The introduction of the AM into the epilepsy prediction model aims to consider the diverse impacts of different input features on prediction outcomes, ultimately contributing to improved prediction accuracy.

The schematic diagram of the AM is illustrated in Fig. 5. In the AM process, the model initially preserves the outputs from the preceding network layer, subsequently correlating them with the values of the output sequence. The unique approach enables the model to learn the selection of input features requiring focused attention. Consequently, higher weights are assigned to input features exhibiting strong correlations. The weights are calculated as shown in Eqs. (7) and (8).

$$u_{t} = \tanh (w_{i} h_{t} + b),$$

(7)

$$a_{t} = soft\max (u_{t}^{T} ,u_{w} ),$$

(8)

where, ${w}_{i}$ is the weight matrix, ${h}_{t}$ is the output vector of the hidden layer of the GRU, ${u}_{t}$ is the activation vector, and ${a}_{t}$ is the weight value. The final result of ${A}_{n}$ vector can be obtained from Eq. (9).

$$A_{n} = \sum\limits_{t = 1}^{n} {a_{t} u_{t} } .$$

(9)

Tenfold cross validation (CV)

The tenfold CV stands as a robust performance evaluation technique, ensuring the inclusion of all data in both training and testing phases. The visualization principle of this method is depicted in Fig. 6. The tenfold CV method is compared with the Holdout Method, with the former offering a more comprehensive utilization of data and mitigating errors arising from uneven data distribution.

In the tenfold CV process, the data is randomly partitioned into 10 equal-sized subsets. The candidate model undergoes training using nine of these subsets and is then tested on the remaining subset. Predictions from the test subset are recorded in vectors, and this procedure is repeated ten times, with a different subset serving as the test data in each iteration. Following these repetitions, the model’s predictions for the entire dataset are consolidated in vectors. These vectors are combined, and specific metrics, chosen based on the problem’s nature, are employed to evaluate the performance of the candidate model. This segmentation method contributes to more reliable results, particularly in the context of classification networks.

Evaluation indicators

To assess the performance of the proposed seizure detection and prediction method, commonly used evaluation metrics in classification are utilized to measure the effectiveness of the approach from various perspectives. These criteria include sensitivity, specificity, accuracy, Receiver Operating Characteristic (ROC) curve, and Area Under the Curve (AUC). The expressions for sensitivity, specificity, and accuracy are provided in Eqs. (10)–(12). The AUC represents the area under the ROC curve.

$$Sensitivity = \frac{TP}{{TP + FN}},$$

(10)

$$Specificity = \frac{TN}{{TN + FP}},$$

(11)

$$Accuracy = \frac{TP + TN}{{TP + FN + FP + TN}},$$

(12)

where TP and TN represent the number of samples correctly predicted as positive and negative by the model. FP and FN represent the number of samples incorrectly predicted as positive when they are negative, and negative when they are positive. The AUC value ranges between 0 and 1. An AUC of 1 signifies perfect separation between positive and negative samples by the classifier. Conversely, an AUC of 0.5 suggests that the classifier’s performance is equivalent to random guessing.

Source link