ZleepAnlystNet: a novel deep learning model for automatic sleep stage scoring based on single-channel raw EEG data using separating training

Machine Learning


This study aimed to develop a deep learning model for automated sleep stage scoring based on single-channel raw EEG signal. The approach involved separating the training process into two main phases: first, the extraction of representative features, and second, the sequential learning of sleep stage classification. Additionally, the interrater reliability across three distinct scenarios was investigated comprising the scoring of the dataset conducted by experienced sleep technicians, scoring predicted by the developed model, and scoring carried out by three other sleep technicians. The developed model, named ZleepAnlystNet, consisted of three sets of CNN models, each set comprising of 5 class-specific CNN models, for representative feature extraction and one BiLSTM model for sequence classification. The input of each CNN model was the EEG signal, while that of BiLSTM model was the output of all trained CNN models including 75 features per epoch. Model performance was then evaluated and compared with the existing model using the same dataset.

Dataset

Sleep-EDF dataset71,72 which are open-access public dataset was utilized in model training and evaluation processes. This dataset has been provided in 2 versions: Sleep-EDF-13 and Sleep-EDF-18. The former contributed in 2013 composed of 61 PSG recordings, while the latter contributed in 2018 composed of 197 PSG recordings. It is divided into 2 datasets separately recorded from 2 studies, which are sleep cassette or SC and sleep telemetry or ST. The SC dataset was the study of age effects on sleep in healthy participants while the ST dataset was the study of temazepam effects on sleep. Therefore, the SC dataset was selected to train and evaluate the performance of the developed model, ZleepAnlystNet. The number of files of SC dataset is dependent on the version. For Sleep-EDF-13, version 1 comprises 39 PSG recordings of 20 participants while Sleep-EDF-18 version 2 has 153 PSG recordings of 78 participants. The PSG recordings of SC dataset are composed of two bipolar EEG channels, namely Fpz-Cz and Pz-Oz with sampling rate of 100 Hz, along with one EOG channel (horizontal) with sampling rate of 100 Hz and one preprocessed submental EMG channel with sampling rate of 1 Hz. The recorded PSG files were scored according to R&K classification10 by well-trained sleep technicians consisting of wake, REM, stage 1, stage 2, stage 3, and stage 4, with abbreviation of W, R, 1, 2, 3, and 4 respectively. There are two stages, M and ‘?’, being movement and not scoring epochs. However, the AASM classification11 was implemented in this study, so the merger of stage 3 and 4 into NREM 3 or N3 was applied. M and ‘?’ stages were removed when the feature extraction model was trained prior to epoch-wise shaffling and changed to W while the sequence classification model was trained due to requirement of BiLSTM sequence learning. Only five sleep stages involving W, N1, N2, N3, and REM were included in model training and evaluation processes. The total number of epochs in each sleep stage of both Sleep-EDF-13 and Sleep-EDF-18 are shown in Table 1. In this study, all the records were trimmed 30 min before and after in-bed duration including in the model training and model evaluation because there were long periods of wake (W) at the start and the end of each record in case of 24-h recording.

The ZleepAnlystNet was designed to carry out sleep stage classification using a single-channel raw EEG signal as input. Therefore, both the Fpz-Cz and the Pz-Oz channels were separately passed through the model for model training and evaluation as the separate models.

ZleepAnlystNet

The developed model is named ZleepAnlystNet consisting of 2 main parts which are a feature extraction part organized by a set of CNN models to pick up the representative features of each sleep stage, and a classification part that is occupied by BiLSTM model to generate sleep stages of targeted epoch.

CNN models for feature extraction

For representative features extraction, three sets of CNN models, named P-models, C-models, and N-models, were established to extract time-invariant features from the input data which was the single-channel EEG signal of the Sleep-EDF dataset. P-models referred to as the set of the ‘Previous’ models, while C-, and N-models referred to as the set of the ‘Current’ models, and the ‘Next’ models, respectively. Each set contains five class-specific models, named W, N1, N2, N3, and REM models (Fig. 1 on the middle column). In total, 15 distinct CNN models are involved which share the same architecture as shown in Fig. 1 (on the right column) but brought about the different weights of loss function calculating to relieve class imbalance problem. Normally, N2-stage is the dominant stage during sleep with few numbers of N1-stage. Therefore, the weights of loss function of each sleep stage were dissimilarly calculated. The designed architecture consisted of two sets of layers represented by CNN1 and CNN2. Each set of layers included a convolution layer, a rectified linear unit (ReLU) activation function layer, a batch normalization layer, and a max-pooling layer. The input data was a 1 × 3000 image which was 30-s epoch of the single-channel raw EEG and was passed through the CNN models.

The first set of layers, the CNN1, had been designed as follows: the input was firstly fed to a convolution layer composed of 10 filters with the filter size of 1 × 55, after which a ReLU activation function layer was implemented to the convoluted data. The output had undergone batch normalization to maintain stable training. Lastly, a max-pooling layer with a pool size of 1 × 16 was applied to reduce spatial dimension. The output was then forwarded to the second set of layers, the CNN2, which had been established in the same fashion as the CNN1 but different in the number of filters and filter size in the convolutional layer of 5 and 1 × 25, respectively. After the output had passed the second max-pooling layer, it was flattened and subsequently underwent two fully connected or dense layers, which were composed of neural nodes of 10 and 5 nodes, respectively, and separated by an activation function layer (ReLU). Finally, a softmax layer was applied to generate the probability of each sleep stage as a feature vector of 5 features which were prepared for sequence classification in the BiLSTM model.

All 15 CNN models had the same architecture but were trained in different data manipulations and different weights of loss function for every single class of sleep stage. Three data manipulation sets called set A, set B, and set C (Fig. 2). Set A was referred to as the ‘Current’ set. They involved the combination of the signal from the current epoch and the target which was also the sleep stage score of the current epoch. Set B was labeled as the ‘Previous’ data because it involved the combination of the signal from the previous epoch and the target which remained the sleep stage score of the current epoch. Lastly, set C was called the ‘Next’ set because the input was a signal from the subsequent epoch and the target was the sleep stage score from the current epoch. Three sets of data manipulation, set A, B, and C, occurred prior to shuffling epoch-wise to generate training and test datasets and utilized to train corresponding C-, P-, and N-models.

For each data manipulation, five class-specific models were trained, each with specific weights assigned to particular sleep stages. It can be designated as \({Net}_{c}\) where \(c\in \left\{W, N1,N2,N3,REM\right\}\), i.e. \({Net}_{W},{Net}_{N1},{Net}_{N2},{Net}_{N3},{Net}_{REM}\). \({Net}_{c}\) represented the class-specific model for class \(c\) which was specifically trained for class \(c\) by adjusting the weight of the loss function that is particular to class \(c\). The adjustment of the weight was conducted to maximize the importance of particular sleep stages from the others. The loss function used in the model training was cross-entropy loss which had been modified as in the Eq. (1):

$$loss= -\frac{1}{N}\sum_{n=1}^{N}\sum_{i=1}^{K}{w}_{i}{Y}_{ni}ln{\widehat{Y}}_{ni}$$

(1)

where \(N\) is the total number of epochs, \({Y}_{ni}\) is the probability of epoch number \(n\) that had sleep stage class \(i\) which the value was either 0 or 1 due to the sleep stage score of the input, \({\widehat{Y}}_{ni}\) is the probability of epoch number \(n\) that model predicted for class \(i\) which was generated by softmax layer during model training process which the value was in range of 0 to 1, \(K\) was 5 due to 5 classes classification, and \({w}_{i}\) is the weight used in loss function calculation for class \(i\). The weighted loss function for a specific class \(i\) or \({w}_{i}\) for \({Net}_{c}\) can be calculated as Eqs. (2) and (3).

$${p}_{i}=\left\{\begin{array}{c}\frac{{N}_{C}}{N} ;i=c\\ \frac{N-{N}_{c}}{N} ;i\ne c\end{array}\right.$$

(2)

$${w}_{i}=\frac{0.5}{{p}_{i}}$$

(3)

where \({p}_{i}\) is the proportion of class \(i\) which \(i\in \left\{W,N1,N2,N3,REM\right\}\), \(N\) is the total number of epochs, and \({N}_{c}\) is the total number of epochs of the particular sleep class \(c\).

BiLSTM model for sequence classification

After feature extraction process, which involved feeding the EEG signal into the 15 CNN models, a sequence of the output is generated. Each epoch of the EEG signal yielded 75 features produced by these CNN models. Among these features, 5 features were derived from the softmax layer of each individual CNN model (Fig. 1 on the right column at the bottom part). The sequence of 75 features per epoch from the entire EEG signal were utilized to train the BiLSTM model.

The sequence classification model was designed to consist of a BiLSTM layer with 128 hidden units to capture the temporal sequence dependencies of the input data. Following the BiLSTM layer, a fully connected layer with 5 neurons was put to pull the BiLSTM output and changed them to a lower-dimensional representation corresponding to the five sleep stages: W, N1, N2, N3, and REM. Finally, a softmax layer was placed to calculate the probability of each sleep stage, where the highest probability was determined as the predicted sleep stage. The BiLSTM architecture is shown in Fig. 1 (on the left column).

Training process

In this study, a separating training approach was introduced to train the proposed model. Consequently, a total of 15 CNN models and a BiLSTM model were trained separately. Each CNN model received single-channel raw EEG signal as input, sampled at a rate of 100 Hz (Fig. 2B). On the other hand, the input for the BiLSTM model comprised the outputs from the 15 CNN models, which were 75 features per epoch. These features were organized in alignment with the original sequence of epochs. For example, the first set of 75 features represented the output of the first EEG signal epoch processed by the 15 CNN models, and this pattern continued for subsequent epochs (Fig. 2C). However, certain epochs in the dataset were excluded during training of the CNN models due to their original scoring. These excluded epochs needed to be reintroduced into the sequence during training of the BiLSTM model with appropriate modifications of their label. Nonetheless, these epochs were excluded during model evaluation.

Initially, each CNN model was trained utilizing MATLAB R2022a on an Intel Core i9-10,900 CPU @ 2.80 GHz with an Nvidia GeForce RTX 2070 Super GPU. A batch size of 64 PSG-epochs was utilized for training. The validation-stop criterion was set to discontinue training when 10 consecutive validations showed no improvement. Validation occurred every 150 iterations performing one round. The maximum training-epoch was capped at 1000 training-epochs. The Adam optimizer was employed with the initial learning rate of 0.001. The data used to train the model were epoch-wisely shaffled and divided into a training set for 70% and validation set for 30% or all epochs included. Both M, and ‘?’ scored epochs were excluded during this training process.

Finally, the BiLSTM model was trained to recognize the sleep stage sequence and assign a score to each epoch. The input of the BiLSTM model was the output from feeding of EEG signal in original sequence into the 15 CNN models with respective to the data manipulation. The training process was implemented in MATLAB R2022a on the Intel Core i9-109000 CPU @ 2.80 GHz with the Nvidia GeForce RTX 2070 Super GPU. The batch size was 4 sequences which each sequence representing one record. The validation-stop criterion was triggered when 10 consecutive validations failed to show improvement. Validation occurred every 10 iterations constituting one round. The maximum training-epoch was set at 1,000 training-epochs. The Adam optimizer was applied with the initial learning rate of 0.01. The data used to train the model were divided into 90% for the training set, and 10% validation set. Both M and ‘?’ were changed to W to complete the sequence of the recording but were excluded during the model evaluation process.

Model evaluation

To evaluate the model performance, a total of 15 CNN models and the BiLSTM model were interconnected as shown in Fig. 1 on the left column. The input signal consisted of a single-channel raw EEG signal. The signal underwent shifting process to the right resulting in the duplication of signal in the first epoch and the deletion of signal in the last epoch (Fig. 2A, set B) and was inputted to the set of the ‘Previous’ models. The original signal remained unchanged and was inputted in the set of the ‘Current’ models. Similarly, the signal underwent a leftward shift leading to the duplication of the signal in the last epoch and the deletion of signal in the first epoch (Fig. 2A, set C) and was inputted to the set of the ‘Next’ models. The model’s output was a sequence of sleep stage with respect to the original sequence of EEG signal. This classification scheme is sequence-to-sequence classification or many-to-many classification.

The assessment of model performance was evaluated employing subject-wise cross-validation, specifically the leave-one-subject-out strategy which led to the certainty that the training and test data were independent from each other. For Sleep-EDF-13, a 20-fold cross-validation was implemented which involved 19 subjects into the training set and 1 subject into the test set. This process was repeated for 20 rounds, with each subject serving as the test set once. For Sleep-EDF-18, a tenfold cross-validation was applied. During the training of the BiLSTM, both M and ‘?’ scored epochs were incorporated. However, during the evaluation of model performance, these epochs were excluded, and only the original scores of the five sleep stages were considered for assessment.

For comprehensive evaluation of the developed model feasibility, both per-class and overall parameters were utilized. Per-class metrics consisted of F1-score which was the harmonic mean of precision and recall. Precision was ratio between the true positives and all the predicted as positives and can be called positive predictive value. Recall was ratio between the true positive and relevant element or the model’s ability to correctly identify true positive, also referred to as true positive rate or sensitivity. On the other hand, overall metrics involved accuracy (ACC), which quantified the proportion of correct predictions out of the total number of the predictions; macro F1-score (MF1), the average of the per-class F1 score; Cohen’s Kappa coefficient (κ), the consistency between model prediction and sleep stage scored in the dataset; average precision or the average of per-class precision; and average recall or the average of per-class recall.

Model analysis

The developed model was designed with a separating training approach with three data manipulations: set A, set B, and set C (Fig. 2A), to provide the time-invariant features for classification tasks. To investigate the effectiveness of the training approach and the effect of data manipulation for feature extraction, two separated scenarios were consequently used as mock-ups including scenario A and scenario B, which referred to as separating training and end-to-end training, and several combinations of sets of data manipulation, respectively.

Efficacy of separating training and end-to-end training

For scenario A, the effectiveness of the training approach was investigated in benefits between the training approaches. In case of separating training, a single CNN model which the architecture was the same as represented in Fig. 1 (on the right column) was trained separately from BiLSTM. Consequently, the BiLSTM model with the same architecture represented in Fig. 1 (on the left column) was trained. Only data manipulation set A was utilized in this scenario. The hyperparameters of both CNN and BiLSTM models used during training process were the same as reported in the ‘Training process’ sub-section. However, the weighted loss function due to sleep stage was not applied in this case. Therefore, five features in total were used as the representative features to train BiLSTM model. On the other hand, in case of end-to-end training, both CNN and BiLSTM models were interconnected, maintaining the same architectures as in the case of separating training. The hyperparameters and the training process were the same as CNN training explained in the ‘Training process’ sub-section. The Fpz-Cz channel of Sleep-EDF-13 was the input of this mock-up simulation.

Efficacy of sets of data manipulation in CNN training

For scenario B, the effectiveness of data manipulation used for the training of CNN models was assessed. To remind, the developed model consisted of 15 distinct CNN models which were allocated by 2 factors. The first factor was class-specific in CNN models which the models were trained by adjusting the weighted loss function and they were divided into 5 sets. The second factor was data manipulation divided into 3 sets (Fig. 2A). For instance, our proposed model consisted of 15 CNN models and BiLSTM model connected. Of which 15 CNN models were three sets of data-manipulated models called P-models, C-models, and N-models which abbreviated from the set of the ‘Previous’ models, the set of the ‘Current’ model, and the set of the ‘Next’ models, respectively. Each set of data-manipulated models contained five class-specific models which were trained with the weighted loss function adjusting to the corresponding sleep stage classes. Each data manipulation set was used to train their respective set of data-manipulated models. The details were provided in the ‘CNN models for feature extraction’ sub-section.

Testing conditions were established based on seven sets: using only set A, set B, and set C; using combinations like set A + set B, set A + set C, set B + set C, and set A + set B + set C. Each condition used a particular set of data to train the corresponding models. For example, in the condition set A + set B, data set A was used to train the set of the ‘Current’ models, and data set B was used to train the set of the ‘Previous’ models. All three sets of the ‘Previous’, ‘Current’, and ‘Next’ models were composed of five class-specific models with the weighted loss functions. All conditions employed separating training approach. The training process was explained in the ‘Training process’ sub-section. The performance was considered using the Fpz-Cz channel of Sleep-EDF-13.

Manual scoring by sleep technician

All PSG files in Sleep-EDF-13 were separately scored by three well-trained sleep technicians. A total of 39 files in the SC study were scored once again with blinding of the sleep stage scoring from the dataset by every sleep technician according to the AASM scoring manual11.

Cross-dataset validation

Another dataset called the Sleep Heart Health Study (SHHS) was utilized for cross-dataset validation. The SHHS was a multi-center cohort study implemented by the National Heart Lung and Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing73,74. This dataset was composed of 2 subsets which were SHHS1 and SHHS2 consisting of 6,441 and 3,295 recordings respectively. The former, SHHS1, was contributed between 1995 and 1998. The latter, SHHS2, took place between 2001 and 2003. The PSG recording was composed of 2 EEG channels which were C3-A2 and C4-A1 with sampling rate of 125 Hz; right and left EOGs at sampling rate of 50 Hz; a submental EMG sampled at 125 Hz; and other standard PSG signals. The scoring procedure of this dataset was conducted under the R&K standard10, therefore the same combination procedure of N3 and N4 was applied to take totally 5 sleep stages in the classification task. The dataset used in the cross-dataset validation process was evenly random from both SHHS1 and SHHS2 for 200 recordings in total having 100 recordings each.

All trained models by the Fpz-Cz and the Pz-Oz channels of both the Sleep-EDF-13 and the Sleep-EDF-18 datasets. The Sleep-EDF-13 dataset contributed 40 models, with 20 models for each channel, while the Sleep-EDF-18 dataset contributed 20 models, with 10 models for each channel. These models were employed for sleep stage classification of the randomized 200 records from the SHHS dataset to simulate an unseen data situation. Model evaluation used both C3-A2 and C4-A1 as input channels. Overall and per-class evaluation parameters were consequently used to investigate the model performance. Following evaluation across all folds, the fold yielding the highest performance in each condition was selected, resulting in four models, two from the Sleep-EDF-13 dataset and two from the Sleep-EDF-18 dataset. Subsequently, all 6,441 records of the SHHS1 dataset were fed into these four models to evaluate performance. This cross-dataset validation process aimed to investigate the feasibility of generalization of the separating training approach which is the focus in this study. Therefore, a fully systematic evaluation was not included in this study.

Moreover, the developed model underwent additional training using the randomized 200 records from the SHHS dataset, with both C3-A2 and C4-A1 employed as input channels for both model training and evaluation. This process resulted in the creation of two new models. Subsequently, a five-fold cross-validation process was employed to evaluate the performance of these newly trained models. The total number of epochs of all 200 recordings from SHHS dataset is exhibited in Supplementary Fig. S1C.

Statistical analysis

Descriptive statistics were utilized to express the model performance explained in ‘Model evaluation’ and ‘Cross-dataset validation’ sub-sections which are comparable to the existing sleep stage scoring models in the literature. Cohen’s kappa statistic was used to interpret the interrater reliability between scoring in the dataset and the model prediction. Fleiss kappa statistic was used to interpret the interrater reliability of more than 2-rater conditions which consisted of (a) among 3 sleep technicians, (b) between scoring in the dataset and 3 sleep technicians, and (c) between the model prediction and 3 sleep techniques. Kappa coefficient was calculated with 95% confidence intervals. The degree of agreement was considered as None (κ = 0 − 0.20), Minimal (κ = 0.21 − 0.39), Weak (κ = 0.40 − 0.59), Moderate (κ = 0.60 − 0.79), Strong (κ = 0.80 − 0.90), and Almost Perfect (κ was above 0.90) according to McHugh67. A two-tailed z-test was conducted to evaluate the coincidence of reliability. A p-value of less than 0.05 was considered significant.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *