Investigating the Impact of the Stationarity Hypothesis on Heart Failure Detection using Deep Convolutional Scattering Networks and Machine Learning

Machine Learning


Bruna and Mallat44 proposed a new feature extraction method, namely the Wavelet Scattering Transform (WST). This technique uses complex wavelets to make a good trade-off between its ability to detect the different pattern in the signal and its ability to produce stable time-frequency features. This makes WST particularly suitable to analyze signals, because it captures important details while it remains resistant to noise and small changes in the data.

The WST stands out as one of the most effective techniques for extracting features from non-stationary signals in both time and frequency domains. Some of the benefits of using WST includes its ability to stand against time shifts, rotation invariances, noise distortions, and also provide dimensionality reduction which are critical for modern signal processing. Thanks to these characteristics, WST proved to be a very effective tool for processing signals.

The Wavelet Scattering Network (WSN) is an equivalent deep convolutional network formed by a cascade of wavelets, non-linearity, and low pass filter. This structure enables the derivation of low variance features with minimal configuration from real valued time series and images for use in ML and deep learning applications. The challenge with deep CNN is that they often work like a black-box which we don’t fully understand why they perform so well in classification tasks. Despite their success, the reasons behind their effectiveness are still not entirely clear. Thus, the scientific community decided to create a modified version of deep CNN using a white-box, in which it’s possible to interpret and see what happens exactly inside. The architectures of both deep CNN and WSN, enabling a comparison between them, are illustrated in Fig. 5.

Fig. 5
figure 5

Features extraction comparison between WSN and CNN.

When preparing to use the WSN, the first thing to decide is which kind of wavelet is going to be used. As we are dealing with ECG signals, the selection of the wavelet should be optimized for this purpose only. After evaluating several options, such as the Mexican hat wavelet, Morlet wavelet, and Haar wavelet, we decided to use the Gabor complex wavelet. The reason for this decision is based on the close resemblance of real and imaginary shapes of Gabor wavelet with the QRS complex found in ECG signals. This makes the Gabor wavelet especially sensitive to the structure of ECG signal, therefore extracts more details and relevant information from the signal. Both the real and imaginary sections of the Gabor wavelet are depicted in Fig. 6.

Fig. 6
figure 6

Gabor wavelet filters with coarsest-scale (lowest frequency).

The following expression presents the mathematical representation of the complex wavelet employed.

$$\psi \left( t \right) = { }\frac{1}{{\sqrt {2\pi \sigma^{2} } }}e^{{\frac{{ – t^{2} }}{{2\sigma^{2} }}}} e^{i\omega t}$$

(1)

In the given expression, \(t\) represents the time, and \(\sigma\) represents the standard deviation of the Gaussian function. \(\omega\) is defined as \(2\pi f\), where \(f\) represents the center frequency of \(\psi\), and \(i\) represents the imaginary unit. The envelope of the complex wavelet is characterized as a low-pass filter denoted as \(\Phi\).

$$\Phi \left(t\right)=\left|\psi
(2)

In our study, the signal \(x\left(t\right)\) represents an ECG signal with 2048 samples at a sampling frequency of 128 Hz. As mentioned earlier, the first step in the WSN convolves this signal with the low-pass filter wavelet \(\Phi\). The bandwidth of the low-pass filter wavelet determines the outcome of this step. The result is the zeroth-order wavelet scattering network contains a vector of 4-time windows. The low pass filter acts as a moving average filter, smoothing out high frequency components. After applying the low pass filter, a critical down-sampling results the restricts of the bandwidth below the cut off frequency \({f}_{c}\), the down sampling factor \(D\), and the number of time windows can be calculated as follows:

$$D = \left\lfloor {\frac{{f_{s} }}{{2f_{c} }}} \right\rfloor$$

(3)

$$NO.\,Time\,windows = \left\lfloor {\frac{{Signal\,length}}{D}} \right\rfloor$$

(4)

In our case, we use the low-pass filter wavelet depicted in Fig. 7, with cut-off estimated frequency of \({f}_{c}=0.125\) Hz. For the sampling rate of 128 Hz, the down-sampling factor can be calculated to approximately comes out to 512 using the previous formula. As a result, we reduce the data significantly by only keeping samples spaced by this factor. Resulting \(NO. Time\,windows=4\).

Fig. 7
figure 7

Power spectrum of the low pass wavelet filter.

These initial coefficients are represented in a vector \({S}_{0}\) of size 1 × 4. In the WSN’s zeroth order, we are primarily analyzing the slower variations in the signal. At this stage, we achieve good time resolution but have limited frequency resolution.

$${S}_{0}=x\left(t\right)*\Phi$$

(5)

An invariance scale duration \(T\) is needed to be set in order to use a WSN. This parameter determines the maximum time span over which translation invariance will not change. For example, if setting the invariance scale \(T\) equal to 1 s, it will preserve the scattering features through any shift within that 1-second period. To ensure effective analysis, the invariance scale value must be shorter than the signal duration value.

$$T< \frac{Signal length}{fs}$$

(6)

where \(fs\) represent the sampling frequency of the signal.

In our study, to balance computational efficiency with the requirement for shift invariance, we tested different scales from 8 to 16 s. The best results came with an invariance scale of \(T=16\) seconds, so we selected this as our final setting.

We need a bank of wavelet filters that cover different frequency ranges to design a WSN. These filters are ordered, such that the signal is decomposed into various frequency bands according to the Nyquist theorem. For our case, the sampling frequency of the signal is 128 Hz. For the first filter bank, the wavelet with the highest frequency band will be centered at its maximum power frequency, which is:

$${f}_{n}\approx \frac{{f}_{s}}{2}$$

(7)

where \({f}_{s}\) represents the sampling frequency of the signal, and \({f}_{n}\) represents the central frequency of the wavelet’s highest frequency band.

In WSN, the scales \({\lambda }_{i,j}\) for a given index \(j\) of the wavelet in the filter bank \({\Delta }_{\text{i}}\) is expressed as a function of the quality factor \({Q}_{i}\), which determines the number of wavelets per octave. This scale can be expressed as follows:

$${\lambda }_{\text{i},\text{j}}={2}^{\frac{-j}{{Q}_{i}}}, {\lambda }_{\text{i},\text{j}}\in {\Delta }_{\text{i}} , j=\{\text{1,2},\dots ,{N}_{i}\}$$

(8)

$$N_{i} = \left\lfloor {Q_{i} .log_{2} \left( {\frac{{f_{s} .T}}{{2.Q_{i} }}} \right) + 1} \right\rfloor$$

(9)

Next, we build the filter bank using a quality factor \({Q}_{1}=8\), which represents the number of wavelets used to cover an octave frequency range. For the first filter bank, the choice is intentional of a quality factor \({Q}_{1}=8\). So, we can create a refined set of wavelet filters based on this value of \({Q}_{1}\), each covers different ranges of frequency. This value provides us with lower scales, in order to capture intricate details of the frequency and preserve manageable complexity. The computational cost increased for a higher value of \({Q}_{1}\), even if it provides even finer frequency separation. With the use of \({Q}_{1}=8\), we achieve a practical division of frequency bands, optimizing the accuracy of the signal representation and the processing speed. As we move to lower frequencies, the frequency bands are split into more detailed sub-bands increasingly, giving \({N}_{1}=57\) wavelets, each has a range of frequency. The power spectrum for these 57 high-pass wavelet filters is presented in Fig. 8, while their individual central frequencies and 3-dB bandwidths are highlighted in Fig. 9.

Fig. 8
figure 8

Power spectrum of the first filter bank wavelets.

Fig. 9
figure 9

Center frequencies and 3-dB bandwidths of the first filter bank wavelets.

To recover the high-frequency components of the signal, we use the first filter bank \({\Delta }_{1}\) that consist of \({N}_{1}=57\) high-pass wavelet filters at different scales or frequency bands. The modulus signal is convolved with the low-pass filter and critically down-sampled, which result in 4-time windows. This process allows us to obtain the first-order scattering network coefficients, denoted as \({S}_{1}\). At this stage, we analyze the high-frequency elements of the signal and extract these important high-frequency coefficients. The first order scattering coefficients produce 57 × 4 coefficients.

$${S}_{1}=\{|x*{\psi }_{{\lambda }_{1}}|*\Phi {, \left|x*{\psi }_{{\lambda }_{2}}\right|*\Phi ,\dots ,|x*{\psi }_{{\lambda }_{{N}_{1}}}|*\Phi \}}_{{\{\lambda }_{1}..{\lambda }_{{N}_{1}}\}\in {\Delta }_{1}}$$

(10)

However, the first order of the process ends with a convolution using the low-pass filter \(\Phi\), which results in the loss of some high-frequency components. So, we constructed a second filter bank of high-pass wavelets \({\Delta }_{2}\) with a quality factor of \({Q}_{2}=1\), in order to recover these lost high frequencies. This choice is intentional, since we have already covered specific frequency ranges in the first filter bank. Using a higher value of \({Q}_{2}\) could lead to redundant scales and increased computational costs. The power spectrum of the second set of high-pass wavelet filters that consist of \({N}_{2}=9\) is illustrated in Fig. 10. While, the 3-dB bandwidths of wavelet filters used in the second filter banks and their central frequencies are depicted in Fig. 11.

Fig. 10
figure 10

Power spectrum of the second filter bank wavelets.

Fig. 11
figure 11

Center frequencies and 3-dB bandwidths of the second filter bank wavelets.

To retrieve the high-frequency components, we applied the second order WSN, in order to obtain the second order scattering coefficients, denoted as \({S}_{2}\). In the second order, the wavelets scattering coefficients are expected to be of a size of 57 × 9. Where 57 represents the number of high-pass wavelet filters in the first filter bank, and 9 represents the number of high-pass wavelets in the second filter bank, which may result in 513 scattering paths in the second order of the scattering network.

$${S}_{2}=\{|\left|x*{\psi }_{{\lambda }_{\text{i}}}\right|*{{\psi }_{{\lambda }_{\text{j}}}\left|*\Phi \right\}}_{{\lambda }_{\text{i}}\in {\Delta }_{1}, {\lambda }_{\text{j}}\in {\Delta }_{2}}, i=\left\{\text{1,2},3,\dots ,{N}_{1}\right\},\text{j}=\left\{\text{1,2},3,\dots ,{N}_{2}\right\}$$

(11)

This can lead to high computing costs for the extraction of characteristics. Consequently, the optimization routes setting was set to true for the WSN. When there is a substantial overlap between the bandwidths of parent and child nodes in the scattering network, the scattering pathways are selectively computed in order to optimize the network’s paths. In this context, “substantial overlap” is defined as follows: the child node’s 3-dB bandwidth is deducted from its wavelet center frequency for a quality factor of 1, or 1/2. The scattering path is calculated if that value is less than the parent’s 3-dB bandwidth. The definition of considerable overlap for quality factors larger than 1 is an overlap between the center frequency of the child minus the child’s 3-dB bandwidth. If this overlap occurs with the 3-dB bandwidth of the parent, the scattering path is computed. This optimization results in 142 scattering paths in the second order of the scattering network. Finally, the scattering coefficients of the second order \({S}_{2}\) are obtained, with a size of 142 × 4.

According to previous research, almost 99% of the scattering coefficient’s energy is concentrated within the first two layers. This energy dropping quickly as we progress to higher layers44. The final scattering coefficients, which we refer to as the feature matrix for a single ECG signal, are represented as \(S\) and have a size of 200 × 4. The scattering network is illustrated in Fig. 12.

Fig. 12
figure 12

Architecture of the scattering network.

After applying the WSN to an ECG signal with 2048 samples, we obtained a feature matrix sized 200 × 4 corresponds to 800 samples. The results of this transformation emphasize the effectiveness of the WSN as a powerful signal processing technique to reduce dimensionality. In this example, compared to the original signal, the WSN yielded a reduction in dimensions of 60.94%. This is an important characteristic because it enables the use of different ML algorithms, which generally perform better with lower-dimensional data. We reshape the tensor into an adequate format to prepare the data for the classifiers.

The features matrices used under the inter-patient and NO inter-patient paradigms in training and testing data are presented in Table 4.

Table 4 Features size using inter-patient and NO inter-patient paradigms.

Comparison of the classification results of inter-patient and NO inter-patient paradigms

Finding and analyzing a confusion matrix is the traditional method for evaluating a model’s performance. This matrix determines False Positives (FP) or instances that were classified into the incorrect category, False Negatives (FN) or instances that were classified into normal beats when they are actually not, True Positives (TP) or the number of cases that were classified into the correct disease category, and True Negatives (TN) or the number of cases that were classified into normal heart beats. We selected accuracy, precision, recall, and F1-score as performance metrics to assess the model’s performances, these metrics are derived from the confusion matrix.

$$Accuracy= \frac{TP+TN}{TP+FP+TN+FN}$$

(12)

$$Precision= \frac{TP}{TP+FP}$$

(13)

$$Sensitivity=\frac{TP}{TP+FN}$$

(14)

$$Specificity=\frac{TN}{TN+FP}$$

(15)

$$F1 score=\frac{2.SEN.PRE}{SEN+PRE}$$

(16)

Different ML models have been used for the purpose of classifying ECG signals into: ARR, CHF and NSR. The ML models were feed the scattering coefficients obtained by applying the WSN. As we have 4-time windows for each ECG segment, the data was oversampled by a factor of 4. This approach allows the ML models to generate 4 predictions for each ECG segment, corresponding to the 4-time windows obtained by applying the WSN. The classification accuracy using 5-folds cross-validation for the inter-patient split and NO inter-patient split are outlined in Tables 5 and 6, respectively.

Table 5 Validation accuracy for the data of inter-patient split.
Table 6 Validation accuracy for the data of NO inter-patient split.

It is obviously clear trough these tables that different ML models were used, testing their performance and starting by DT, LD, Quadratic Discriminant (QD), Naïve Bayes (NB), Linear SVM, Quadratic SVM, Cubic SVM, KNN, Ensemble Bagged Trees (EBT), and Ensemble subspace KNN (EKNN). As we can observe and analyze the results, the two splits are really close. This is because the validation folds don’t follow the inter-patient paradigm, meaning NO inter-patient data separation was maintained in the validation sets. Interestingly, in both shemes, the validation accuracy of the KNN separation outperforms other ML models. And as the KNN achieved the best separation, we further tested the KNN on the testing data of both inter-patient and NO inter-patient paradigms to explore the difference between these two paradigms. The testing accuracy using KNN under inter-patient and NO inter-patient paradigms is illustrated in Fig. 13. The investigation conducted in this study confirms the observations of De Chazal et al.30 and Luz et al.31. In the inter-patient paradigm, the testing accuracy achieved was about 79.40%. On the other hand, the testing accuracy using NO inter-patient was significantly higher, reaching close to 100%, with a remarkable accuracy of 99.30%.

Fig. 13
figure 13

Accuracy comparison between inter-patient and NO inter-patient split.

Based on the results given by the Table 5, we decided to proceed with the LD model as it strikes a balance between prediction speed and validation accuracy. LD classifier is a method that aims to maximize the separation between two or more groups by finding the optimal values for the vector \(\upsigma\) which represents the weights used to calculate the discriminant scores.

$$LD=\sum_{i=1}^{N}{\upsigma }_{i}{X}_{i}$$

(17)



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *