Enhancing cardiac disease detection via a fusion of machine learning and medical imaging

This section encompasses the procedures for feature selection, management of missing data, identification and management of outliers, data normalization and discretization, and data visualization to enhance feature recognition. Section 3.3 delineates the modeling procedures, interpretation, and assessment of outcomes.

Data identification phase

This segment of the research requires not only acquiring essential facts regarding the heart and cardiovascular disorders but also evaluating the dataset from both medical and technological viewpoints to ensure its appropriateness for developing a predictive model for heart diseases.

Based on the extensive investigations and research completed on the Cleveland Clinic heart patients data set, it can be determined that this data set is an appropriate selection for developing a model to predict heart disorders.

Furthermore, given that the data utilized in the research comprises authentic information categorized as either sick or healthy by medical professionals and diagnostic tests, it can be inferred that, after developing predictive models and attaining satisfactory accuracy, the results are deemed acceptable from both a medical and technical standpoint, and can be employed to forecast heart diseases.

Data Preparation phase

The features taken from clinical data (Cleveland UCI) and cardiac ultrasound pictures (EchoNet-Dynamic) are contrasted in Table 2. Clinical characteristics like blood pressure and cholesterol are integrated with image features like cardiac region area and mean pixel intensity, which are recovered using FCN and U-Net models and displayed in Figs. 2 and 6.

Table 2 Comparison of features based on images in the suggested approach to heart disease diagnosis.

Univariate outlier data

In univariate outlier analysis, outliers are detected based solely on the box plot, without regard to additional variables. A box plot is a graphical representation that illustrates variations in a single variable. This graph is derived from the statistical metrics of the minimum value, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum value.

The graph in Fig. 5 illustrates the distance between the first and third quartiles, as well as the median, or second quartile, of the box plot. According to Eq. (4), the IQR value is computed, whereas Eqs. (5) and (6) establish the top and bottom boundaries of the graph.

$$\:Lower\:limit:\:Q1-\left(1.5*\:IQR\right)$$

(5)

$$\:Upper\:limit:\:Q3+\left(1.5*IQR\right)$$

(6)

Data that exceeds the upper limit or falls below the lower limit is classified as outlier data⁴⁹. Considering that certain qualities are medical in nature, the strategy for transforming outlier data into normative data stipulates that if the data falls within a medical range and pertains to the patient class, it remains unaltered. It serves as a risk factor for heart disease and contributes to the development of a more precise model; however, if the individual falls within the healthy category, the outliers will adjust to the nearest normal range. This study initially identifies and addresses univariate outliers for each characteristic, followed by the identification and management of multivariate outliers as outlined in Sect. 4.2.2. The input and output data of the preprocessing and segmentation procedure in the suggested method are illustrated in Fig. 2, which displays a raw and segmented cardiac ultrasound image. The segmented image (right) is the outcome of utilizing FCN and U-Net models to separate important cardiac regions, whereas the raw image (left) serves as the starting data for applying noise removal and normalization procedures. The significance of preprocessing in enhancing image quality and obtaining precise features for deep learning models is illustrated in this figure. Lastly, the suggested multimodal architecture uses this figure as a foundation for combining clinical and imaging data.

Three visual phases for segmenting a cardiac ultrasound image are depicted in Fig. 6. Three images make up the figure: The raw cardiac ultrasound data, including the first features and noisy heart structures, is displayed in the first image. The second image demonstrates the outcome of preprocessing techniques that enhance image quality and highlight important cardiac regions, such as leveling pixel intensity to the interval [0,1] and denoising using a Gaussian filter. Important parts of the heart, including the left ventricle, are precisely isolated from the background and ready for the extraction of significant features in the third image, which displays the final segmentation output using deep learning models like FCN and U-Net.

Figure 7 illustrates the box plot of the cholesterol variable. Considering that this feature has a medical threshold and the outliers reside within the upper range, these data points for ill individuals are deemed a risk factor and will remain unchanged. Conversely, the four outlier data points for healthy individuals will be adjusted to their nearest permissible values, specifically the upper and lower limits.

Figure 8 depicts the architecture of a hybrid system for diagnosing heart disease, incorporating clinical data, medical imaging, and sensor data. Data is gathered from various sources in the initial layer, including clinical records (such as medical history, blood tests, and demographic details), medical images (such as echocardiography, cardiac MRI, CT scans, and chest X-rays), and sensor data (such as heart rate, blood pressure, and oxygen saturation levels). In the second layer, pertinent features are derived from these data sources; medical images are scrutinized using image processing methodologies and convolutional neural networks (CNNs) to extract essential features, whereas clinical and sensor data are evaluated through signal analysis and quantitative feature extraction. The collected characteristics are subsequently integrated into a cohesive dataset for additional processing.

In the subsequent layer, the data undergoes preprocessing procedures including feature selection, normalization, management of missing data, and data cleansing to enhance quality and consistency. Subsequent to preprocessing, the enhanced features are input into machine learning models and deep neural networks to forecast the probability of heart disease in patients. These models, such as SVM, Random Forest, XGBoost, and deep neural networks, concurrently evaluate both imaging and clinical data to improve diagnostic precision. Ultimately, the system’s output may encompass disease predictions, emergency alarms, and treatment recommendations for both physicians and patients, therefore enhancing the accuracy, speed, and non-invasive characteristics of heart disease diagnosis.

Data on multivariate outliers

To detect multivariate outliers in the dataset, the dependent variable is the data class, while the independent variables are a collection of numerical properties. Subsequently, the Mahalanobis distance of the data is computed in relation to the class, and outliers are found and eliminated from the dataset using boxplots and histograms. The Mahalanobis distance, utilizing the data covariance matrix, quantifies the distance of each observation in multidimensional space from the mean center of all observations. Consequently, it may serve as an appropriate metric for identifying multivariate outliers. Figure 9 illustrates a box plot depicting the Mahalanobis distance of the data in relation to the class.

As stated in Sect. 4.2.1, subsequent to analyzing the outliers for each feature, the total multivariate outliers were assessed. A total of 8 data points (8 sick or healthy persons) were detected as multivariate outliers throughout the processing of stages 1, 2, and 3, and subsequently removed. Figure 9 illustrates the extent of data dispersion according to the Mahalanobis interval pertaining to the class⁴⁴. The Mahalanobis interval for each dataset is computed utilizing Eq. (7).

$$\:{Mi}^{2}=({X}_{i}-{\mu\:}_{i}{)}^{T}{C}^{-1}({X}_{i}-{\mu\:}_{i})$$

(7)

$\:{M}^{2}$ represents the Mahalanobis distance of data $\:i$, $\:{X}_{i}$ denotes the vector of variables for the $\:i$-th sample, and $\:\mu\:$ signifies the vector of mean values of independent variables. In this context, $\:C$ represents the covariance matrix of the training data (independent variables), whereas $\:T$ denotes the transposition of the expression within the parenthesis. Mahalanobis distance resembles Euclidean distance, however it is adjusted by the covariance matrix. Figure 10 illustrates that data points with a Mahalanobis distance exceeding 54.5% in relation to the class are classified as outliers and subsequently excluded from the dataset.

Figure 10 shows how the dataset’s multivariate outliers are discovered using Mahalanobis distance from the class variable (healthy or diseased). The scatter plot shows each Cleveland Clinic Heart Patient Dataset sample against its estimated Mahalanobis distance, a metric that evaluates the distance of a data point from the distribution mean, corrected by the covariance matrix. The x-axis may be an index or feature dimension, while the y-axis shows Mahalanobis distance values with a threshold line (about 54.5%) separating typical data points from outliers. Multivariate outliers are points above this threshold that deviate significantly from their class distribution.

Numerical data normalization

Data normalization is a data scaling technique utilized in machine learning algorithms, applicable to numerical features for several objectives. In health research, the challenges of data access, security, and individual privacy can be partially addressed through data normalization. Normalization is crucial in the modeling process and model efficacy, significantly influencing the learning velocity of the model. It is predominantly utilized in scenarios where the data ranges vary, aiming to mitigate the adverse effects of disparate numerical ranges on model performance. This study employed the MIN-MAX normalization technique to standardize numerical data. The dataset for each feature has been normalized to a range between the $\:{New}_{max}$ and $\:{New}_{min}$, specifically 1 and 0, in accordance with Eq. 8.

$$\:{X}_{i}={New}_{min}+\left({New}_{max}-{New}_{min}\right)*\frac{{X}_{i}-{X}_{min}}{{X}_{max}-{X}_{min}}$$

(8)

Where $\:{X}_{min\:}$ is the minimum value of the data before normalization, $\:{X}_{max}$ is the maximum value of the data before normalization, $\:{New}_{min}$ is the minimum value after normalization, and $\:{New}_{max}$ is the maximum value after normalization. $\:{X}_{i}$ is also mapped to a value in the range $\:{New}_{min}$ and $\:{New}_{max}$.

Numerical data discretization

Owing to the characteristics of certain modeling techniques employed in this study, the numerical features must be discrete and transformed into suitable intervals. Given that certain aspects have a medical domain; the discretization flowchart has been implemented as shown in Fig. 10. The outcomes of the discretization are presented in Table 1.

The detailed procedure for evaluating and diagnosing cardiac disease utilizing a combination of EchoNet and Cleveland UCI data is depicted in the flowchart in Fig. 11 First, 512-dimensional ResNet-50 is used to extract clinical data and characteristics from EchoNet pictures from Cleveland UCI. Following denoising, normalization, and EchoNet-specific segmentation in the preprocessing phase, ResNet-50 is used to extract the features. Following feature selection, these features go through the extraction process after being concatenated with UCI data. If medically verified, discretization based on the medical domain is carried out by looking at the association with class features; if not, sampling-based clustering is used. This phase is carried out if discretization optimization is feasible; if not, discretization is carried out using the density of equal intervals. Lastly, two categories of present/absent cardiac disease are classified using ADA Boost/SVM algorithms, and performance measures including accuracy and AUC are presented.