New Machine Learning Framework for Resource Identification robustly equipped with missing data with constrained configurations

This study aims to develop a robust machine learning model to distinguish ischemia (IS) and hemorrhagic (HS) strokes using only clinical attributes, without relying on neuroimaging data. We addressed key challenges including class imbalances (non-uniform distribution of IS and HS cases), missing data, and unique target leaks within the stroke dataset. The models were trained and tested on a retrospective dataset containing 2190 samples and 78 attributes. Table 1 shows the number and percentage of missing values for each attribute. Seven attributes were excluded from further analysis due to excessively high percentages of missing data.

Our framework initially achieved extremely high performance using the remaining 72 attributes, boasting a weighted accuracy of 97.3% and a accuracy of 96.5%. However, this exceptional performance caused doubt as it seemed unrealistically good. After thorough investigation, we found that even one attribute can achieve comparable performance, as shown in Figure 2a. Specifically, the “National National Medical Stroke Scale (NIHSS) at Enrollment/First Visit” attribute alone was sufficient to accurately distinguish between IS and HS, while the other attributes had minimal impact.

The investigations identified the underlying causes of target leaks in the dataset. NIHSS score data were collected only from IS patients and not from HS cases. As a result, our model speculated that cases with missing NIHSS values are HS, and cases with unconfusing values incorrectly result in high performance.

IS and HS distributions of missing and non-missing indexes for each attribute were analyzed and statistically performed to determine the attributes responsible for target leakage. t-test With the null hypothesish_θ) The distribution of IS is that the HS of the lost index and the non-confusing index are roughly the same. Table 1 shows the difference in distribution of missing and non-missing indexes. From the remaining 72 attributes present after exclusion of attributes due to missing incidence, we found that the total NIHSS of hospitalization/first visit and serum homocysteine was very small (≤10 or less)^-50). Therefore, we removed them from further analysis. However, in hypothesis tests, p-value ≤0.05 is considered important towards rejecting the null hypothesis (h_θ); therefore, additional analysis was performed by removing all attributes with p-value ≤0.05.

After deleting all attributes with p-value ≤0.05 over Experiments 1-5, including performance achieved after fixing target leakage due to two attributes and results achieved via attribute selection, the existing clinical scores and performance of baseline models reported mean invisibility for comparative analysis.

Table 2 Comparison results of the framework using existing clinical scores for baseline models and retrospective data.

After correcting the target leak data, Experiments 4 and 5 achieved a weighted accuracy of 80.84%, an accuracy of 80.82%, a sensitivity of 80.80%, a specificity of 80.88%, and an F1 score of 85.51%. Meanwhile, in Experiment 3, we achieved a weighted accuracy of 80.38%, an accuracy of 80.87%, a sensitivity of 81.6%, a specificity of 79.15%, and an F1 score of 85.65%. The weighting accuracy range from EXP-1 to EXP-5 is 79.93%-80.84% [MEAN: 80.41%, IQR: 80.08–80.84%, SD: 0.42%],The range of accuracy, sensitivity and specificity is 80.18-81% [MEAN: 80.70%, IQR: 80.64–80.87%, SD: 0.32%]80.47–81.6% [MEAN: 81.05%, IQR: 80.8-81.47%, SD: 0.46%]and 78.69–80.88% [MEAN: 79.78%, IQR: 79.15–80.88%, SD: 1.03%] each. A low standard deviation indicates minimal performance differences across the five experimental techniques.

Selective analysis of the TOP-N attribute reveals a trend in weighting accuracy as the number of attributes decreased from TOP-68 to only the TOP-1 attribute following the target leak correction. These trends are visualized in Figures 2b,c. According to the findings in Table 2, the best performance was achieved using the top 34 attributes after removing two key attributes responsible for target leakage. This gave a weighting accuracy of 82.42%, an accuracy of 82.33%, sensitivity of 82.19%, specificity of 82.65%, and an F1 score of 86.68%.

The minimum number of attributes required for meaningful performance was 19. Reducing the 19 attributes further reduced performance metrics significantly. In fact, as shown in Figure 2C and Table 2, removing all attributes with p-value ≤0.05 reduced performance to a weighted accuracy of 80.09%, particularly as shown in Figure 2C and Table 2.

All performance metrics presented here are based on a decision threshold of 0.5 applied to the predicted probability of assigning a class label. This threshold was intentionally chosen as a standardized, unbiased benchmark for comparing model performance. Analysis of the threshold plots for metric V/s (see Supplementary Figures 2 and 3) revealed that at this particular threshold, all models converge to about 0.80 with almost identical weighting accuracy despite different architectures. This point of convergence represents a practical and unbiased trade-off between sensitivity and specificity across the model. Furthermore, the best performance models maintain a close-optimal F1 score at this threshold. Therefore, 0.5 allows for a fair and interpretable comparison of the model's unique features without introducing bias from model-specific threshold adjustments.

To assess classifier threshold-independent performance, Auroc and AUPRC were analyzed (see Supplementary Figure 4). Ensemble boost models, especially CatBoost and Xgboost, appear as top performers with an Auroc value of about 0.90, demonstrating their excellent ability to distinguish between classes. More importantly, on the unbalanced dataset, the model achieved a top model with a good AUPRC score and Gradient Boosting, CatBoost, Xgboost and Balanced Random Forest scores above 0.93.

DL-Model Tabnet recorded a weighted accuracy of 72.44% at a 0.5 decision threshold, demonstrating a strong overall performance with AUPRCs of 0.81 and 0.89.

Interpretable models

Among the interpretable models, logistic regression (formally defined in the supplementary material) showed superior performance compared to the decision tree and SVM model variations, but it is important to note that all interpretable ML models have poor performance compared to the best performance achieved in Experiments 1-5. Logistic regression achieved a weighted accuracy of 77.07%, while linear and RBF kernel SVM achieved a weighted accuracy of 76.63% and 75.34%, respectively. The decision tree model (CART and ID3) yielded similar weighted accuracy results of 67.25% and 66.65%. See Table 3 for more information.

Table 3 Results of the interpretable model adopted for stroke classification in retrospective data.

Characteristics importance analysis

Figure 3 shows the top 20 impactful attributes that significantly affect the model output identified via the shake value. Of these attributes, blood pressure on hospitalization (both diastolic and systolic) appears as the most influential in determining stroke type. Other major contributors include serum electrolytes (sodium and potassium), Glasgow Coma Scale (GCS) at admission, platelet count, hemoglobin levels, atrial fibrillation, diabetes, and stroke onset time.

To cross-validate the influential attributes identified via SHAP analysis, we created a list of key attributes based on the logistic regression coefficient values in Figure 4. Logistic regression, previous stroke details, hospitalization (diastole and systolic), oral anticoagulants, RHD (Rheumatic Heartis), GCS on Stripe sped for sped for sped for st for st for st stripe stripe stripe sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise for sprise

Attributes identified as important in both SHAP analysis and logistic regression include blood pressure at hospitalization (diastole and systolic), GCS at hospitalization, at-rial fibrillation, diabetes, and stroke onset time. Meanwhile, serum electrolytes (sodium and potassium), platelet count, hemoglobin, previous stroke details, oral anticoagulants, and RHD appeared to be significant only in one of the two analyses.

Additionally, we plot the decision tree (height = 3) that we learned to visualize the attributes of each node's attributes in the top three levels of the tree in Figure 5. These nodes show the importance of the attributes of differences and the attributes of the HS sample at each node in the tree. This tree confirms that the top attributes from SHAP and logistic regression analyses are displayed in the prominent nodes, and also reveals the importance of serum electrolytes (not in logistic regression) and previous stroke history (not in SHAP analysis) for stroke type classification.

Prospective Data Analysis

Using the same configuration that provides the best performance on EXP-4, our framework achieves optimal weighting accuracy of 70.90% on a prospective dataset, with a corresponding sensitivity of 81.66% and a specificity of 60.15%. This represents a 9.94% reduction in weighting accuracy compared to retrospective analysis. Supplementary Figure 1 provides a visual comparison of performance between two datasets.

To understand the causes of this performance difference, we conducted a one-year vacation analysis instead of a standard 10x cross-validation of retrospective data. Models were trained individually in all but one year samples and evaluated in samples from excluded years, which were considered as test datasets. Supplementary Table 1 shows results showing that weighting accuracy varies with year by 73.47% (2014), 78.49% (2015), 84.88% (2016), 78.07% (2017), and 80.71% (2018). However, performance in 2019 (69.81%) and 2020 (71.61%) are in close alignment with performance achieved in future data. This temporal variation in performance over various years suggests a gradual change in the data distribution over time, affecting the performance of models trained with data from 5-10 years ago.

Complete data sample performance (no missing values)

Of a total of 2190 samples, we identified 1892 samples with no missing data in the 19 most important attributes determined by SHAP. This framework achieved a weighted accuracy of 79.06% with 10x cross-validation and 71.45% on future datasets. These results are comparable to the performance of the framework using the most important attributes after mouse substituting the remaining 298 samples.

Source link