COVID-19 risk stratification among older adults: a machine learning approach to identify personal and health-related risk factors

Given the lack of predefined labels for COVID-19 risk stratification in our dataset, we used K-modes clustering to identify latent subgroups based on individuals’ knowledge, perception, and health-related issues categorical variables. K-modes was selected specifically because it is well-suited for clustering categorical data by minimizing dissimilarity through simple matching. In the K-modes clustering algorithm, clustering cost referred to as inertia, quantifies the overall dissimilarity within the clustering solution. Specifically, it is defined as the sum of dissimilarities between each data point and the mode (centroid) of the cluster to which it is assigned. The dissimilarity is calculated using a simple matching measure, which counts the number of mismatched categorical attributes between a data point and the cluster mode. Lower clustering cost values indicate tighter, more cohesive clusters, reflecting higher internal similarity among clustered instances. This metric serves as an important criterion for evaluating and comparing the compactness of clustering configurations, particularly in categorical datasets.

Based on the clustering results (Table 1), the configuration that used 3 clusters with the Cao initialization method and 5 initializations was selected. The cluster head assignment process involved three key steps; (1) initial cluster heads were selected using an initialization method (in our case, the Cao method), (2) data were assigned to the cluster that was most closely resembled, measured by the Hamming distance count of differing categorical attributes, (3) cluster heads were updated iteratively by recalculating the mode of each feature within the cluster until the assignments stabilized. This configuration achieved a silhouette score of 0.1584, a Dunn index of 0.9983, and a clustering cost (inertia) of 13,309. These metrics indicated a reasonable trade-off between cluster separation and compactness.

Table 1 Clustering configuration performance metrics

Cluster centroids and descriptive statistics

The three clusters exhibited distinct individuals’ knowledge, perception, and health-related issues of COVID-19 risks are presented in Table 2.

Table 2 Cluster centroids and descriptive statistics

Cluster 0 consisted predominantly of younger males aged 60–64, all living in urban areas, and generally in good health, without conditions like hypertension, diabetes, cancer, cardiovascular, or pulmonary diseases. These individuals had not undergone PCR testing and had no household members diagnosed with COVID-19, suggesting limited exposure. They gained high score in knowledge regarding COVID-19 treatments and vaccines, although they maintained a moderate perception of the risk, severity, and vulnerability to the disease, reflecting a balanced but cautious outlook.

Cluster 1, on the other hand, was slightly older, with individuals primarily aged 70–74 and predominantly female, also residing in urban areas. This group had some underlying health-related issues, notably hypertension, but was otherwise in relatively good health. Like Cluster 0, they had not undergone PCR testing and had no household members with COVID-19. Their knowledge score was moderate, and they exhibited a slightly higher perception of COVID-19 risk, severity, and vulnerability, likely influenced by their age and health condition.

Cluster 2 fell between the other two in age, with individuals primarily aged 65–69, mostly female, and residing in urban areas. While they shared similar health profiles with Cluster 1, they differed, as they had a household member diagnosed with COVID-19. This direct exposure likely contributed to their higher knowledge score and the highest perceived levels of COVID-19 risk, severity, and vulnerability among the clusters. This group’s high level of knowledge and risk perception was likely shaped by their closer to the disease.

The correlation heatmap visually represents the pairwise relationships between features, highlighting the strength and direction of correlations to facilitate a deeper understanding of their interdependencies (Fig. 1).

Supervised classification

All algorithms consistently divided the dataset into training and testing subsets using a 70:30 ratio, ensuring that this split was performed before any preprocessing, preventing data leakage, and clearly distinguishing between the training and testing phases. Additionally, to provide a more robust assessment of each model’s generalizability, we employed a 10-fold cross-validation (CV) approach across all models.

In handling categorical features, different models employed varied preprocessing techniques. CatBoost transformed categorical features into string format to leverage its built-in support for categorical data. XGBoost, GLM, and Decision Tree models all used label encoding to convert categorical features into numerical representations. For H2O DNN, categorical features were initially encoded as strings and subsequently converted into H2O frames. In contrast, L2 SVM required categorical features to be one-hot encoded, which created a binary column for each category to facilitate its processing.

Hyperparameter tuning

Hyperparameter tuning was performed for various models using GridSearchCV to optimize critical parameters. The CatBoost model achieved a peak accuracy of 92.65% with a tree depth of 4 and 200 iterations, while XGBoost reached 94.56% peak accuracy with specific configurations including a colsample_bytree of 0.8 and a maximum depth of 3. The GLM model recorded an accuracy of 84.54% with an L2 penalty, and the Decision Tree improved performance with a maximum depth of 15. The L2 SVM model attained the highest accuracy of 97.24% using a linear kernel, and the Random Forest model achieved 84.93% accuracy with 200 estimators.

Evaluation metrics

The classification model’s performance was evaluated by comparing predicted values to the original labels using the confusion matrix, accuracy, recall, precision, and F1-score (Table 3). The performance metrics table revealed that the L2 SVM model outperformed others with high accuracy and F1-scores across all classes, while CatBoost and XGBoost excelled, especially for class 0, but struggled with recall for class 2. The Random Forest, GLM, and Decision Tree models exhibited varied performances across the classes.

Table 3 Performance metrics for classification models using splitted data

The ROC curves for Classes 0, 1, and 2 revealed the performance differences among classification models in distinguishing these classes (Fig. 2). Class 0 demonstrated exceptional performance from the L2 SVM, CatBoost, XGBoost, and Random Forest, each achieving an AUC of 0.99. In Class 1, L2 SVM and CatBoost excelled with a perfect AUC of 1.00, while Random Forest and Decision Tree showed diminished performance. For Class 2, all models faced challenges, particularly CatBoost and XGBoost with AUCs of 0.99, necessitating further refinement and alternative strategies to enhance classification accuracy.

As Table 4 shows, the performance of classification models was also evaluated using 10-fold CV, and L2 SVM showed the highest accuracy (0.9624).

Table 4 Performance metrics for classification models using 10-fold CV

Figure 3 compares performance of the models (Accuracy, F1-Score, Precision, Recall) across 3 classes (0–2) using 10-fold CV.

SHAP

The SHAP values offer valuable insights into the contribution of individual features to the model’s predictions for each class. The analysis of mean absolute SHAP values for Classes 0, 1, and 2 reveals distinct patterns of feature importance, underscoring the specific attributes that significantly impact the model’s classification decisions.

For Class 0, the analysis indicated several influential features (Fig. 4). The top feature, k_drug, had a mean absolute SHAP value of 1.2403, suggesting that the presence or significance of certain drugs strongly influenced the model’s classification into Class 0. Another critical feature was risk_cov_vulner (0.7766), which signified that vulnerability risk was crucial for predicting Class 0. Additionally, risk_cov_dis (0.5936) played a significant role, with higher values indicating a stronger association with Class 0. Moderate influential features included k_GI_only (0.5239) and knowledge_score (0.5284), both of which contributed to the prediction by highlighting their relevance in distinguishing Class 0. Other notable features for this class included Knowledge of incubation period (k_period) (0.4514), HTN (1.1476), and gender (1.3349). Overall, Class 0 was heavily influenced by features related to drug use and vulnerability risk, suggesting that these factors were critical for classification.

The analysis for Class 1 revealed different insights (Fig. 5). The most influential feature was knowledge of incubation period (k_period), with a mean absolute SHAP value of 1.6309, underscoring its critical role in determining Class 1. Another key feature was Knowledge of GI symptoms (k_GI_only)(1.4399), indicating significant relevance in the classification. HTN (0.9175) also appeared as an important predictor for this class. Moderate influential features included risk_cov_severity (0.6806) and risk_cov_vulner (0.6635). Knowledge_score (0.5302) remained an important factor for Class 1, further emphasizing the importance of health knowledge in this classification. Other notable features included gender (0.9520) and fam_hygien_cov (0.1262). The results showed that Class 1 is predominantly influenced by temporal features such as k_period and k_GI_only, along with health conditions like hypertension.

For Class 2, several features stood out in significance (Fig. 6). The fam_hygien_cov had the highest SHAP value of 3.2488, indicating a strong association with Class 2. Risk_cov_dis (1.0851) played a crucial role in this classification, while familmem_covid (0.6340) was notable due to the presence of COVID-19 in family members. Moderate influential features for Class 2 included knowledge_score (0.8038), risk_cov_severity (0.9545), and HTN (0.5044). Other notable features for this class included gender (0.4454) and urban (0.0689). The results for Class 2 indicated that hygiene and family health considerations were particularly influential, reflecting the complexities in predicting this class compared to the others.

The obtained patterns

The results showed that family hygiene practices significantly impacted COVID-19 risk, but high hygiene standards alone did not fully prevent transmission. In fact, household transmission may happen due to close contact and shared spaces, and even excellent hygiene cannot completely eliminate the risk if an infected person is there. Households with multiple positive cases were still classified as high risk, demonstrating that transmission within close-knit environments could surpass even stringent hygiene measures. Regular PCR testing helped lower infection rates, though issues like accessibility and false negatives remained. PCR Testing helps identify positive cases early, but false negatives and accessibility issues limit effectiveness, so testing must be combined with other strategies and improved access.

Misinformation about vaccines negatively affected vaccination rates, particularly in hesitant communities, highlighting the need for targeted public health messaging. Vaccine hesitancy fueled by misinformation is well-documented globally, emphasizing the importance of culturally sensitive and targeted communication.

Key factors shaping risk perception included household COVID-19 history, adherence to hygiene practices, perceived severity of the illness, and the presence of pre-existing health conditions such as hypertension and diabetes, which enhanced vulnerability. The study found that pre-existing health conditions amplified individual risk, even in households maintaining high hygiene standards. This aligns with other evidence that shows health-related issues like hypertension and diabetes increase susceptibility and severity, so risk assessments must incorporate health status, not just behavior. Household dynamics played a significant role in COVID-19 spread, with faster transmission in close-knit environments. Urban residents tended to have better knowledge of COVID-19 and lower risk classifications than rural residents; rural counterparts faced greater challenges in accessing accurate information and healthcare resources, highlighting the need for tailored outreach in rural areas.

Gastrointestinal (GI) symptoms, age, and health-related issues were strong predictors of higher risk classifications. Risk perception was a vital element in behavior modification; individuals with better COVID-19 knowledge made safer decisions yet may have underestimated their risk. Public health strategies must align perceived vulnerability with actual risks. Notably, individuals with high perceived severity of the disease were often classified as moderate risk, suggesting that severity perceptions may have overshadowed factors like hygiene and knowledge in influencing risk classifications. This reflects common psychological phenomena such as optimism bias or denial, where people’s risk perceptions do not always align with reality, underscoring the need to address cognitive biases in public health messaging.

Unexpected patterns emerged in the analysis, particularly among rural males, who, despite low knowledge of treatments, showed favorable views of vaccines and were classified as lower risk, particularly when family exposure was minimal. This may be due to cultural factors or trust in vaccines as preventive tools. Interestingly, low COVID-19 knowledge did not always correlate with risky behavior, as some individuals with high perceptions of disease severity but strong hygiene practices may have still fallen into moderate risk categories. However, contradictory risk perceptions were concerning. Results showed individuals who perceived low personal risk, acknowledged the severity of the disease, and those with high COVID-19 knowledge and significant household exposure, might be still classified as high risk. This complex interplay between knowledge and vulnerability often led individuals to underestimate their personal risk, potentially increasing the likelihood of negative health outcomes. These contradictions call attention to the limits of simplistic risk models and the importance of multifactorial approaches that consider behavioral, social, and psychological factors.

Households maintaining high hygiene standards might be still classified as high risk when multiple COVID-19 cases were there, revealing that transmission could overwhelm preventive measures. Conversely, individuals exhibiting GI symptoms—a known predictor of more severe disease outcomes—may have been classified as low risk if they controlled hypertension and had good family hygiene practices. This suggests that some pre-existing conditions may mitigate the impact of other risk factors, and risk models must consider interactions between factors rather than just additive effects. Additionally, low COVID-19 knowledge did not necessarily indicate higher risk, as seen in individuals who, despite limited knowledge, practiced strong hygiene measures and maintained lower risk profiles. This shows that behavior can sometimes override knowledge gaps, highlighting the importance of practical interventions.

Source link