A data-driven machine learning framework to predict side effects of AstraZeneca and sinopharm COVID-19 vaccines

This study represents the inaugural extensive investigation regarding COVID-19 vaccinations among populations in northwestern Iran. The primary objective of this research was to identify potential factors contributing to adverse events following the administration of AstraZeneca and Sinopharm COVID-19 vaccines using ML tools. A comprehensive evaluation of various side effects was conducted, with the most frequently reported symptoms including body pain, headache, fever, confusion, shivering, cough, shortness of breath, muscle spasms, nausea, vomiting, diarrhea, pain at the injection site, redness at the injection site, muscle spasms at the injection site, warmth at the injection site, armpit pain, and localized itching. The documented side effects after administering the first and second doses were classified into three categories: local, systemic, and total symptoms. The total category encompasses both local and systemic symptoms. While these side effects are generally not life-threatening, some vaccinated individual’s data did report experiencing severe reactions. A notable statistical correlation was identified between the local and systemic side effects of both doses and factors such as age, gender, education, BMI, and marital status.

The findings indicated that, in comparison to individuals aged ≥ 60 years, every other age group demonstrated a higher propensity for experiencing vaccine-related side effects. Specifically, all age groups, except those aged ≥ 60, were at an elevated risk of exhibiting such side effects. Notably, the younger demographic (ages 18–34) showed an increased likelihood of encountering vaccine-related adverse effects. The results align with previous research, which indicated that advancing age correlates with a decrease in both the frequency and intensity of side effects^34,35. Elnaem et al. noted that individuals in younger age brackets exhibited a greater likelihood of experiencing side effects compared to those aged ≥ 60 years³⁶. Additionally, another investigation assessed the side effects between two distinct age categories, ≤ 50 years versus > 50 years, after the first dose of the Oxford–AstraZeneca COVID-19 vaccine was given. It was noted that adults aged ≤ 50 years exhibited a greater frequency and severity of reactions than those > 50 years of age³⁷. This phenomenon is probably attributable to a more vigorous immune response in younger people, resulting in an increased occurrence of side effects. According to our research, females reported a more increased incidence of adverse effects than their male counterparts. Consistent with the findings of this study, a cross-sectional investigation conducted in Malaysia among the general population also found that females experienced a greater number of adverse events than males³⁶. In a different study that was conducted in Germany by Hoffman et al. the researchers aimed to assess the occurrence of local and systemic unfavorable effects following COVID-19 vaccinations. The findings revealed that women were more prone to experiencing both types of side effects³⁸. This contrasts with a study carried out in Saudi Arabia, which reported post-vaccination side effects within the general population, revealing that males underwent a more increased incidence of negative effects compared to females. The study identified sex and age as the primary factors influencing the likelihood of experiencing side effects related to the COVID-19 vaccine^39,40.

A greater likelihood of experiencing certain side effects from the COVID-19 vaccine was observed in individuals classified as overweight (including both overweight and obese) when compared to those who were not overweight (underweight and normal). However, it is noteworthy that Isabel Iguacel et al. indicated that individuals with a non-overweight status exhibited a heightened risk of developing fever ≥ 38°, vomiting, diarrhea, and chills in comparison to their overweight counterparts. Nevertheless, after adjusting for variables such as sex and age, the majority of the significant associations between vaccine side effects and weight status lost their significance⁴¹.

Electronic health data present a significant opportunity for enhancing healthcare services. The management of these extensive and intricate datasets necessitates specialized computational methods capable of effectively processing them. Machine learning techniques have extensive applications within the healthcare sector and are instrumental in recognizing patterns within large datasets. Advances in machine learning, along with improved model interpretability and robust methods for calculating and visualizing the influence of input variables on model outputs, can facilitate the translation of scientific knowledge into practical applications^42,43,44. The utilization of ML is advantageous in the context of our multidimensional datasets, as it excels in managing numerous input variables. During the pandemic, ML and deep learning (DL) provided an efficient method for swift COVID-19 screening and the identification of potential high-risk subjects, ultimately enhancing care amenities and mitigating severe symptoms⁴⁵.

The research surrounding COVID-19 significantly leveraged ML and artificial intelligence (AI). In the present study, we also concentrated on a comparative analysis of various classification methods. The research employs ML techniques, SVM, LR, RF, XGB, ANN, KNN, DT, GB for the purpose of comparative analysis and predictions. The usefulness of the developed models in forecasting local and systemic side effects, as well as their severity after receiving the first and second doses is evaluated using several metrics, including AUC, sensitivity, and specificity. Among the commonly employed machine learning methods in the fight against COVID-19, Gutierrez et al. utilized gradient-boosting decision trees to assess the likelihood of hospitalization within 30 days after a diagnosis of SARS-CoV-2 infection, using Shapley values to analyze the importance of different variables⁴⁶. The researchers utilized the XGBoost methodology to develop a gradient-boosting model and assessed its efficacy against four empirical risk stratification factors, which were influenced by age and the existence of comorbid conditions. Through the use of routinely gathered health administrative data, a specific risk stratification model was developed and validated. The authors suggested that risk stratification obtained from consistently compiled health information could improve the management of COVID-19 on a population scale^5,46,47.

In our analysis, SVM, XGBoost, and Random Forest frequently achieved higher predictive performance compared to other algorithms, particularly KNN. The relatively strong performance of SVM can be attributed to several well-established properties: its ability to maximize the decision margin, its robustness in moderate-dimensional and moderately balanced datasets such as ours, and the flexibility of kernel functions to capture non-linear relationships. These characteristics allow SVM to generalize effectively in clinical prediction tasks. By contrast, ensemble tree-based methods like RF and XGBoost mitigate the limitations of single decision trees by aggregating multiple weak learners, thereby providing robustness against noise and the capacity to model complex, non-linear associations. Collectively, these findings suggest that while several algorithms performed well, SVM maintained a consistent advantage, although differences were not always statistically significant when tested with DeLong’s method. For the second dose, the forecasting capability of all models enhanced, attributed to the incorporation of initial dose side effects as input characteristics. In our analysis, the KNN algorithm consistently and significantly performed the lowest, likely due to its sensitivity to data dispersion, feature scaling, and dimensionality problems.

The latest research carried out by Canas et al. ML was applied to distinguish between adverse events from vaccinations and early infections of COVID-19. It was reported that, despite some variations in the prevalence and distribution of symptoms between individuals who tested positive and those who tested negative, these differences were insufficiently robust to effectively distinguish between the groups, even when utilizing ML techniques⁴⁸. Ahamad et al. in a momentous study effort sought to identify potential common factors contributing to post-vaccination side effects to facilitate their prediction. This research analyzed patient medical records along with information about post-vaccination effects and outcomes. A range of statistical techniques was utilized to assess the data, which was later enhanced by various machine learning classification algorithms. The results revealed that specific characteristics were significantly associated with negative patient reactions in many cases. Key factors identified included previous infections, hospitalization, and re-infection with SARS-CoV-2. Furthermore, existing health conditions like the patient’s age, gender, history of allergies, use of concurrent medications, type-2 diabetes, hypertension, and heart disease have been identified as the key factors linked to poor outcomes and prolonged hospital stays. By employing classification techniques, this research clarifies the characteristics of individuals who might require increased observation and support to reduce negative outcomes¹². In a recent study, a research group investigated the side effects of the COVID-19 vaccine using machine learning algorithms. In this study, various machine learning models were used to predict and classify side effects. According to the reported results, models such as XGBoost, Random forest, and SVM were able to predict local and systemic side effects with high accuracy. The characteristics examined in their study were age, type of vaccine, and time of injection⁴⁹.

Shahid et al. also examined the role of ML in addressing the virus to date, focusing on screening, prediction, and the development of vaccines. They provided a comprehensive outline of the ML algorithms and models applicable to this endeavor in the fight against the pandemic⁵⁰. Additionally, GB was employed to model the influence of temperature and humidity on the rate of COVID-19 transmission in India⁵¹. Kaliappan et al. evaluated the performance of different non-linear regression techniques, such as KNN, SVR, GB, RF, and XGBoost, in predicting the COVID-19 reproduction rate, taking into account the impact of feature selection algorithms and hyperparameter tuning. A total of sixteen factors, such as the number of cases and deaths per million, were examined to forecast the COVID-19 reproduction rate, focusing on critical parameters like testing, mortality, active cases, positivity rate, stringency index, and population density. The performance of the algorithms with and without feature selection was comparable; however, a significant improvement was noted with hyperparameter tuning⁴⁷.

In a study, a total of 2,213 participants received various vaccines, including Sinopharm, AstraZeneca, and Pfizer-BioNTech. The analysis, which considered vaccine type, demographic information, and adverse events, revealed that the RF, XGBoost, and MLP models achieved notable accuracies of 0.80, 0.79, and 0.70, respectively, along with Cohen’s kappa values of 0.71, 0.70, and 0.56. ML techniques can also be employed to forecast the severity of side effects based on the collected data; cases anticipated to be severe may necessitate increased medical intervention or hospitalization. The effective integration of ML tools to assess the severity of post-vaccination side effects represents a promising strategy for enhancing patient health predictions through extensive medical databases, ultimately contributing to improved healthcare services⁷. Additionally, another study utilized machine learning and ensemble learning methods, including SVM, random forest, decision tree, light gradient boosting machine, XGBoost, extra tree classifier, gradient boosting, AdaBoost, logistic regression, and K-nearest neighbors for comparative analysis and predictions. The most significant factors linked to adverse events were identified as the patient’s age, gender, allergies, and medical history. The study focused on the number of deaths, hospitalizations, and SARS-CoV-2-positive symptoms as target variables. The random forest classification model demonstrated superior accuracy in performance metrics compared to other machine learning models. Furthermore, the research indicated that individuals at higher risk for adverse events post-vaccination tend to be older, have allergies, or possess a prior medical history. The study primarily investigated the effects of various mRNA-based COVID-19 vaccines across different demographic groups, emphasizing those most impacted in the current vaccination landscape⁵².

In a separate investigation focusing on health-related attributes and individual characteristics, a ML methodology was developed to forecast possible negative effects following COVID-19 vaccination. Utilizing a substantial cohort of individuals who received COVID-19 vaccines, a novel and tailored approach was created to accurately foretell the likelihood of the most prevalent unfavorable outcomes. This method can act as a resource for guiding COVID-19 vaccine choice and producing personalized information sheets to alleviate worries regarding adverse effects⁵³.

In our analysis, in order to improve the efficiency of ML models, we used the feature selection technique. Feature selection was performed using Random Forest feature importance, an embedded method that ranks predictors based on their contribution to model performance. However, the results of the evaluation indicated that the performance of the models after feature selection was not significantly improved. This may be due to the relatively moderate number of predictors in our dataset, which limited the potential benefit of removing variables. Moreover, several of the applied algorithms, particularly ensemble methods such as Random Forest and Gradient Boosting, inherently manage irrelevant or less informative predictors through mechanisms like split-based variable selection and regularization. Therefore, while feature selection enhanced the interpretability of the models by identifying the most influential variables, its impact on predictive accuracy was minimal.

The strengths of our research included an analysis of vaccine side effects from three distinct perspectives: local, systemic, and overall, with a separate discussion on the efficacy of each machine learning tool in predicting these side effects. Additionally, we investigated side effects associated with the two vaccines most frequently administered in Iran. Future research could improve this approach by incorporating a broader range of input and output data, enhancing both expediency and precision. Moreover, this methodology has the potential to be applied to other vaccines and medications, rather than being limited to COVID-19 vaccines. The use of SHAP analysis to improve transparency is another important strength of this study. By applying SHAP across different modeling scenarios, we were able to identify the most influential predictors and visualize how they contributed to the predicted risk of vaccine side effects. This approach not only augments traditional performance metrics like AUC and Brier score but also improves the clinical interpretability of our findings, aligning with contemporary guidelines for explainable artificial intelligence in healthcare. Future studies could extend this approach by comparing SHAP with other explainability tools such as LIME.

A limitation of our study is that we employed only classical supervised machine learning algorithms, as the dataset was relatively small and tabular in structure. Future studies with high-dimensional and larger multicenter datasets could incorporate advanced deep learning approaches to further improve predictive performance and capture complex nonlinear relationships.

We acknowledge that the dataset contained a higher proportion of female participants, which may influence the observed associations, as females have been shown to report more frequent vaccine-related side effects. This imbalance could introduce bias and limit generalizability. Future studies should aim to recruit more balanced cohorts across sex, age, and other demographic characteristics to strengthen external validity. Moreover, our current dataset did not include booster-related outcomes; however, integrating longitudinal follow-up data would allow predictive modeling of delayed or persistent adverse effects. This represents a key direction for future work.

External validation was not performed in this study due to limited access to comparable datasets. However, we plan to validate our models prospectively in future cohorts as additional vaccination and follow-up data become available. Such validation will be critical to assess generalizability across populations and vaccine types.

Source link