Employing supervised machine learning algorithms for classification and prediction of anemia among youth girls in Ethiopia

Machine Learning


Design, data source, setting, and periods

A nationally representative cross-sectional 2016 Ethiopian Demographic and Health Surveys (EDHS) were conducted. Ethiopia is laying between latitude 3° and 14°N and longitude 33° and 48°E in the horn of Africa and structured in nine regional states, namely Tigray, Afar, Amhara, Benishangul-Gumuz, Gambela, Harari, Oromia, Somali and Southern Nations Nationalities and Peoples of Region and two city administrations (Addis Ababa and Dire Dawa)20. Ethiopia is the second-most populous country in Africa next to Nigeria with a population of more than 120 million. The EDHS is a part of the international demographics and health survey (DHS) program led by the United States Agency for International Development, in collaboration with other organizations and host countries. Recorded data were accessed at www.measure dhs.com on request with the assistance of ICF International. The survey took place from January 18 to June 27, 2016 with a multi stage stratified sampling technique on 645 enumeration areas covering the entire nation. The survey had included a nationally representative sample of women (aged 15–49 years) with a total sample size of 15,683 women21. In this study, we have included a weighted of 5,642 youth women aged 15–24 as our final sample. Out of all the participants, we have analyzed 19 different features.

Population of the study

All youth girls aged 15–24 years in Ethiopia were the source populations for this study, whereas all youth girls 15–24 years in the selected enumeration areas (EAs) and whose hemoglobin level recorded were the study populations.

Sampling procedures

The EDHS sample was stratified and selected in 2 stages cluster sampling procedure. At the first stage, a stratified sample of enumeration areas, 645 EAs (202 in urban) were selected with probability proportional to size: in each stratum, a sample of a predetermined number of EAs is selected independently with probability proportional to the EA measure of size. In the selected EAs, a listing procedure is performed such that all households are listed. At the second stage, after a complete household listing is conducted in each of the selected EAs, a fixed number of households is selected by equal probability systematic sampling in the selected EA21. The detailed sampling procedure is available in the EDHS reports from the Measure DHS website (www.dhsprogram.com) for each specific survey.

Sample selection for this study, youths without hemoglobin test result (not tested) and respondents above the age of 24 years were excluded, the final analytic sample of youth girls were 5642 considering the weight.

Study variables and measurements

Outcome variable

We used individual women data sets files, 2016 EDHS, to extract the anemia status of youth girls· Anemia is defined as hemoglobin levels less than 12 g/ dl for non-pregnant and 11 g/dl for pregnant youth girls· It was further categorized into mild, moderate, and severe anemia with a hemoglobin range of 10–11·9 g/dl, 7–9·9 g/dl, and less than 7 g/dl, respectively14,21. For the current study we classify it as binary 0 for non-anemic and 1 for anemic merging mild, moderate, severe together.

Independent variables

Age Group: Current age of the women and re-coded in to two categories with values of “0” for 15–19, “1” for 20–24. Religion: Recoded in four categories with a value of “0” for Muslim, “1” for Orthodox, “2” for protestant, and “3” for other religious groups (combining catholic, traditional and the other religious categories as youngest women in this category are small in number). Wealth Index: The datasets contained wealth index that was created using principal components analysis coded as “poorest”, “poorer”, “Middle”, “Richer”, and “Richest in the EDHS data set·” For this study we recoded it in to three categories as “poor” (includes the poorest and the poorer categories), “middle”, and “rich” (includes the richer and the richest categories). Occupation: Re-coded in two categories with a value of “0” for not working, and “1” for working. Media exposure: A composite variable obtained by combining whether a respondent reads newspaper/ magazine, listen to radio, and watch television with a value of “0” if women were not exposed to at least one of the three media, and “1” if a woman has access/exposure to at least one of the three media. Educational status: this is the minimum educational level a woman achieved and re-coded into three groups with a value of “0” for no education, “1” for primary education, and “2” for secondary and above (combining secondary and higher education categories together). Source of drinking water: By using the DHS guide it was recoded into two categories as “unimproved” and “improved source”21,22. Family size: Recoded in to two categories as 1–4, and greater than or equal 5. Body mass index: re-coded in to three categories with values of 0 for underweight 25 kg/m2)23. The altitude of the cluster categorized as high and low altitude using 2500 m as reference· Type of place of residence: The variable place of residence recorded as rural and urban in the dataset was used without change. Region: The variable region was coded in to 11 categories in the dataset and we retained without change.

Data preprocessing and analytic strategies

Preparing raw data for analysis through data pre-processing is essential before building a prediction model in order to improve the model’s predictive performance. Data pre-processing involves techniques such as data cleaning, feature engineering, dimensionality reduction, and data splitting24. The specific workflow for this study is outlined in Fig. 1.

Figure 1
figure 1

Data cleaning

The initial step in data pre-processing is data cleaning, which involves identifying and removing outliers, handling missing values, and addressing imbalanced categories in the outcome variable. We explored various methods for managing missing data in machine learning, including deletion, imputation, model-based imputation, and domain-specific knowledge. Considering the missingness nature, data amount, assumptions, and the machine learning algorithm used, we have opted to handle missing value in our data set using K-nearest neighbor (KNN) imputation. KNN imputation retains all data, handles outliers, does not assume missingness mechanisms, works for numerical and categorical features, adapts to new data, and minimizes bias while encompassing a wide range of values25. In order to identify outliers, we employed various visualization techniques such as scatter plots, box plots, and histograms. These techniques enabled us to detect data points that deviated significantly from the overall pattern. Additionally, we assessed multicollinearity by examining the correlation matrix and considering a correlation value above 0.8 between two pairs of variables as indicative of high correlation.

Data balancing

Another data cleaning task was imbalanced data handling. Class imbalance is a significant challenge in data mining and machine learning as it can lead to decreased classification accuracy, particularly for instances belonging to the minority class (45). ML models trained on imbalanced data are typically biased toward the majority class and fail to predict cases that are rare/minority class26. To address this issue, researchers have developed various mechanisms. In this study, we employed four balancing methods27: under-sampling, over-sampling, adaptive synthetic sampling (ADASYN), and synthetic minority oversampling technique (SMOTE). We aimed to address the imbalance in our dataset and enhance the performance of our predictive model. Initially, we trained our chosen machine learning algorithms using unbalanced data. We then explored various methods such as under-sampling, over-sampling, ADASYN, and SMOTE to balance the data for training the models. Following this, we assessed the performance of the models by comparing accuracy and AUC metrics. In instances where one algorithm showed higher accuracy but lower AUC compared to another, we considered the AUC value for unbalanced data and the accuracy value for balanced data. Accuracy is a suitable metric for balanced classes, while AUC is valuable for imbalanced datasets or when the relative cost of false positives and false negatives is unknown. It is advisable to consider both accuracy and AUC, along with other relevant metrics, to comprehensively understand the model’s performance and make informed comparisons between different machine learning algorithms. Taking these factors into account, we selected the balancing technique that demonstrated superior performance for the final prediction.

Feature engineering

Feature engineering involves transforming raw data into features that are more suitable for predictive models. In this study, one-hot coding was used to convert categorical variables into numeric values, and label encoding was employed to assign a unique number to each category of variables. Additionally, dimensionality reduction was conducted to decrease the number of input variables for the predictive model, aiming to create a simpler and more effective model for making predictions on new data28.

There are two approaches to dimension reduction: feature selection and feature extraction, with the latter being more appropriate for image processing28. Feature selection involves choosing the most relevant independent variables that have the greatest impact on predicting the target variable. Feature selection is the appropriate method for our dataset, while feature extraction is typically utilized for datasets involving image processing. There are various well-known methods for feature selection, and it is crucial to carefully consider the predictive performance when selecting a method for ML model. Based on this, we have explored various feature selection methods such as Lasso, PCA; wrapper methods includes forward selection, backward elimination, and recursive feature elimination, correlation-based feature selection, and chi-square test and compared their performance using evaluation metrics29. Through this analysis, we have found that Boruta is the most effective feature selection method. We opted for the Boruta-based feature selection method to pinpoint the most important features for our predictive model. Boruta is a wrapper-based technique that uses the random forest classifier algorithm and is known for its unbiased and consistent performance, making it highly effective in selecting key variables30,31. Incorporating Boruta with the random forest classifier offers several benefits, including enhanced feature selection, robustness against noise and irrelevant features, reduction of bias in feature importance, and improved interpretability. This combination refines the feature selection process, resulting in better model performance, reduced over fitting, and increased interpretability. However, there are challenges and limitations associated with their use. To address these issues, we have employed various techniques such as L1 or L2 regularization, cross-validation, maintaining an independent test set, parallel processing, analyzing feature importance stability across multiple runs or subsets, recursive feature elimination, balancing false positives and false negatives, and conducting principal component analysis32.

Data splitting:- to train the model and validate it on data it has never seen before a simple 80/20 split method in which 80% of samples (4514 respondents data) were used for testing and the rest 20% of respondents (1128 sample) used for testing the model. However, a tenfold cross-validation method was used in this study for model training as it does not waste a lot of data, which is a big advantage when the number of samples is small33.

Model selection and development

After splitting the data into training and testing sets, we chose appropriate models for training. Since the target variable was categorical, the task involved classification, and we needed to select suitable classifiers for prediction. The dataset falls into the binary classification category, as anemia was divided into two mutually exclusive categories as non-anemic and anemic. To assess the predictive capabilities of ML algorithms in predicting anemia status, we employed eight state-of-the-art algorithms. These algorithms were chosen based on previous research that applied machine learning techniques for classification tasks on EDHS data17,34,35,36. Moreover, the selection of these algorithms were depend on their scalability, interpretability, features number, computational efficiency, data characteristics, type of problem, robustness to noise/outlier, accuracy, bias-variance trade off, and domain expertise. In this study, we utilized the scikit-learn version 1.3.2 packages in Python, implemented within Jupyter Notebook, to employ ML algorithms. The descriptions of eight algorithms are as follows:

(A) Decision tree (DT)

A DT is a non-parametric technique that classifies a data set based on the problem’s predictive structure. Decision trees are highly interpretable, efficiently capture nonlinear relationships, handle both categorical and numerical features, relatively robust to outliers and noisy data, handle missing values by utilizing surrogate splits or imputation techniques, and can handle large datasets efficiently37. For this study, because of these advantages we have employed DT algorithm to predict the status of anemia among youth girls in Ethiopia. However, DT also have limitations. They can be prone to over fitting, struggle with capturing certain complex relationships that require more sophisticated algorithms, and can be sensitive to small changes in the data, leading to different tree structures.

(B) Random forest (RF)

RF is a type of supervised ML that can be used for classification, regression, and dimension reduction purposes. It is a versatile algorithm used for huge amounts of data and overcoming noise. Random Forest uses an error-minimizing technique to select the variables to split into groups. Random forests are preferred when improved predictive performance, reduced bias, reduction of variance, robustness to noise and outliers, feature importance, and handling high-dimensional data are important considerations for the problem at hand38,39. However, RF has some limitations. They can be a black-box model, making it less interpretable or more difficult to interpret compared to individual DT; the ensemble nature of random forests makes it challenging to trace the decision-making process. Additionally, RF may not perform well on datasets with strong linear relationships.

(C) Extreme gradient boost (XG Boost)

XG boost is a DT-based ensemble machine learning algorithm working by a gradient boosting framework. Boosting involves combining weak classifiers to produce a powerful averaged classifier. It can be applied to both classification and prediction problems. XG boost is preferred because of robust to noisy data and outliers, handle high-dimensional datasets, control model complexity and prevent over fitting, handle missing values in the data, saves computational resources, and provides a wide range of hyper parameters40. However, XG boost may have higher computational and memory requirements and it also tends to be less interpretable compared to the other algorithms.

(D) Light gradient boosting machine (LGM boost)

Light GBM is a gradient-boosting framework that works by combining multiple learners usually DT to create a strong predictive model and reduce memory usage. Light GBM is generally faster and more memory-efficient, making it suitable for large datasets than XG boost41. Light GBM is preferred when efficiency, scalability, handling high-dimensional data, handling categorical features, advanced boosting techniques, regularization techniques, feature importance, handling imbalanced datasets, and flexibility are important considerations for the problem at hand.

(E) Support vector machine (SVM)

SVM is a set of supervised learning methods used for classification, regression, and outlier detection. SVMs are preferred when dealing with high-dimensional spaces, robustness to outliers, nonlinearity, margin maximization, memory efficiency, and small to medium-sized datasets are important considerations for the problem at hand42. However, SVMs may have limitations in terms of scalability to large datasets and computational efficiency, especially when using non-linear kernels. Besides, SVMs may not perform well when the dataset is imbalanced, or when the classes are overlapping and not well-separated.

(F) Logistic regression (LR)

LR is a supervised ML algorithm used to solve classification issues. It is a parametric method that assumes a Bernoulli distribution of the target variable and the independence of the observations42.

(G) K-nearest Neighbor (KNN)

KNN is a non-parametric, robust, and adaptable supervised ML primarily used for classification problem. This approach keeps track of all existing cases and categorizes new ones using a similarity score with a distance function and the majority vote of its neighbors. KNN is preferred when dealing with nonlinear relationships, interpretability, robustness to outliers, handling imbalanced datasets, no explicit training step, flexibility, and datasets with varying densities are important considerations for the problem at hand43. However, KNN has limitations. It can be computationally expensive, especially when dealing with large datasets or high-dimensional feature spaces. Besides, KNN is sensitive to the choice of the distance metric, and the optimal value of K needs to be determined through experimentation or cross-validation.

(H) Gaussian Naïve Bayes (GNB).

NB is a collection of ML algorithms built based on Bayes theorem which has two basic assumptions. The first one is every pair of features should be independent of each other and the second assumption is the feature must have an equal contribution to the outcome prediction. GNB is preferred when efficiency, simplicity, handling continuous features, small training sets, text classification, and the feature independence assumption are important considerations for the problem at hand44. However, GNB may not perform well in cases where the two assumptions are severely violated. It may struggle with datasets where the features have strong dependencies or when the decision boundary is complex.

Model training and evaluation

After dividing the data into training and testing sets, we selected appropriate models for training, focusing on classifiers suitable for the categorical target variable. The dataset involved binary classification for anemia, so we utilized eight machine learning algorithms including logistic regression, random forest, K-nearest neighbor, support vector machine; Gaussian Naïve Bayes, eXtreme gradient boosting, decision tree, and light gradient boost classifiers. These choices were based on previous research using machine learning techniques on EDHS data.

Following model selection, we trained the selected classifiers with both balanced and unbalanced data. The best predictive model was then chosen and trained with balanced training data for the final prediction on unseen test data. To evaluate the performance of the final model, we used a confusion matrix and receiver operating characteristic (ROC) curve with metrics such as accuracy, sensitivity (recall), specificity, F1 score, and area under the curve (AUC). The AUC was considered the main performance metric, providing an overall assessment of the model’s performance at different classification thresholds. The confusion matrix allowed us to extract one-dimensional performance metrics such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)26.

Ultimately, the choice of the best evaluation metrics should be driven by the specific context requirements, trade-off between different evaluation metrics, benchmark and standard on the same field, model interpretability, problem type, data characteristics, and goals of the task at hand. For instance, accuracy is suitable when the distribution of classes is balanced and the costs of misclassifying instances are equal. On the other hand, sensitivity is especially valuable in situations where the classes are imbalanced, meaning there is a high cost associated with missing positive instances, and in applications where it is crucial to detect or mitigate risks early on45. Additionally, the ROC curve is beneficial in imbalanced class scenarios for selecting appropriate thresholds and for comparing different models46. Therefore, it’s crucial to carefully evaluate and select the metrics that best align with the problem type, data characteristics, and objectives to effectively assess the model’s performance47,48,49.

In addition to the standard metrics, tenfold cross-validation techniques were employed to further evaluate the model’s performance50. Tenfold cross-validation involves dividing the data into ten subsets and training and evaluating the model ten times, each time using a different combination of nine subsets for training and one subset for evaluation51. The research also carried out a comprehensive examination of hyper parameters with the aim of enhancing and optimizing the model’s performance. Various methods such as grid search, random search, and Bayesian optimization were systematically employed to discover the most effective hyper parameter configurations. The choice of these methods is depend on various factors such as the size of the search space, the available computational resources, and the desired balance between exploration and exploitation. Grid search is a simple and exhaustive method but can be computationally expensive. Random search is less intensive but may require more iteration. Bayesian optimization is efficient and effective for complex search spaces but may require additional setup and computational resources. Suitability of each tuning method also depends on the specific machine learning algorithm being used and the characteristics of the dataset. Experimentation and evaluation of different methods on the validation set is recommended to identify the most effective approach for hyper parameter tuning52. Therefore, the authors tried all techniques considering their advantages to select the best tuning technique based on their performance metrics. Additionally, to enhance the precision and reliability of the model used in this study, calibration was conducted. By fine-tuning the model through calibration, its ability to accurately predict the desired outcome was significantly improved.

Model interpretability

Researchers have highlighted the potential of integrating SHAP (SHapley Additive exPlanations) values and association rule mining to accomplish various goals53. When the aim is to uncover concealed patterns and connections within the data, association rule mining proves to be a more suitable method. On the other hand, when the objective is to comprehend how different features influence the model’s predictions, SHAP analysis emerges as a more appropriate choice53,54. To gain a thorough understanding of the data and analyze the factors that influence the prediction of anemia, we employed a range of techniques. Firstly, we calculated the average SHAP values to assess the overall impact of each feature on the model’s predictions. This allowed us to gain insights into the relative importance of different variables. SHAP analysis is a widely used method in machine learning for interpreting model predictions and understanding feature importance. It assigns a numerical value, called a SHAP value, to each feature, indicating its contribution to predictions. By calculating SHAP values, practitioners can gain insights into how features influence predictions. Positive values indicate positive contribution, negative values indicate the opposite, and the magnitude represents the strength of influence. SHAP analysis enhances transparency and interpretability, providing a global view of feature importance and explaining individual predictions55,56,57.

Following that, we utilized a waterfall plot to visually represent the cumulative effects of these variables, highlighting their contributions to the overall prediction58.

Association rule mining

For this research, we employed association rule analysis through the Apriori algorithm in R software to identify particular predictor variables linked to anemia. The purpose of this analysis was to uncover connections between categorical attributes and anemia among young girls in Ethiopia, as machine learning algorithms do not inherently reveal which categories have stronger associations with anemia. By investigating frequently occurring patterns and detecting dependencies among attributes, our objective was to comprehend the relationships between different attributes and the level of confidence they hold in predicting anemia. To achieve this, we utilized If/Then statements to uncover these associations59. The If/Then association rule is a pair of attributes (X, Y) expressed as X- > Y, where X is the antecedent and Y is the consequent. This rule signifies that if X happens, then Y would also happen. The relationship between X and Y attributes can be categorized based on the lift value. A lift value of 1 indicates an uncorrelated rule, meaning that X and Y appearing at the same time belong to independent random events and have no special significance. If the lift value is less than 1, it indicates a negative correlation rule, where the occurrence of X reduces the occurrence of Y. On the other hand, if the lift value is greater than 1, it indicates a positive correlation rule, where the occurrence of X promotes the occurrence of Y60.

Ethical considerations and consent to participate

The CSA received the ethical clearance for the 2016 EDHS survey from the Ethiopian Health and Nutrition Research Institute Review Board and the National Research Ethics Review Committee at the Ministry of Science and Technology. Moreover, they confirmed that their research has been performed in accordance with the declaration of Helsinki and the Central Statistical Agency (CSA) obtained written informed consent from the respondents. The authors obtained approval from the DHS Program to access and utilize their data for our study.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *