Study design, setting, and data sources
This study employed a cross-sectional design using a nationally representative population-based survey conducted in sub-Saharan Africa from 2021 to 2024. Data were extracted from the Demographic and Health Surveys (DHS) program, which is funded by the United States Agency for International Development (USAID) and provides financial and technical support for standardized demographic and health data collection around the world. This analysis used the latest available DHS datasets from eight countries: Burkina Faso, Ivory Coast, Ghana, Kenya, Tanzania, Democratic Republic of Congo, Lesotho, and Senegal. DHS employs rigorous multistage stratified sampling techniques to ensure national and regional representativeness, with large sample sizes designed to capture demographic, behavioral, and health-related indicators. The survey included standardized modules on HIV knowledge, pre-exposure prophylaxis (PrEP) awareness and awareness, sociodemographic characteristics, and behavioral risk factors. For comparability, survey datasets were harmonized and pooled to create a secondary dataset, increasing statistical power and allowing cross-national analysis of factors associated with PrEP awareness and awareness among women of reproductive age.
source population
The source population consisted of women of reproductive age (15–49 years) residing in sub-Saharan Africa who were eligible to participate in the Demographic Health Survey (DHS) conducted from 2021 to 2024. DHS programs are designed to use standardized sampling and data collection procedures to collect nationally representative data and ensure comparability across countries. These surveys represent the general population of women in their reproductive age groups in each participating country and serve as the basis for deriving the study population.
Study population
The study population included women aged 15 to 49 years from eight sub-Saharan African countries (Burkina Faso, Côte d’Ivoire, Ghana, Kenya, Tanzania, Democratic Republic of the Congo, Lesotho, and Senegal) who reported data on adequate awareness and positive perceptions of HIV pre-exposure prophylaxis (PrEP) in recent DHS rounds. The final pooled dataset consisted of a weighted sample of 123,132 women. Eligible participants were those who completed the PrEP awareness and recognition module without missing any key demographic information. Women who reported being HIV-positive at the time of the survey were excluded, and analyzes focused on awareness and awareness of PrEP among HIV-negative women (Table 1).
Determination of sample size and sampling procedure
Demographic and Health Surveys (DHS) are conducted approximately every five years in many low- and middle-income countries. Use standardized, pre-tested questionnaires and consistent methods for sampling, data collection, and coding. This enables cross-country comparisons and multi-country analysis. For each country included in this study, the survey relies on the most recent national census as a sampling frame, and the sample is stratified by urban and rural areas within administrative regions. DHS applies a two-stage stratified cluster sampling design. In the first stage, clusters called enumeration areas (EAs) are randomly selected from the census list. The probability of selection is proportional to the population size of each stratum. In the second stage, all households in the selected EA are listed and a fixed number is systematically selected (e.g., every nth household) to ensure equal selection probabilities. This process produces a nationally representative sample of women aged 15–49.
This analysis pooled data from eight countries, resulting in a weighted sample of 123,132 women who answered questions about HIV pre-exposure prophylaxis (PrEP) awareness and awareness.
result variable
The main outcome of interest in this study was women’s awareness and awareness of HIV pre-exposure prophylaxis (PrEP), a preventive measure against HIV infection. Participants were asked whether they had heard of PrEP and, if so, what their perceptions were about its use. Response options included: I have never heard of it, I have heard of it, I have heard of it and I approve of it being taken daily, I have heard of it but I do not approve of its daily use, or I have heard of it but I am not sure if I would approve of it. For analysis, women who were aware of PrEP and expressed agreement with its use were coded as “yes = 1,” whereas women who had not heard of PrEP, did not approve, or were unsure were coded as “no = 0.” We recognize that combining awareness and approval into a single binary variable may misclassify women who are aware but hesitant. However, this approach reflects the study objective of identifying women who are likely to be aware of and actively use PrEP. Sensitivity analyzes were performed to assess potential misclassification, but the results were not significantly affected. There were no missing or unknown values for this outcome variable.
Predictor and feature selection
Predictor variables included sociodemographic characteristics (age, marital status, education level, employment status, income quintile), behavioral factors (number of sexual partners, condom use, history of sexually transmitted infections), health service utilization (recent HIV testing, participation in antenatal care), and situational variables (urban/rural residence, media exposure, national HIV prevalence). To reduce multicollinearity and improve model efficiency, feature selection was performed using recursive feature elimination (RFE) and correlation analysis to retain only the most informative predictors for model training. [26,27,28].
Data preprocessing
Data cleaning includes handling missing values using multiple imputation with chained equations and encoding categorical predictors with one-hot encoding. Continuous variables were normalized with min-max scaling. The dataset was divided into training (70%) and testing (30%) subsets and stratified by outcome variable to ensure balanced class representation. [28].
Correlation matrix heat map
A correlation matrix heatmap was generated to visualize the relationships between the predictor variables included in the model. Heatmaps display both strong and weak correlations, making it easier to identify potentially overlapping or complementary variables. Insights from these correlation patterns inform subsequent feature selection and model optimization steps, ensuring that only the most informative predictors are retained in model training (Figure 1).

Correlation matrix heatmap showing pairwise associations between sociodemographic, behavioral, and contextual predictors used in machine learning models
Feature ranking using recursive feature elimination (RFE)
In this study, feature selection techniques were applied to remove irrelevant or redundant variables during predictive model development to improve efficiency and interpretability. Data preprocessing involves systematically reducing the number of features to keep only the most informative predictors. We employed Recursive Feature Elimination (RFE). This method iteratively evaluates and removes less important features based on the importance scores derived from the model until the most relevant variables remain. This approach improves model performance, reduces overfitting by filtering out noise, and simplifies model interpretation. Using RFE, the most influential predictors selected for model building included maternal age, education, place of residence, marital status, household wealth index, employment status, media exposure, ANC follow-up, place of birth, number of medical examinations, total number of children, children under 5 years of age, contraceptive use, ever heard of sexually transmitted infections, ever been tested for HIV, age at first birth, sexual partner’s employment status, cohabiting age, and history of abortion. These selected determinants were used to train a predictive model, as shown in Figure 2.

Ranking of the most important features for predicting women’s awareness and perception of HIV pre-exposure prophylaxis (PrEP) using recursive feature removal
machine learning model
Five supervised machine learning classifiers (K-Nearest Neighbors (KNN), XGBoost, CatBoost, LightGBM, and Gradient Boosting) were trained to predict PrEP awareness and positive perceptions. Hyperparameters were optimized using grid search with 5-fold cross validation, with accuracy and F1 score as the main criteria. Model performance on the test set was evaluated using precision, precision, recall, F1 score, and area under the receiver operating characteristic curve (ROC AUC).
Interpreting the model
To increase interpretability, Shapley Additive Explains (SHAP) were calculated to identify the most influential predictors in each model. Feature importance rankings were also derived from the algorithm, and SHAP summary plots were used to visualize both the direction and magnitude of feature effects. This provided practical insights into the drivers of PrEP awareness. [28].
Assignment for statistical analysis
Descriptive statistics were used to summarize participant characteristics and prevalence of PrEP awareness. Bivariate analyzes (chi-square test for categorical variables and t-test for continuous variables) were conducted to investigate the association between predictors and PrEP awareness. Preprocessing of the data before training the model included checking for multicollinearity between predictor variables using the variance inflation factor (VIF). To ensure model stability, highly correlated variables (VIF > 10) were excluded. Missing values were treated according to the nature of the variable. For categorical variables, a modal imputation method was applied, and for continuous variables, a multiple imputation method was used to preserve statistical power and reduce bias. However, the proportion of missing data was minimal across included variables.
Multiple supervised machine learning algorithms were applied for predictive modeling, including K-nearest neighbors (KNN), XGBoost, CatBoost, LightGBM, and gradient boosting. Model performance was evaluated using accuracy, precision, recall, F1 score, and ROC AUC metrics. Feature selection was performed using recursive feature elimination (RFE), and model interpretability was assessed by Shapley Additive Explains (SHAP) values. All analyzes were implemented in Python (v3.8+) using libraries such as scikit-learn, XGBoost, CatBoost, LightGBM, SHAP, and pandas.
