A supervised machine learning approach with feature selection for sex-specific biomarker prediction

Clinical factors for the 1199 participants in the cohort

The descriptive statistics for the cohort were summarised in (Table 1). For each, the mean value was accompanied by the standard deviation in parentheses for continuous data and percentage for non-continuous data, providing an indication of the variability within the subgroup. The cohort exhibited a balanced sex distribution, encompassing diverse racial backgrounds and displaying homogeneity in age group representation. Notably, there was a higher prevalence of smokers within the male subgroup when compared to the female counterpart.

Table 1 Clinical factors and demographic statistics and counts for the N = 1199 participants in the NHANES cohort

Patterns in data variance

Analysis was done on the data to explore the variance in the data distribution for both male and female subgroups. Figure 1 shows the density distribution across the various biomarkers, whereas Table 2 shows the statistical analysis done to determine the significant differences between mean and variances according to features between male and female groups.

**Fig. 1: Distribution analysis of the NHANES data used in model training.**

Table 2 Summary of comparative analysis of various health biomarkers between male and female subgroups with respect to variance and mean values

Density distributions of various training features across sex groups (male and female) for a dataset. Each subplot shows the distribution of a specific feature, such as Waist Circumference, BMI, Albuminuria, UrAlbCr, Uric Acid, Blood Glucose, HDL, Triglycerides, Age, and Systolic Blood Pressure, with density curves overlaid for males (in blue) and females (in red). The shaded areas represent the density distribution, and vertical dashed lines indicate the mean value for each sex.

The results revealed significant differences in both variance and mean values for certain biomarkers. Specifically, systolic blood pressure (mmHg), HDL cholesterol (mmol/L), and Urinary Albumin-to-Creatinine Ratio (UrAlbCr, mg/g) demonstrated notable differences between the sexes. For systolic blood pressure and HDL cholesterol, Levene’s test indicated significant variance differences, suggesting a broader range of values in one sex, which may reflect greater variability or differing influences on these biomarkers within that sex. Similarly, UrAlbCr showed significant differences in both variance and mean values, highlighting pronounced differences in biomarker expression between males and females.

In contrast, for biomarkers such as waist circumference (cm), albuminuria (mg/L), uric acid (mmol/L), blood glucose (mmol/L), and triglycerides (mmol/L), no significant variance-based differences were observed between sexes. However, the mean values for these biomarkers were significantly different, as evidenced by the Unpaired Sample t-Test and Mann-Whitney U-Test.

Figure 2 shows the spearman correlations in each of the male, female and combined data groups. This initial pairwise Spearman correlation analysis was conducted to explore the relationships among biomarkers and to support the feature selection process in this study. As shown in Fig. 2, the results revealed complex co-dependencies, which are more likely to emerge in multivariate, nonlinear, or higher-dimensional analyses. A notable exception was the strong positive correlation between BMI and waist circumference (r = 0.9) in both male and female subgroups. In contrast, most other biomarkers exhibited weak, positive or negative monotonic relationships. This suggests that while Spearman correlation captures pairwise monotonic associations, it may not fully account for the complex multivariate interactions essential for robust predictions.

The results also showed that the Spearman correlations for male and female subgroups were notably similar to the combined group, though some biomarker relationships differed when comparing the three. While the correlation between BMI and waist circumference remained consistently strong across all groups (r = 0.9), other biomarkers revealed sex-specific differences. For example, albuminuria had a stronger negative correlation with BMI in females (r = −0.37) compared to males (r = −0.23), with the combined group (r = −0.3) falling between the two, suggesting a more pronounced inverse relationship in females.

Additionally, age and systolic blood pressure demonstrated a more significant correlation in females (r = 0.47) than in males (r = 0.29), while the combined group showed a moderate positive correlation (r = 0.37), indicating that age-related increases in blood pressure are more pronounced in females. Similarly, HDL and triglycerides exhibited a slightly stronger inverse relationship in males (r = −0.4) compared to females (r = −0.36), with the combined group mirroring this relationship (r = −0.4).

Collectively, these results highlight that while combined data provides an overall view, it can obscure subtle but important sex-specific differences in biomarker interactions. These findings suggested that separate analyses for males and females may offer more precise insights, particularly when developing predictive models for healthcare interventions.

Spearman’s rank correlation coefficients were computed for 10 feature pairs separately for male, female and combined groups. Results were visualised with positive correlations highlighted in red and negative relationships in blue.

Biomarker prediction optimisation including feature selection (RFECV)

Despite the weak pairwise correlations, the predictive value was captured when analysed as part of a multivariate framework^16,17,18,19. To this end, Recursive Feature Elimination with Cross-Validation (RFECV) was applied to the full set of 14 biomarkers, aiming to identify the most relevant features for training individual models. This approach enhanced prediction accuracy, minimised error, and optimised the number of features required for an ideal validation score.

Certain demographic factors, such as race and marital status, were excluded from the feature set due to their association with model inaccuracies. The remaining 12 features were retained for further analysis.

Table 3 shows the feature sets for predicting various biomarker targets within a multivariate framework, applied to female-specific, male-specific, and combined datasets. For instance, the biomarker target “Albuminuria” was best predicted in females using body mass index (BMI), high-density lipoprotein (HDL), waist circumference, triglycerides, uric acid, and urine albumin-to-creatinine ratio (UrAlbCr). In contrast, the predictors for males included age, waist circumference, triglycerides, BMI, HDL, smoker status, and UrAlbCr as input variables. This pattern of varying optimal markers across datasets was observed across all listed biomarkers. Interestingly, biomarker targets in the combined datasets required more features hinting towards data pattern relationship complexity.

Table 3 Features selected for biomarker optimisation

The findings in Table 2 further revealed sex-specific variations in biomarkers, which guided the feature selection process outlined in Table 3 and illustrated in Fig. 3. This process demonstrated that sex-based variability substantially impacted feature selection by altering coefficient importance (β_i), potentially explaining the observed differences in selected features presented in Table 3^17,18,19. Optimal feature selection, recorded in Table 3, was determined by analysing cross-validation scores relative to variable importance, as demonstrated in Fig. 3. Specifically, Fig. 3A identified the optimal number of features for the model, while Fig. 3B highlighted the most influential features (Variable Importance) and the corresponding (β_i) values. This methodology¹⁹ ensured the selection of only the most significant features and the development of the most effective model, addressing sex-specific influences and reducing the risk of overfitting, as further detailed in Fig. 4.

**Fig. 4: Demonstrates the impact of variable importance (βi) and variable importance for optimising the machine learning model to predict triglyceride levels in males (as an example).**

Figure 4A presents a side-by-side comparison of variable importance plots for females, males, and combined groups. These plots illustrate the standardised and weighted βi values of key features, ranked in order of their contribution to optimising each biomarker model (Model Target).

A Variable Importance (βi) of Predictors for Model Targets, Grouped by Sex and Combined Data in Feature Analysis. Each panel: one for females, one for males, one for the combined data, and the combined data with sex as the input feature presents horizontal stacked bar charts illustrating the importance of each predictor. Asterisks (*) highlight the most significant predictor for each model target for each group. (B) Displays the frequency count of each biomarker used as a feature in the optimisation process across male, female, and combined subgroups.

In Fig. 4A, for the female subgroup, waist circumference was the most significant contributor to the BMI model target, age to systolic blood pressure and blood glucose, HDL to triglycerides, triglycerides to HDL, and systolic blood pressure to UrAlbCr. Notably, BMI was the top contributor to three biomarkers: albuminuria, uric acid, and waist circumference. In contrast, the male subgroup showed a different pattern. Waist circumference contributed to both the BMI and triglyceride models, age to blood glucose, albuminuria, and UrAlbCr, BMI to uric acid and waist circumference, UrAlbCr to systolic blood pressure, and triglycerides to HDL. In the combined data, the pattern seen was that the significant contributor for each model target was either the same as the male or female and sometimes all 4 data sets contained the same variable, however, UrAlbCr was the only biomarker where all 4 differed entirely.

In Fig. 4B, age was the most frequently selected feature for both male and female groups in the ML models. For blood glucose, the analysis showed a higher frequency count in females compared to males. The analysis of feature selection across sex revealed distinct trends in feature frequency: In males, features having higher counts than the females were BMI, triglycerides and UrAlbCr. In females, waist circumference and albuminuria clustered together, while BMI, triglycerides, blood glucose, and HDL also tended to cluster. In the combined groups, the ML required almost all the features as inputs, in order to make more accurate predictions of the biomarkers compared to the sex-stratified groups.

The results in Table 4, showed the best ML model and corresponding metrics for the predictions of the various biomarkers for the four groups. A comparison table showing all 19 models evaluated against each other for all 9 biomarkers for the female and male subgroups can be found in the supplementary (Supplementary Table 1). The ML model that performed the best based on the data was chosen irrespective of the sex for the remaining processes.

Table 4 Best performing models for the various biomarkers in both male and female subgroups as well as combined data and combined data with sex as an input feature

When examining Table 4, the top performing models were waist circumference, BMI, systolic blood pressure, blood glucose and albuminuria. The overall trend seen in these metrics were that male and female subgroups, had lower values than the combined groups indicating higher performances with the following evidence:

Waist circumference

The male subgroup had lower MAE, MSE and higher R² values than the female group, indicating a better prediction accuracy for males. Both combined models fell between the subgroups, hinting towards an averaging out of the sex-specific differences. The combined plus sex as the input variable had lower values than the combined data without sex indicating a more accurate model.

Female: Model = Bayesian Ridge, MAE = 4.95 cm, MSE = 40.55 cm², RMSE = 6.37 cm, R² = 0.8.

Male: Model = Huber Regressor, MAE = 3.90 cm, MSE = 25.39 cm², RMSE = 5.04 cm, R² = 0.86

Combined: Model = Bayesian Ridge, MAE = 4.16 cm, MSE = 32.15 cm², RMSE = 5.67 cm, R² = 0.83

Combined + sex = Bayesian Ridge, MAE = 3.98 cm, MSE = 28.10 cm², RMSE = 5.30 cm, R² = 0.85

BMI

The results in this group followed the same pattern as seen with the waist circumference for all 4 data sets.

Female: Model = Gradient Boosting Regressor, MAE = 2.05 kg/m², MSE = 7.54 kg/m², RMSE = 2.75 kg/m², R² = 0.77

Male: Model = Ridge, MAE = 1.43 kg/m², MSE = 3.15 kg/m², RMSE = 1.77 kg/m², R² = 0.86

Combined: Model = Gradient Boosting Regressor MAE = 1.76 kg/m², MSE = 5.47 kg/m², RMSE = 2.34 kg/m², R² = 0.81

Combined + sex = Gradient Boosting Regressor, MAE = 1.62 kg/m², MSE = 4.53 kg/m², RMSE = 2.13 kg/m², R² = 0.85

Blood glucose

Although this model was able to predict glucose, there is room for improvement across all 4 groups. The results also explained only a small portion of the variability in the target variable, and its predictions had a noticeable average deviation from the actual values, with both of the combined models being slightly better. For this biomarker, it would appear that sex as an input feature had no influence on the validation metric results as the values were identical for both combined data sets.

Female: Model = Linear Regression, MAE = 0.47 mmol/L, MSE = 0.36 mmol/L², RMSE = 0.60 mmol/L, R² = 0.17

Male: Model = Huber Regressor, MAE = 0.40 mmol/L, MSE = 0.28 mmol/L², RMSE = 0.53 mmol/L, R² = 0.14

Combined: Model = Huber Regressor, MAE = 0.43 mmol/L, MSE = 0.30 mmol/L², RMSE = 0.55 mmol/L, R² = 0.21

Combined + sex: Model = Bayesian Ridge, MAE = 0.43 mmol/L, MSE = 0.30 mmol/L², RMSE = 0.55 mmol/L, R² = 0.21

Systolic blood pressure

For this biomarker the female subgroup’s prediction metrics were slightly better when compared with the male subgroup. Both combined models showed a slightly higher error which could be due to the inherent biological sex-differences in blood pressure patterns. These results indicated that the model was well-suited to predict values within the first, second, and third quartile ranges but could face challenges in accurately predicting values outside of these ranges.

Female: Model = Huber Regressor, MAE = 10.37 mmHg, MSE = 175.94 mmHg², RMSE = 13.26 mmHg, R² = 0.24

Male: Model = Huber Regressor, MAE = 10.38 mmHg, MSE = 176.90 mmHg², RMSE = 13.30 mmHg, R² = 0.08

Combined: Model = Random Forest Regressor, MAE = 10.77 mmHg, MSE = 198.82 mmHg², RMSE = 14.10 mmHg, R² = 0.13

Combined + sex: Model = Random Forest Regressor, MAE = 10.56 mmHg, MSE = 194.87 mmHg², RMSE = 13.96 mmHg, R² = 0.15

Albuminuria

For this biomarker the female subgroup performed slightly better than both the male and combined models in predicting albuminuria. These findings suggested that the models demonstrated strong predictive ability for the target variable with minimal errors and high accuracy overall. Interestingly the ML model chose Ridge for the combined data whereas the male, female and combined with sex groups had the same Bayesian Ridge Model.

Female: Model = Bayesian Ridge, MAE = 0.19 mg/L, MSE = 0.06 mg/L², RMSE = 0.24 mg/L, R² = 0.20

Male: Model = Bayesian Ridge, MAE = 0.21 mg/L, MSE = 0.07 mg/L², RMSE = 0.26 mg/L, R² = 0.14

Combined: Model = Ridge, MAE = 0.21 mg/L, MSE = 0.07 mg/L², RMSE = 0.27 mg/L, R² = 0.15

Combined + sex: Model = Bayesian Ridge, MAE = 0.21 mg/L, MSE = 0.07 mg/L², RMSE = 0.27 mg/L, R² = 0.19

The contour plot in Fig. 5. provided a visual representation of the relationship between two key performance metrics, Root Mean Squared Logarithmic Error (RMSLE) and Mean Absolute Percentage Error (MAPE) from Table 4, for the biomarker targets for all 4 data groups. Both metrics serve as indicators of model accuracy, where lower values signify better performance. The clustering of data points in the lower regions of both axes indicated that the majority of models exhibited low error rates across both metrics. Specifically: RMSLE captured the logarithmic differences between predicted and actual values, with a focus on penalising large errors more heavily. MAPE reflected the percentage error between predicted and actual values, offering insight towards a measure of accuracy. The proximity of most data points to the lower left corner of the contour plot highlighted the overall robustness of the models, indicating that both male and female models performed comparably well with minimal prediction errors. This trend, coupled with the regression coefficients, suggested that the models maintained reliable performance across different error metrics.

Contour plot of RMSLE versus MAPE, demonstrating the performance of all the biomarker models developed in the study. The data points predominantly cluster in the lower left regions of both RMSLE and MAPE axes, indicating minimal prediction errors across all 4 model targets. This concentration suggested consistent accuracy and low variance in model performance (Refer to Table 6).

Figure 5 showed a significant positive correlation between MAPE and RMSLE for both male (r = 0.91), female (r = 0.92), combined without sex as an input variable (r = 0.92), and combined with the sex input variable (0.93) models. The high correlation suggested that these error metrics were closely aligned, such that a reduction on one tended to be accompanied by a reduction in the other.

From this research we concluded that models such as waist circumference =β0 + β1⋅BMI + β2⋅Sex+β3⋅(BMI⋅Sex) could capture differences between sexes within a unified framework as it is arguably conserved as a function of BMI and sex. However, these types of models assumed that the relationship between predictors and outcomes were fundamentally similar for both sexes.

This research assumed disparate differences existed and therefore involved separate models for males and females, acknowledging that physiological and biomarker differences may require distinct model structures. For instance, Table 4 showed that the optimal model architecture for predicting various model targets in males may not align with the features or interactions identified in females. In support of this, the tailored feature selection Fig. 5A, B, highlighted the influence of variable importance (βi) and the effect of weighting on the optimisation of models (Table 3) when developing distinct models for each sex. This approach allowed for precise tailoring of feature selection based on each group’s specific biological or physiological characteristics. Such customisation was not achievable with a general interaction model, where the same predictors were applied uniformly to both sexes. This assumption was also supported and tested by the comparison of sex-specific models against the same sex and opposite sex. What was found was that the models generally outperformed when tested within the same sex. There were instances where a female-trained model generalized well on male data and vice versa, however this was not the common trend. This suggests that while some features may be transferable across sexes, others are inherently sex-specific, influencing overall predictive accuracy. These results are shown in the supplementary (See Supplementary Figs. 1 and 2, along with Tables 2 and 3).

Figure 5 showed that the predictive power of the separate models for males and females remained even in the absence of model standardisation. This method, which avoided standardising models or features, allowed for the optimisation of each model target based on MAPE and RMSLE performance metrics specific to each subgroup. The contour plot analysis supported this conclusion, indicating that accuracy was preserved, with the majority of models falling within the 0.8 to 1 accuracy range and corresponding MAPE and RMSLE values between 0.05 and 0.1.

Validation test of actual results vs predictive results

After creating and refining the prediction models, a validation test was performed on the hold out set (Test Data) for all four data groups for each of the biomarkers. To evaluate the predictive ability on the test set, the results were grouped according to “within a 5% and 10% error” respectfully. Heatmaps showing these results are displayed in Figs. 6–8 for the female and male subgroups, and the combined data. Higher values (shown in red) demonstrated effective predictions by the optimised ML model, while blue indicated ineffective predictions.

**Fig. 6: Validation test for females.**

As seen in Fig. 6, the biomarkers that had the majority of individuals falling within the 10% error category were albuminuria, waist circumference, BMI, blood glucose, and systolic blood pressure. Of these biomarkers, the one consisting of the highest number of individuals was albuminuria with 93%, waist circumference with 86%, BMI with 76%, and the lowest two being blood glucose and systolic blood pressure with 64%. From a validation metrics point of view these were also the top performers with the exception of systolic blood pressure having a very high MSE value. The biomarker with the least number of individuals in the “within 10% error” category was UrAlbCr with 12%, followed by triglycerides (26%), HDL (37%), and uric acid (39%) respectively.

Despite a 10% error being acceptable for prediction purposes, in a clinical sense, that error could result in an individual falling incorrectly into an abnormal diagnostic range. For this reason a error of 5% was also evaluated and it can be seen that albuminuria resulted in 64% of the individuals falling within this range, followed by waist circumference (58%). UrAlbCr (8%) and triglycerides (14%) were the lowest two biomarkers respectively. Both uric acid and HDL had the same percentage (19%) of individuals for this category.

Heatmap showing the number of individuals where the predicted value falls either within or outside of a 5% and 10% error of the actual value in the male subgroup (from the n = 119 test group).

Upon examining Fig. 7, the male subgroup had a higher percentage of individuals falling within the “within the 10% error” category compared to the female group overall: waist circumference (96%), albuminuria (92%), BMI (91%), blood glucose (74%), and systolic blood pressure (68%) with HDL (43%), uric acid (39%), UrAlbCr (16%), and triglycerides (14%).

Within the 5% margin of error for the male subgroup, waist circumference exhibited the greatest number of individuals, followed by albuminuria. A similar pattern was noted in the female subgroup with these markers occurring in reverse order. BMI also had more than 50% of the individuals falling within this category which was different from the female group. Again, the lowest two biomarkers for this category were triglycerides and UrAlbCr, which was reversed in the female subgroup.

Heatmap showing the number of individuals where the predicted value falls either within or outside of a 5% and 10% error of the actual value in the combined data group with sex as an input feature (from the n = 240 test group).

When observing the validation results (Fig. 8) for the combined data most of the top predictions fell within the 10% error group. Waist circumference was the highest (76%), followed by albuminuria (75%), BMI (68%), blood glucose (59%), with systolic blood pressure (48%), HDL (34%), uric acid (28%), triglycerides (19%) and UrAlbCr (8%) falling below the 50% mark. Waist circumference was the only biomarker to have a prediction of above 50% within the 5% error range.

**Fig. 8: Validation test for combined – sex group (where sex has been removed an an input feature).**

The results seen in the validation results (Fig. 9) for the combined data containing the sex input feature, were very similar to the combined data set without it. The results were as follows in order of highest to lowest percentage for the within 10% error range were: Waist circumference (78%), albuminuria (74%), BMI (70%), blood glucose (58%), blood pressure (50%), uric acid (36%), HDL (35%), and triglycerides and UrAlbCr having 20%. The within 5% error group were very similar between the combined data and combined with sex input feature with the exception of UrAlbCr having 3% and 8% respectively.

**Fig. 9: Validation test for combined + sex group (where sex is included as a feature).**

When comparing the results for the validation tests collectively (Fig. 10) waist circumference had the highest prediction within the 10% error for all data groups except the female subgroup, which was albuminuria. BMI was third on the list for all the data groups followed by blood glucose, and then blood pressure. The final 4 biomarkers were in different orders of prediction capability for all 4 groups. Overall the male subgroup data predicted better than the other data groups followed by the female subgroup, then combined with sex and finally the combined without sex data predicting the lowest. The divergence in the data e.g. between male and female subgroups for waist circumference, BMI and blood glucose indicated that these models were able to discriminate between data patterns more effectively (also seen in the clustering found in Fig. 5). The results seen in the validation test were consistent with the validation metrics determined in Table 4, Figs. 4 and 5.