The main objective of this study is to investigate the feasibility of classifying obese women based on their VAT weight by MLR model and machine learning based classification techniques, combining routine blood test results and basic clinical information. Given the established association between VAT accumulation and metabolism, both MLR and classification are important from a clinical perspective for subject risk stratification.48,49 Clinical Results18.
Analysis variables
On the other hand, all but one of the DXAdata variables were estimated from central body regions, which does not seem to be too far from previously reported studies in different populations, as noted in refs.50,51,52 To give a few examples, the variables included in the BLD data were blood chemicals whose concentrations change with obesity and its comorbidities. Interestingly, the adoption of these blood biochemical concentrations for the classification of overweight and obesity, considering both men and women, had already been proposed in previous studies.53,54thus supporting their use in the context of this study.
Although the Wilcoxon signed-rank test showed that most variables were statistically different between the three groups, only a few variables passed the Bonferroni correction for multiple testing, most of which were comparisons between classes 0 and 2 (see Table 3). In other words, a single variable obtained from DXA or blood samples may not be sufficient to robustly classify people living with obesity, possibly due to physiological interrelationships. Nevertheless, machine learning-based classification models can capture these complex relationships between variables more effectively than other statistical models, as confirmed by previous studies, including Ferenci et al.51 and Mitu et al.55Therefore, its application to this type of analysis is justified.
Regression Model
In this study, we chose MLR because (1) it can model the complex relationships that exist across DXA-based and blood chemistry, (2) it can determine the influence of individual variables on estimated VAT weight values, and (3) it can control for confounding variables. Indeed, by testing the influence of multiple variables simultaneously, MLR increases prediction accuracy, strengthens model robustness, and accommodates nonlinear relationships.34.
The estimation error is large, \(\hbox {R}^{2}\)none of the MLR models produced estimates of VAT weight accurate enough to be reliably used in the medical decision-making process. Specifically, the estimated VAT weight values could differ by up to two times the actual VAT weight, as shown for patient 12 in Supplementary Table 2. This discrepancy could be attributed to the limited sample size (N = 149), which compromises the precision of the MLR models and reduces the reliability of VAT prediction and risk stratification within the study population.
Choosing the number of classes
As a preliminary study, four different threshold settings were investigated to define the most appropriate number of clusters. The thresholds were based on (1) the mean, (2) tertiles, (3) quartiles, and (4) quintile values to ensure equal class sizes and thus avoid potential bias due to imbalanced data. The statistical power of the resulting groups was tested using the “PWR” package in R.56 The results were 23% on a mean basis, 22% on a tertile basis, 18% on a quartile basis, and 13% on a quintile basis. Thus, splitting the population more finely reduces the statistical power of the results.
On the other hand, the two-class approach increases intra-class heterogeneity and makes it more complicated to correctly classify patients due to a significant overlap in VAT weight distributions (see Supplementary Figure 1a), a fact that may reduce the clinical relevance of the results.
On the other hand, the three-class approach allows for a better discrimination between the two main classes (class 0, including subjects with low VAT weights and low risk, and class 2, including subjects with high VAT and high risk) by significantly reducing the overlap (see Supplementary Figure 1b), and also adds an intermediate class that can contribute to a more detailed classification (e.g., low, medium, and high risk levels) without losing statistical power.
Classification Models
In general, LR works well on a wide variety of datasets and performs better than decision trees (and their derivatives, Random Forests and XGBoost) and k-nearest neighbors.57This is also evident from the results in Table 5. This can be explained by the fact that LR regression is a classification technique where the target variables (classes in this study) are assumed to be categorical (i.e., class 0, class 1, class 2). Specifically, LR is best suited in the context of binomial problems where one-pairs-remaining analysis is employed, as is the case in this study.34,42The performance of k-nearest neighbors generally worsens with high dimensional data, especially in the presence of outliers, which can adversely affect the computation of the distance function, as was the case in this study.42On the other hand, SVM has shown comparable performance to LR in various studies on medical datasets.58,59 It is particularly useful in scenarios where the number of variables is much greater than the number of samples, but LR is the one most familiar to clinicians because of the relatively simple relationship between inputs and outputs.60.
We found striking similarities in the LR HPs configuration results, as well as confusion matrices, ROC, and PR graphs between ALLdata and DXAdata. This is probably because DXA-based variables are more correlated with VAT weight than blood chemistry, reducing their contribution to the final classification (see Figure 3a). Nevertheless, the LR model derived from BLDdata showed comparable test accuracy and similar classification performance (see Figure 2) as ALLdata and DXAdata, at least for classes 0 and 2. In any case, class 1 seems to be the most difficult to correctly evaluate. This could be a result of the VAT distribution within class 1, where most values are near the border of class 0 or class 2, as can be seen in Figure 1.
Interpretability and performance evaluation of classification models
SHAP was employed to gain a deeper understanding of the underlying mechanism of LR classification, analyze the impact of each variable on the classification, and sort the variables from more relevant to less relevant. Furthermore, comparing the length of each colored bar with other bars provides information on the importance of a particular variable for a particular class. Visual inspection of the SHAP diagrams reveals that both LR classification models derived from ALLdata (Figure 3a) and DXAdata (Figure 3b) are biased towards classes 0 and 2, as blue and green bars obscure orange bars across the datasets. In contrast, the classes in BLDdata (Figure 3c) appear more balanced across variables, although non-blood related classes (e.g., weight, age, hips) are still leaning towards classes 0 and 2.
One-pairs-remaining analysis allows the extension of any binary classifier to multi-class problems and has been widely adopted in multi-class classification problems.61,62,63 It evaluates the ability of the selected LR model to correctly distinguish patients of a certain class from all others. Comparing the AUC of the ROC and PR curves, we see similar results between ALLdata and DXAdata (see Figure 4), i.e., classification with BLDdata is more accurate on average (i.e., has a higher AUC). Interestingly, the LR obtained from BLDdata performed better in class 1 classification than the same models obtained from the other two databases.
Considering the ALLdata, blood chemicals seem to be less relevant than DXA-based variables (see Figure 3a). However, despite a slight decrease in the AUC for class 2, both the ROC and PR graphs indicate better overall results for the BLDdata-derived LR, especially for class 1 patients, as they were almost perfectly classified due to the contribution of blood chemicals such as AST, COLT, usCRP, LDL, and FPG to the model (see Figure 3c), confirming their important role in the classification task of BLDdata.
Clinical significance
This study demonstrated the utility of machine learning algorithms to build risk stratification models based on variables directly related to VAT weight (e.g. DXAdata) as well as variables derived from non-imaging techniques such as blood chemistry concentrations that may be altered by VAT excess. Indeed, thanks to machine learning, medical professionals can take advantage of subtle blood changes that may go unnoticed at first glance and stratify obese women according to their VAT weight using only common laboratory indicators and routine clinical information. In clinical practice, these non-imaging based models can be used for early stage risk stratification in primary health care centers, allowing timely interventions to prevent severe clinical outcomes.
Furthermore, our findings may represent an advance in the assessment of obesity-related complications, especially cardiovascular disease, which is clearly associated with VAT. This has important implications for women. Data from the Framingham Heart Study show that64 Obesity is associated with a higher incidence of cardiovascular disease in women (64%) than in men (46%). Obesity, along with other factors, significantly influences cardiovascular disease morbidity and mortality in women, making it an important focus of health interventions.65,66.
Limitations
First, it is important to note that usCRP may not always be included in routine blood sample testing, especially in primary health care centers. Second, the number of subjects included was relatively small, which may have prevented us from finding significant changes in some of the parameters proposed in the original study, limiting the total amount of variables that could be included in the final model. Third, the study population included people living with severe obesity, so it was not possible to test the performance of the model on subjects with less obesity. Fourth, the thresholds used to define the three proposed categories may not be suitable for other cohorts, limiting the extrapolation of the results to other populations. Furthermore, only women were included to avoid confounding effects due to gender. Finally, the small size of the study population does not allow the application of more advanced techniques, such as those based on deep learning analysis proposed in similar studies such as Agrawa et al.67 or Klarqvist et al.68.