Research Group
Our study utilizes a data set containing 303,372 individuals who had undergone health testing at the Health Care Center at Xiangya No. 3 Hospital, Central South University, from 2017 to 2022. This retrospective study, designated as D1, was approved by the hospital's ethics committee and complied with Helsinki's declaration. Participants provided informed consent. Data were obtained from hospital electronic medical records systems and employed for training and validation of METS predictive models.
For D2 datasets, we incorporated the external validation set D2 to assess the generalizability of the model. This set was sourced from a cross-sectional study of the workforce in the Balearic Islands, Spain, conducted between 2012 and 2016. The dataset is https://datadryad.org/stash/dataset/doi:10.5061/dryad.cb54, [17]including 60,799 participants aged 20-70, representing a diverse range of sectors, including government and health services. The survey invited 69,581 employees, and 60,799 people (comprising 10.2% of the active population) participated, with 57.3% being male and 42.7% being female. The flow chart is shown in Figure 1.

Research Methods Overview
Data collection procedure
In the D1 dataset, trained medical staff conducted health interviews to collect demographic and behavioral information such as age, gender, medical history, and details of the medication used. Fasting blood samples were analyzed and a specific analyzer (called Cobas 8000) was used to determine the levels of fasting blood glucose (FBG), triglycerides (TG), and high density lipoprotein cholesterol (HDL-C). Blood pressure was measured on a specific device (OMRON) following a standardized protocol. Anthropometric measurements were taken according to standards set by ISAK.
Health information such as ID, age, gender, smoking status, body fat percentage (BF), body type index (ABSI), BMI, WC, WAIST-TO-HEIGHT ratio (WHTR), non-insulin-dependent diabetes mellitus (NIM), syStolic (SBP), DBP) blood pressure, dBP), Cholestures, dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp (sbp), dbp, dbp (sbp), HDL-C, glucose (GLU), and TG levels.
Data cleaning and processing
D1 Data Cleaning
We began a comprehensive data cleaning process with the first 303,372 participants. Inclusion criteria were available for adults aged 18-75 years, with data on WC, hip peri-articular, blood pressure, heart rate, and age. Exclusion criteria included people with missing data, severe chronic diseases such as cancer, end-stage renal disease, severe heart disease, and patients with antihypertensive or lipid-lowering medications to reduce METS diagnostic bias. This resulted in a final sample of 268,942 participants. They were categorized as having MetS based on IDF criteria [18] (Figure 2))49,107 have been identified as having Mets, with no 219,835.

International Diabetes Federation (IDF) Diagnostic Criteria for Metabolic Syndrome
D2 data processing
The D2 dataset initially diagnosed by the NCEP-ATP III criteria were reevaluated using the IDF standard for consistent diagnostic comparisons. We applied the European WC threshold (>94 cm for men and >80 cm for women). Height was calculated via WHTR to determine the BRI. Following the re-diagnosis of IDF, 5,515 participants were recognized as having MetS, and 55,284 were designated as non-met controls.
Selecting and extracting features
For Early Mets detection, we highlighted non-invasive indicators as predictive features to reduce diagnostic complexity and costs. We started with correlation and significance analysis of all candidate features. Using Pearson correlation coefficients, we evaluated the connection between continuous variables and METS, and found that BRI had the strongest correlation (r= 0.582, p<0.01), its important predicted potentials are shown (Table 1). WC also had significant correlations and was included. The role in the calculation of BRIs adds height. Age and gender were also included in the study because of substantial evidence suggesting their role in metabolic health, ease of acquisition, and METS development. Specifically, the prevalence of Mets increases with age [19, 20]and numerous studies have confirmed the relationship between gender and MetS, with the incidence of MetS varying between genders [21]. Through both empirical analysis and statistical testing, we confirmed that these selected features were statistically significant and biologically plausible, making them reliable predictors of MetS. Finally, five variables (gender, age, BRI, height, and WC) were selected, and BRI was most valuable due to its strongest correlation with MetS.
Model comparison: Machine learning and baseline models
To investigate the benefits of machine learning, a simple baseline model was first constructed based on the diagnostic criteria outlined in Figure 2 (International Diabetes Federation (IDF) Criteria for METS Diagnosis). This baseline model categorized individuals as positive for METS based on predefined IDF criteria for central obesity and associated risk factors. Following the development of the baseline model, we evaluated the performance of 10 machine learning algorithms. The main objective was to assess whether these machine learning models could improve predictions of METS compared to baseline models that utilized predefined diagnostic criteria. Machine learning models employed in this comparison include logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boost (XGBOOST), support vector machine (SVM), gradient boost decision tree (GBDT), light gradient boost machine (LightGBM), K-Nearest Neavers (KNN), and Naive Bayes. Data preprocessing and standardization is performed, ensuring that all models are trained with comparable data, and with continuous functionality standardized using the StandardScaler method.
Model construction
Based on the feature selection process, five important features were selected: sex, age, WC, height, and BRI to predict the occurrence of MetS. We have adopted 10 machine learning algorithms, including Xgboost, SVM, DT, RF, GBDT, LightGBM, LR, KNN, NB, ANN. Consistent data standardization was performed using StandardScaler. We used 10x cross-validation to ensure the robustness and accuracy of the model. There, the data was split into 10 parts, each used once for validation, and the model was trained on the remaining nine parts. This process was repeated 10 times. Importantly, grid search was employed to optimize the hyperparameters of each machine learning model. Detailed parameter configuration and tuning instructions are provided in supplemental online content. Class imbalances were managed by the scale_pos_weight parameter of the classifier. Additionally, additional ablation analyses were performed to further explore the relative contribution of central body mass indexes (particularly BRI) to model performance. Given that BRI is a complex anthropometric measurement derived from WC and height, and acts as a proxy for central obesity, all models were retrained after removing BRI and other features directly related to central obesity. The goal was to assess whether these indicators independently contributed to Mets discrimination beyond the remaining features.
Model Evaluation and Threshold Analysis
To assess model performance of the validation set, individuals were first classified as positive for METS based on predicted probability using a 50% threshold. This threshold assessed several important metrics, including accuracy, accuracy, recall, F1 score, AUC-ROC, and Brier score, to measure model performance. [22,23,24]. A lower score indicates better calibration.
Additionally, we analyzed the behavior of the model by testing performance at various thresholds (0.1-0.9). This allowed us to assess the trade-off between recall, accuracy, accuracy, specificity, and F1 scores. By adjusting the thresholds, we prioritized recalls (minimizing malfunctions) or accuracy (minimizing false positives) that are essential for Mets detection. [25, 26].
This model was trained using the D1 dataset and externally validated on the Balearic Islands D2 dataset to assess generalizability across different demographics. The same threshold was applied to test the robustness of the model in real context. The optimal model was selected based on a threshold-based performance assessment to ensure optimal diagnostic utility for clinical use. See the supplemental online content for detailed calculations of metrics at various thresholds.
Detailed analysis of model accuracy in external validation sets: a thorough investigation of true negative and false positive samples
After evaluating the models, the relatively low accuracy of the external validation set encouraged a deeper investigation into the causes of misclassification, particularly focusing on true negatives (TN) and false positives (FP). To identify factors contributing to these misclassifications, we selected the model with the lowest diagnostic rate for further analysis. TN samples (healthy individuals correctly classified as non-met) and FP samples (individuals without METS) were extracted from the D2 external validation set. These samples were analyzed based on five major METS diagnostic criteria: WC, blood pressure, HDL-C, TG, and FBG levels. The goal was to identify data characteristics leading to misclassification, provide insights to optimize models, and increase clinical applicability for early METS diagnosis and prevention.
Software and Statistical Methods
All statistical analyses were performed using Python 3.8. For continuous variables with skewed distributions, median and interquartile range were reported (IQR, M (Q1), Q3)), for the variables that were successfully varied, the mean ± standard deviation was presented. Categorical variables were expressed as counts and percentages. Individual logistic regressions were performed to compare BRIs and BMIs based on their association in METS risk assessments. For group comparisons, the Mann-Whitney U test was used for the chi-square test of continuous variables with skewed distributions and categorical variables. Data preprocessing and feature selection was performed using Pandas Library, and data standardization was performed using Scikit-Learn Standard Scaler to ensure consistency of functional scales. To address class imbalances, we used the classifier's Scale_Pos_Weight parameter to adjust the focus on minority classes, improving model performance and reducing false negatives. Statistical significance was set to a p-value <0.05, and all analyses were performed using two-tailed t-tests. Tripod guidelines were followed to ensure transparency and rigorousness in model development and reporting [27, 28]. Completed Tripod - AI Checklist is provided as supplementary material (Supplementary File 2).
