Machine learning screening model to identify risk of high-frequency hearing loss in the general population

Subject characteristics and prevalence of HFHI

This study included 3371 community participants, consisting of 1730 men (51.3%) and 1641 women (48.7%), ranging from 18 to 98 years of age, with a mean age of 50.39 ± 15.23 years. Of these, 57.3% (1930/3371) were diagnosed with HFHI. Compared to those without HFHI, those with HFHI were usually older. Most likely male. Their educational level and income were low. They were also more likely to be diagnosed with high blood pressure, diabetes, otitis media, and chronic heart disease. Univariate analysis results of behavioral factors, environmental exposures, symptoms and medical conditions, daily blood indicators, and liver function indicators are summarized in Additional file 3. As a result, a total of 58 candidate indicators were identified for the next model building process. (p< 0.05).

Performance evaluation of HFHI screening model

In this study, we used seven ML algorithms to build HFHI classification models on the training set and evaluated these models on the validation set using AUC values, precision, precision, recall, specificity, and F-score measurements. Performance was evaluated. Finally, the model with the best discriminatory ability during the validation stage was selected as the final model to differentiate between HFHI patients and community residents. First, we compared the performance of each model using commonly used receiver operating characteristic (ROC) curves and AUC values. The validated ROC curves and validated AUC values for all models are shown in Figure 2 and Table 2. Among the employed algorithms, the LASSO algorithm achieved the best validated AUC of 0.868 (95% confidence interval (CI): 0.847 to 0.889). The validation cohort showed the highest overall discriminatory ability compared to other models. Moreover, KNN, Boosting, and XGBoost achieved relatively good overall discriminatory ability, with validated AUC values of 0.866 (95% CI: 0.845 to 0.887), 0.858 (95% CI: 0.837 to 0.880), and 0.854 (95% CI: 0.833 ~). 0.876), respectively.

Table 2 Comparison of classification accuracy, precision, recall, specificity and F-score between different machine learning approaches

These ML models were further evaluated and compared with respect to other performance-related characteristics such as accuracy, precision, recall, specificity, and F-score. The results are summarized in Table 2. In detail, the highest accuracy was achieved by the RF model (80.57%). In terms of accuracy, all seven models achieved relatively good performance, achieving an accuracy of over 80%. The best model among them was the NB model (93.52%). In terms of recall measurements, the RF model achieved the highest value of 81.74% among the 7 models. The two best models in terms of specificity were the NB model (96.62%) and the LASSO model (81.01%). Finally, we compared the F-scores between these models. Six out of seven models achieved values above 80%. The RF model (83.21%), SVM model (82.94%), KNN model (81.97%), LASSO model (81.78%), Boosting model (80.52%), and XGBoost model (80.38%) are ranked. From the highest to the lowest.

After performing these comprehensive comparisons, we found that the overall performance of the SVM model, RF model, KNN model, and LASSO regression model is relatively better than other models, but the SVM model (0.805 ), the AUC values of the RF model (0.803) and the KNN model (0.866) were lower than the LASSO regression model. Furthermore, we used 5-fold and 10-fold internal cross-validation techniques to evaluate the performance of different algorithms. The AUC values of the cross-validated model were compared to the AUC values of the original model. Similar to the original model, LASSO and KNN consistently performed better than other algorithms. Specifically, in five-fold cross-validation, the average AUC values for both the LASSO-based and KNN-based models were 0.857. In particular, when comparing the 95% CIs, the KNN model showed a slightly narrower range (0.844–0.869) than his LASSO model (0.845–0.870). In the 10-fold cross-validation model, KNN achieved slightly better performance than LASSO in terms of average AUC (Additional file 4). However, from the perspective of model interpretation and application, LASSO-based models have unique strengths. First, regarding model application, LASSO allows the selection of influential variables, which simplifies the model and promotes its application value. Since KNN lacks a variable filtering step, all variables must be included for subsequent application. Second, from a model interpretation perspective, the important variables selected by LASSO can be used to design individualized intervention programs, whereas KNN-based models lack such a feature. This is because all the features involved contribute to the model without clearly distinguishing their importance. Interpretability and customization possibilities. As a result, he ultimately adopted the LASSO regression algorithm to construct the final HFHI screening model.

By employing LASSO regression, our screening model finally selected 34 variables as indicators of HFHI risk, with a fitted AUC of 0.866 (95% CI: 0.852-0.881) and a predicted AUC of 0.868 ( 95% CI: 0.847-0.847-) was reached. 0.889) in the validation cohort. The two AUC values are relatively high with little difference, indicating that the derived model achieved robust performance. The 34 HFHI risk indicators that remained in the final LASSO-based model included 5 demographic indicators, 7 disease-related characteristics, 5 behavioral factors, 2 environmental exposures, 2 auditory-cognitive factors, and 13 Includes blood test indicators. These include a history of coronary heart disease, otitis media, and self-reported hearing loss, as well as several routine blood indicators (e.g., RDW, PDW, LY%) and liver function indicators (e.g., TG, IBIL). , AST, LDL). , was identified as the most important indicator (Additional file 5).

HFHI Screening Nomogram and Model Interpretation Case Study

Based on the 34 HFHI risk indicators identified by LASSO regression, we further developed a nomogram to ultimately transform the LASSO regression model into an accessible screening tool that can be easily used by primary care physicians and community members. . In a nomogram (Figure 3), the points achieved by an individual on each indicator on the relevant scale are summed to determine the individual's total points, and a vertical line is drawn on the scale based on the total points. Individual's final HFHI risk score. In our dataset, all individuals are ranked from low to high risk based on their risk score, with three different risk categories: high risk (score 0.75 to 1.00), medium risk (score 0.45 to 0.75), and low risk. classified into risk group. (score 0-0.45). Classification thresholds for each risk category were determined based on the positive predictive value (PPV), sensitivity, and specificity of each risk category. Generally, the low-risk group consisted of 1,314 patients, of which 22.91% (301/1,314) showed her HFHI. In the intermediate risk group of 822 patients, 60.83% (500 of 822) were diagnosed with HFHI. In the high-risk group, 91.42% (1129/1235) were confirmed to have her HFHI. According to the nomogram, the demographic characteristics of the high-risk group of HFHI patients are older men, lower education, and lower income. Regarding medical history, those with a history of tinnitus, hypertension, diabetes, coronary heart disease, and otitis media were at higher risk of HFHI. Lifestyle characteristics such as smoking history, drinking history, heavy use of electronic products, high levels of living pressure, and exposure to noise in the work environment were risk factors for HFHI. In addition, 13 blood test indicators were identified by the model. Among them, the indicators that had the greatest impact on HFHI were RDW, NE, and TG.

Figure 3 also uses the community population as an example of applying the constructed HFHI screening model to a community setting, demonstrating the model's ability to identify potentially high-risk populations that are typically overlooked. . In general, older residents are at higher risk for HFHI, as age is one of the most important indicators of her HI. However, in the validation set he was a 42-year-old man, and although he was relatively young, his identified HFHI risk was 0.846, so he was classified in the high-risk group for HFHI. . In addition, this middle-aged man was confirmed to have HFHI by his hearing test results (he was 47.50 dB in poor hearing). Based on the patient record, his 34 relevant features are marked with red triangles, as shown in Figure 3. In terms of demographics and disease status, this man had self-identified HI, hypertension, and diabetes. He also has several behavioral risk factors, including smoking, alcohol consumption, excessive electronic volume, and routine noise exposure at work. Regarding biomarker risk factors, the patient had an abnormally high level of his IBIL. It is hoped that by applying this screening model, we will be able to screen these potentially high-risk patients for subsequent confirmatory diagnoses and identify risk factors for subsequent individualized interventions.

Differences in characteristics in risk stratification

Differences in resident lifestyles depending on risk stratification

To further explore the distribution of lifestyle variables for the three risk categories identified by the model, we calculated the proportion of individuals with a particular lifestyle behavior within a 0.1 point interval of the risk score, and compared loess across the spectrum of risk profiles. I plotted the curve. As shown in Figure 4, the proportion of individuals with >40% electronic volume use, smoking, drinking, or self-reported workplace noise experience at least once per week increased significantly with increasing risk score. Increased. The number of people who consumed 500g or more of fruits and vegetables daily and who exercised at least once a month decreased significantly. In detail, in the low-risk group, 72.83% (957/1314) exercised at least once a month, while in the high-risk group, this proportion decreased to 46.07% (569/1235). Did. A total of 28.10% (347 out of 1235 people) were alcohol drinkers in the high-risk group, which was 3.13 times higher than in the low-risk group (8.98%, 118 out of 1314 people). Similarly, 39.51% (488 out of 1235) of individuals in the high-risk group were smokers, which was 2.93 times higher than in the low-risk group (13.47%, 177 out of 1314).

Differences in resident blood index by risk stratification

We also investigated the distribution of identified blood indicator variables in the three risk categories (Figure 5). As the risk score increased, the proportion of people with high levels of LDL, LY%, RDW, TC, and EO% also increased, but the proportion of people with high levels of HDL decreased slightly. Among them, 29.15% (360/1235) of the high-risk group had high levels of TC, which was 3.25 times that of the low-risk group (8.98%, 118/1314). In the high-risk group, 36.03% (445/1235) individuals showed high LDL levels, which decreased to 23.06% (303/1314) in the low-risk group. Conversely, the proportion of individuals with high levels of her HDL decreased from 6.16% (81/1314) in the low-risk group to 3.16% (39/1235) in the high-risk group.