Study population
The study population consisted of patients diagnosed with the following symptoms: japanese schistosome Located in Yueyang City, Hunan Province, China. The city has historically been an endemic area for schistosomiasis. This is because it was located near Dongting Lake on the middle and lower reaches of the Yangtze River, which was the intermediate venue. Oncomelania hupensis It breeds in large numbers.
japanese schistosome Infections were diagnosed according to the definition of Zhou et al.26. The following diagnostic criteria include: history of living in a schistosomiasis-endemic area, contact with infected water, schistosomiasis-specific serology, color ultrasound, and microscopic examination of excreta (feces, urine). Schistosoma infection was considered if schistosome eggs were identified in the stool or urine or if the schistosome serology was positive.
Liver fibrosis was determined by ultrasound according to the World Health Organization diagnostic criteria. japanese schistosome infection27,28. An experienced ultrasound specialist divided the patients into two groups according to the ultrasound results. No fibrosis group (smooth and uniform liver echotexture without mesh-like changes). This diagnosis was reconfirmed by another experienced schistosomiasis specialist.
Data collection
A retrospective medical record review was conducted from June 2019 to June 2022 at Xiangyue Hospital, Yueyang City, Hunan Province, China. All patients underwent blood tests and ultrasound examinations upon admission. All variables were extracted from the hospital's electronic medical record system. Data include patient demographic characteristics, blood routine indicators, and other variables. KNN filling method is used to fill in missing data. The principle is to identify k spatially similar or close samples in a dataset through distance measurements and use these k samples to estimate the value of the missing data point. The percentage of missing data points is shown in Supplementary Table 5. LassoCV method was used to screen key variables. Data entry was performed by full-time research physicians or medical students. This study was conducted and approved by the Ethics Committee of the Third Xiangya Hospital of Central South University (number: 21149), and the experiments were conducted in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki). All methods were performed in accordance with relevant guidelines and regulations. Due to the retrospective nature of the study, the Ethics Committee of the Third Xiangya Hospital of Central South University waived the need for informed consent. The privacy of all participants will be completely protected.
Feature selection
Patients were divided into liver fibrosis and non-liver fibrosis groups according to the color Doppler ultrasound results. Hepatitis B virus (hepatitis B surface antigen seropositive), hepatitis C virus (HCV antibody seropositive), human immunodeficiency virus (HIV antibody seropositive), alcoholic and non-alcoholic fatty liver disease (ultrasound scan and Patients with alcohol intake >30 g daily), decompensated liver disease or liver cancer (ultrasonography and liver function tests), and organ transplantation (self-reported) were excluded. Key variables are selected by the LassoCV method for subsequent modeling.
research design
First, the classification task is performed using six machine learning algorithms, including “XGB Classifier”, “Logistic Regression”, “LightGBM Classifier”, “Random Forest Classifier”, “Support Vector Classifier”, and “K-Nearest Neighbor Classifier”. Completed using . Five-fold cross-validation method was used for validation. Each model was evaluated using AUC, clinical decision curve plot, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score. The ROC diagram and forest diagram show the ROC results of each model for predicting “liver fibrosis.”
After selecting the best algorithm through multi-algorithm model comparison, we re-modeled using the best algorithm. Unlike the multiple model comparison, when using the best performing algorithm for modeling, 15% of the total samples are randomly selected as the test set, and the remaining samples are used as the training set for 5-fold cross validation.
Interpreting the model
Python's SHAP package can interpret the output of a machine learning model, considering all features as “contributors”. The model generates a predicted value for each predicted sample. Its biggest advantage is that it reflects the influence of each sample's characteristics and can show positive and negative effects. In this study, we used the SHAP package to interpret the model. A SHAP value plot was used to show the contribution of each variable in the model. A model variable importance plot was used to show the importance ranking of each variable. We used force diagrams to illustrate how each variable affects the predicted results for each sample with two examples.
statistical methods
The Python used in this study was version 3.7. The statsmodels 0.11.1 package in Python was used to count whether each variable was different between two groups of people. The analytical method was selected depending on the sample distribution, homogeneity of variance, and sample size. Chi-square tests were used for categorical variables. Student's t test or Mann-Whitney U test was used for quantitative variables.
In this study, LassoCV was used to screen for key variables, and factors with coefficients of 0 were automatically excluded (sklearn 0.22.1 package in Python). Lasso obtains a more sophisticated model by constructing a penalty function that compresses some regression coefficients, that is, forces the sum of the absolute values of the coefficients to be less than a certain fixed value. At the same time, we set some regression coefficients to zero. Therefore, the benefits of subset reduction are preserved, resulting in biased estimates for dealing with data with multicollinearity. In the multi-model and best model modeling process, Python's xgboost 1.2.1 package is used for XGBoost algorithm modeling, Python's lightgbm 3.2.1 package is used for LightGBM algorithm modeling, and sklearn 0.22.1 package is used for XGBoost algorithm modeling. will be used. Python was used to build other models. The shap 0.39.0 package in Python was used to demonstrate the interpretability of the model.
ethical standards
Ethics approval was obtained from the Ethics Committee of the Third Xiangya Hospital of Central South University.