Machine learning models based on routine blood and biochemical test data for the diagnosis of neurological diseases

Data Collection and Processing

All the raw data we collected came from inpatients of neurologic and healthy people who had undergone physical examinations at the first associated hospital at Xiamen University between 2018 and 2023. These data were from the Hospital Information Systems. Diagnostic information for these individuals and biochemical test data for blood routines were integrated. For all patients, blood routine and biochemical test data from the first post-hospital test were screened as characteristics of model construction, but for healthy individuals, annual blood routine and biochemical test data were selected as characteristics from the first physical examination. Too many missing values could affect prediction accuracy, so we removed features with missing values ratios greater than 50% and finally screened 22 features from blood routines and 30 features from biochemical test data (Supplementary Tables 1 and 2). Diagnostic information for all patients was determined according to the International Statistical Classification of Disease and the Tenth Amendment of Related Health Issues (ICD-10). To ensure that the sample size for each nervous system disease was sufficient, we removed the nervous system disease with less than 100 samples. All samples with missing values were then deleted to ensure data reliability. Finally, we constructed a model using 25,794 healthy individuals and 7,518 patients with neurological disease (Fig. 1; Table 1). These data were randomly divided into training sets (70%) and validation sets (30%).

Table 1 Disease data distribution.

Machine Learning Methods

Logistic regression (LR), also known as logistic regression analysis, is a generalized linear regression analysis model that is often used in data mining, automated disease diagnosis, economic forecasting, and other fields. Logistic regression estimates the probability of events occurring based on a specific dataset of independent variables. Because the result is a probability, the dependent variable ranges between 0 and 1. Random Forest (RF) is a classifier with many decision trees that can be used to handle classification and regression problems, and to deal with dimension reduction problems. It is also resistant to outliers and noise, providing better prediction and classification performance than decision trees. A support vector machine (SVM) is a type of generalized linear classifier that classifies data bidirectionally according to monitored learning, whose decision boundaries are the maximum margin hyperplanes resolved by the training sample. Extreme Gradient Boosting (XGBoost) is an implementation of algorithms or engineering based on a gradient boost decision tree (GBDT). XGBoost is efficient, flexible, lightweight and widely used in data mining, recommended systems, and other fields. Deep Neural Networks (DNNs) are frameworks for deep learning and are neural networks with at least one hidden layer. Like shallow neural networks, deep neural networks can also provide modeling for complex nonlinear systems, but additional layers provide a higher level of model abstraction, which improves the functionality of the model. We built the model by selecting LR, RF, SVM, XGBoost, and DNN to compare performance of different machine learning methods^{20,21,22,23,24}. LR, RF, and SVM were used via DNN on SCIKIT-LEARN (version 1.3.0), XGBoost with XGBoost package (version 2.0.2), and Tensorflow in Python (version 2.0.2).

All features were standardized before being used for model training. We addressed the issue of class imbalance by adjusting the Class_Weight parameters of the LR, RF, SVM, and DNN algorithms during model training. For LR, the Class_Weight mechanism increases the lost weight of minority class samples during the calculation of entropy losses, ensuring that the model pays more attention to these samples. In RF, class_weight adjusts the weight of the sample during tree construction, balancing the calculation of information gain or Gini impurities. For SVMs, the class_weight parameter assigns different penalty weights to samples of different classes of optimization goals, amplifying the impact of minority class support vectors. In DNN, Class_Weight assigns weights to samples of different classes, making samples of minority classes more important in loss calculations. For the xgboost algorithm, we handled class imbalances by adjusting the scalle_pos_weight parameter. This parameter changes the gradient and Hessian calculations of the objective function, assigning weight coefficients to positive samples, thereby modifying the effects of positive and negative samples on model optimization.

For all five machine learning algorithms, cross-validation (CV) of grid searches and manual fine-tuning were combined to identify optimal parameters and reduce the risk of overfitting. The hyperparameter optimization process for five algorithms aimed to balance computational cost with search inclusiveness. Specifically, representative and rational parameter grids were designed based on the characteristics of each algorithm, ensuring the robustness and reliability of the results through CV. On the other hand, parallel computing technology has been adopted to further improve optimization efficiency. Evaluating each hyperparameter combination typically requires extensive training and validation procedures, particularly during CV, so computational demand escalates rapidly. To accelerate this process, multi-core processors were used for parallel computations to ensure simultaneous evaluation of multiple hyperparameter combinations, thus significantly reducing overall optimization time. Additionally, resources were carefully allocated in parallel computing to maintain efficiency without overloading system resources or causing performance degradation. By adopting this approach, we effectively reduced computational costs, ensuring comprehensive optimization, thereby increasing the feasibility and efficiency of the experiment. We have optimized the following hyperparameters in detail using LR and SVM as examples: For LR, we focused on adjusting the penalty to select the commonly used L1 and L2 normalization types. Parameter C was set from 0.5 (strong normalization) to 4 (weak normalization) to five values (0.5, 1, 2, 3, and 4) covering a typical range. Solvers were selected from five mainstream optimization algorithms (e.g. Liblinear, Saga, LBFGS). The first experiment revealed that a particular solver was not compatible with a specific parameter combination (e.g., L1 normalization), leading to narrowing the scope of the solver in subsequent steps to reduce computational load. For SVM, the following hyperparameters have been optimized: C is set between 0.1 and 100, with final values being 0.1, 1, 10, and 100, from strong normalization to weak regularization. The kernel was restricted to linear and RBF to avoid computationally expensive polykernels. The gamma parameters were investigated in the range 0.001 to 1 with the scale option to investigate the effects of kernel functions at different scales. This kernel selection strategy avoided expensive calculations of high-dimensional data, balancing model representation and efficiency by adjusting gamma and c-range. Using this method, similar balanced strategies were applied across all algorithms for hyperparameter optimization. On the one hand, we ensured search inclusiveness by covering important parameters that could affect model performance. On the other hand, we controlled computational costs by reasonably limiting the search space, reducing invalid combinations, and adopting appropriate CV settings. I used 5x the CV. This is a robust evaluation method that divides the dataset into five mutually exclusive subsets. In each iteration, one subset serves as a validation set, and the remaining four subsets are used for training. This process is repeated five times to ensure that each subset is used exactly once as a validation set. The model is trained on each iteration and validation performance is recorded. The average performance of the entire five iterations is calculated as the final performance metric. This approach maximizes data utilization, effectively assesses the generalizability of the model, and reduces the effect of randomness from a single split. Furthermore, we split the dataset into chronological orders to verify the generalizability of the model. Data from 2018 to 2021 were split into training and validation sets at a 7:3 ratio, and data from 2022 to 2023 were used as test sets only.

After performing the above series of grid search attempts, the parameter adjustments for all machine learning methods were confirmed as follows: For LR, we optimized C, penalty, and solver parameters. For example, using L2 regularization with reduced C values reduced the overfitting. For RF, the min_samples_leaf and n_estimators parameters have been fine-tuned. For SVM, I adjusted the C, kernel, and gamma parameters. xgboost optimized colsample_bytree, gamma, learning_rate, max_depth, n_estimators, and subsample. To avoid overfitting, we reduced max_depth, reduced colsample_bytree, and increased n_estimators. For DNNs, we adjusted the activation, number of layers, and number of neurons per layer, and applied L2 normalization to high density layers to reduce the risk of overfitting. The DNN architecture used in this study consists of a detailed four-layer structure as follows: The first layer is the input layer, with the number of neurons matching the number of input functions. The second layer is a hidden layer with 64 neurons, which utilizes the Relu Activation function. The third layer is also a hidden layer with 64 neurons, using the Relu Activation Function. The fourth layer is the output layer containing a single neuron with a sigmoid activation function designed for binary classification tasks.

Model performance evaluation

The models were trained on the training set and then validated on the validation set. Sensitivity (SN), specificity (SP), positive predictor (PPV), negative predictor (NPV), F1 score, Matthews correlation coefficient (MCC), and accuracy (ACC) were used for model performance assessment. These equations are shown below^25,26,27:

$$\:\text{sn}\text{}\text{=}\text{}\frac{\text{tp}}{\text{tp}\text{\:+\:fn}}$$

$$\:\text{s}\text{p\:}\text{=}\text{}\frac{\text{t}\text{n}}{\text{t}\text{n\:+\:fp}}$$

$$\:\text{ppv\:}\text{=}\text{}\frac{\text{t}\text{p}}{\text{t}\text{p\:+\:fp}}$$

$$\:\text{npv\:}\text{=}\text{}\frac{\text{t}\text{n}}{\text{t}\text{n\:+\:fn}}$$

$$\:\text{acc\:}\text{=}\text{}\frac{\text{t}\text{p\:+\:tn}}}{\text{t}\text{p\:+\:fn\:+\:tn\:\:\:\:+\:

$$\:\text{f1\:score\:}\text{=}\text{}\frac{\text{2}\text{t}\text{p}}{\text{2}\text{t}\text{p\:+\:fn\:+\:fp\:}}$$

$$ \:{\text {mcc}} \:{\text {=}} \frac {{\text {tp}} \times \:{\text {tn}} \:{\text {- }} \:{\text {fn}}} \:{\text {fn}}} {{\sqrt {({\text {tp}} \:{\text {+}} \:{\text {fp}}) ({\text {tp}}} \:{\text {\}} \:{\text {fn}}) ({\text {tn}} \:{\text {+}} \:{\text { {fp}})({\text {tn}}} \:{\text { +}}}}}}}}}} }} $$

TP, TN, FP, and FN individually represent true positives, true negatives, false positives, and false negatives. Meanwhile, the area (AUC) below the curve (AUC) of the receiver operating characteristic curve (ROC) was used to comprehensively evaluate the performance of the model and select the best algorithm (Supplementary Fig. 1).

Model Interpretation

Because machine learning makes it difficult to explain the contribution of each function due to the black box principle, this study introduced the SHAP algorithm. The SHAP algorithm assigns SHAP values to each function. This is used to explain the impact of features on predictive models.²⁸. The SHAP value for each feature was calculated by the SHAP Python package (version 0.44.0).

Source link