How machine learning of real-world clinical data improves endoscopic adverse event records

Ethical approval was received from Ethics Committee II from the Mannheim School of Medicine at Heidelberg University (approval number 2021-694). Pseudonym data processing in retrospective studies is exempt from obtaining individual patient consent under the applicable regulatory framework.

Labeling Process

Labels for monitored machine learning algorithms were obtained from written reports such as endoscopic reports, discharge notes, and readmission notes. Labels reviewed and assigned by experts are called “manually generated labels” and labels generated by large language models (LLM) are called “LLM-generated labels.” Manually generated labels were considered ground truth and were assumed to be accurate. Specifically, ground truth was established using endoscopic and discharge notes for bleeding and perforation of adverse events, but adverse events associated with readmission were identified via readmission notes in conjunction with previous reports to determine their association with previous endoscopic mucosal resection. To obtain reliable performance metrics, the tests were run only in cases where manual labels were available.

For readmissions, all 213 cases were manually labeled and the complete set of high-quality ground truth data was provided. However, in adverse event bleeding and perforation, only 500 of the 2490 cases were manually labeled and used as test data sets. The remaining cases used as training data were labeled using the large language model Llama-2 70b, a fine-tuning version of German (Llama-2 70b “Sauerkraut”), as described in the bibliography. ¹⁵. The referenced paper provides implementation details. A simple prompt to extract adverse events has been found to work well. Ground truth was established using the definitions provided in the supplementary material.

It is important to note that this method of using large-scale language models can lead to slight inaccuracy, as it can result in the potential for noisy labels generated by the model. See Supplementary Figure 9 for an analysis of the quality of these LLM-generated labels.

Data Preprocessing and Functional Engineering

Data processing was performed using Pandas^{twenty three}. Data was delivered via multiple Excel files combined using pandas. Unstructured data such as written text has been removed from the dataset. One-hot encoding was used for category data. For materials used during endoscopy, the amount of each material was also encoded. For example, if three clips of type “Hemostatic Clip 235 cm” were used, this was encoded as 3 in the “Hemostatic Clip 235” category.

Assignment using median values was applied if necessary. This means that if the values for a particular case are missing, they have been replaced by the median value. If readmission was not performed, the function “admission time” was set to 1000 days. In 7 cases, patients received two endoscopic interventions. Both included EMR during one hospitalization. In these cases, we combined the DRG, OPS code and the materials used. The first intervention date was used to calculate the feature value “from step to read time”, but the maximum value for the function “procedure time” was used. In addition, the situation of multiple interventions in a single hospital was captured by the function “number of procedures.”

Machine Learning Algorithms

Random Forest Classifier Implemented in Scikit-Learn^{twenty four}was trained for classification. Random forest classifiers are expected to be more robust to overfitting. This is a concern given the small and potentially loud dataset. Other machine learning algorithms were also tested, but no significant performance improvements were demonstrated. In particular, two gradient boost decision tree algorithms (LightGBM)^{twenty five} and catboost²⁶) and one deep learning algorithm optimized for tabular data (TabNet²⁷) was applied to the data after completion to facilitate performance evaluation.

Feature selection was carried out using Backward Feature Elimination, an iterative method that starts with all available features and systematically removes the most important features. At each step, the features that most contribute to the performance of the model, determined by the evaluation metric (in this case, the importance of impurity-based features) are excluded. This process continues until you reach the required number of features. Prior to the selection of posterior features, a total of 4547 functions were available for perforation and bleeding, and 493 functions were available for readmission.

Hyperparameter tuning and class imbalance

Hyperparameter tuning is the process of maximizing model performance by selecting the optimal combination of model parameters. This is usually achieved by dividing the training data into smaller subsets, such as training subsets and validation subsets. The training subset is used to fit the model, and the validation subset evaluates the performance of each set of hyperparameter choices. Next, we systematically investigate the space of possible hyperparameter combinations using grid searches to identify the optimal configuration (other search algorithms such as Bayesian optimization can also be considered potential alternatives).

For perforation and bleeding of adverse events, the number of features was treated as hyperparameters ranging from 50 to 500 features in 50 increments. For adverse event readmissions, the number of hyperparameter tunings for internal parameters in the random forest was set to 100, but had no significant impact on performance or negative performance. As a result, all results were achieved in a consistent setting. The number of estimates for Random Forest (N_ESTIMATORS) was set to 1000, and all other parameters were set to the default value of SCIKIT-LEARN. To address class imbalances, balanced class weights, synthetic data augmentation using Small, and balanced random forests were tested, but no performance improvements were obtained.

The importance and stability of features

The most important features were identified using SHAP¹⁹. Random subsampling to assess the stability of the algorithm²⁸ It was held 100 times. For each iteration, the data were randomly divided into training sets and test sets, and machine learning algorithms were trained on the training set and evaluated on the test set. Due to unfavourable events in readmission, all 213 cases included manual labels, allowing direct random subsampling in the complete dataset. However, due to limited availability of manual labels, random subsampling could not be applied directly to perforation and bleeding of adverse events (500, defined as ground truth). In these cases, the stability of the algorithm was assessed using labels generated by large linguistic models that were applied across the dataset considered approximations.

Rating Metrics

The area under the curve of receiver operating characteristics (AUC-ROC) and the area under the precision recovery curve (AUC-PR) were evaluated as target metrics. In particular, we focused on the precision recovery curve due to dataset imbalances. Given the disproportionate distribution of adverse events and the need to accurately identify adverse events while minimizing false positives, AUC-PR is considered the most relevant performance metric²⁹. In contrast, AUC-ROCs may provide overly optimistic estimates when applied to highly unbalanced data sets¹⁷.

AUC-PR is evaluated against its baseline. This is defined by the performance of the dummy classifier (i.e. those that make random predictions). For such a dummy classifier, AUC-PR corresponds to the incidence of adverse events in the test data set. For example, if 20 adverse events occur in 100 cases, the AUC-PR of the dummy classifier is calculated by dividing 20 by 100, and as a result, the AUC-ROC of the dummy classifier is always 0.5.

This baseline serves as a meaningful reference point for assessing the effectiveness of the model in detecting rare adverse events. The error in the region under curve metrics (defined as standard deviation) serves as an estimate of the stability of machine learning algorithms.

Source link