Figure 1 shows the complete workflow diagram for this study. First, to convert variables from categorical values to numbers, the identified data was preprocessed for data that lacked value handling and data encoding. We used a feature selection approach to extract the most important features from the dataset. The aggregated dataset was split into 70% for training, 20% for testing, and the remaining 10% of data for model validation. Four models were trained, tested and validated using the gradient boost tree ensemble machine learning technique, followed by the voting classifier. A confusion matrix was used to assess the effectiveness of the model. Finally, we evaluated the models relatively to determine the optimal model.

1 is a flow diagram of a monitored machine learning modeling method, as described.
Dataset Description
This retrospective study utilized identified electronic health records (EHR) data from community cancer care centers in Appalachia. (1) St. Elizabeth Healthcare, Edgewood, Kentucky. (2) Pikeville Medical Center, Pikeville, Kentucky. (3) Thompson Cancer Survival Center (TCSC) based in Knoxville, Tennessee. Combined, the dataset included 25 attributes from 7,718 adults aged 18 and older in three cancer care centers. Between 2000 and 2017, only patients diagnosed with 18+ and malignant colon and/or rectal cell carcinoma were included in this study. Patients under the age of 18 were excluded from the data set. All datasets include demographics, clinical, and SDOH patients' functions. Data from St. Elizabeth Healthcare and Pikeville Medical Center were extracted from local EHR systems. Data from TCSCs collected between 2000 and 2009 were captured by research staff from paper records and transcribed into .CSV files. TCSC data for 2010-1017 were extracted from the local EHR system. All data was identified by individual healthcare sites prior to delivery, shared with secure data transfer, and stored in password-protected, cloud-based storage. The use of unidentified retrospective data is classified as a study of non-human subjects and does not require IRB approval. The need for informed consent was deemed unnecessary in accordance with the HIPAA Privacy Rules. To ensure ethical conduct, this study was conducted under approval by the Western-Copernicus Group Institutional Review Board (WCG).® IRB) Protocol number 20,223,670, matches the code 46.102 in the 45 Federal Rules. All methods were implemented according to relevant guidelines and regulations.
CRC 5-year survival and death indicators
The EHR data included a “month of survival” variable used to report mortality based on the amount of time a patient survived in a few months compared to the initial diagnosis of the colon and/or rectum. The Gold Standard 5-year cancer survival indicator was applied in this study as ML models were developed to classify ML models as surviving at 60 or 60 months or more (5 years) after CRC diagnosis.28. The ML prediction model is a binary classifier, with the “negative” class representing patients with less than 0 months of survival, less than 60 months after CRC diagnosis, and the “positive” class representing patients with 60 or more months of survival in EHR.
Data preprocessing and feature selection
The dataset was preprocessed before being installed in the ML model to address missing, data entry errors, outliers, variable labeling, and encoding. Data preprocessing involves converting all categorical data variables into numerical values, merging three datasets from individual healthcare sites into one aggregated dataset, and data cleaning to address typegraphic errors and missing data. Listwise deletion was used to process missing data.
Features selection was employed to eliminate redundancy, identify the most important EHR data variables, and improve the accuracy of the ML model in its predictive capabilities. The extreme gradient boost (xgboost) ML algorithm complemented by SHAP plots was used to determine the most important features in this study. The dataset had 25 features. Raw data collected from EHRS include a variety of patient information, including demographic and clinical information related to the CRC diagnosis, and includes selection of SDOH variables. The main target variable for this study was several months of survival. Demographic characteristics were diagnosis, gender, race, ethnicity, place of residence, county, and ZIP (three digits) age. SDOH features available for analysis include marriage status status, geographical classification (rural/non-rural), employment status, and insurance status. All available features and corresponding model inclusions are shown in Table 1.
Marriage status status data was missing in the TCSC data and was processed by assignment. Research suggests a two-stage method (combination of assignment and classifiers).29,30,31 Provides optimal configuration for a particular dataset. iterativeImputer32 It is a Python-based multivariate statistical method for handling missing data by accurate attribution, especially when there is a rather strong correlation between features. Linear regression with 300 estimators and standard tolerance was used ~2.5%. Convergence required fewer than 20 iterations. There were no significant differences in model accuracy between iterative robust assignments and arbitrary (fixed value of missing data) assignments. A weak correlation between the imbalance between known and unknown data (e.g., TCSC did not collect patient marriage status status) could be a potential reason for this observation. A comparison table showing the accuracy of the model by the assignment method is included as supplementary material.
Training and verification
The dataset was randomly divided into training, test, and validation parts. About 70% in training and 20% in test data. The remaining 10% of the data was reserved for model validation purposes. The ML prediction model was trained using the input functions mentioned above. In this study, we used XGBoost with hyperparameter optimization technology. Xgboost is a gradient boost tree ensemble method in ML, combining simpler and weaker model estimations to predict selected targets. The main principle of gradient boosting is to minimize errors in previous models and improve repeated prediction performance. Specifically, a gradient boost classifier is used for classification tasks33,34. Recent studies have demonstrated that the gradient boost tree algorithm provides high accuracy for both acute and chronic prediction tasks, and that XGBoost performs better than other ML models on tabular data.35. One of the benefits of using xgboost is that it allows you to implicitly handle certain levels of missing data by taking into account the missing data during the training process.36. Before training the ML prediction model, hyperparameter optimization was performed using training datasets and applied via 5x cross-validation grid search and evaluation under the receiver operating characteristics (AUROC) curve for various combinations of hyperparameters included in the grid search. The parameters (optimized values) were the maximum tree depth, pseudo-normalization parameters, minimum sum of observations, percentage of observations subsampled at each step, percentage of features used to construct each tree, nodes, and tree depth levels.
Performance Metrics for ML Models
AUROC was used as the key metric to assess and compare the overall performance of ML prediction models37. Other metrics used to assess the performance of ML prediction models after diagnosis after 5-year survival were calculated as follows based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values.38,39. Sensitivity indicates the ability of the model to accurately identify patients with CRC survival after 5 years.
$$Sensitivity\; (true\;positive\;rate, tpr) = \:\frac {tp} {tp+fn}*100 $$
Specificity is used to assess that the model can accurately identify patients who cannot survive.
$$Specificity\; (true\;negial\;rate, tnr) = \:\frac {tn} {tn+fp}*100 $$
Accuracy, or positive predictor, measures the number of positive cases (CRC survivors) correctly predicted by the model.
$$Positive\;Predict\;Value\; (precision, ppv) = \:\frac {tp} {tp+fp}*100 $$
Conversely, negative predictors measure the number of negative cases (CRC deaths) correctly predicted by the model.
$$Negation\;Predictions\;Value\; (npv)=\:\frac {tn} {tn+fn}*100 $$
Additionally, 95% confidence intervals (CIs) were calculated for each metric as explained in the statistics section below. Shape additive description (SHAP) analysis was performed to assess the importance of each function to generate model output40. SHAP analysis ranks features by their importance for modeling top-to-bottom predictions.
The effect of SDOH features on ML model performance
We used a merged dataset consisting of EHR data from St. Elizabeth Healthcare, Pikeville Medical Center, and TCSC (Table 2). Because cause of death data is not available in one of the hospital's EHRs, the ML prediction model classified patients as 5 or 5 years or older, regardless of CRC-related causes of death or other factors. To test the hypothesis, we developed four ML models with the following functional inputs: Model 1: Demographic and clinical characteristics. Model 2: Demographics and SDOH features. Model 3: Demographic, clinical, and SDOH functions. Model 4: Clinical and SDOH characteristics (Table 1). The dataset was randomly split into training, holdout test, and validation datasets using 70:20:10 splits, and the holdout test dataset was not exposed to the model during training. To assess the fairness of the algorithm, XgBoost was compared with other regression methods, logistic regression, and k-nearest neight. Xgboost provided excellent accuracy compared to both logistic and K-Nearest neighbor regression (adjacent = 5) methods. A comparison table showing the accuracy of predictive models with each of the four data models with regression techniques is included as supplementary material.
Statistical analysis
The confidence interval (CI) for AUROC was calculated using a bootstrap method in which a subset of patients in the holdout test data set was randomly sampled, and AUROC was calculated using data from those patients, and AUROC was calculated using 1000 iterations in replacements.41. From these bootstrap AUROC values, the mid-95% range was chosen as the 95% CI of AUROC. The sample size of the holdout test dataset was large enough, so the CI for other metrics was calculated using normal approximations42. Demographic and clinical features have traditionally been used to predict death/survival; Model 1 (demographics + clinical features) was considered a control ML model. Differences in metrics Model 2, 3and 4 Comparing with Model 1 It was determined using a two-tailed t-test to a 95% significance level (Fig. 2).

The effect of SDOH characteristics on AUROC in predicting CRC in Appalachians. The performance of the machine learning algorithm was the highest in Model 3 (0.790). All models performed better than baseline. Model 1 – Demography + Clinical Function Model 2 – Demography + SDOH Function. Model 3 – Demographics + Clinical + SDOH Function. Model 4 – Clinical + SDOH Function.
