A deep learning model for predicting survival of hepatocellular carcinoma patients based on the Surveillance, Epidemiology, and End Results (SEER) database analysis

Data Description

In this study, 35,444 HCC patients were screened from the SEER database between 2010 and 2015, and 2197 patients met the inclusion criteria. Table 1 shows the main baseline clinical characteristics of the patients (eTable 1 in the Supplement). Of the 2197 participants, 70% (n = 1548) were aged 66 years or younger, 23% (n = 505) were aged 66 to 77 years, and 6.6% (n = 144) were aged 77 years or older. 78% (n = 1915) were male participants, and 22% (n = 550) were female. In terms of race, the majority of participants were white, accounting for 66% (n = 1455), followed by Asian or Pacific Islander, 22% (n = 478), Black, 10% (n = 228), and only 1.6% (n = 36) were Native American/Alaska Native.Regarding marital status, 60% (n = 1319) were married and the remaining 40% (n = 878) had other marital status. In terms of histology, most of the participants (98%, n = 2154) had type 8170. Also, 50% (n = 1104) of the patients had grade II differentiated type, 18% (n = 402) had grade III, 1.0% (n = 22) had grade IV, and 30% (n = 669) had grade I. Regarding tumor staging, 48% (n = 1054) of the participants had stage I, 29% (n = 642) had stage II, 16% (n = 344) had stage III, and 7.1% (n = 157) had stage IV. According to TNM classification, 49% (n = 1079) were T1, 31% (n = 677) were T2, 96% (n = 2114) were N0, and 95% (n = 2090) were M0. 66% (n = 1444) of participants had positive/high AFP. 70% (n = 1532) had high levels of liver fibrosis. 92% (n = 2012) had a single tumor, and the remaining 8.4% (n = 185) had multiple tumors. 32% (n = 704) underwent lobectomy, 14% (n = 311) underwent local tumor-destructive surgery, 34% (n = 753) did not undergo surgery, and 20% (n = 429) underwent wedge or segmental resection. Finally, 2.1% (n = 46) received radiotherapy, 62% (n = 1352) did not receive chemotherapy, and 38% (n = 855) received chemotherapy. The median overall survival (OS) of participants was 45 ± 34 months, with 1327 (60%) still alive at the end of follow-up.

Table 1. Univariate and multivariate Cox regression analyses of main characteristics.

Feature Selection

The results of univariate Cox regression analysis identified several factors that were significantly correlated with the survival rate of patients with hepatocellular carcinoma (p < 0.05). These factors included age, race, marital status, histological type, tumor grade, tumor stage, T stage, N stage, M stage, alpha-fetoprotein level, tumor size, type of surgery, and chemotherapy status. All of these variables significantly affected patient survival in univariate analysis. However, multivariate Cox regression analysis further confirmed that only age, marital status, histological type, tumor grade, tumor stage, and tumor size were independent factors affecting patient survival (p < 0.05) (Table 1). Furthermore, collinearity analysis confirmed significantly high collinearity between tumor stage (STAGE) and the individual stages of T, N, and M (Figure 1). This phenomenon occurs mainly because the overall tumor stage (STAGE) is directly determined based on the results of TNM assessment. This collinearity suggests that these variables should be treated with care during modeling to avoid overfitting and poor predictive performance. Even though the multivariate analysis did not identify certain variables as independent predictors, we incorporated them into the construction of our deep learning model for several compelling reasons. First, these variables may capture subtle interactions and nonlinear relationships that are not immediately evident in traditional regression models but can be identified by more sophisticated modeling techniques such as deep learning. Second, the inclusion of a broader set of variables may improve the generalization and robustness of our model across diverse clinical scenarios and better explain the variation across patient subgroups and treatment conditions. Based on this analysis, we ultimately selected 12 key factors (age, race, marital status, histological type, tumor grade, T stage, N stage, M stage, alpha-fetoprotein, tumor size, type of surgery, and chemotherapy) to include in the construction of our predictive model. We divided the dataset into two subsets: a training set containing 1537 samples and a test set containing 660 samples (Table 2). By training and testing models on these data, we aim to develop models that can accurately predict survival rates in patients with HCC, aid in clinical decision-making, and improve patient outcomes.

Table 2. Main characteristic distributions of data in the training and test sets.

Hyperparameter optimization and model comparison results

First, we performed 5-fold cross-validation on the training set and repeated the random search 1000 times. Among all these validations, we selected the parameters with the highest average concordance index (C-index) and identified it as the optimal parameters. Figure 2 shows the loss function graphs of two deep learning models (NMTLR and DeepSurv). This set of graphs shows the loss evolution of these two models during the training process.

When comparing the predictive performance of the machine learning models with the standard Cox proportional hazards (CoxPH) model, Table 3 shows the performance of each model in the test set. In the analysis, the log-rank test was used to compare the concordance index (C-index) between the models. The results showed that the three machine learning models (DeepSurv, N-MTLR, and RSF) showed significantly better discrimination ability compared with the standard CoxPH model (p < 0.01), as shown in Table 4. Specifically, the C-index of DeepSurv was 0.7317, NMTLR was 0.7353, and RSF was 0.7336, while the standard CoxPH model was only 0.6837. Among these three machine learning models, NMTLR had the highest C-index, indicating its superior predictive performance. Further analysis of the integrated Brier score (IBS) of each model revealed that the IBS of the four models were 0.1598 (NMTLR), 0.1632 (DeepSurv), 0.1648 (RSF), and 0.1789 (CoxPH), respectively (Fig. 3). The NMTLR model had the lowest IBS value, indicating the best performance in terms of prediction uncertainty. Furthermore, there was no significant difference between the C-indexes obtained from the training and test sets, suggesting that the NMTLR model has better generalization performance for real-world complex data and can effectively avoid the phenomenon of overfitting.

Table 3. Performance of the four survival models.

Table 4 Comparative analysis of the discrimination ability (C-index) of CoxPH and machine learning models (DeepSurv, N-MTLR, RSF).

The calibration plot (Figure 4) showed that the NMTLR model had the most consistency between model predictions and actual observations for 1-, 3-, and 5-year overall survival, followed by the DeepSurv, RSF, and CoxPH models. This consistency was also reflected in the AUC values, where the NMTLR and DeepSurv models had higher AUC values than the RSF and CoxPH models for predicting 1-, 3-, and 5-year survival. Specifically, the 1-year AUC values were 0.803 for NMTLR and 0.794 for DeepSurv, compared to 0.786 for RSF and 0.766 for CoxPH. The 3-year AUC values were 0.808 for NMTLR and 0.809 for DeepSurv, compared to 0.797 for RSF and 0.772 for CoxPH. The 5-year AUC value was 0.819 for both DeepSurv and NMTLR, compared to 0.812 for RSF and 0.772 for CoxPH. The results indicate that the deep learning models (DeepSurv and NMTLR) show higher accuracy than RSF and traditional CoxPH models when predicting survival prognosis of patients with HCC. The NMTLR model showed the best performance across multiple evaluation criteria.

Model feature importance

In feature analysis of deep learning models, the impact of a feature on the accuracy of the model when its values are replaced with random data can be measured by the reduction in the concordance index (C-index). A higher reduction indicates that the feature is more important in maintaining the model's predictive accuracy. Figure 5 shows the feature importance heatmaps for DeepSurv, NMTLR, and RSF models.

In the NMTLR model, replacing features such as age, race, marital status, histological type, tumor grade, T stage, N stage, alpha-fetoprotein, tumor size, surgery type, and chemotherapy reduced the concordance index by more than 0.1% on average. In the DeepSurv model, replacing features such as age, race, marital status, histological type, T stage, N stage, alpha-fetoprotein, tumor size, and surgery type with random data similarly reduced the concordance index on average. In the RSF model, features such as age, race, tumor grade, T stage, M stage, tumor size, and surgery type were found to have a significant impact on the model accuracy, as evidenced by the noticeable decrease in the C-index when replaced with random data, which decreased by more than 0.1% on average.

Risk stratification ability of the NMTLR model

In the training cohort, the NMTLR model was employed to predict the risk probabilities of patients. The optimal thresholds of these probabilities were determined using X-tile software. Based on these cutoff points, patients were classified as low risk (< 178.8)、中リスク (178.8–248.4)、高リスク (> 248.4) category. As shown in Figure 6A, the survival curves between the groups showed a statistically significant difference with a p-value of less than 0.001. As shown in Figure 6B, similar results were replicated in the external validation cohort, highlighting the robust risk classification capability of the NMTLR model.

Deploying the model

The web application developed in this study is primarily for research or informational purposes and is publicly available at http://120.55.167.119:8501/. A visualization of the functionality and output of this application is shown in Figure 7 and Figure 1 in the Supplementary Material.