Machine learning-based construction of a programmed cell death-related model reveals prognosis and immune infiltration in pancreatic adenocarcinoma patients

Machine Learning


Screening and identification of PCD related prognostic candidate genes in PAAD

To minimize batch effects across cohorts, batch effect correction was performed for the GSE183795, GSE62452, and GSE28735 cohorts, which were subsequently integrated into a combined GSE-merged cohort (Supplementary Fig. 1 A, B). Differential expression analysis identified 1,393 Differentially expressed genes (DEGs) in the TCGA-PAAD cohort and 900 DEGs in the GSE-merged cohort, resulting in a total of 2,129 unique DEGs. Among them, the TCGA-PAAD cohort contains 551 upregulated genes and 842 downregulated genes; whereas the GSE-merged cohort contains 583 upregulated genes and 317 downregulated genes, respectively (Fig. 1A, B). Through Venn plot analysis, 334 PCD-related DEGs shared between the TCGA-PAAD and GSE-merged cohorts were identified (Fig. 1C, Supplementary Table 1). Based on these genes, univariate Cox regression analysis was performed, resulting in the identification of 96 genes in the TCGA-PAAD cohort, 108 genes in the GSE-merged cohort and 73 genes in the ICGC cohort that were both differentially expressed and significantly associated with OS (P < 0.05). The 17 overlapping DEGs associated with OS across the three cohorts were considered potential prognostic candidate genes for PAAD (Supplementary Fig. 1 C, Supplementary Table 2). The 17 candidate genes’ univariate Cox analysis in the GSE-merged, TCGA-PAAD and ICGC cohorts were shown in Fig. 1D, Supplementary Fig. 1D and E, respectively. Subsequently, protein-protein interaction (PPI) network analysis was conducted to explore the interactions among the 17 prognostic candidate genes. Among them, FN1 exhibited the highest degree of connectivity with other genes (Fig. 1E). Heatmap revealed that CLU was downregulated in tumor samples from the GSE-merged cohort, while the remaining 16 genes were upregulated (Fig. 1G). In addition, copy number variation (CNV) analysis demonstrated frequent genomic alterations among these genes. Specifically, ITGA3 showed the most prominent CNV gain, whereas LAMC2, PLAU, SLC2A1, and CDH3 exhibited significant CNV loss (Fig. 1F).

Fig. 1
figure 1

Identification of prognostic candidate genes associated with 21 types of programmed cell death (PCD). (A, B) The volcano plots of differentially expressed genes (DEGs) in the TCGA-PAAD (A) and GSE-merged (B) cohorts. (C) Venn plot showing the overlapping between DEGs from the TCGA-PAAD and GSE-merged cohorts and the PCD-related genes. (D) Univariate Cox regression analysis of 17 prognostic candidate genes in the GSE-merged cohort. (E) Protein-protein interaction (PPI) network of 17 prognostic candidate genes. (F) Copy number variation (CNV) analysis of the 17 candidate genes. (G) Heatmap of expression levels of the 17 candidate genes in tumor and normal tissues from the GSE-merged cohort. (H-I) Gene Ontology (GO) (H) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (I) pathway enrichment analysis of the 17 candidate genes.

Gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathway enrichment analysis

GO and KEGG enrichment analyses of the PCD-related prognostic candidate genes revealed their involvement in multiple cellular signaling pathways, metabolic processes, and cancer-related pathways. GO enrichment indicated that these genes were primarily associated with biological processes (BP) such as regulation of transforming growth factor beta production, cellular components (CC) such as the basement membrane, and molecular functions (MF) such as extracellular matrix structural constituent (Fig. 1H). KEGG pathway analysis further highlighted significant enrichment in pathways including ECM-receptor interaction, PI3K-Akt signaling pathway, Central carbon metabolism in cancer, and HIF-1 signaling pathway (Fig. 1I).

Construction of a PCD related prognostic model using machine learning

We employed an integrated machine learning-based approach to develop a prognostic model using the expression profiles of 17 prognostic candidate genes. A total of 117 predictive models were fitted under a Leave-one-out cross-validation (LOOCV) framework, and the Harrell’s concordance index (C-index) was calculated for each model. Among them, the StepCox[both] + Ridge combination achieved the highest average C-index of 0.64 (Fig. 2A). Prior studies using RNA-based signatures have reported C-index ranging from 0.61-0.7121,22,23, suggesting that our model performs within an expected and acceptable range. Using this optimal algorithm, six PCDRGs, ITGA3, CDCP1, IL1RAP, CLU, PBK, and PLAU, were selected (Supplementary Table 3). A Circos plot was generated to visualize the chromosomal locations of these PCDRGs (Fig. 2B).

Fig. 2
figure 2

Integrated machine learning-based development and validation of a predictive model associated with 21 types of PCD. (A) Harrell’s concordance index (C-index) values of 117 algorithmic combinations were calculated to identify the optimal predictive model and algorithm. (B) Circos plot showing the chromosomal locations of the PCD-related genes (PCDRGs). (C) Gene Set Variation Analysis (GSVA) analysis revealed differences in enriched pathways between high- and low-risk groups in the GSE-merged cohort. (D) Boxplots showing expression differences of PCDRGs between normal and tumor tissues in the GSE-merged cohort. (E) Pearson correlation analysis among the PCDRGs. (F-K) Kaplan-Meier survival analyses of the six PCDRGs in the TCGA-PAAD cohort. (L-M) Waterfall plot showing the somatic mutation landscape of low-risk (L) and high-risk samples (M) in the TCGA-PAAD cohort. **: P < 0.01; ***: P < 0.001.

Subsequently, individual risk scores were calculated for each patient using the established risk score formula, and patients were stratified into high- and low-risk groups based on the median risk score. To explore the biological processes significantly enriched in each risk group, Gene Set Variation Analysis (GSVA) was performed. Results indicated that in the GSE-merged cohort, notable pathway enrichment differences were observed between the high- and low-risk groups. High risk samples were mainly enriched in cancer-related signaling pathways, including DNA repair, Wnt/β-catenin signaling, MYC targets V1, KRAS signaling dn, and Hedgehog signaling, while low risk samples were enriched in TNF-α Signaling via NF-κB, Angiogenesis, KRAS signaling up and Apoptosis, suggesting distinct underlying biological mechanisms between the two groups (Fig. 2C). In the TCGA-PAAD cohort, high risk samples showed enrichment in the Reactive oxygen species pathway, Oxidative phosphorylation and Inflammatory response, whereas low risk samples were primarily enriched in the p53 pathway and Wnt/β-catenin signaling (Supplementary Fig. 2 A).

Further analysis revealed significant differences in the expression levels of PCDRGs between normal and tumor tissues in the GSE-merged cohort. CLU was significantly downregulated in tumor samples, whereas the other five genes, ITGA3, CDCP1, IL1RAP, PBK, and PLAU, were significantly upregulated (P < 0.001) (Fig. 2D). Pearson correlation analysis showed a negative correlation between CLU and the other five genes, while ITGA3, CDCP1, IL1RAP, PBK, and PLAU exhibited positive correlations with each other (Fig. 2E). Kaplan-Meier survival analysis of the six PCDRGs in the TCGA-PAAD cohort revealed that patients with low CLU expression had poorer prognosis, while high expression of the remaining five genes was associated with worse prognosis (Fig. 2F-K). These results are consistent with their differential expression patterns in tumor samples. In addition, we analyzed the mutational landscape of patients in different risk groups. The high-risk group exhibited a markedly higher mutation frequency (95.12%) compared to the low-risk group (64.86%) (Fig. 2L, M). Notably, mutations in TGFBR2, ATM, and CHD6 were more frequent in the low-risk group, whereas RIMS2, TPO, and TNXB mutations were more prevalent in the high-risk group.

Evaluation of the clinical relevance of the PCDRGs prediction model

We validated the predictive performance of the risk model across five independent cohorts, TCGA-PAAD, GSE183795, GSE28735, GSE62452, and ICGC, and further explored the relationship between the risk score and clinical characteristics. As shown in Fig. 3A, patients in the high-risk group exhibited shorter OS time and poorer prognoses compared to those in the low-risk group across all cohorts. This finding was further supported by Kaplan-Meier survival analysis, which consistently demonstrated significantly shorter OS time in the high-risk group across all five cohorts (Fig. 3B). Principal component analysis (PCA) also revealed distinct distributions between high- and low-risk patients in each cohort (Fig. 3C).

Fig. 3
figure 3

Validation of the prognostic model and association between risk score and clinical characteristics. (A) Survival status of high- and low-risk patients across TCGA-PAAD, GSE183795, GSE28735, GSE62452, and ICGC cohorts. (B) Kaplan-Meier survival curves comparing high- and low-risk groups in the five cohorts. (C) PCA illustrating the distribution of high- and low-risk patients across the five cohorts. (D) Violin plots showing the distribution of risk scores between alive and dead patients in the five cohorts. (E-H) Violin plots comparing risk scores across different clinical subgroups in the TCGA-PAAD cohort: T (E), N (F), M (G), and Stage (H).

Violin plots showed that in the TCGA-PAAD, GSE183795, and ICGC cohorts, risk scores were significantly higher in dead patients compared to those who were alive (P < 0.05). Although no statistically significant differences were observed in the remaining two cohorts, dead patients still exhibited higher risk scores (Fig. 3D). Additionally, in the TCGA-PAAD cohort, the risk score was significantly associated with clinical staging. Patients with T3 + T4 tumors had significantly higher risk scores than those with T1 + T2 tumors, and patients at Stage II had higher scores compared to those at Stage I (P < 0.05). However, no significant differences in risk scores were observed across N and M stages (Fig. 3E-H).

Unsupervised consensus clustering analysis of prognostic PCDRGs

Unsupervised consensus clustering was performed to stratify PAAD patients and identify subgroups with distinct molecular characteristics. The optimal number of clusters was determined to be k = 2, and thus, patients in the TCGA-PAAD cohort were divided into two clusters (Fig. 4A, B). Subsequent Kaplan-Meier survival analysis revealed that patients in cluster C2 had significantly worse survival outcomes compared to those in cluster C1 (P = 0.01) (Fig. 4C). Moreover, a heatmap integrating clinical features and PCDRGs expression levels demonstrated marked differences in gene expression between the two clusters. Most patients in cluster C1 were characterized by early-stage disease and lower risk scores, whereas those in cluster C2 were predominantly in advanced stages and had higher risk scores (Fig. 4D).

Fig. 4
figure 4

Unsupervised clustering and construction of a nomogram model in the TCGA-PAAD cohort. (A) Unsupervised consensus clustering divided patients into two clusters. (B) The optimal number of clusters was determined to be k = 2. (C) Kaplan-Meier survival analysis comparing patients in clusters C1 and C2. (D) Heatmap showing the distribution of PCDRGs expression levels and clinicopathological characteristics across clusters. (E) Univariate Cox regression analysis of risk score and clinicopathological characteristics in the TCGA-PAAD cohort. (F) Multivariate Cox regression analysis of the same variables. (G) Nomogram constructed to predict 1-, 3-, and 5-year overall survival (OS) rate in PAAD patients. (H) Calibration curves evaluating the predictive accuracy of the nomogram. (I) Kaplan-Meier survival curves for patients stratified by nomogram scores. (J) Receiver operating characteristic (ROC) curves assessing the predictive performance of the nomogram for clinicopathological features. (K) ROC curves evaluating the predictive accuracy of the nomogram for 1-, 3-, and 5-year OS.

Establishment and assessment of the nomogram survival model

Univariate and multivariate Cox regression analyses were performed to evaluate the associations between OS and the risk score, along with clinicopathological characteristics in the TCGA-PAAD cohort (Fig. 4E, F). The results indicated that grade, T stage, N stage, tumor position, and risk score were all independent prognostic factors. Based on these variables, a nomogram was constructed to predict the 1-, 3-, and 5-year OS probabilities of PAAD patients (Fig. 4G). Calibration curves demonstrated good concordance between the predicted and observed survival rates at 1-, 3-, and 5-years (Fig. 4H). Kaplan-Meier survival analysis revealed that patients in the high nomogram score group had significantly worse prognosis (Fig. 4I). Receiver operating characteristic (ROC) curve analysis demonstrated the robust predictive performance of the model. The area under the ROC curve (AUC) for the risk score was 0.727, followed by grade (AUC = 0.656), stage (AUC = 0.679), T stage (AUC = 0.633), N stage (AUC = 0.752), state (AUC = 0.992), and cluster (AUC = 0.650). The AUCs for predicting 1-, 3-, and 5-year OS rate were 0.753, 0.776, and 0.727, respectively (Fig. 4J, K).

Immune landscape and its correlation with PCDRGs in PAAD patients

CIBERSORT analysis revealed that high-risk patients exhibited higher proportions of Macrophages M0 and Macrophages M2, whereas low-risk patients had higher proportions of T cells CD4 naïve, B cells naïve, and T cells CD8 (Fig. 5A). A heatmap of correlation analysis showed that PLAU was positively correlated with Macrophages M0, Macrophages M2 and Neutrophils, but negatively correlated with B cells naïve, Monocytes, NK cells activated, T cells CD4 memory resting and T cells CD8. In contrast, CLU was positively correlated with B cells naïve, Monocytes, and T cells CD8, but negatively correlated with Macrophages M0 (Fig. 5B). Significant associations were observed between the expression levels of PCDRGs and the abundance of 22 immune cell types. Specifically, CLU was negatively correlated with Macrophages M0, and PLAU was negatively correlated with T cells CD4 memory resting cells in both high- and low-risk groups (Fig. 5C-J). Furthermore, we analyzed the correlations between PCDRGs and tumor microenvironment (TME)-related gene sets and identified the top 21 TME-related gene sets with the strongest correlations. The results showed that PCDRGs were significantly positively correlated with multiple TME-related gene sets (Supplementary Fig. 2B). ESTIMATE algorithm analysis revealed that low-risk PAAD patients exhibited higher ImmuneScore and ESTIMATEScore, but lower tumor immune dysfunction and exclusion (TIDE) scores compared to the high-risk group (Fig. 5K, L). Moreover, immune checkpoint related genes, including BTN2A1, CD40, CEACAM1, PDCD1LG2, and TNFSF9, were highly expressed in the high-risk group (Fig. 5M).

Fig. 5
figure 5

Immune microenvironment analysis based on the risk score. (A) Boxplots showing differences in the relative proportions of 22 immune cell subtypes between high- and low-risk groups. (B) Heatmap illustrating correlations between PCDRGs and 22 immune cell populations. (C-J) Scatter plots displaying the correlations between PCDRGs expression levels and immune cell abundance. (K) Violin plots comparing StromalScore, ImmuneScore, and ESTIMATEScore between high- and low-risk groups. (L) Differences in TIDE scores between the two risk groups. (M) Boxplots showing the expression differences of 21 immune checkpoint-related genes between high- and low-risk groups. *: P < 0.05; **: P < 0.01; ***: P < 0.001.

Impact of PCDRGs on targeted therapy response in PAAD patients

To investigate the relationship between our prognostic model and drug sensitivity, we analyzed the half-maximal inhibitory concentration (IC50) values of several chemotherapeutic agents in PAAD samples using the Genomics of Drug Sensitivity in Cancer (GDSC) database. Boxplots illustrated the differences in IC50 values between the high- and low-risk groups for each drug (Fig. 6A). Interestingly, the IC50 values of Cisplatin, Niraparib, Olaparib and Oxaliplatin were significantly lower in the low-risk group (Fig. 6B-E), whereas the IC50 values of Afatinib, Sapitinib, SCH772984, and ULK1_4989 were lower in the high-risk group (Fig. 6F-I).

Fig. 6
figure 6

Predictive value of PCD-based risk model for chemotherapeutic drug sensitivity. (A) Boxplots showing differences in drug sensitivity between high- and low-risk groups across various chemotherapeutic agents. (B-I) Comparison of IC50 values for individual drugs between the high- and low-risk groups.

Validation of the identified PCDRGs

To validate the expression of the identified six PCDRGs in PAAD, we performed quantitative real-time PCR (qRT-PCR) on PAAD tissues and peritumoral PAAD tissues. The results showed that the mRNA expression levels of ITGA3, CDCP1, IL1RAP, CLU, PBK, and PLAU were significantly upregulated in PAAD tissues compared to peritumoral PAAD tissues (Supplementary Fig. 3A-F). Among them, the expression patterns of ITGA3, CDCP1, IL1RAP, PBK, and PLAU were consistent with the bioinformatics results from the GSE-merged cohort. These findings further confirmed the differential expression characteristics of these PCDRGs in PAAD and suggested their potential roles in tumor development and progression.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *