Prognostic significance of migrasomes in neuroblastoma through machine learning and multi-omics

Machine Learning


Unsupervised clustering and clinical relevance of migrasome-associated genes

To comprehend the clinical relevance and distribution of migrasome-related gene expression in neuroblastoma, we performed an unsupervised clustering analysis, the results of which are vividly portrayed in Fig. 1. The consensus cumulative distribution function (CDF) in Fig. 1A displays a clear distinction among different potential clusters, suggesting a robust clustering scheme. The delta area curve in Fig. 1B accentuates a noticeable decline, indicating the optimal number of clusters, which in this case appears to be three. The consensus matrix heatmap (Fig. 1C) visually distinguishes between Cluster1, Cluster2, and Cluster3, highlighting robust intra-cluster consensus and inter-cluster differentiation. Figure 1D illustrates a t-SNE plot demonstrating the spatial distribution of the three clusters post-classification. The clear separation between clusters emphasizes the success of the clustering and the inherent differences in migrasome-related gene expressions among the clusters. Survival probability comparisons among the three clusters (Fig. 1E) show significant differences. Specifically, Cluster1 and Cluster3 display a relatively better prognosis than Cluster2. The survival curves begin to deviate significantly after 2000 days, and this difference remains pronounced throughout the observation period. In conclusion, cluster 1, comprised of 228 samples, exhibits better overall survival rates and a lower likelihood of MYCN amplification. Cluster 2, containing 115 samples, is characterized by a higher incidence of MYCN amplification and poorer survival outcomes. Cluster 3 includes 155 samples, with clinical characteristics that are a mix of cluster 1 and cluster 2, generally showing moderate survival outcomes.

Figure 1
figure 1

Unsupervised Clustering and Genetic Analysis of Migrasome-associated Genes in Neuroblastoma. A Consensus cumulative distribution function (CDF) graph demonstrating the distinction among potential clusters. B Delta area curve illustrating the determination of the optimal number of clusters, with three identified as the ideal. C Heatmap of the consensus matrix, presenting the differentiation between Cluster1, Cluster2, and Cluster3 based on migrasome-associated gene expression. D t-SNE plot highlighting the spatial distribution of the three identified clusters. E Kaplan–Meier survival curves for Cluster1, Cluster2, and Cluster3, depicting survival probability over time. F Heatmap displaying the expression levels of migrasome-associated genes such as PIGK, TSPAN4, NDST1, and ITGB1 across the samples. G Alluvial diagram representing the correlation between MYCN amplification status and the three clusters. H Bar chart illustrating the distribution percentages of MYCN amplified (Amp) and non-amplified (Non-Amp) samples across the clusters. I Alluvial diagram showcasing the correlation between INSS stage distribution and the three clusters. J Bar graph presenting the distribution of INSS stages across Cluster1, Cluster2, and Cluster3. K Scatter plot demonstrating the mRNA stemness index (mRNAsi) values for each of the three clusters. L Venn diagram portraying the overlap and unique differentially expressed genes between the clusters.

Cluster correlation with genetic and clinical markers

Figure 1F showcases a heatmap of the expression of specific migrasome-related genes in each sample. Notable genes such as PIGK, TSPAN4, NDST1, and ITGB1 manifest marked expression variations among the clusters, potentially driving the differences in survival patterns. The distribution of MYCN amplification status among the clusters is represented in Fig. 1G and H. Clusters exhibit varying proportions of MYCN amplified (Amp) and non-amplified (Non-Amp) samples. Cluster2, in particular, demonstrates a heightened percentage of MYCN-amplified cases, potentially correlating with its poorer prognosis. The INSS stage distribution among clusters is delineated in Fig. 1I and J. Each cluster encapsulates a range of disease stages. Significantly, Cluster2 is devoid of stage 4S cases, indicating potential clinical and molecular distinctions within this group. Figure 1K delves into the mRNA stemness index (mRNAsi) across the clusters. It’s evident that Cluster2 exhibits the highest stemness index, implying a pronounced stem-like quality in this group when compared to the other clusters. The Venn diagram in Fig. 1L provides a comprehensive overlap of differentially expressed genes among the clusters, hinting at unique as well as shared migrasome-related genetic signatures that could underpin the distinctions observed among the clusters.

Machine learning analysis illuminates prognostic genes in neuroblastoma

Drawn from insights of Fig. 1, a refined machine learning strategy was deployed, narrowing focus on the 283 intersecting genes. Out of these, 50 were discerned to have significant correlation with neuroblastoma prognosis and were thus employed as inputs for machine learning. To identify the best-performing model, we evaluated ten distinct machine learning algorithms and 101 algorithmic combinations using the GSE62564 dataset as the training set. We employed Harrell’s concordance index (C-index) as the primary metric for evaluating the predictive power of each model, given its robustness in survival analysis. The models were validated on three independent datasets: GSE181559, target, and fwr144. The mean C-index scores for each model were calculated across these validation datasets to select the best model. The CoxBoost + Enet (α = 0.1) model emerged as the best-performing model with the highest mean C-index of 0.664 across the GSE181559, target, and fwr144 datasets. In the GSE181559 dataset, it achieved a C-index of 0.699, while in the target dataset, it reached 0.611, and in the fwr144 dataset, it demonstrated a C-index of 0.683. These results consistently indicated superior performance compared to other models such as the Random Survival Forest (RSF) and GBM, which showed mean C-indices of 0.652 and 0.625, respectively (Supplementary Fig. 1). This model provided the most reliable predictions for neuroblastoma prognosis.

Figure 2A and B dive into the intricacies of the Coxboost + Enet model with α = 0.1. Figure 2A portrays the trajectory of regression coefficients. As the penalty (log lambda) intensifies, numerous gene coefficients approach zero, signifying their reduced significance. However, a certain subset retains non-zero coefficients, underscoring their pivotal relevance. Figure 2B further exemplifies the number of genes selected at varied penalty values, underscoring the model’s stability and resilience to overfitting. Figure 2C hones in on the genes exhibiting significant coefficients, implying their indispensable role in neuroblastoma prognosis. Preeminent among these are genes such as NACC2, LAMB2, and CDH5. The magnitudes and orientations of these coefficients signify the genes’ positive or negative prognostic implications. The MigScore, is calculated by multiplying the expression level of each gene in the dataset by its corresponding coefficient, as depicted in Fig. 2C. This gives rise to the MigScore for every individual sample. Figure 2D conducts a nuanced survival analysis, categorizing the GSE62564 dataset into high and low MigScore groups. A stark disparity in survival trajectories is evident. Samples with elevated MigScores face notably diminished survival probabilities compared to their lower-scoring counterparts. This delineation is further bolstered by a significant hazard ratio of 12.95 and a compelling p-value (< 0.001). The three external validation cohorts displayed survival curves with trends consistent with the training set (Fig. 2E–G).

Figure 2
figure 2

Machine Learning Evaluation of Prognostic Genes in Neuroblastoma. A Illustration of regression coefficients for the Coxboost + Enet model (α = 0.1). B Demonstration of the number of genes selected at different penalty values for the Coxboost + Enet model, highlighting the model’s stability and resistance to overfitting. C Highlight of genes with significant coefficients indicating their vital role in neuroblastoma prognosis. The orientation and magnitude of these coefficients hint at the positive or negative prognostic implications of the respective genes. The MigScore is determined by multiplying each gene’s expression level in the dataset by its respective coefficient. DG Survival analysis of the GSE62564, fwr144, GSE18159, and target dataset, segregating it into high and low MigScore groups. The graph depicts distinct survival trajectories for each group.

Single-cell validation of the prognostic model established from bulk-RNA-seq data

To ensure the robustness and reliability of our MigScore model derived from bulk RNA-seq data, we validated it using two independent single-cell RNA-seq datasets: nb-dong (GSE137804) and nb-jansky. This cross-validation aimed to confirm that the prognostic signatures identified in bulk samples are consistent and reproducible at the single-cell level.

The single-cell RNA-seq datasets were preprocessed and annotated according to established protocols. The cell composition of the nb-dong (GSE137804) and nb-jansky datasets were shown in Supplementary Fig. 2. The MigScore for each tumor cell was calculated using the Seurat package’s AddModuleScore function. In the nb-dong dataset, we applied the MigScore calculation to the identified tumor cells. The results showed a clear distinction in MigScores between high-risk and low-risk tumor cells. Tumor cells with higher MigScores were associated with poorer prognostic indicators, consistent with our findings from the bulk RNA-seq data. The distribution of MigScores was significantly higher in high-risk tumor cells, indicating a robust correlation between MigScore and tumor aggressiveness at the single-cell level (Fig. 3A, B).

Figure 3
figure 3

Single-Cell Validation of Prognostic Model Derived from Bulk-RNA-Seq Data. A MigScore visualization for tumor cells in the nb-dong dataset. Cells are grouped based on their risk categories, with the gradient illustrating the MigScore magnitude. B Violin plot comparing the MigScore distribution between high-risk and low-risk tumor cells in the nb-dong dataset. C Visualization of MigScores for tumor cells in the nb-jansky dataset, with cells color-coded based on risk categories, demonstrating the gradient of MigScores. D Violin plot contrasting MigScores between high-risk and low-risk tumor cells in the nb-jansky dataset.

For the nb-jansky dataset, we similarly calculated the MigScores for individual tumor cells. The UMAP visualization revealed distinct clusters of high-risk and low-risk tumor cells of MigScores(Fig. 3C, D). High-risk tumor cells exhibited elevated MigScores, reinforcing the model’s predictive power. This validation further demonstrated that the MigScore is a reliable metric for assessing tumor cell aggressiveness and potential prognosis in neuroblastoma.

Clinical relevance of MigScore in the GSE62564 dataset

To discern the correlation between the devised MigScore and pertinent clinical parameters, and to further understand its prognostic potential, we delved into the GSE62564 dataset. Supplementary Fig. 3A depicts the spread of MigScore across all samples, further categorized into high and low score groups. The accompanying heatmap conveys the overlap of MigScore with a plethora of clinical attributes. Significant interplay is evident between MigScore and various clusters, MYCN amplification status, INSS stage, age, overall survival (OS), disease progression, and COG risk classification. This suggests that MigScore is not an isolated metric but is intertwined with pivotal clinical attributes. Figure 4A represents the distribution of MigScore amongst three distinct clusters. A discernible difference in MigScores between the clusters is evident. MYCN amplification, a crucial genetic marker in neuroblastoma, is closely tied with MigScore. As depicted in Fig. 4B, samples with MYCN amplification (MYCN-Amp) register significantly different MigScores compared to those without amplification (MYCN-Nonamp).

Figure 4
figure 4

Clinical Relevance and Prognostic Potential of MigScore in the GSE62564 Dataset. A Violin plots representing the distribution of MigScore among three distinct clusters. B Violin plots contrasting MigScores of samples with MYCN amplification (MYCN-Amp) against those without amplification (MYCN-Nonamp). C Violin plots detailing MigScore distribution across different INSS stages, ranging from st1 to st4s. D Time-dependent Receiver Operating Characteristic (ROC) curves comparing the performance of MigScore against other clinical parameters: INSS stage, MYCN status, and age over varying time durations. E Decision curve analysis at different timepoints (1-Year, 4-Year, and 8-Year), highlighting the net benefit of utilizing MigScore in prognosis compared to other clinical markers.

Delving deeper into clinical stages, Fig. 4C unveils the MigScore distribution across various INSS stages (from st1 to st4s). With progression in stages, the variability in scores is palpable, suggesting a potential link between disease stage and MigScore.

Prognostic potential of MigScore

The time-dependent ROC curves were constructed using the “timeROC” package in R, which calculates the ROC curve for each time point by considering the censored survival data27. This method allows for the assessment of the predictive accuracy of the MigScore and other clinical parameters at various time intervals, providing a dynamic view of its prognostic value over time. The time-dependent ROC curves exhibit the performance of MigScore compared with other clinical parameters like INSS stage, MYCN status, and age over a temporal spectrum. MigScore, with its performance trajectory, underscores its potential as a dependable prognostic marker, often outshining other clinical markers in certain timeframes (Fig. 4D).

The Decision Curve Analysis (DCA) was performed using the “rmda” package in R28. The y-axis represents the net benefit, which is calculated by considering the true positive rate and false positive rate, adjusted for the harm of unnecessary treatments. The x-axis represents the risk thresholds, which are the probabilities at which a patient would opt for a treatment. Figure 4E presents decision curve analysis for different timepoints (1-Year, 4-Year, and 8-Year). The graph elucidates the net benefit of using MigScore for prognosis over other clinical markers. In all timepoints, MigScore consistently delivers promising results, advocating its applicability in clinical prognosis.

Biological function exploration of MigScore groups using GSEA

To elucidate the potential biological roles and pathways associated with varying MigScores, we utilized the Gene Set Enrichment Analysis (GSEA) method. The comparison between high and low MigScore groups unveiled several significantly enriched pathways. From the chromosomal standpoint, the high MigScore group exhibited associations with chr13p12, chr19p12, and chr2q11, among others. Moreover, the high MigScore group correlated with specific cancer and disease phenotypes such as the bladder cancer and the invasive ductal breast cancer, suggesting its potential involvement in these disease processes. Several microRNAs, including MIR365_3 and MIR3943_3, and gene targets such as foxr2 displayed enrichment in the high MigScore group, emphasizing the importance of post-transcriptional regulation and specific transcription factors in the MigScore context. Furthermore, distinct modules such as MODULE_44 and MODULE_33 appeared more prevalent in the high MigScore samples. Biological processes and cellular component associations also emerged, with notable enrichments in adaptive immune response, and ribonucleoprotein complex biogenesis (Fig. 5A).

Figure 5
figure 5

Biological Implications and Network of Pathways Associated with MigScore Variation. A Gene Set Enrichment Analysis (GSEA) of high vs low MigScore samples. The color scale on the right indicates different categories, while the number of genes in each category is represented by the length of the bars. B Network visualization of selected enriched pathways based on GSEA results. Each circle (node) represents a specific biological pathway, and the connections (edges) between them suggest potential interactions or shared functions. The size of the circle corresponds to the number of genes in that pathway, while the color gradient represents the Normalized Enrichment Score (NES), with deeper colors indicating higher enrichment.

Network visualization of selected enriched pathways

To better understand the interconnectedness and crosstalk of the enriched pathways, we employed the aPEAR package (Advanced Pathway Enrichment Analysis Representation) for the autonomous visualization of pathway enrichment networks29. The aPEAR package is designed to address the redundancy and complexity often encountered in pathway enrichment analysis results by aggregating similar pathways and visualizing their interactions as a network of interconnected clusters. The high MigScore group exhibited higher pathway activity like ribosome biogenesis, sister chromatid segregation, and inner mitochondrial membrane protein complex, and lower pathway activity like exogenous antigen presentation and leukocyte cell adhesion. The size of the pathway nodes indicates the number of genes involved, and the NES (Normalized Enrichment Score) depicts the extent of enrichment in the given pathway (Fig. 5B).

Exploration of immune infiltration and predictive capability of immune response between MigScore groups

To shed light on the differential immune infiltration and predictive capability of immune response between the MigScore groups, a combination of analytical methods and evaluations were employed. The heatmap in Fig. 6A displays the deconstruction of immune infiltration across the two MigScore groups utilizing three distinguished algorithms: ESTIMATE, Cibersort, and MCPCounter. It is evident from the heatmap that there’s a distinct cellular composition of immune cells between the high and low MigScore groups. The variations in the infiltration levels of immune cell types such as T cells, B cells, NK cells, monocytes, fibroblasts, among others, underscore a differential immune milieu associated with the MigScore levels. The violin plots presented in Panels B and C elucidate the Immune Prognostic Score (IPS) and Tumor Immune Dysfunction and Exclusion (TIDE) metrics for the respective MigScore groups (Fig. 6B and C). Interestingly, the low MigScore group showcased a higher IPS and a decreased TIDE score, suggesting a more conducive immune response environment in comparison to the high MigScore group. The bar graph outlines the predicted immune response across MigScore groups(Fig. 6D). It can be observed that the low MigScore group predominantly displayed a responsive (R) immune behavior, contrasting the high group which leaned more towards a non-responsive (NR) trend. The stacked bar chart offers insights into the overall predicted immune response rates between the groups. A pronounced difference is evident with the low MigScore group indicating a superior 65% responsive rate as opposed to the 36% responsiveness seen in the high group (Fig. 6E).

Figure 6
figure 6

Immune Infiltration Analysis and Predictive Immune Response Assessment Between MigScore Groups. A Heatmap representing immune infiltration across high and low MigScore groups using algorithms: ESTIMATE, Cibersort, and MCPCounter. Color intensities indicate the relative abundance of immune cell types, with legend denoting scale. B Violin plot illustrating Tumor Immune Dysfunction and Exclusion (TIDE) metric distributions for high and low MigScore groups. C Violin plot showing the Immune Prognostic Score (IPS) distributions for the two MigScore groups. D Bar graph representing predicted immune response, categorized as responsive (R) and non-responsive (NR), across different MigScore levels. E Stacked bar chart showcasing the percentages of predicted immune response (R and NR) for high and low MigScore groups. F Kaplan–Meier survival curves for patients in high and low MigScore groups after receiving immunotherapy, based on the GSE91061 dataset. The X-axis denotes time (in days), and the Y-axis represents survival probability. G Bar chart comparing the immunotherapy response rate for patients in high and low MigScore groups, using the GSE91061 dataset. H Kaplan–Meier survival curves for patients in the high and low MigScore groups following immunotherapy, based on the PRJEB23709 dataset. The X-axis represents time (in days), and the Y-axis denotes survival probability. I Bar chart illustrating the immunotherapy response rate for patients in the high and low MigScore groups, based on the PRJEB23709 dataset.

External validation of immunotherapy outcomes and associated survival metrics

Figure 6F features the Kaplan–Meier survival curves for the two MigScore groups post immunotherapy intervention of the GSE91061 dataset. The survival probabilities for the low MigScore group were markedly enhanced in comparison to the high group. Furthermore, the response rate distribution in Fig. 6G suggests a heightened 31% responsiveness in the low MigScore group, contrasting the 16% in the high group.

Figure 6H post immunotherapy draw parallels to the GSE91061 dataset observations of the PRJEB23709 dataset, with the low MigScore group reflecting better survival outcomes. Figure 6I further reinforces this narrative, revealing an impressive 80% response rate for the low MigScore group in contrast to the 46% seen in the high group.

Exploration of potential therapeutic agents between MigScore groups using the CMAP database

The analysis uses the CMAP database to uncover potential therapeutic compounds beneficial for various MigScore groups. Figure 7A, B and C depict the scoring distributions of therapeutic agents across the GSE62564 Dataset, Target Dataset, and fwr144 Dataset, respectively.

Figure 7
figure 7

Therapeutic agent analysis across various datasets. AD Score distribution of therapeutic agents in the GSE62564,Target, fwr144 and GSE181559 Datasets. Compounds are ranked based on their sensitivity score, with -1 indicating maximum sensitivity. E Aggregated therapeutic score distribution for various compounds. Scores are the cumulative summation from datasets A to D. F HDAC1-TF activity plotted against MigScore in the GSE62564 dataset. The relationship between HDAC1-TF activity and MigScore is showcased. G UMAP visualization of HDAC1 activity differentiation between high-risk and low-risk tumor cells in the nb-jansky dataset. H Violin plot illustrating the variation in HDAC1-TF activity between high-risk and low-risk tumor cells.

A score closer to -1 indicates higher sensitivity, suggesting that the compound is more likely to be effective as a therapeutic agent. This scoring is based on the connectivity map (CMAP) methodology, where negative connectivity scores imply that the gene expression signature induced by the compound opposes the disease signature26. Therefore, a score near -1 reflects a strong inverse correlation, meaning the compound potentially counteracts the disease state effectively. Remarkably, MS-275 consistently emerges as the compound nearing the -1 mark across all three datasets, thereby signifying its heightened therapeutic sensitivity. Other compounds like XS109870 and phenoxybenzamine also show scores that suggest potential therapeutic efficacy, but MS-275 remains the most promising among them.

Figure 7D, focused on the GSE181559 Dataset, highlights the scores of MS-275, butein, and STOCK1N-35874. Among these, MS-275 continues its trend, thereby accentuating its potential therapeutic significance. Figure 7E showcases the aggregated therapeutic score distribution for compounds, where scores are the summation from Fig. 7A–D. Here, the lower the score, the higher the compound’s cumulative sensitivity across datasets. MS-275, consistent with previous observations, retains its prominence with one of the lowest aggregated scores, emphasizing its potential therapeutic utility.

Given the prominence of MS-275 as an HDAC inhibitor, we further delve into the activity of HDAC1, casting light on its role and implications. Figure 7F elucidates the association between HDAC1-TF activity and MigScore in the GSE62564 dataset. It becomes clear that the high MigScore group displays augmented HDAC1-TF activity relative to the low MigScore group.

To further explore the HDAC1 activity on single-cell dataset, we utilized the nb-jansky dataset. Figure 7G offers a UMAP visualization differentiating HDAC1 activity between high-risk and low-risk tumor cells. The clusters distinctly indicate that high-risk tumor cells have pronounced HDAC1 activity compared to the low-risk tumor cells. Figure 7H delves deeper into the HDAC1-TF activity variances between high-risk and low-risk tumor cells. It is evident that high-risk tumor cells demonstrate considerably elevated HDAC1-TF activity compared to the low-risk counterparts.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *