Interpretable machine learning-guided single-cell mapping deciphers multi-lineage pancreatic dysregulation in type 2 diabetes | Cardiovascular Diabetology

Machine Learning


Development and validation of PanSubPred for accurate pancreatic cell subtype identification

To systematically identify novel and robust cell-type-specific marker genes across seven major pancreatic cell populations, including both endocrine and exocrine lineages, we developed PanSubPred, an integrated computational framework tailored for pancreatic cell subtype annotation. Details on the development and validation of PanSubPred are provided in the Supplementary Text S1 (Fig. 1). Using cell labels predicted by PanSubPred, we successfully distinguished cell subpopulations with distinct clustering patterns (Fig. 1B). Among the 64 marker genes, most were exocrine-related (Fig. 1C and D). Notably, 26 genes were previously reported markers, while 38 are novel candidates (Table S6).

Fig. 1
figure 1

Construction and characterization of PanSubPred. A Summary of the PanSubPred method. B Clustering of gene expression profiles of pancreatic islet cells, using cell labels predicted by PanSubPred. C The heatmap of 64 genes in seven pancreatic cell types reveals the cell-type-specific genes (left). The normalized expression level for CELA3A, GC and COL6A2 genes for each cell projected onto UMAP coordinates (right). D The dot plot of 64 genes in different pancreatic cell subpopulations reveals the cell-type-specific genes. E Pathway enrichment analysis of 64 pancreas marker genes. Pathways were ordered by statistical significance (p-value). The Black dashed line represents the statistical significance threshold (p-value < 0.05)

PSC-Stat decodes stellate cell activation dynamics across the T2D-PDAC pathological spectrum

PSCs are central to pancreatic fibro-inflammatory remodeling, where their transition from qPSCs to aPSCs drives disease progression in both diabetes and pancreatic cancer [22, 53]. Precise discrimination of these functionally distinct phenotypes is critical for identifying therapeutic targets against pathological stromal reprogramming. Among the 64 pancreatic cell marker genes identified in PanSubPred, we found that 14 of the 17 stellate-cell-specific genes, including both known markers and newly identified candidates, were differentially expressed between qPSCs and aPSCs (Fig. 2A). This distinct expression pattern suggested that these genes could serve as potential biomarkers for distinguishing between the two PSC phenotypes. To test this hypothesis, we developed PSC-Stat, a dedicated computational framework for PSC phenotype classification (Fig. 2B). During model construction, we trained and compared three machine learning models (SVM, RF, and XGBoost), with the XGBoost-based model achieving the best performance (AUC = 0.956) in 5-fold CV (Fig. 2C). To further optimize the model, we applied three feature selection algorithms, MIC, ANOVA, and Gini, to evaluate the 17 stellate-cell-specific genes. The Gini method identified the top 15 genes as the optimal feature set based on their importance scores (Fig. 2D and E). Notably, MMP14 ranked as the top-priority gene in feature selection, followed by TIMP1 and CRLF1, suggesting their pivotal roles in pancreatic stellate cell activation dynamics (Fig. 2E). This refined gene panel not only improved model performance, but also highlighted molecular features that may play a role in the transition from qPSCs to aPSCs, offering biologically meaningful clues into the activation process.

The optimized PSC-Stat achieved outstanding performance, with an AUC of 0.963 in 5-fold CV and 0.978 on the independent test set (Fig. 2F), demonstrating its robustness and reliability. Notably, it accurately distinguished qPSCs/aPSCs in the T2D status with an AUC of 0.992 (Fig. 2F). To further assess the generalizability of PSC-stat, we validated it on three external independent cohorts: Fang (healthy and T2D individuals) [32], Kang (healthy donors) [16], and Steele (PDAC and adjacent tissues) [33] (Fig. S5A–C). Since these datasets lacked explicit PSC activation labels, we projected the scRNA-seq data onto a publicly available Azimuth integrated human pancreas reference to infer stellate cell phenotypes and obtain corresponding prediction scores (Fig. 2G, S5D and E) [16]. Cells with high-confidence (prediction score > 0.9) were retained for model evaluation. PSC-stat achieved consistent performance across all cohorts (Fig. 2F: AUC: Fang = 0.886, Kang = 0.835, Steele = 0.881), underscoring its robust predictive accuracy across diverse disease contexts. Together, these results validate PSC-stat as a powerful tool for precise classification of PSC state, providing a valuable resource for studying stellate cell dynamics in pancreatic physiology and pathology. PSC-Stat is freely available at http://lin-group.cn/server/PSC-Stat/index.html.

To investigate the pathological relationship between T2D and PDAC, we analyzed stellate cell activation dynamics across disease states. Strikingly, the aPSC/qPSC ratio exhibited a stepwise increase from healthy controls (ND) to T2D and further to PDAC (Fig. 2H-I, S5F, aPSC/qPSC ratio: ND: 1.44 ± 1.02, T2D: 4.72 ± 4.01, PDAC: 18.67 ± 18.70). This cascade activation pattern aligns with clinical observations: epidemiological studies show that new-onset diabetes (NOD) patients have a 5- to 8-fold increased risk of being diagnosed with PDAC within 1 to 3 years [54, 55], and approximately 50% of PDAC cases present with diabetic phenotypes at diagnosis [56]. Our findings provide single-cell evidence that progressive stellate cell activation may serve as a pathological nexus linking T2D to PDAC development. These results establish a cellular dynamics framework for understanding the T2D-PDAC continuum and suggest that monitoring aPSC ratios could serve as a novel biomarker for early PDAC detection in diabetic populations.

Fig. 2
figure 2

Construction and evaluation of PSC-Stat. A The violin plot of 17 stellate-cell-specific genes showed that 14 genes of them were differentially expressed between qPSCs and aPSCs. The statistical significance was as such: ns: p-value > 0.05; *: p-value < 0.05; **: p-value < 0.01; ***: p-value < 0.001; ****: p-value < 0.0001, two-sided Wilcoxon rank-sum test. B Summary of the PSC-Stat framework. C The performance comparison of three classifier algorithms-based models with all 17 genes on 5-fold CV of training set. D The IFS curve of the 17 genes from analysis of Gini, ANOVA, and MIC based on the XGBoost algorithm with 5-fold CV. E The importance scores of the top 15 genes based on gini impurity. F The ROC curve of PSC-Stat for different datasets. G Fang scRNA‑seq data projected on the Azimuth reference (top) and the corresponding cell type annotation prediction score displayed on dimensional reduction plot (bottom). H aPSC/qPSC ratio difference between ND (n = 9) and T2D (n = 4) individuals. two-sided Student’s t-test. I aPSC/qPSC ratio difference among ND (n = 9), T2D (n = 4) and PDAC (n = 14) individuals. two-sided Student’s t-test

Rewired intercellular communication in T2D highlights ductal-centric interaction hubs and a 15-gene diagnostic signature

Recent studies have highlighted that coordinated intercellular communication is essential for maintaining pancreatic function and glucose homeostasis [57]. To investigate how diabetes perturbs this coordination, we systematically compared cell-cell interaction networks between ND and T2D samples using CellCall [46], a computational framework that integrates ligand-receptor (L-R) interactions and downstream TF activities through KEGG pathway-based L-R-TF axes. This method infers potential communication events by calculating the L2-norm of L-R co-expression (normalized via softmax function) combined with TF activity scores derived from regulon-target gene enrichment analysis.

In normal pancreatic tissue, intercellular communication networks primarily centered on exocrine cells, with stellate and ductal cells acting as central signal integrators that coordinate signals from both endocrine and exocrine populations (Fig. 3A). In contrast, T2D patients exhibited significant remodeling of these networks: global interaction frequency (i.e., the number of cell-cell interactions) declined (e.g., complete loss of stellate (sender)-stellate (receiver), delta-alpha, and gamma-stellate interactions etc.), and signal strength (i.e., the communication score calculated by CellCall for each cell-cell interaction) between most cell pairs decreased, particularly the communication targeting stellate cells. This shifted the communication hub toward ductal cells, forming a new pattern where ductal cells become the primary signal recipients (Fig. 3A). To further validate this finding, we conducted comparative analyses using two additional cell-cell communication inference tools, CellChat and CellphoneDB. CellChat analysis confirmed that, compared to the ND group, both the number and the overall strength of inferred interactions were reduced in T2D (Fig. S6A). Examining interaction numbers and strengths consistently showed that in the ND group, the majority of cell-cell interactions were organized around stellate and ductal cells serving as core signal integrators (Fig. S6B, D). In contrast, in T2D samples, the communication network shifted to a ductal cell-centered pattern (Fig. S6C, E). We further used CellChat’s comparative pipeline to visualize the differences in interaction numbers and strengths between ND and T2D using circle diagrams. The results once again supported the observation that in T2D, interactions directed toward ductal cells were notably enhanced, particularly signals from beta cells to ductal cells and from delta to ductal cells (Fig. S6F and G). Additionally, we applied CellphoneDB as a second independent method to identify statistically significant ligand–receptor interactions in both ND and T2D groups, considering interactions with p-values < 0.05 as significant. Consistent with the previous analyses, we found that the number of significant interactions in T2D was lower than in ND. Moreover, in the ND group, significant interactions were predominantly concentrated around stellate and ductal cells, whereas in T2D they were mainly distributed in stellate and ductal cells, with ductal cells emerging as a key hub (Fig. S6H-I).

Interestingly, despite the global attenuation of intercellular signaling, beta-ductal and delta-ductal communication scores were enhanced in T2D (Fig. 3A), potentially driven by ligand-receptor axes such as FGF7FGFR2/3, EFNB3EPHB2/4/6, and EFNA5EPHA2 (Fig. 3B and C, Fig. S7A–C). Further analysis of these ligand-receptor-TF axes in T2D ductal cells identified six key TFs: ETS1, JUN, MYC, NFKB1, NFKBIA, and ABL1 (Fig. 3B, Fig. S7D–F). Additionally, CellCall pathway analysis revealed T2D-specific activation of insulin signaling and axon guidance pathways, predominantly enriched in beta-ductal and delta-ductal interactions (Fig. S7G-H), suggesting their roles in metabolic stress adaptation and neural remodeling during diabetes progression.

To test whether these transcriptional and pathway alterations could serve as potential T2D-specific signatures, we developed a machine learning framework integrating TF-associated genes and pathway components. From the 1,472 target genes (TGs) of the six key TFs in ductal cells (Table S8A) and 304 genes from T2D-enriched pathways (PGs) (Table S8B), we identified 1,729 candidate genes, including 47 shared genes (Fig. 3D). We then extracted their expression profiles from two independent ductal cell datasets, covering both ND and T2D samples. Due to the inherent sparsity in single-cell data, many genes are not detected in most cells; therefore, we excluded genes not expressed in more than 70% of cells. After filtering, 78 and 1,451 genes remained in the two datasets, with 77 genes shared (Fig. 3E). The downstream analyses focused on this robust set of 77 genes. The larger dataset was used as a discovery cohort, randomly split into training (80%) and testing (20%) sets (Table S9). We evaluated multiple algorithms and found that LR-based model achieved the best performance (Fig. 3F). Using MIC, ANOVA, Gini, and GBDT for feature selection combined with LR, we selected the inflection point with top 15 GBDT-ranked genes as the optimal feature set (Fig. 3G-H). Parameter tuning based on these 15 genes yielded a model with good discrimination in both the training and internal test sets (Fig. 3I). External validation on the Segerstolpe dataset further confirmed its robustness (AUC = 0.846; Fig. 3I). These results indicate that integrating communication-related transcriptional hubs and pathway perturbations is a promising strategy for identifying T2D biomarkers mechanistically linked to disease progression. The consistent high AUC values across datasets support the notion that altered intercellular communication is a core feature of T2D and offers important insights into its pathogenesis.

Among the 15 identified genes, the core gene RPL3 (Fig. 3H, ranked first in feature importance) is a downstream target of the transcription factor MYC. In addition, RPS4X, RPL13, RPL36, and ACTG1, part of this 15-gene panel, are also regulated by MYC. Studies have shown that MYC, a known recovery gene associated with apoptosis, also acts as a key regulator of cell proliferation, and its altered expression levels are linked to the development of T2D [58]. Furthermore, four of the 15 genes are downstream targets of ABL1, and another four are regulated by NFKB1. Notably, seven out of the 15 genes, including ACTB (ranked third), S100A6 (ranked fourth), and KRT19 (ranked fifth), are downstream targets of the transcription factor ETS1 (Fig. 3H, Table S8). This suggests that ETS1 may serve as a central hub in regulating the transcriptional network of pancreatic ductal cells, further supporting our hypothesis that communication-regulated transcription factors and their target genes can provide new molecular evidence and potential intervention targets for T2D progression. Collectively, these genes, orchestrated by TFs such as MYC, ETS1, and NFKB1, form a ductal cell-centered communication network, offering a new set of potential molecular biomarkers for diabetes.

Fig. 3
figure 3

Analysis of cell-cell communication among pancreatic cells in ND and T2D. A Circle plot of intercellular communication patterns among seven pancreatic cell subsets in ND (left) and T2D (right), respectively. The outermost ring colors represent the seven distinct cell types. The inner ring colors indicate whether the cell is acting as a sender (ligand-expressing, in red) or a receiver (receptor-expressing, in blue). Each arc represents an inferred communication from a ligand-expressing cell to a receptor-expressing cell, representing the inferred signal directionality. The color of the arc corresponds to the sender cell type, and the color gradient from light to dark reflects the strength of the communication score between the interacting cell pairs. B Sankey plot showing the ligand–receptor–transcription factor (L–R–TF) signaling cascade from beta cells (sender) to ductal cells (receiver) under T2D conditions, in which ligands are expressed in beta cells and both receptors and TFs are in ductal cells. The three columns from left to right represent ligands (sender), receptors (receiver), and TFs, respectively. And the coloring of the genes is only intended to improve readability. Each complete stream indicates a distinct signaling pathway involving a L–R–TF signaling. The color of the stream on the left and right sides are consistent with ligand and receptor respectively, allowing the flow of information through the pathway to be visually tracked. C The communication network from other cells to ductal cells in T2D patients. Select ligand-receptor axes with a communication score greater than 0.5 for visualization. D Venn diagram showing the overlap of target genes associated with 6 key TFs in ductal cells (TGs) and two disease-specific pathways (PGs). E Venn diagram showing the shared genes in ductal cells between the Fang et al. (discovery) and Segerstolpe et al. (validation) datasets. F The performance comparison of four classifier algorithms-based models with the shared 77 genes on training set. G The IFS curve of gene selection using LR and four feature selection methods on training set. The black dotted line represents the top 15 genes selected by GBDT. H The importance scores of the top 15 genes selected by GBDT. I The ROC curve of the model with the top 15 genes on training, internal testing and external validating Segerstolpe dataset

A representative mature beta cell subset associated with insulin secretion and mitochondrial metabolism is reduced in T2D

Beta cell heterogeneity has been well-documented in both humans and mice, with distinct subpopulations identified based on functional, morphological, molecular, and maturation characteristics [11,12,13]. However, the relationship between beta-cell heterogeneity and beta-cell failure in T2D, as well as the specific roles of individual beta subpopulations in diabetes pathogenesis, remains poorly understood. To address this, we integrated single-cell transcriptomic data from Xin, Baron, and Segerstolpe with ND and T2D individuals. After quality control to remove low-quality cells, we constructed a joint embedding space consisting of 18 distinct clusters, irrespective of data source, donor origin, and disease status (Fig. S8A). Upon annotating the major pancreatic cell subpopulations (Fig. S8B–D), we extracted beta cells and performed Louvain clustering, identifying three distinct clusters that were present in both ND and T2D (Fig. 4A-B).

We conducted molecular profiling of the three distinct beta cell clusters (Fig. S8D-E), identifying three distinct beta cell subpopulations: cluster 0 representing mature beta cells, cluster 1 corresponding to immature or dedifferentiated beta cells, and cluster 2 characterized by elevated expression of endoplasmic reticulum (ER) stress-related genes. The detailed molecular signatures of each cluster are discussed in Supplementary Text S2 (Fig. 4C-E).

To obtain a global landscape of beta cell heterogeneity, we compared the proportion differences of beta cells between ND and T2D individuals. We identified a significant reduction in mature beta cells (cluster 0) and concomitant expansion of immature beta cells (cluster 1) in T2D pancreases, accompanied by an increased trend of ER stress-associated beta cell subpopulations, although this did not reach statistical significance (Fig. 4F). These observations suggest that the progression of diabetes may be associated with a gradual loss of beta cell function and the accumulation of immature beta cells, reflecting the decline in insulin secretion capacity and changes in beta cell responsiveness. Consistent with our findings, Hrovatin et al. [59] constructed a comprehensive single-cell atlas of mouse islets spanning developmental and diabetic states, and revealed that beta cells in T2D genetic models exhibit downregulation of genes associated with maturity and insulin secretion (e.g., Ucn3, G6pc2), alongside upregulation of ER stress activity and enrichment of immature-markers. This shift in gene expression programs aligns well with the heterogeneity we observed in the distribution of beta cell subpopulations in human T2D. Further analysis revealed that in T2D, subpopulations representing mature beta cells (cluster 0) and less mature cells (cluster 1) exhibited more differentially expressed genes compared to ND (Fig. 4G), suggesting that diabetes progression may amplify the molecular differences between these subpopulations, increasing transcriptional output and noise. Such heightened transcriptional output resembles the dysregulated gene expression observed in aging beta cells reported by Shrestha et al. [60], potentially explaining the recurrent identification of ER stress-linked beta cell subtypes. Regarding the potential mechanisms of beta cell loss, we hypothesize that cells initially activate compensatory responses, such as ER-associated degradation (ERAD), to alleviate ER stress by degrading misfolded proteins. If stress is resolved, cells restore homeostasis and resume normal insulin secretion. However, persistent ER stress that cannot be mitigated may eventually trigger apoptotic pathways, leading to beta cell loss (Fig. 4H).

Fig. 4
figure 4

scRNA-seq analysis reveals changes in beta cell heterogeneity promoted by T2D. A Unsupervised clustering of single-cell transcriptome visualized with UMAP analysis. Data represent pancreatic islet cells from ND (n = 6,284 cells pooled from 19 donors) or T2D (n = 3,083 cells pooled from 9 donors). B Projection of beta cells (n = 2,793) from both ND and T2D using UMAP analysis. C Normalized expression values for key beta cell maturation, immaturity, insulin secretion, endoplasmic reticulum (ER) stress and ER-associated degradation (ERAD) compared among different beta cell clusters. D Cell scores for selected gene sets, including beta cell maturity, immaturity, insulin secretion, beta cell development, beta cell proliferation and response to ER stress. The central error bars in the violin plot indicate the mean and standard deviation (mean ± sd) of the data distribution. E Enrichment of GO terms ordered by statistical significance in cluster 0 beta enriched genes. Dashed line represents p-value < 0.05. F Frequency of beta cluster cells in ND and T2D donors, with cluster 0 (ND n = 19, T2D n = 9), cluster 1 (ND n = 7, T2D n = 7), and cluster 2 (ND n = 13, T2D n = 5). Data shown are the mean ± s.e.m. *p value < 0.05, two-sided Wilcoxon rank-sum test. G Significant differential enriched genes in each beta cluster of ND and T2D individuals. H Summarized map of beta cell heterogeneity and loss. Tp: transcriptional. I RNA-velocity analysis of ND and T2D INS/GCG, alpha-/beta-cells. Black streamline arrows represent predicted direction of cell state change and trajectories. Larger red arrows represent overall velocity for each area of the UMAP

Molecular and functional heterogeneity of non-beta pancreatic cells in T2D

Beyond beta cells, significant molecular heterogeneity exists across pancreatic cell populations. Sub-clustering of acinar cells identified two functionally distinct subsets (Fig. 5A): cluster 0 exhibited high expression of digestive enzyme-related genes (e.g., CELA2A, CELA3A, CELA2B), while cluster 1 was enriched for immune and inflammatory markers, such as CD74, CCL2, CXCL17 (Fig. 5B), consistent with prior reports by Segerstolpe et al. [19]. Notably, beyond immune and inflammatory signatures, subset 1 acinar cells also displayed distinct metabolic dysregulation, characterized by oxidative stress, xenobiotic metabolism, and hormonal modulation (Fig. 5D). Functional characterization identified cluster 0 as the dominant mature acinar population (70.3% in healthy controls), primarily responsible for normal function of acinar cell. In contrast, cluster 1 adopted a multifunctional state involving inflammatory modulation, metabolic stress adaptation. Critically, the proportional expansion of cluster 1 in T2D individuals suggests its potential role in exacerbating pancreatic dysfunction through aberrant metabolic-immune crosstalk (Fig. 5C).

Sub-clustering analysis of ductal cells also identified two distinct subpopulations (Fig. 5E). Cluster 0 predominated in ND pancreases (70.5%), whereas subset 1 was markedly expanded in T2D individuals (Fig. 5F: 70.4%). Transcriptomic profiling revealed ductal 0 cell was enriched for microtubule-associated genes (including TUBA1C, TUBA1B, TUBB, TUBB4B etc.), involved in cytoskeletal organization and mitotic cell cycle regulation, while cluster 1 was enriched for genes associated with bile, gastric, pancreatic, and salivary secretion pathways (Fig. 5G and H). Consistent with our findings, Baron et al. previously reported ductal cell heterogeneity, identifying two major subpopulations characterized by high expression of MUC1 and CFTR, respectively [21]. In our study, both MUC1 and CFTR were significantly enriched in cluster 1 (Fig. 5G).

We conducted a subpopulation analysis of pancreatic alpha cells and identified three distinct alpha cell subtypes (Fig. 5I). The predominant cluster 0 (79.4% of alpha cells) exhibited marked enrichment of alpha cell identity genes (Fig. 5J: ARX, LOXL4, FEV), aligning with the electrophysiologically active mature alpha cell subtype reported by Camunas-Soler et al. [61]. In stark contrast, the minor cluster 2 (0.5%) showed pronounced cell cycle gene activation, including CDK1, MKI67, CENPF etc. Consistent with findings from Segerstolpe et al. [19], this rare alpha-cell subset may represent a highly proliferative population contributing to pancreatic tissue self-renewal. Additionally, cluster 1 (20.1%) demonstrated unique enrichment of mitochondrial-encoded genes, suggesting metabolic reprogramming may underlie its functional specialization (Fig. 5J). Functional enrichment analysis confirmed that cluster 0 was associated with biological functions related to mature alpha-cell activity, including hormone secretion/transport/response and signal release. Meanwhile, cluster 2 was predominantly enriched in pathways related to cell division, proliferation, and cell cycle regulation (Fig. 5K). Similar to the heterogeneity observed in beta-cell maturation states, we speculate that alpha cell subpopulations may maintain pancreatic homeostasis through a mature-proliferative duality, suggesting a conserved mechanism for pancreatic endocrine homeostasis.

Fig. 5
figure 5

Functional and molecular heterogeneity in non-beta pancreatic cells. A Projection of two acinar subpopulations from ND and T2D using UMAP analysis. B The differentially expressed genes between acinar clusters 0 and 1 indicate enrichment patterns, with upregulated genes representing those enriched in cluster 0 and downregulated genes corresponding to those enriched in cluster 1. Significance threshold: adjusted p-value < 0.05 and|logFC| >1. C Bar plot showing the proportion of each acinar 0/1 subpopulation in ND and T2D. Each condition is color-coded as indicated. D Enrichment of GO terms and associated genes significantly enriched in acinar clusters 0 and 1. E Projection of two ductal subpopulations from ND and T2D using UMAP analysis. F Bar plot showing the proportion of each ductal 0/1 subpopulation in ND and T2D. Each condition is color-coded as indicated. G Genes with significantly differential expression between subclusters of ductal subsets, with upregulated genes representing those enriched in cluster 0 and downregulated genes corresponding to those enriched in cluster 1. Significance threshold: adjusted p-value < 0.05 and|logFC| >1. H Pathway enrichment analysis of genes significantly enriched in ductal cell clusters 0 and 1. Dashed line represents p-value < 0.05. I Projection of three alpha subpopulations from ND and T2D using UMAP analysis. J Normalized expression values for key alpha cell identity/function, mitochondrial-associated, and cell proliferation-associated genes compared among different alpha cell clusters. K Comparison of GO enrichment analysis for genes enriched in alpha cell clusters 0 and 2. Dashed line represents p-value < 0.05



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *