Identifying common differential genes
PCA analysis was performed on the expression matrices of GSE33570 (Fig. 2a) and GSE29315 (Fig. 2d), and a clear two-sided distribution of samples was observed in both disease and control groups. A total of 1572 distinct genes were detected as differentially expressed in the analysis of the GSE35570 dataset. These DEGs were categorized into 824 upregulated and 748 downregulated genes (Fig. 2b). Similarly, 423 DEGs were observed in the GSE29315 dataset, including 271 upregulated and 152 downregulated DEGs (Fig. 2e). We then display the GEGs in the two datasets in a heatmap of both datasets (Fig. 2c, f). Furthermore, we used a Venn diagram to identify overlapping genes with the same directional trend, and found that 64 genes were upregulated (Fig. 2g) and 37 genes were downregulated (Fig. 2h).
Functional analysis of common differential genes
To gain a better understanding of the underlying biological functions associated with the 101 DEGs, GO and KEGG enrichment assessments were performed using the “clusterProfiler” software package in R. GO analysis revealed that these shared genes were mainly enriched in leukocyte-mediated immunity, myeloid leukocyte activation, and antigen processing and presentation (Figure 2j). Furthermore, the DEGs showed significant enrichment across the top five KEGG pathways, including tuberculosis, phagosome, viral myocarditis, inflammatory bowel disease, and Th1 and Th2 cell differentiation (Figure 2i). Clearly, the functions of the differentially expressed genes are closely related to the body's immune function. The core genes mainly serve the purpose of activating immune cells.
Building a PPI network and testing modules
To perform the PPI analysis, we used the STRING online tool and visualized the results using Cytoscape software (Supplementary Figure S1a). The PPI network displayed 68 nodes and 498 edges. The DC value of each node was calculated, with a median value of 11. Based on this, we identified 17 hub genes in the PPI network: TYROBP, ITGB2, STAT1, HLA-DRA, C1QB, MMP9, FCER1G, IL10RA, LCP2, LY86, CD53, CD14, CD163, HCK, MNDA, HLA-DPA1, and ALOX5AP. We then used the MCODE plugin to identify 6 modules containing a total of 29 common DEGs (Supplementary Figure S1b, c).These DEGs are LCP2, TYROBP, CD53, LY86, ITGB2, FCER1G, MNDA, C1QB, HCK, IL10RA, HLA-DRA, ALOX5AP, MT1G, MT1F, MT1E, MT1X, ISG15, IFIT3, PSMB9, GBP2, CD14, CD163, VSIG4, CAV1, TIMP1, S100A4, SDC2, FGFR2, and STAT1. The most significant module consisted of 12 genes (LCP2, TYROBP, CD53, LY86, ITGB2, FCER1G, MNDA, C1QB, HCK, IL10RA, HLA-DRA, ALOX5AP) and was further analyzed using the ClueGO plugin of Cytoscape software. Our investigation revealed that these genes mainly functioned to activate neutrophils to participate in immune responses and activate innate immunity (Supplementary Fig. S1d).
Diagnostic value of shared hub genes
In this study, we analyzed a total of 26 genes from six modules extracted from MCODE. To determine the importance of each gene, we used the RF algorithm in two datasets, namely GSE35570 (Fig. 3a) and GSE29315 (Fig. 3b). By comparing the gene importance rankings in both datasets, we identified the top eight genes that consistently ranked highly. To visualize this overlap, we constructed a Venn diagram (Fig. 3c). This revealed three genes (CD53, FCER1G, and TYROBP) that were shared between the two datasets. Notably, these three genes overlap with the hub genes identified through PPI analysis based on DC values, as well as genes found in the most significant modules. These three genes showed promising diagnostic potential for HT and PTC. To evaluate the diagnostic value of the common hub genes, we calculated the cutoff values, sensitivity, specificity, AUC, and 95% CI for each gene in the four datasets (Table 1). In the GSE35570 dataset (Fig. 3d), the AUC values were as follows: CD53 (AUC 0.71, 95% CI 0.61–0.82), FCER1G (AUC 0.81, 95% CI 0.73–0.89), and TYROBP (AUC 0.79, 95% CI 0.71–0.88). In the GSE29315 dataset (Fig. 3e), the AUC values were as follows: CD53 (AUC 1.00, 95% CI 1.00–1.00), FCER1G (AUC 1.00, 95% CI 1.00–1.00), and TYROBP (AUC 1.00, 95% CI 1.00–1.00). In the TCGA dataset (Fig. 3f), we validated the diagnostic value of the common hub genes for PTC. The AUC values were as follows: CD53 (AUC 0.71, 95% CI 0.61–0.82), FCER1G (AUC 0.74, 95% CI 0.64–0.89), TYROBP (AUC 0.80, 95% CI 0.70–0.89). To further evaluate the diagnostic value of the common hub genes for PTC in HT, we calculated the AUC and 95% CI for each gene using GSE1398198. In the GSE138198 dataset (Fig. 3g), the AUC values were as follows: CD53 (AUC 0.83, 95% CI 0.57–1.00), FCER1G (AUC 0.92, 95% CI 0.72–1.00), TYROBP (AUC 1.00, 95% CI 1.00–1.00). We also analyzed the difference box plots between the two groups of the four datasets (Supplementary Figure S2). Analysis using box plots revealed notable differences in gene expression between the HT and control groups in GSE29315. This difference explains the observed AUC values of 1 for all three hub genes in GSE29315.
Diagnostic model construction and feature importance analysis
Using the GSE35570 dataset, we developed three diagnostic models specifically for PTC that incorporated these significant genes identified by our analysis. The ANN model (Fig. 4a) had four hidden units, a penalty of 0.0108, and was trained for 537 epochs. The ANN model achieved an AUC of 0.94 (95% CI 0.91–0.98) on the training set, and on the test set, the AUC was 0.94 (95% CI 0.83–1.00) (Fig. 4b). The XGBoost model had 8 mtry, 6 min_n, 3 max_depth, 0.001 learn_rate, 0.07 loss_reduction, and 0.97 sample_size. The XGBoost model achieved an AUC of 0.84 (95% CI 0.75–0.93) on the training set, whereas the AUC was 0.62 (95% CI 0.42–0.83) on the test set (Supplementary Figure S3a). The DT model had a cost_complexity of 0.0003, tree_depth of 5, and min_n of 6. The DT model achieved an AUC of 0.93 (95% CI 0.90–0.97) on the training set, whereas the AUC was 0.83 (95% CI 0.65–1.00) on the test set (Supplementary Figure S3b). Supplementary Table S1 shows the prediction performance of the three machine learning models. The results showed that the ANN model outperformed the other models, and we selected the ANN model for further analysis. The TCGA dataset was used as an external validation dataset to evaluate the diagnostic performance of the ANN model for PTC, and an AUC value of 0.77 (95% CI 0.66–0.87) was obtained (Fig. 4c). The GSE138198 dataset was used to evaluate the diagnostic effectiveness of the ANN model for PTC in HT. In the GSE138198 dataset (Fig. 4d), the ANN model showed a perfect AUC of 1.00 (95% CI 1.00–1.00). To help clinicians better understand the contribution of variables, the SHAP algorithm was used to interpret the ANN prediction results. Figure 4e, f, and g show how the imputed importance of features changes as their values change. Our findings show that CD53 had the greatest impact on the output of the ANN model. It was initially positively correlated with the risk of PTC, but changed to a negative correlation after a turning point of approximately 6 years. TYROBP and FCER1G showed positive correlations with the occurrence of PTC.
Verification of common DEGs
We analyzed the protein expression of hub genes based on the HPA database (Supplementary Fig. S4). CD53 was highly expressed in both tumor and normal tissues, whereas FCER1G and TYROBP showed higher expression in tumor compared to normal tissues. Furthermore, IF staining was performed to measure the expression of CD53, FCER1G, and TYROBP in clinical samples including 10 HT-associated PTC tissues and 6 NAT. By performing IF analysis (Fig. 5), semi-quantitative results were obtained showing that the fluorescent signal intensity of CD53, FCER1G, and TYROBP was significantly elevated in the HT-associated PTC group compared to the HT-associated PTC group (P< 0.05).
Analysis and validation of immune infiltration
Considering the important role of immune and inflammatory responses in the development of HT and PTC, we used the CIBERSORT algorithm to analyze the differences in immune cell infiltration patterns between PTC, HT, and normal samples. Utilizing the GSE35570 dataset, we identified 12 immune subgroups that showed significant variation between PTC and normal samples (Supplementary Fig. S5a). Furthermore, analysis of the GSE29315 dataset revealed five immune subgroups that were significantly different between HT and normal samples (Supplementary Fig. S5b). Of these, four common immune subpopulations were found to be significantly higher in both PTC and HT samples compared with normal samples. These subpopulations included T cell CD8, T cell CD4 memory resting, macrophage M1, and mast cell resting. Furthermore, we performed Spearman correlation analysis between hub genes and immune cells (Supplementary Fig. S5c, d). The results suggested that immune responses may potentially contribute to the involvement of hub genes in PTC and HT progression. IF staining was used to identify immune cell infiltration in 5 PTC and 5 NAT cases of HT tissues (Fig. 6). The expression levels of CD4+ T cell marker Cd4, CD8+ T cell marker Cd8, and macrophage marker Cd86 were found to be significantly higher in PTC in the HT group compared with the NAT group. The results of IF staining verified the accuracy of the immune infiltration analysis results to some extent.
Potential treatments for HT and PTC
Based on the three core genes screened by the RF algorithm, we searched for related potential drugs in the DGIdb database. As a result, only FCER1G had related drugs, but no related drugs were found for CD53 and TYROBP. FCER1G was predicted to have two potential drugs: benzylpenicilloylpolylysine and aspirin. Of these, benzylpenicilloylpolylysine had the highest score of 29.49, while aspirin had a score of only 1.26. We hypothesize that benzylpenicilloylpolylysine and aspirin may be effective in treating HT and PTC and may prevent HT carcinogenesis.