Features and functional annotation of miRNAs with variants targeting RNA modifications
In this study, we collected 504 miRNAs with variants targeting RNA modifications, RMvar-related miRNAs, within the RMVar database (Supplementary Table 1). In those miRNAs, the m6A modifications accounted for the highest proportion (86.3%), followed by the m5C modification, with a proportion of 4.2% among the eight modifications mentioned previously. These findings were consistent with a previous study reporting that the m6A modification is the most common RNA modification in eukaryotes25 (Fig. 2A). As shown in Fig. 2B, most of the variants targeting RNA modifications had a single nucleotide, followed by a small amount of deletion mutation targeting the m6A, m5C, and m7G modifications. The KEGG enrichment analysis indicated that these miRNAs were associated with multiple pathways, including the MAPK signaling pathway, miRNAs in cancer, viral carcinogenesis, transcriptional misregulation in cancer, hepatitis B, and signaling pathways regulating pluripotency of stem cells (Fig. 2C, Supplementary Table 2). GO enrichment analysis demonstrated that these miRNAs were associated with cell–cell signaling, positive regulation of NIKNF-kappa B signaling, the endoplasmic reticulum-unfolded protein response, histone acetyltransferase activity, and generation of precursor metabolites and energy (Fig. 2D, Supplementary Table 3). These results imply that the RMvar-related miRNAs have broad biofunctions in cell regulation and are closely related to tumorigenesis.

Features and functional annotation of miRNAs with variants targeting RNA modification. (A) Pie chart showing the proportions of the RNA modifications targeted by miRNA variants. (B) Types of single nucleotide polymorphism in the miRNA variants. DEL: deletion, INS: insertion, SNV: single nucleotide variant. (C) KEGG26,27,28 enrichment analysis of miRNAs with variants targeting RNA modifications (adjusted p < 0.05). The font size of the items represents the degree of enrichment. (D) GO enrichment analysis on miRNAs with variants targeting RNA modifications (adjusted p < 0.05). The font size of the items represents the degree of enrichment.
Screening of representative serum miRNAs and the construction of the cancer prediction model
To investigate the clinical significance of the RMvar-related miRNAs, we first screened them for representative miRNAs. Figure 3A shows 298 serum RMvar-related miRNAs (in total 504 miRNAs) differentially expressed between patients with and without cancer, as screened using the “limma” package (|log2 fold change|> 1.5, adjusted p < 0.001) on the serum chip data. Notably, most of these miRNAs were highly expressed in cancers. The LASSO analyses and RFE were used to screen miRNAs with independent expression features in the 504 RMvar-related miRNAs based on the traing cohort.

Screening of representative serum miRNAs from 504 RMvar-related miRNAs for constructing the cancer prediction model. (A) Volcano plot chart depicting the differentially expressed serum RMvar-related miRNAs in patients with and without cancer (|log2 fold change|> 1.5, adjusted p < 0.001). (B, C) Least absolute shrinkage and selection operator regression performed to screen the candidate NCOA6-related miRNAs based on the minimum criteria. (D) RFE algorithms in the tumor cohort. (E) Venn diagram depicting the 109 (screened using RFE, left circle) and 310 miRNAs with independent expression features (screened using LASSO analysis, right circle), as well as the 79 miRNAs in the intersection. (F) Pie chart showing the proportion of the RNA modification targeted by miRNA variants in the 79 representative serum miRNAs. (G) Principal component analysis (PCA) of 79 representative serum miRNAs on serum chip data from patients with and without cancer. (H) Unsupervised clustering analysis of 79 representative serum miRNAs on serum chip data from patients with and without cancer.
Notably, 310 and 109 RMvar-related miRNAs were screened out (Fig. 3B–D, Supplementary Tables 4 and 5), and 79 RMvar-related miRNAs in the intersectional part of these two groups served as representative miRNAs for constructing the cancer prediction model (Fig. 3E, Supplementary Table 6). As shown in Fig. 3F, five types of RNA modifications—m6A, m5C, m1A, m7G, and A-to-I—were targeted by the variants of the 79 representative miRNAs. The m6A modification accounted for the largest proportion (79.7%), consistent with previous results. The principal component analysis (PCA) based on the 79 representative miRNAs demonstrated a significant difference between patients with and without cancer, confirming that the 79 miRNAs were suitable for constructing the cancer prediction model (Fig. 3G). Moreover, the unsupervised clustering analysis based on the serum chip data demonstrated that these 79 representative miRNAs were differentially expressed in patients with and without cancer, consistent with the PCA results (Fig. 3H).
Based on the obtained 79 candidate RMvar-related miRNAs, we used the machine learning algorithm to construct a diagnostic signature, the RMvar-related miRNA signature, for cancer detection. First, serum samples (sequencing via the GPL21263 platform; cancer = 8,187, non-cancer control = 13,846) with miRNA expression data from patients with and without cancer were randomly split into two cohorts at a ratio of 1:1 (training and validation cohorts), and nine common machine learning algorithms, namely MARS, RF, NNET, avNNET, SVM with the radial basis function kernel, SGBT, XGBoost, NB, and KNN, were used independently to construct diagnostic signatures based on the training cohort. The hyperparameters for each diagnostic signature were selected according to the best ROC curve (Supplementary Fig. 1). Five machine learning-related indices, including residual, cumulative gains, lift chart, precision recall curve, and ROC, were calculated to compare the predictive power of the nine diagnostic signatures. As shown in Fig. 4A–F, the diagnostic signatures constructed using SGBT and SVM exhibited the best performance in these indices among the above nine diagnostic signatures. The AUC of ROC, sensitivity, and specificity of diagnostic signatures constructed using SGBT and SVM were also highest among the nine diagnostic signatures (Fig. 4G, H). Therefore, we constructed the RMvar-related miRNA signature by combining the SGBT and SVM algorithms using the “caretStack” function of the “caretEnsemble” package, with 100 iterations of bootstrap sampling. The related influences of SGBT and SVM algorithms in constructing the RMvar-related miRNA signature were 63.38% and 36.62%, respectively (Fig. 4I).

Construction of RMvar-related miRNA signature using machine learning algorithms. (A–F) Performance of signatures constructed using nine machine learning algorithms on related indices, including residual, cumulative gains, lift chart, precision recall curve, and ROC. (G, H) AUC, specificity, and sensitivity of miRNA signatures constructed using nine machine learning algorithms in distinguishing cancer and non-cancer control samples. (I) Influence of machine learning algorithms (SGBT and SVM) in constructing RMvar-related miRNAs. (J, K) Differences in the output strength of RMvar-related signature between cancer and non-cancer control samples, as well as different cancer types. (L) ROC curve showing the diagnostic performance of RMvar-related signature in distinguishing cancer from non-cancer controls in the training cohort. The AUC, specificity, sensitivity, and accuracy were also calculated. (M, N) The diagnostic performance of RMvar-related signature as validated in the test cohort and combined cohort using the ROC curve. (O, P) The diagnostic performance of RMvar-related miRNA signature as validated in the validation cohort using the ROC curve. (Q, R) The miRNA signature was reconstructed and validated in the combined cohort from the platform of GLP18941. (S) AUC of ROC of 79 representative miRNAs in distinguishing cancer and non-cancer control samples. (T) The difference in net benefit between RMvar-related miRNAs and all representative miRNAs using the decision curve analysis (DCA) within a wide range of decision threshold probabilities.
The output strengths of these signatures in the cancer groups were significantly lower than those in the non-cancer controls (Fig. 4J). We next investigated the difference in RMvar-related miRNA signature values between each cancer type. As shown in Fig. 4K, in the cancer group, patients with lung cancer had the highest median signature value (1.00). Further, patients with breast cancer had the lowest median signature value (0.64), and a significant difference in signature values was observed between the non-cancer controls and each cancer type (p < 0.001). The RMvar-related miRNAs showed high diagnostic power in distinguishing cancer samples from non-cancer controls in the training cohort (Fig. 4L). We then applied the signature to the test cohort. Similar to the training cohort, the RMvar-related miRNAs also showed a high diagnostic performance, with an AUC of 0.996 (95%CI 0.995–0.997), a specificity of 97.7%, and a sensitivity of 96.0%. The diagnostic accuracy was 96.7% (Fig. 4M). We also examined the RMvar-related miRNA signature in the combined training and test cohort. The AUC, specificity, sensitivity, and accuracy demonstrated a satisfactory diagnostic value (Fig. 4N). To examine the predictive power of the miRNA signatures, we analyzed the RMvar-related miRNA signature on two external validation sequencing data GSE211692 and GSE73002. In GSE211692 (from the same sequencing platform as the training cohort), the miRNA signature showed excellent performance in distinguishing cancer samples from non-cancer cases, with an AUC of 0.998 (95% CI 0.998–0.998), specificity of 93.1%, sensitivity of 99.3%, and diagnostic accuracy of 96.2% (Fig. 4O). Examination of the miRNA signatures in the other external validation data (GSE73002) from the sequencing platform GPL18941 indicated slightly lower diagnostic performance with an AUC of 0.948, specificity of 97.2%, sensitivity of 76.0%, and accuracy of 86.6% (Fig. 4P). All cancer cases in miRNA expression data GSE73002 were breast cancer, whose signature value had the largest overlapping parts with non-cancer samples (Fig. 4K). Moreover, the differences in detection technology, detection schemes, and data processing methods across different platforms will affect the presentation of data. These factors may have affected the diagnostic power of RMvar-related miRNA signature in GSE73002. To overcome this, we collected three additional datasets, namely GSE59856, GSE85679, and GES124158, of liver cancer, pancreatic cancer, cholangiocarcinoma, and malignant bone and soft tissue tumor from platform GPL18941 and combined them with GSE73002 to generate new external validation data (named the GPL18941 cohort). Then, we randomly split it into two cohorts at a ratio of 3:1 (training and test cohorts) and constructed diagnostic signatures in the training cohort based on the 79 representative miRNAs and the same signature constructing method. The result showed that the miRNA signature performed well in distinguishing cancer samples from non-cancer controls within the training cohort (Fig. 4Q) and test cohort (Fig. 4R) of GPL18941, with the AUC, specificity, and sensitivity higher than 0.989, 98.2%, and 96.9%, respectively. These results highlight that our optimized machine-learning workflow was effective in constructing a cancer detection tool.
The diagnostic power of each representative miRNA was calculated for the combined training and test cohorts. Our results showed that hsa-miR-320a had the highest diagnostic value, with an AUC of 0.8503, specificity of 80.63%, sensitivity of 79.04%, and accuracy of 79.63%, which was significantly lower than the RMvar-related miRNA signature (Fig. 4S). In the decision curve analyses, the RMvar-related miRNA signature demonstrated superior net benefit within a wide range of decision-making threshold probabilities compared with all the representative miRNAs (Fig. 4T).
Diagnostic performance of RMvar-related miRNA signature within different conditions and cancer types
Our analyses revealed the potent diagnostic value of RMvar-related miRNA signature in cancer detection. We investigated the diagnostic power of RMvar-related miRNA signature in different conditions and certain cancer types. First, we tested the diagnostic performance of RMvar-related miRNA signature classified by patient sex. Our results showed no significant difference in the output strength of the RMvar-related miRNA signature between female (median signature value = 0.99486) and male patients (median signature value = 0.99473) (Fig. 5A), and the AUC of the RMvar-related miRNA signature of both groups performed well, as previously reported (Fig. 5B, C). The correlation analysis showed no significant correlation between patient age and RMvar-related miRNA output strength (cor = − 0.12, Fig. 5D). Therefore, we investigated the ability of RMvar-related miRNAs to distinguish cancer types and combined each cancer type individually with non-cancer control samples. The RMvar-related miRNA signature demonstrated superior discrimination ability (Fig. 5E, blue polyline, Supplementary Table 7). Although the performance of the RMvar-related miRNA signature was slightly lower in distinguishing each cancer type from the mixed samples of all cancer and non-cancer controls, it exhibited a remarkably high sensitivity (Fig. 5E, yellow polyline, Supplementary Table 8). The RMvar-related miRNA signature accurately detected > 87.6% of cancer types (except breast cancer, which had a sensitivity of 64.8%), and the rate of missed diagnosis was low.

Diagnostic performance of the RMvar-related miRNA signature within different conditions and cancer types. (A) Differences in the output strength of the RMvar-related miRNA signature between samples from female and male patients. (B, C) ROC curve showing the diagnostic performance of the RMvar-related signature in male and female groups. (D) Correlation analysis between age and output strength of the RMvar-related miRNA signature in patients with cancer. (E) Radar chart summarizing the AUC of the RMvar-related miRNA signature of each cancer type. The blue polyline represents the AUC value for distinguishing each cancer type from non-cancer controls. The yellow polyline represents the AUC value for distinguishing each cancer type from all mixed cancer and non-cancer samples. (F–K) ROC curve showing the diagnostic performance of the RMvar-related miRNA signature in cohorts corresponding to different conditions. (L) ROC curve showing the diagnostic performance of AFP in distinguishing HCC from patients with chronic hepatitis/liver cirrhosis. M: Density of RMvar-related miRNA signature output strength in HCC samples and hepatitis/liver cirrhosis cases.
Next, we investigated the influence of tumor stage and benign diseases on the diagnostic power of the RMvar-related miRNA signature. Regardless of the advanced or early stages, cancer samples were distinguished accurately from non-cancer controls using the RMvar-related miRNA signature, with AUC values of 0.995 and 0.997, respectively (Fig. 5F, G). Considering the influence of benign diseases on the diagnostic power of RMvar-related miRNA signature, we constructed a cohort that excluded benign diseases and another cohort that only included cancers and benign diseases. As shown in Fig. 5H and I, the RMvar-related miRNA signature demonstrated a high diagnostic power in distinguishing cancer samples from non-cancer controls (excluding benign diseases) or benign diseases, with AUC values of 0.998 and 0.961, respectively, indicating the potent ability of this signature in distinguishing cancers from benign diseases. To confirm this result, we applied this signature to two cohorts involving malignant bone and soft tissue tumors or HCC and their relevant benign diseases. The RMvar-related miRNA signature exhibited a high diagnostic power in distinguishing malignant bone and soft tissue tumors from benign bone and soft tissue tumors, with an AUC of 0.893 (95%CI 0.871–0.916), specificity of 71.4%, sensitivity of 87.8%, and diagnostic accuracy of 79.6% (Fig. 5J). The RMvar-related miRNA signature demonstrated superior performance in distinguishing HCC from hepatitis and liver cirrhosis, with an AUC of 1.000 (95%CI 0.999–1.000), specificity of 99.3%, sensitivity of 99.4%, and diagnostic accuracy of 99.3% (Fig. 5K). These values were superior to those of the traditional biomarker alpha fetoprotein (AFP; AUC = 0.684, specificity = 77.5%, and sensitivity = 50.0%, with a cutpoint of 25 ug/L) (Fig. 5L). The output strength of the RMvar-related miRNA signature in patients with HCC rarely intersected with the value range of patients with chronic hepatitis\liver cirrhosis (Fig. 5M). Our results indicate that the RMvar-related miRNA signature can accurately distinguish cancers, regardless of stage, without a significant interference from related benign diseases.
