Cancer detection via one-shot learning: integrating gene expression and genomic mutation analysis

In this section, we provide details about the experiments carried out to assess the effectiveness of the proposed approach. We start with a description of the data used for experimentation, including the acquisition and preprocessing steps employed to prepare the data for the analysis. Following, we provide details about the the experiments carried out to assess the effectiveness of the proposed SNN S. Finally, we describe the insights and the cancer-related knowledge gained by explaining the trained S model behavior using the SHAP values-based technique proposed described in Sect. 2.

Data acquisition and preprocessing

The TCGA program has extensively characterized over 20,000 primary cancer and matched samples across 33 different cancer types. For each analyzed patient (sample), a comprehensive set of genomic, omics, and clinical data is available in a separate files. To build the dataset used for our experiments, after downloading the whole dataset, we only retained samples for which information about genomic mutations and gene expression profiles were available. We remark that, with the term genomic mutations we generically refer to: (i) single nucleotide polymorphism (SNP), insertions (INS), and deletions (DEL) occurring on the genes of a given subject; for each patient, this information was summarized by counting the frequency of each type of mutation occurring in each gene; (ii) copy number aberrations (CNAs) values, which were normalized; after this operation each gene has a copy number value in [0, 0.25, 0.5, 0.75, 1]. Finally, our dataset included 8788 patients across 24 distinct cancer types, with data available for 23665 genes per patient. See Table 2 for a detailed overview of the 24 selected cancer types and the corresponding number of patients for each.

Table 2 Selected cancer types and number of patient samples

The dataset described in Table 2 presents a classic example of the “curse of dimensionality”, characterized by high-dimensional data with a relatively low number of samples. To address this, we reduced the number of genes considered by filtering based on gene expression levels, following a similar approach proposed in [24]. First, we applied the min-max normalization to normalize the gene expression across all the genes in our dataset. Following, for each gene, we calculated the standard deviation of its expression across all cancer types. Subsequently, we retained only those genes with a standard deviation above a specified threshold. As we can see in Fig. 2, approximately the 90% of the genes have a standard deviation less than 0.0005, and therefore we decided to select only those that have a standard deviation above than this value, for a total of 1349 selected genes. Following, we considered two different datasets for our experiments: (i) GE (Gene expression), obtained considering uniquely the gene expression of each one of the 1349 genes (1349 features), and (ii) GEM (Gene expression + Mutations), obtained by considering also the frequency of each type of mutation occurring in each of the 1349 genes, for a total of 5719 features obtained by considering the 1349 features of the gene expressions, and for each of such value (one gene) we also consider 4 additional information regarding INS, DEL SNP, and CNAs, for a total of $1349 \times 5=6745$; to reduce the sparsity of this last dataset, we decided to remove all the columns related to genomic mutations that had a value of 0 across all samples, resulting in the elimination of 1026 columns; so, the final number of features considered is $6745-1026=5719$. Table 3 summarizes the number of the features considered for each sample into the two datasets produced.

Table 3 Gene expression (GE) and gene expression + mutations (GEM) datasets, produced for the experiments, each one with a total of 8,788 samples (patients)

Experiments

This work builds upon and extends CancerSiamese [24], addressing its limitations in leveraging SNNs to identify type-agnostic marker genes and effectively utilize their combined information to define the similarity between samples from distinct TMEs.

Specifically, we have organized the experiments into three main parts:

E1 As a baseline, the SNN model of CancerSiamese, denoted with C, is trained on the GE dataset to detect the 24 types of cancer detailed in Table 2.
E2C is trained on the GEM dataset to detect the 24 types of cancer in Table 2.
E3 The proposed SNN described in Fig. 1, and denoted with S, is trained on the GEM dataset to detect the 24 types of cancer described in Table 2.

For each experiment, both the SNNs S and C were trained using Keras DL platform with the Tensorflow backend. As explained in Sect. 2, for each experiment, network transfer learning was exploited for the training of both S and C models, i.e., the initial weights of the 1D-CNN feature extractors were set as those in the 1D-CNN for classification of cancer types pretrained on the same training set. The weights for the rest of the layers (i.e. Full Connected and sigmoid) were initialized by Xavier Initialization as also proposed by [24]. Each SNN was optimized with a binary cross-entropy loss and trained with 20,000 training iterations, where each iteration includes a batch of 128 pairs with an equal number of matched and mismatched pairs, all chosen randomly from the corresponding training dataset. The parameters of the networks were optimized by Adam optimizer and all of the hyperparameters were tuned manually. The core utility of a SNN in a one-shot learning setting lies in its ability to generalize to unseen classes after training. Therefore, a more appropriate evaluation framework is based on N-way one-shot tasks. Each task involves:

A query sample from a given class $C_q$
A support set $S = \{x_1, x_2, \dots , x_N\}$, containing one sample from $C_q$ and $N – 1$ samples from distinct, unrelated classes
The model computes similarity scores between the query and each support element:

$$ \hat{y}_j = \text {sim}(x_q, x_j), \quad \forall j \in \{1, \dots , N\} $$
The predicted class is associated with the support having the maximum similarity.

The prediction is correct if the support sample with the highest similarity score is from the same class as the query. N-way accuracy is computed over k repetitions:

$$ \text {Accuracy}_{\text {1-shot}} = \frac{1}{k} \sum _{i=1}^{k} \overline{1} \left[ \arg \max _j \hat{y}^{(i)}_j = j^{(i)}_{\text {true}} \right] $$

where $\overline{1}$ is the indicator function, which equals 1 if its argument is true, and 0 otherwise.

In our specific case, for each experiment, the trained SNN networks were tested for N-way predictions with $N=24$, which is the number of cancer classes considered. For an N-way prediction, each SNN compared a query sample with a support set of N samples, each from a different cancer type (N-way one-shot learning). The query sample cancer type was predicted as the support sample type whose corresponding pair received the highest probability from the SNN among the N pairs. The prediction was counted as correct if the predicted type was the same as the true type of the query sample. For each experiment, we tested the SNN on 20000 randomly selected query samples and the corresponding support set from the test dataset, where each support set contained N randomly selected samples, each from a different cancer type but one of them coming from the same cancer type as the query sample. The TCGA datasets used to construct the primary dataset in this study can be accessed via the cBioPortal for Cancer Genomics platform.^{Footnote 3} The datasets produced and used for the experiments,^{Footnote 4} and the developed source code are available online.^{Footnote 5}

As for the experiment E1, when we use the baseline SNN C on the GE dataset, we obtained a higher accuracy (91%) than the highest one obtained in [24] (89.67%) for the same task on the same type of data. Such a discrepancy can be explained by having slightly different data. Indeed, our dataset included a greater number of cancer types (24 compared to 19 considered in [24]). Furthermore, we evaluated the accuracy of C using a 24-way classification (as opposed to the 10-way approach used in [24]). Increasing the value of N in the N-way classification inherently raises the likelihood of misclassification, as it expands the number of support samples representing the classes with which the query sample is compared. Therefore, achieving superior results with a higher N value not only highlights the robustness of our approach but also underscores the enhanced capability of our model in handling more complex classification tasks.

As for the experiment E2, when we use the baseline SNN C on the GEM dataset, that is, on a more complex input obtained by adding genomic mutations to the gene expression, the architecture of C is shown to be not powerful enough to handle both gene expression and gene mutations simultaneously, achieving accuracy 22.4%. This was expected, since we added a a high number of features by considering the mutations, while leaving the number of samples unchanged. In order to overcome this limitations, we ran the experiment again using the deeper architecture of S described in Fig. 1 (experiment E3). With this setting, and using a 24-way to evaluate the model’s performance, we were able to achieve an accuracy higher than C equal to 85.3%. Table 4 summarizes the results obtained through the different experiments.

Table 4 Results achieved by the three different experiment settings

Statistical comparison of SNNs

As shown in Table 4, when we applied the baseline architecture proposed by Mostavi et al. to our extended dataset (which includes both gene expression and mutational data), we observed a significant performance drop, which moved from 91% accuracy with gene expression data alone to 22% when mutation data was added. This underscores the need for architectural modifications to effectively integrate multiple data modalities. To fairly compare our proposed architecture (detailed in Sect. 2.1) with the baseline, we conducted seven independent runs using different train/validation/test splits and only using gene expression data, as in the original Motavi et al setting. To assess statistical significance, we applied the Wilcoxon signed-rank test [45] to the validation accuracies of the proposed and baseline models, as this non-parametric test is well-suited for small paired samples without assuming normality. The test yielded a p-value of 0.007, indicating that the performance improvement of our method over the baseline is statistically significant. These results support the robustness and effectiveness of the proposed deeper architecture in handling the integrated data and achieving superior performance.

As we can see in the Fig. 3, in both the cases, on 100 test runs, the proposed SNN S achieves better results (as also highlighted in Table 4).

Cancer insights from SHAP-based explainability

After training the proposed S on the GEM dataset (experiment E3), we applied the novel SHAP-based technique described in Sect. 2.2 to assign an importance score to each feature for each cancer type described in Table 2. As explained above, in standard machine learning models, SHAP values are signed quantities that measure how much a feature pushes the model’s output away from a baseline prediction. A positive SHAP value means the feature increases the prediction (e.g., toward a particular class), while a negative SHAP value means it pulls the prediction down. This allows directional interpretations at the single-sample level. However, our model works differently, by taking pairs of samples as input and outputs a similarity score. Each feature appears twice in a pair (once for each sample) and may influence the similarity score in different directions. For instance, a gene’s high expression in one sample might increase similarity, while its low expression in the other might decrease it. If we were to simply average the two signed SHAP values for each feature across the pair, these effects could cancel out, even if the feature was important in both cases. To avoid this, we use the average of the absolute SHAP values for each feature across the two samples. This gives us a magnitude-based score that reflects the overall importance of a feature for that sample pair, without being misled by opposite directions of influence. For example, suppose for a given feature (e.g., expression of gene G1), we obtain SHAP values of −0.6 for sample A, and 0.6 for sample B. Averaging the raw values would yield 0, falsely suggesting that the feature had no impact. In contrast, averaging the absolute values gives us 0.6, correctly indicating that G1 had a strong overall influence on the model’s decision for this pair. This adaptation sacrifices directional information (i.e., whether the feature pushed the score up or down), but in return, it captures how much a feature consistently matters across both samples. By aggregating these values across many such pairs for a specific cancer type, we can rank genes by their global influence on similarity predictions for that cancer, highlighting features that help distinguish that cancer type from others. As a concrete example, Fig. 11 shows that the gene KLK3 had the highest average contribution to the similarity score when comparing sample pairs involving prostate cancer. Specifically, the average SHAP value for KLK3 across those pairs is approximately 0.1. It’s important to note that this value represents the magnitude of the contribution, not the signed effect, because it is computed as the average of the absolute SHAP values for each sample in the pair. This means KLK3 consistently had a strong impact on the model’s similarity assessments, regardless of whether it increased or decreased the score in individual samples.

We recall that, our goal was to identify the features that were most discriminative for each cancer type, i.e., the features that enabled the model to distinguish a specific cancer type from others. Since SHAP values are additive, for each cancer type we “aggregated” the feature importances by gene, summing the contributions from its expression, SNPs, INS, DEL, and CNAs. Then, we ranked the genes based on their overall importance. As an example, in Fig. 4 we show the contribution of each aggregated feature for the top 20 most important genes in Hepatobiliary cancer. The detailed analysis is available online,^{Footnote 6} where we show the importance of the top 50 genes for each of the 24 cancer types described in Table 2, highlighting the contribution of each one of the aggregated features.

After identifying the top 100 genes for each cancer type, we checked how many of these 100 genes are relevant genes, i.e., how many of these 100 genes were in the top 100 for one cancer type, then for 2 cancer types, then for 3 cancer types, and so on, up to 24, which is the maximum number of cancer types analyzed. As a result, we identified a total of 414 relevant genes across all cancer types. As we can see in Fig. 5, 168 genes appear in a single cancer type (first bar from the left), while 36 genes are relevant across all the 24 cancer types (first bar from the right). Interestingly, this analysis is that it allows us to organize the relevant genes into relevance classes.

First, we identified the set of genes which have the highest level of cross-cancer “sensitivity”, reflecting their universal relevance within the dataset. This set, named shared genes, consists of the following 36 genes relevant across all the 24 cancer types: A2M, ACTBL2, BMS1P20, C3, CEACAM5, CHGA, CLU, COL6A2, COL6A4P1, EFEMP1, EN01, FN1, FTHL17, FTMT, GNAS-AS1, IGF2, KLK3, KRT19, KRT5, KRT8, LCP1, MALAT1, MUC1, PABPC1, PLTP, PTMA, RG55, RPL7A, SDC1, SFN, SFTPB, SPARC, TG, TUBA1A, XBP1, and X1ST.

Then, we identified the genes with greater “specificity”, defined as those genes which are “not relevant” to at least one cancer type. Referred to as specific genes, these are the 378 genes out of the 414 relevant ones that are not among the 36 shared genes.

Finally, to further refine our analysis and explore levels of relevance that lie between the specificity of the specific genes and the cross-cancer sensitivity of the shared genes, we identified an intermediate category of genes, which we will refer to as cross-relevant genes. These are genes relevant across at least 20 out of the 24 cancer types analyzed. This threshold was chosen to capture a subset of genes that exhibit significant cross-cancer relevance while not necessarily meeting the stringent criteria of being relevant to all cancer types. This expanded subset consists of the following 48 genes: GNAS-AS1, XBP1, TG, KLK3, KRT19, FTHL17, KRT5, ACTBL2, CHGA, CEACAM5, KRT8, RPL7A, FN1, ITGA3, MUC1, SFTPB, SPARC, C3, COL6A4P1, PTMA, SFN, ENO1, RGS5, PLTP, CLU, XIST, S100A6, FTMT, MALAT1, LCP1, EFEMP1, HLA-A, IGF2, PABPC1, FBLN1, SDC1, ALB, P4HB, TUBA1A, PRSS16, COL1A1, A2M, TTN, BMS1P20, CRYAB, COL6A2, SPOCK2, and ITM2B.

We carried out 3 different analysis: (i) enrichment analysis of the 48 cross-relevant genes, (ii) analysis of the 36 shared genes, and (iii) analysis of the 378 specific genes.

Enrichment analysis

The selection of 48 cross-relevant genes, achieved by setting the cutoff threshold to at least 20 cancer types out of 24, enhances the robustness of the enrichment analysis. This approach also allows for the inclusion of genes that might otherwise be excluded due to variability in SHAP value computations, ensuring a broader and more reliable gene set for analysis. On such genes, we performed an enrichment analysis using The Database for Annotation, Visualization and Integrated Discovery (DAVID), which is an integrated biological knowledge-base and analytic tools aimed at extracting biological meaning from large gene/protein lists [46]. We found that the cross-relevant genes are enriched for components related to the TME, such as proteins and elements found in extracellular matrix (ECM). This is consistent with what reported by [24]. As discussed in Sect. 1, it is known in literature that different cancer types are characterized by different TMEs, which can also determine the vulnerability of the cancer to different therapies [47]. TMEs are defined as normal tissues surrounding the tumor which are shaped by it, and which can determine several things, such as its spreading. This means that different TMEs are usually made by the same types of components (such as blood vessels, extra cellular matrix components) but performing different functions among the cancers. Details about this analysis can be found in Table 5. Additional materials produced during our analysis can be found online.^{Footnote 7}

Table 5 Annotation clusters with enrichment scores and corresponding statistics

Key findings from the enrichment analysis, summarized in Table 5, reveal that these genes are prominently associated with the ECM, including components such as collagen and glycoproteins, and processes such as ECM organization. Genes such as COL1A1, COL6A2, and FN1 show the structural and functional roles of the ECM in tumor progression, including cell adhesion, migration, and interaction with immune cells. Additionally, the analysis identifies significant involvement of these genes in signaling and regulatory pathways, such as those mediated by integrins and other ECM-receptor interactions. Proteins encoded by genes like ITGA3 and SPARC are critical for communication between tumor cells and the surrounding stroma, influencing tumor growth, invasion, and immune evasion. The presence of secreted proteins, such as those encoded by MUC1 and CEACAM5, further supports the role of these genes in modulating the TME and facilitating metastasis. From an immunological perspective, genes like HLA-A and RGS5 reflect the dynamic interactions between tumors and immune cells within the TME. These interactions are crucial for understanding how tumors evade immune responses and how targeted immunotherapies can be developed. Furthermore, the inclusion of genes linked to cytoskeletal dynamics, such as ACTBL2 and TUBA1A, underscores the importance of intracellular structural components in maintaining cellular integrity and facilitating tumor cell motility.

Shared genes analysis

Since tumors are characterized by TMEs composed of similar components, such as healthy tissues “co-opted” by the tumor, yet exhibit distinct functional behaviors, including immune responsiveness and biological roles, across cancer types, this implies the existence of a shared set of TME-related genes that are “universally” important across cancers while capturing variations in their biological mechanisms. Given this, it is reasonable to assume that the proposed SNN, by focusing on these genes, can learn to distinguish between TMEs, allowing it to differentiate the cancer types effectively.

In our case we hypothesize that the set of 36 shared genes described above can be used for this purpose. This can be see in detail in Fig. 11, where, first we can observe that the same shared gene, although being among the most important genes for all the different cancer types, can assume highly diverse importance values. Furthermore, the importance of these genes is mainly given by their expression. This is not surprising, since, as described in Sect. 3.1, gene expression was the primary criterion used to select the subset of genes to be used to create the datasets used for the experiments. However, as we can see in Fig. 12, the distribution of importances of the TMB among such shared genes varies among the different cancer types, contributing to the overall result. This leads us to believe that that different mutational patterns may contribute to the TMEs varieties in functions. Indeed, several shared genes identified as important from the SHAP values based on our trained model, nicely relate to several known results in literature. This is an additional capability with respect to CancerSiamese which highlighted some genes that could potentially serve as general biomarkers, meaning they are important for the network overall. Instead, our approach allows to identify genes that are more or less significant for specific cancer types, such as those listed below. For example, as we can see in Fig. 11, KLKL3 seems to be extremely important when it comes to the Prostate Cancer. This looks coherent with the literature, which reports that such a gene is in fact a known prostate cancer biomarker [48, 49]. A similar event can be observed for the Thyroid Cancer, with the TG being the most important gene. Also in this case, such gene is a well established thyroid cancer biomarker [50]. Similarly, KRT5 has been reported to be highly expressed in the Head and Neck Cancer [51], and CEACAM5 has been identified as a biomarker for Colorectal Cancer [52], among others. For example, genes from the Collagen VI family (COL6A) are a major member of extracellular matrix protein, which is involved in tumor genesis and tumor progression, and they are differently expressed across different cancers [53]. In Fig. 12 we can observe that the presence of CNAs on the COL6A4P1 gene seems particular important in the context of the Bladder Cancer, and it would be worthy of further investigations. Such observations seems in agreement with the recent finding that Bladder Cancer is, in general, affected by a high number of CNAs [54]. The broad impact of CNAs on multiple genes in bladder cancer is evident in our results, as can be seen in, for example, in Fig. 6. Another noticeable example is the high GNAS-AS1 importance in the Adenocortical Cancer. Such a gene was reported to be up-regulated compared to a healthy control group [55].

To assess the role of shared genes in a cohesive manner, we analyzed them function through the STRING platform [56], which is a database of known and predicted protein-protein interactions useful for Protein-Protein Interaction Networks Functional Enrichment Analysis. As a result, 18 out of the 36 shared genes were clustered together and classified as tumor markers. Figure 7 shows the connections among the members of the clusters, where a higher number of connections suggests a higher reliability of the data, since it corresponds to a higher number of findings in literature.

The literature supporting our findings suggests that even among the identified shared genes, those lacking established biological roles or evidence in current studies may hold significant potential for further investigation. This highlights the broader utility of our method, not only as a diagnostic tool but also as a powerful framework for prioritizing and selecting promising targets for in-depth cancer research.

Specific genes analysis

Among the specific genes described above, several genes were chosen mainly due their mutational patterns. It is known that cancer development tends to generate a high number of mutations which, however, do not actually give any evolutionary advantage [47]. This makes it difficult to identify some signature mutations, i.e., mutations which are specific to a cancer type and that may have an actual functional role in its development. However, the proposed SHAP-based analysis enable us to identify some important genes chosen for their mutational patterns which are supported in literature, suggesting that our method may be a valuable tool for directing the attention of biologist on a promising subset of genes when looking for untactful mutations.

For instance, as for the Leukemia, gene NPM1 importance was mainly determined by the frequency of the insertions mutation (see Fig. 8), and it is know in literature that insertions in the NPM1 gene are present in the 30-50% of patients affected by acute Myeloid leukemia [57]. When we looked into our data, we found out that 41 patients out of 132 affected by Leukemia patients presented one insertion on the NPM1, while no other patient affected by another cancer type had any insertion on this gene.

Similarly, in the Endometrial cancer (EC), we can observe that the DST gene was chosen mainly due to the impact of its SNPs mutation frequency (see Fig. 9). In [58], the authors carried out a preliminary study on a cohort of 23 patients affected by uterine sarcomas, carcinosarcomas, endometrial stromal sarcoma (ESS), adenosarcoma, and leiomyoma. One of their goal was to identify some signature genes. Among other results, they classified the SNP mutation (16429C>T) on the DST gene as a signature gene for the ESS cancer (i.e. our Endometrial cancer). However, in their cohort only 7 patients were affected by ESS, and only 2 of them shown the aforementioned mutation. Our model seems to confirm the importance of SNPs on the DST gene for the ESS cancer type, even using a much bigger cohort, since our dataset included 563 ESS patients. Overall, 10 among the top 30 predicted genes with respect to the EC find confirmation in literature [59,60,61,62,63,64,65,66,67,68]. Specifically, TUBB4B gene is a known prognostic marker in EC [59], and SERPINC I is reported as an important blood-based protein biomarker candidates for EC detection [60]. Similarly, RPL7A expression was associated to the Uterine Corpus Endometrial Carcinoma [61]. Genes such SPOCK2, XBP1, SPARC and SDC1 are connected to EC growth, progression and migration [62,63,64,65]. Findings about MALAT1 suggested that the rs664589 C>G mutation significantly increased the risk of EC [66]. Interestingly, mutations are responsible for a considerable portion of the overall importance assigned to MALAT1 by our model for the EC, although multiple type of mutations are involved, as can be seen in Fig. 9. A final valuable finding is that CTNNB1, although not among the 50 most important genes for EC shown in Fig. 9, but among the top 100 most important genes for EC according to our model, appears to be a frequently mutated gene in EC. Interestingly, mutations on the CTNNB1 gene in the context of the EC are reported to lead to alterations in the Wnt/$\beta $-catenin signaling pathway, which is involved in the carcinogenesis and progression [69]. According to our model, most of the importance of CTNNB1 is determined by its mutations, coherently with the aforementioned literature. Further details for all 24 cancer types can be observed in the additional material available online,^{Footnote 8} where for each one of the 24 cancer types we report the list of the top 50 genes according to their importance.

Another interesting case is the Non-Small Cell Lung cancer (NSCLC). As shown in Fig. 10, several genes were marked as important due to the impact of their mutational patterns. In particular, the TFRC, CTNNB1, TTN, SDC2, PLEC, AP2M1, COL4A1 and AEBP1. Some of these genes have some confirmation in literature. For example, it is knwon that oncogenic mutations have been found on CTNNB1 in liver cancer, uterine cancers, and in a small subset of NSCLCs, although clinical significance of CTNNB1 mutations in NSCLCs and Lung Adenocarcinomas (LUADs) is still unclear [70]. Similarly, it is reported that TTN is a common mutated gene in diverse types of tumor including LUAD and lung squamous cell carcinoma [71]. In [72], the authors suggest that SDC2 likely plays an important role in the invasive properties of LUAD cells.

One-shot evaluation for Healthy vs. Cancerous detection

As final experiment, we evaluated the capability of the proposed SNN S, trained on the GEM dataset, to distinguish between healthy and cancerous samples. This analysis aimed to investigate whether S could be applied as a reliable “binary” classifier to ascertain a patient’s health status. To this end, we constructed a dataset of 305 healthy individuals using data from the TCGA project, accessed through the cBioPortal for Cancer Genomics platform, ensuring that each individual was characterized by the same 5719 features, i.e., Gene Expressions + Mutations (see GEM dataset in Table 3). This enabled direct application of the trained S without modification.

The capability of S has been evaluated in two different scenarios. In the first one, given a query healthy sample p, we assume the availability of support healthy samples that can be included in the support set of samples $\mathcal {S}$ to be compared with p. In the second, instead, we assume that, given a query healthy sample p, the support set of samples S to be compared with p, consists only of samples belonging to cancer classes. In the following, we will provide details about both the experiments.

Available healthy support samples

In this scenario, we assume the availability of support healthy samples.

The trained S model was evaluated on 25-way predictions, where a query sample (among the 305 healthy individuals described above or the 8788 cancerous samples reported in Table 3) was compared against a support set of 25 samples: 24 representing the 24 distinct cancer types considered, and one derived from the set of 305 healthy individuals (one-shot). The model predicted the type of the query sample based on the paired support sample that received the highest similarity score from S among the 25 pairs. The prediction was counted as correct if the predicted type was the same as the true type of the query sample. We tested S on 600 randomly selected query samples (300 among the 305 healthy individuals and 300 among the 8788 cancerous samples) and the corresponding support set. The accuracy performance of 25-way prediction is calculated as the number of correct predictions out of 600 25-way predictions. The results demonstrated the S’s remarkable ability to differentiate between healthy individuals and those affected by cancer, achieving an accuracy of 95.7%.

Not available healthy support samples

In this scenario, we assume the support set to be compared with a query sample, consists only of samples belonging to cancer classes.

According to these premises, we need to formalize the similarity criterion for assessing whether a healthy patient $p_i$ belongs to a given cancer class $C_l$. Let $mean_l$ represent the average similarity score for $C_l$, obtained by using S to compute the similarity scores for all pairs of samples within this class and then calculating their mean. Let $r_l$ be a randomly selected representative sample from class $C_l$. Then, the similarity score between $p_i$ and $r_l$, can be calculated using the trained SNN S as follows:

$$d(p_i, r_l) = \texttt {L2}\left( {\texttt {S\_q}}(\texttt {fv}(p_i)), {\texttt {S\_s}}(\texttt {fv}(r_l))\right) $$

where $\texttt {L2}$ denotes the Euclidean distance between the embeddings generated by the query subnetwork $\texttt {S\_q}$ and the support subnetwork $\texttt {S\_s}$ (see Sect. 2.1), and fv$(p_i)$ (resp. fv$(r_l)$) is the feature vector of $p_i$ (resp. $r_l$). Then, $p_i$ is similar to $r_l$ if:

$$\begin{aligned} mean_l – \epsilon \le d(p_i, r_l) \le mean_l + \epsilon \end{aligned}$$

(3)

where the parameter $\epsilon $ defines the similarity tolerance margin, acting as a control for the interval $[mean_l – \epsilon , mean_l + \epsilon ]$. It specifies the acceptable range of deviation from $mean_l$. A smaller $\epsilon $ implies a stricter similarity criterion, increasing specificity, while a larger $\epsilon $ broadens the acceptance range, allowing for greater variability in the similarity without compromising classification robustness. The patient $p_i$ is classified as likely belonging to $C_l$ if Eq. 3 holds. Otherwise, the patient $p_i$ is classified as not belonging to $C_l$, implying that $p_i$ is considered healthy with respect to $C_l$. This formulation provides a robust method for assessing the health status of $p_i$ by comparing it against multiple representatives of each cancer class, leveraging the learned similarity function of S for a reliable classification decision.

Through extensive experimentation with varying value of $\epsilon $ on the 305 healthy individuals described above, we determined that setting $\epsilon = 0.03$ yielded best results. Under these parameters, the system accurately identified 93.1% of the healthy patients as non-cancerous. To further validate the robustness of this approach, we conducted a similar experiment on 305 randomly selected cancerous patients from the GEM dataset. Using the same classification criterion, S accurately identified 99.3% of the cancerous patients as belonging to a cancer class, thus affirming its effectiveness in recognizing malignancy. These results underscore the S’s stability and robustness, demonstrating its ability to perform binary classification, healthy vs. cancerous, without the need for retraining, and on healthy patients never seen during the training.

Source link