In a recently published study, Biological methods and protocolsThe researchers developed binary and multiclass machine learning models to distinguish between cancer and non-cancerous tissue samples.

background
Cancer is a major global health problem, developed by age, environmental toxins and lifestyle choices. Early detection is essential for effective treatment and survival. The complex nature of cancer and its interactions with the tissue microenvironment and the immune system make developing interventions challenging.
Metastatic malignancies account for most of the cancer deaths due to delayed diagnosis. Early detection and diagnosis, coupled with modern medicine, have a significant impact on cancer survival and treatment. Computational approaches can aid in early detection, diagnosis, and screening of complex tumor methylation patterns.
About the Research
In this study, the researchers used machine learning and microarray-based methylation analysis to classify 13 types of cancer and their associated normal tissues.
The researchers obtained methylome microarray data from The Cancer Genome Atlas (TCGA) GDC data portal and examined 13 human cancer types with at least 15 non-cancer samples, and also analyzed data from independent studies to evaluate their models.
During data preprocessing, we removed potentially noisy probes and probes with more than 5.0% missing values, and retained probes that mapped to autosomes and sex chromosomes. For multi-class information, features were created by intersecting non-cancer class and cancer type features obtained from non-cancer samples collected from all tissue types.
While preprocessing the dataset, the researchers used features from the TCGA data to analyze the unmethylated and methylated counts and derive a beta value. They used binary and multiclass machine learning models to distinguish between cancerous and normal tissue. All binary models evaluated a single tissue type and distinguished between cancerous and non-cancerous, while the multiclass models used all 13 types of tissue and non-cancerous data.
The input data was split into training and testing datasets, with the testing dataset accounting for 25% of the samples. We used two basic classification methods: logistic regression and support vector machine (SVM).
The researchers developed an XGBoost model using gradient boosted decision trees to generate 450 estimators with a depth of 10 and a learning rate of 0.2. The researchers built EMethylNET, a multi-class feed-forward neural network with input feature significance values greater than zero (3,388 features).
The researchers combined cancer molecular mechanism pathways with cancer pathways (human) from the Ingenuity Route Analysis (IPA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases to create a cancer-wide methylome model. Multiclass methylation signatures linked to genes are shown as blue nodes, and in purple if listed as a cancer gene in OncoKB or the Cosmic Cancer Gene Census.
The researchers analyzed and compared long non-coding ribonucleic acid (lncRNA) and cancer lncRNAs using two cancer lncRNA databases, Lnc2Cancer 3.0 and CRlncRNA, and the Cancer LncRNA Census (CLC). Following gene normalization, they split the data into stratified training and test sets and estimated the hazards in the test set using three Cox proportional hazards regression models.
result
The model classified 13 types of cancerous and noncancerous tissues based on their deoxyribonucleic acid (DNA) methylome with 98% accuracy. Methylation-associated genomic sites identified by the model classifier were linked to cancer-related pathways, networks, and genes, providing insights into the epigenomic regulatory pathways of carcinogenesis.
Multiclass classification approaches performed better than binary classification of DNA methylation in individual tumor and normal tissues. Multiclass logistic regression models achieved a mean Mathews correlation coefficient (MCC) score of 0.96, although their effectiveness varied across cancer types.
The experiment evaluated 13 genes, four of which overlapped with the multi-class genes. The research team noted an enrichment of pathways related to cancer hallmarks, including cancer pathways, metabolic pathways, and signaling pathways. Several of the cancer-related pathways had multi-class genes categorized into specific cancer types, cell death and survival, tissue microenvironment, signaling, metabolism, and the immune system.
The study showed that feeding an XGBoost model into EMethylNET, a multi-class deep neural network, was able to detect cancer. However, there were two outliers in the model's performance: an independent colorectal cancer (COAD) data set and an independent dataset for head and neck squamous cell carcinoma (HNSC). EMethylNET performed as well or better on the test set data compared to related cancer classification studies.
The study showed that the XGBoost model can classify different cancer types based on DNA methylation data. The researchers also created an EMethylNET model that can generalize to most independent datasets.
Genetic mapping has revealed genes with functional signatures and pathways related to carcinogenesis. This technology can identify hundreds of cancer types and may be extended to deoxyribonucleic acid methylation datasets from cell-free deoxyribonucleic acid for early diagnosis by liquid biopsy methods. A practical application of this technology would be to screen for specific cancers of unknown cause, which is not possible with current machine learning models.
Journal References:
-
Izzy Newsham, Martin Sendela, Sri Ganesh Jamra, and Shamith A. Samarajwa. Interpretable machine learning for early cancer detection and diagnosis: discovering cancer-specific DNA methylation patterns. Biological methods and protocols,Volume 9, Issue 1, 2024,bpae028, Doi: https://doi.org/10.1093/biomethods/bpae028
