Machine learning models using DNA methylation patterns may help identify the origin of cancers of unknown primary origin
SAN DIEGO – A machine learning model that analyzes CpG-based DNA methylation accurately predicted the origin of many different cancer types in patients with cancer of unknown primary (CUP), according to a study presented at a scientific conference. American Association for Cancer Research (AACR) 2026 Annual Meeting, April 17-22.
CUP is a metastatic malignant tumor in which the primary cancer site cannot be identified. According to the presenters, these cancers are often associated with a poor prognosis because treatment decisions must be made without knowing the origin of the cancer and patients are typically treated with broad, non-specific chemotherapy regimens rather than treatments targeted to specific cancer types. Dr. Marco A. de Velasco is a faculty member in the Department of Genome Biology at Kinki University, Japan.
“Only 15% to 20% of patients with CUP exhibit features that doctors can treat with site-specific therapies, which are associated with better outcomes,” De Velasco explained. “However, most patients, 80% to 85%, receive more general chemotherapy, which is often less effective. Patients who receive site-specific therapy can survive up to 24 months, while those who receive standard therapy can survive 6 to 9 months.”
Researchers have considered whether using molecular profiling to identify the origins of cancer can improve treatment decisions. These approaches analyze patterns in tumor biology, such as gene activity and chemical modifications to DNA. These patterns vary depending on the type of cancer and can persist even after the cancer has spread. De Velasco said some methods have shown promise, but no clear survival benefit has been demonstrated in clinical trials.
In this study, De Velasco and colleagues, including co-principal investigators, Dr. Kazuko Sakai, Principal Investigator Kazuto Nishio, MD, has developed a new approach that focuses on CpG DNA methylation, a type of chemical modification that occurs at cytosine and guanine DNA bases. De Velasco pointed out that CpG methylation acts like a molecular “fingerprint” for different tissues in the body. By analyzing these patterns in tumor samples, the researchers developed a computational model that can distinguish between 21 different types of cancer.
“Instead of relying on large, complex datasets, we aimed to identify a smaller, more actionable set of markers that hold strong predictive power,” De Velasco said. “The long-term goal is to create tools that can help doctors identify the likely source tissue and determine more effective treatments.”
The model was developed using methylation data from approximately 7,500 patients with 21 different cancer types from the Cancer Genome Atlas program and other public datasets. The data was divided into training and testing cohorts.
The researchers applied machine learning to identify CpG methylation sites within tumor DNA in the training cohort and build methylation profiles associated with different tumor types.
Study results showed that the model accurately identified cancer type in approximately 95% of cases in the study cohort and maintained high performance (approximately 87% accuracy) when applied to an independent validation cohort from the investigators’ institution consisting of 31 cases representing 17 different cancer types.
“One of the most important findings from our study was that we were able to accurately predict the origins of different types of cancer using a very small subset of DNA markers, about 1,000 CpG regions selected from hundreds of thousands across the genome,” De Velasco said. “This is important because it shows that complex molecular data can be simplified while maintaining strong predictive performance.”
For patients with CUP, he added, the model could help doctors move away from a trial-and-error treatment approach and instead select treatments tailored to where the cancer is thought to originate.
“Our findings suggest that a DNA-based approach can help determine where cancer started, even when the original tumor is not visible. By using a much smaller and more focused set of markers, this approach could make this type of test more practical and accessible in the future,” said De Velasco.
“Overall, we believe this study is part of a broader effort to use molecular information to better understand cancer, with the aim of supporting more informed and personalized care in the future. However, this study is still in the research phase. We next need to assess how well this approach performs in prospective analyzes of patients with true cancers of unknown primary origin,” added de Velasco.
One important limitation of this study is that the model was developed using a cancer of known origin rather than a true CUP. This means that models need to be tested on real CUP patients to understand how well they perform in clinical practice. Another limitation is that not all tumors have easy access to genetic testing, especially in advanced-stage settings. According to De Velasco, an important next step for this study is to adapt and evaluate this model using blood-based biopsies to analyze circulating tumor DNA, rather than relying on DNA from tissue samples.
This research was funded by the Japan Society for the Promotion of Science. De Velasco reports no conflicts of interest.
This news release was issued by the American Association for Cancer Research on April 20, 2025.
