Cells read their internal DNA and make useful products, such as proteins, through a process called . gene expression. Scientists and health organizations report that gene expression datasets often contain too few patient samples and contain too many genes per sample, creating a major barrier to reducing cancer globally. This imbalance makes it difficult to find and prioritize changes in gene expression that distinguish cancer cells from healthy cells. Scientists call this challenge: curse of dimension.
Machine learning techniques can model existing patterns within these large data sets and classify samples as cancerous or non-cancerous, but this poses another barrier. Clinicians and doctors do not understand how machine learning models reach their conclusions, so they are hesitant to trust the results. they do this black box problem. Therefore, researchers aim to develop methods to explain how machine learning models make decisions.
A research team based at multiple institutions in Africa focused on explaining breast cancer model predictions. They downloaded publicly available gene expression data from a global database called . cancer genome atlasThis included approximately 20,000 genes across 1,208 breast cancer samples. Their goal was to identify a small number of genes out of 20,000 that could predict whether a tissue had cancer.
First, the researchers narrowed their data down to 3,602 genes that were differentially expressed between breast cancer cells and healthy cells. From there, they used an algorithm to test multiple gene combinations and select the smallest group of genes that consistently produced good results. It’s like running thousands of small races with different runners to figure out which runner will always come in first place, even if everyone eventually reaches the finish line.
They then used various machine learning techniques to train and tune multiple models based on the expression data of the genes selected by the algorithm. They reported that all models performed well and accurately predicted cancer status at least 98% of the time. Next, they asked, “Which genes make the model work?” “How do these genes affect predictions?”
They employed four different statistical interpretation methods known as . Importance of features Use techniques to identify genes that contribute most to model performance. The first one showed how each model’s predictions change. level of expression Second, we showed how multiple genes interact to drive model decisions. The third method quantified the overall influence of each gene on the model’s decision, thereby providing a ranked importance, and the last method assessed how accurately a single gene could predict breast cancer on its own.
The researchers identified seven genes that consistently appeared across all trained models and feature importance scales. They confirmed that all of these genes have associated biological functions that can influence cancer growth, such as repairing damaged tissue, controlling the movement of substances in and out of cells, and controlling how cells defend themselves.
The researchers noted that while different models tend to agree on the most important genes, the exact rankings and impact scores may vary. They explained that for biological data, models often see different slices of the same reality, so combining the perspectives of multiple machine learning models instead of relying on a single machine learning model yields better results.
The researchers highlighted several limitations. The gene selection algorithm took nearly 6 hours on a powerful laptop, which was longer than expected and may not be efficient when the dataset is large. They also acknowledged that the algorithm may have omitted some important genes during selection. Also, despite its large size, their dataset did not fully capture the diversity of breast cancer worldwide, so their model may not perform as well across all samples. The researchers concluded that combining machine learning models with transparent and explainable techniques is the future of cancer prediction, enabling clinical confidence in machine learning recommendations.
Post views: 455
