SAN FRANCISCO, Calif.—May 31, 2023—Researchers at the Gladstone Institute, the Broad Institute at the Massachusetts Institute of Technology and Harvard Universities, and the Dana-Farber Cancer Institute have identified a large scale of interconnected human genes. He turned to artificial intelligence (AI) to understand how networks control human genes. The function of cells and how disruption of their networks leads to disease.
A large language model, also called a foundation model, is an AI system that learns basic knowledge from large amounts of general data and applies that knowledge to accomplish new tasks (a process called transfer learning). These systems have recently received mainstream attention with the release of ChatGPT, a chatbot built on OpenAI’s model.
In this new study, published in Nature, Gladstone’s research assistant Christina Theodoris, MD, PhD, has developed a basic model for understanding how genes interact. The new model, called Geneformer, learns from large amounts of data on gene interactions from a wide range of human tissues and transfers this knowledge to predict how disease will go wrong.
Theodoris and her team used Geneformer to reveal how heart cells malfunction in heart disease. However, this method can also address many other cell types and diseases.
“Geneformers have broad applications across many areas of biology, including finding potential drug targets for disease,” said Theodoris, who is also an assistant professor of pediatrics at the University of California, San Francisco. “This approach will greatly advance our ability to design network correction therapies in diseases where progress is hampered by limited data.”
Theodoris is Ph.D. with X. Shirley Liu, Ph.D., former director of the Center for Functional Cancer Epigenetics at the Dana-Farber Cancer Institute, and Patrick Eleanor, M.D., Ph.D., director of the Broad Institute’s Cardiovascular Disease Initiative. I designed Geneformer as a researcher. Both are authors of the new study.
network view
Activation of many genes initiates a cascade of molecular activity that triggers other genes to increase or decrease their activity. Some of these genes then either affect other genes or loop back and put the brakes on the first gene. So when scientists sketch out connections between dozens of related genes, the resulting network map often looks like a tangled web.
If mapping just a handful of genes this way is tedious, trying to understand the connections between all 20,000 genes in the human genome is very difficult. But large-scale network maps like this will provide researchers with insight into how entire genetic networks are altered by disease, and how to reverse those changes.
“If a drug targets a peripheral gene in the network, it may have a subtle effect on cell function or just manage disease symptoms,” Theodoris said. “But by restoring normal levels of genes that play a central role in the network, we can treat the underlying disease process and have a greater impact.”
Artificial Intelligence “Transfer Learning”
To map genetic networks, researchers typically rely on huge datasets containing many similar cells. They use a subset of AI systems called machine learning platforms to unravel patterns in data. For example, a machine learning algorithm can be trained on a large number of samples from patients with and without heart disease to learn genetic network patterns that distinguish between diseased and healthy samples.
However, standard machine learning models in biology are trained to accomplish only a single task. For the model to perform another task, it must be retrained from scratch on new data. So if the researchers in the first example wanted to identify diseased kidney, lung, and brain cells from healthy cells, they would have to start over and use data from those tissues to train new algorithms. I have.
The problem is that for some diseases there is not enough existing data to train these machine learning models.
In a new study, Theodoris, Eleanor, and their colleagues solve this problem by leveraging a machine learning technique called “transfer learning” to train Geneformer as a foundational model that can transfer its core knowledge to new tasks. I worked on
First, we fed Geneformer with data on the activity levels of genes from approximately 30 million cells from a wide range of human tissues, so that we could develop Geneformer to give us a basic understanding of how genes interact. “Pre-training”.
To demonstrate that the transfer learning approach is working, the scientists fine-tuned Geneformer to make predictions about the relationships between genes, or whether reduced levels of certain genes would cause disease. Geneformer was able to make these predictions with much higher accuracy than other approaches due to the underlying knowledge gained during the pre-training process.
Furthermore, Geneformer was able to make accurate predictions even when very few examples of relevant data were presented.
“This demonstrates the applicability of Geneformer to the prediction of diseases where research progress has been slowed due to lack of access to large enough datasets, such as rare diseases and diseases that affect tissues that are difficult to sample in the clinic.” means,” said Theodoris. .
Lessons for heart disease
Theodoris’ team then set out to advance heart disease detection using transfer learning. They first asked geneformers to predict which genes would adversely affect the development of cardiomyocytes, the muscle cells of the heart.
Of the top genes identified by this model, many were already associated with heart disease.
“The fact that the model predicted genes that were already known to be very important for heart disease gave us additional confidence that the model could make accurate predictions,” says Theodoris. .
However, other potentially important genes identified by Geneformer, such as the gene TEAD4, have so far not been associated with heart disease. And when the researchers removed TEAD4 from heart muscle cells in the lab, the cells could no longer beat as hard as healthy cells.
Geneformer therefore used transfer learning to draw new conclusions. Although no information was provided on cells lacking TEAD4, they accurately predicted the critical role that TEAD4 plays in cardiomyocyte function.
Finally, the research group asked Geneformer to predict which genes should be targeted to make diseased cardiomyocytes resemble healthy cells at the gene network level. When the researchers tested two of the proposed targets in cells suffering from cardiomyopathy (a disease of the heart muscle), using CRISPR gene-editing technology to remove the predicted gene, diseased cardiomyocytes It was actually found that the pulsatile ability of the heart was restored.
“In the process of learning what a normal gene network looks like and what a diseased gene network looks like, Geneformer can target what features to switch between healthy and diseased states. I was able to understand what was going on,” says Theodoris. “The transfer learning approach allowed us to overcome the challenge of limited patient data and efficiently identify potential drug target proteins in diseased cells.”
“The advantage of using Geneformer is that we can predict which genes will help cells switch between healthy and diseased states,” says Ellinor. “In our lab at the Broad Institute, we were able to validate these predictions in cardiomyocytes.”
The researchers plan to expand the number and types of cells analyzed by Geneformer to continue to enhance its ability to analyze gene networks. They also open-sourced the model for other scientists to use.
“The standard approach would be to retrain the model from scratch for each new application,” says Theodoris. “What’s really interesting about our approach is that we can apply our basic knowledge of Geneformer’s gene networks to answer many biological questions. What do other people do with it?” I look forward to seeing what you do.”
