Machine learning now available to all biologists

By Lindsey Brownell

(Boston) — Thanks to the declining cost of sequencing technology and the increasing computing power available, the amount of data generated by scientists today is staggering. But parsing all that data to reveal useful information is like looking for molecular needles in a haystack. Machine learning (ML) and other artificial intelligence (AI) tools can dramatically speed up the process of data analysis, but most ML tools are difficult to access and use for non-ML experts. It Is difficult. Recently, automated machine learning (AutoML) methods have been developed that can automate the design and deployment of ML tools, but are often very complex and require facilities with ML owned by scientists outside the AI field. is required.

A group of scientists from Harvard University and MIT’s Wyss Institute for Biologically Inspired Engineering address an unmet need by building a new comprehensive AutoML platform designed for biologists with little or no ML experience. fulfilled. The company’s platform, called BioAutoMATED, can use nucleic acid, peptide, or glycan sequences as input data, and its performance is on par with other of his AutoML platforms, while requiring minimal user input. The platform is described in a new paper published in Cell Systems and available for download on GitHub.

“Our tools don’t have the ability to build our own custom ML models, and we’re like, ‘We have this amazing dataset, will ML work on it? How can we bring it into the dataset? For those who are wondering: ML models? The complexity of ML prevents us from going further with this dataset, how can we overcome it?” said Jackie Valeri, a Wyss Core faculty member and graduate student in Jim Collins’ lab. , Ph.D. “We wanted to make it easier for biologists and other professionals to harness the power of ML and his AutoML to answer fundamental questions and reveal meaningful biology.” rice field.”

AutoML for everyone

Like many great ideas, the seeds that became BioAutoMATED were planted at lunch, not in the lab. Valeri and co-first author Dr. Louis Thorncksen were eating together at one of Wyth’s dining tables, despite Wyth’s reputation as a world-class destination for biological research. I realized that only a handful of the top experts working there had the ability to build and train ML models. It can bring great benefits to their work.

“We decided that something needed to be done about this because we wanted Wyss to be at the forefront of the AI biotechnology revolution, and we also wanted to ensure that the development of these tools was done by biologists for biologists. We also wanted to be promoted to the next level,” says Soenksen. He is a postdoctoral fellow at the Wyss Institute and a serial entrepreneur in science and technology. “Now everyone agrees that AI is the future, but when we had this idea four years ago, it wasn’t so obvious, especially in biological research. started out as a tool we wanted to build to serve ourselves and us, my colleagues at Wyss, but now we know it can do a lot more.”

Various AutoML systems have already been developed to simplify the process of generating ML models from datasets, but they usually have drawbacks. Among them is the fact that each AutoML tool is designed to look at only one type of his model (such as a neural network) when searching for the best solution. This limits the resulting model to a narrow range of possibilities, but in practice it may be more optimal to combine completely different types of models. Another problem is that most AutoML tools are not specifically designed to take biological sequences as input data. Several tools have been developed that use language models to analyze biological sequences, but these lack automation capabilities and are difficult to use.

To build a robust all-in-one AutoML for biology, the team modified three existing AutoML tools that use different approaches to model generation. DeepSwarm searches convolutional neural networks using swarm-based algorithms. TPOT searches non-neural networks using various methods such as genetic programming and self-learning. BioAutoMATED then produces standardized output results for all three tools so users can easily compare them to determine which types yield the most useful insights from their data.

The team built BioAutoMATED, which can take as input DNA, RNA, amino acid, and glycan (sugar molecules found on the surface of cells) sequences of any length, type, or biological function. BioAutoMATED automatically preprocesses input data to generate models that can predict biological function from sequence information alone.

This platform has a number of features that help the user decide if additional data should be collected to improve the quality of the output, or to learn the sequence features that the model paid the most attention to. It also has a number of useful (and therefore potentially more biologically interesting) functions to design new sequences for future experiments.

Nucleotides, peptides, glycans, what!

To test the new framework, the team first used it to see how a series of sequence alterations in RNA, called ribosome binding sites (RBS), affect the efficiency with which ribosomes bind RNA and translate it into protein. We investigated whether it would affect Escherichia coli. They fed the sequence data into his BioAutoMATED and identified a model generated by the DeepSwarm algorithm that could accurately predict translation efficiency. This model performed similarly to a model created by a professional ML expert, but took only 26.5 minutes to generate and only 10 lines of input code from the user (versus other models he may need 750 or more). They also used BioAutoMATED to identify which regions of the sequence appeared to be most important in determining translation efficiency and designed new sequences that could be tested experimentally.

We then entered the peptide and glycan sequence data into BioAutoMATED and proceeded with the experiment using the results to answer specific questions about those sequences. This system yields highly accurate information about which amino acids within a peptide sequence are most important in determining the ability of an antibody to bind to the drug ranibizumab (Lucentis), and can also be used for different types of glycans. were classified into immunogenic and non-immunogenic groups based on their sequences. . The team also used this to help him optimize the sequence of an RNA-based toehold switch and inform the design of a new toehold switch for experimental testing with minimal input coding from the user. .

“Ultimately, without becoming an ML expert, BioAutoMATED could 1) recognize patterns in biological data, 2) ask better questions about that data, and 3) answer those questions quickly. We were able to show that it can all be useful within a single framework: ourselves,” said Katie Collins, now a graduate student at Cambridge University who worked on the project while an undergraduate at MIT. .

Models predicted using BioAutoMATED, like any other ML tool, should be experimentally validated in the lab whenever possible. But the team hopes that this will be further integrated into his ever-growing set of AutoML tools, and that one day its functionality may extend beyond biological arrays to array-like objects such as fingerprints. ing.

“Machine learning and artificial intelligence tools have been around for a long time, but the explosion in popularity, as in the case of ChatGPT, is the recent development of user-friendly interfaces,” says Jim Collins. said Mr. Thermere Professor of Medical Engineering and Science at the Massachusetts Institute of Technology. “We hope that BioAutoMATED will enable the next generation of biologists to discover the basis of life more quickly and easily.”

“Making these platforms accessible to non-experts is critical to unlocking the full potential of ML technologies to solve long-standing problems in biology and beyond. This advance by the Collins team is a major step forward in making AI a major collaborator “for biologists and bioengineers,” said Judah Folkman, professor of vascular biology at Harvard Medical School and Boston Children’s Hospital. , and founding director of Wyss, Don Ingber, M.D., who is also the Hansjerg Wyss Professor of Bioinspiration Engineering in the United States. Harvard University John A. Paulson School of Engineering and Applied Sciences (SEAS).

Other authors on this paper include George Kai of the Wyss Institute and Harvard Medical School. Former Whis Institute members Pradeep Ramesh, Rani Powers, Nicholas Angenent Mali and Diogo Camacho. and Felix Wong and Timothy Lu from MIT.

This work was supported by the Defense Threat Reduction Agency (grant HDTRA-12210032), the DARPA SD2 Program, the Paul G. Allen Frontier Group, the Wiss Institute for Bioinspired Engineering, the MIT-Takeda Fellowship, CONACyT grant 342369/408970, MIT-TATA. Center Fellowship (2748460).

/ Open to the public. This material from the original organization/author may be of the nature of its time and has been edited for clarity, style and length. Mirage.News does not take any organizational positions or positions and all views, positions and conclusions expressed herein are those of the authors only. Read the full article here.

Source link