The amount of data generated by scientists today is staggering, thanks to the falling cost of sequencing technology and the increasing computing power available. But parsing all that data to reveal useful information is like looking for molecular needles in a haystack. Machine learning (ML) and other artificial intelligence (AI) tools can dramatically speed up the process of data analysis, but most ML tools are difficult to access and use for non-ML experts. It Is difficult. Recently, automated machine learning (AutoML) methods have been developed that can automate the design and deployment of ML tools, but are often very complex and require facilities with ML owned by scientists outside the AI field. is required.
A group of scientists from Harvard University and MIT’s Wyss Institute for Biologically Inspired Engineering address an unmet need by building a new comprehensive AutoML platform designed for biologists with little or no ML experience. fulfilled. The company’s platform, called BioAutoMATED, can use nucleic acid, peptide, or glycan sequences as input data, and its performance is on par with other of his AutoML platforms, while requiring minimal user input. The platform is described in a new paper published in 2016. cell system You can download it from GitHub.
“Our tools don’t have the ability to build our own custom ML models, and we’re like, ‘We have this amazing dataset, will ML work on it? How can we bring it into the dataset? It’s for those who are wondering: ML models? The complexity of ML prevents us from going further with this dataset, how can we overcome it?” said author Jackie Valeri, a graduate student in the lab of Wyss Core faculty member Jim Collins. , Ph.D. “We wanted to make it easier for biologists and other professionals to harness the power of ML and his AutoML to answer fundamental questions and reveal meaningful biology.” rice field.”
AutoML for everyone
Like many great ideas, the seeds that became BioAutoMATED were planted at lunch, not in the lab. Valeri and co-first author Dr. Louis Thorncksen were eating together at one of Wyth’s dining tables, despite Wyth’s reputation as a world-class destination for biological research. I realized that only a handful of the top experts working there had the ability to build and train ML models. It can bring great benefits to their work.
“We decided that something needed to be done about this because we wanted Wyss to be at the forefront of the AI biotechnology revolution, and we also wanted to ensure that the development of these tools was done by biologists for biologists. We also wanted to be promoted to the next level,” says Soenksen. He is a postdoctoral fellow at the Wyss Institute and a serial entrepreneur in science and technology. “Now everyone agrees that AI is the future, but when we had this idea four years ago, it wasn’t so obvious, especially in biological research. started out as a tool we wanted to build to serve ourselves and us, my colleagues at Wyss, but now we know it can do a lot more.”
Various AutoML systems have already been developed to simplify the process of generating ML models from datasets, but they usually have drawbacks. Among them is the fact that each AutoML tool is designed to look at only one type of his model (such as a neural network) when searching for the best solution. This limits the resulting model to a narrow range of possibilities, but in practice it may be more optimal to combine completely different types of models. Another problem is that most AutoML tools are not specifically designed to take biological sequences as input data. Several tools have been developed that use language models to analyze biological sequences, but these lack automation capabilities and are difficult to use.
To build a robust all-in-one AutoML for biology, the team modified three existing AutoML tools that use different approaches to model generation. DeepSwarm searches convolutional neural networks using swarm-based algorithms. TPOT searches non-neural networks using various methods such as genetic programming and self-learning. BioAutoMATED then produces standardized output results for all three tools so users can easily compare them to determine which types yield the most useful insights from their data.
The team built BioAutoMATED, which can take as input DNA, RNA, amino acid, and glycan (sugar molecules found on the surface of cells) sequences of any length, type, or biological function. BioAutoMATED automatically preprocesses input data to generate models that can predict biological function from sequence information alone.
This platform has a number of features that help the user decide if additional data should be collected to improve the quality of the output, or to learn the sequence features that the model paid the most attention to. It also has a number of useful (and therefore potentially more biologically interesting) functions to design new sequences for future experiments.
Nucleotides, peptides, glycans, what!
To test the new framework, the team first used it to see how changes in a series of RNA sequences called ribosome binding sites (RBS) affect the efficiency with which ribosomes bind RNA and translate it into proteins. We investigated whether it would affect Escherichia coli bacteria. They fed their sequence data into BioAutoMATED and identified a model generated by the DeepSwarm algorithm that could accurately predict translation efficiency. This model performed similarly to a model created by a professional ML expert, but took only 26.5 minutes to generate and only 10 lines of input code from the user (versus other models he may need 750 or more). They also used BioAutoMATED to identify which regions of the sequence appeared to be most important in determining translation efficiency, and designed new sequences that could be tested experimentally.
We then entered the peptide and glycan sequence data into BioAutoMATED and proceeded with the experiment using the results to answer specific questions about those sequences. This system yields highly accurate information about which amino acids within a peptide sequence are most important in determining the ability of an antibody to bind to the drug ranibizumab (Lucentis), and can also be used for different types of glycans. were classified into immunogenic and non-immunogenic groups based on their sequences. . The team also used this to help him optimize the sequence of an RNA-based toehold switch and inform the design of a new toehold switch for experimental testing with minimal input coding from the user. .
“Ultimately, we believe that BioAutoMATED can 1) recognize patterns in biological data, 2) ask better questions about that data, and 3) answer those questions quickly, without being ML. all within a single framework, and I’m an expert myself,” says Katie Collins, a graduate student at Cambridge University who worked on the project as an undergraduate at MIT. Told.
Models predicted using BioAutoMATED, like any other ML tool, should be experimentally validated in the lab whenever possible. But the team hopes that this will be further integrated into his ever-growing set of AutoML tools, and that one day its functionality may extend beyond biological arrays to array-like objects such as fingerprints. ing.
“Machine learning and artificial intelligence tools have been around for a long time, but the explosion in popularity, as in the case of ChatGPT, is the recent development of user-friendly interfaces,” says Jim Collins. says Mr. Thermere Professor of Medical Engineering and Science at the Massachusetts Institute of Technology. “We hope that BioAutoMATED will enable the next generation of biologists to discover the basis of life more quickly and easily.”
“Making these platforms accessible to non-experts is critical to unlocking the full potential of ML technologies to solve long-standing problems in biology and beyond. This advance by the Collins team is a major step forward in making AI a major collaborator “for biologists and bioengineers,” said Don Ingber, M.D., Ph.D., founding director of Wyss. Judah Folkman Professor of Vascular Biology Harvard Medical School and Boston Children’s Hospital, and Hansjörg Wyss Professor of Bioinspiration Engineering from the John A. Paulson School of Engineering and Applied Sciences (SEAS) at Harvard University.
Other authors on this paper include George Kai of the Wyss Institute and Harvard Medical School. Former Whis Institute members Pradeep Ramesh, Rani Powers, Nicholas Angenent Mali and Diogo Camacho. and Felix Wong and Timothy Lu from MIT.
This work was supported by the Defense Threat Reduction Agency (grant HDTRA-12210032), the DARPA SD2 Program, the Paul G. Allen Frontier Group, the Wiss Institute for Bioinspired Engineering, the MIT-Takeda Fellowship, CONACyT grant 342369/408970, MIT-TATA. Center Fellowship (2748460).