Researchers at MIT are using artificial intelligence to design new proteins that surpass those found in nature.
They have developed a machine learning algorithm that can generate proteins with specific structural features. It can be used to create materials with specific mechanical properties such as stiffness and elasticity. Such biology-inspired materials could replace materials made from petroleum or ceramics, but with a much smaller carbon footprint.
Researchers at MIT, the MIT-IBM Watson AI Lab, and Tufts University have adopted generative models. This is the same type of machine learning model architecture used in AI systems such as DALL-E 2. As in DALL-E 2, we adapted the model architecture to be able to acquire images from natural language prompts and predict the amino acid sequences of proteins that achieve specific structural goals.
In a paper to be published in Chem, researchers show how these models can generate new proteins while remaining realistic. Models that learn the biochemical relationships that control how proteins form can generate new proteins that enable unique applications, says senior author Markus Buehler.
For example, this tool can be used to develop protein-inspired food coatings. This allows produce to stay fresh longer while still being safe for human consumption. A portfolio will be available soon, he added.
“When thinking about the design of proteins that nature has yet to discover, it is a very huge design space that cannot be laid out with just a pencil and paper. Buehler, who is also a member of the MIT-IBM Watson AI Lab, said:
Joining Buehler’s paper is lead author Bo Ni, a postdoc at Buehler’s Institute for Molecular Mechanics. David Kaplan, Stern Professor of Home Engineering and Professor of Bioengineering at Tufts University.
adapt new tools to the task
Proteins are formed by chains of amino acids that fold in 3D patterns. The sequence of amino acids determines the mechanical properties of proteins. Scientists have identified thousands of proteins created by evolution, but estimate that vast numbers of amino acid sequences remain undiscovered.
To streamline protein discovery, researchers recently developed a deep learning model that can predict a protein’s 3D structure from a set of amino acid sequences. However, the reverse problem of predicting the sequence of amino acid structures that meet design goals has proven even more difficult.
The emerging advent of machine learning allowed Buehler and his colleagues to tackle the thorny problem of attention-based diffusion models.
Attention-based models can learn a very long range of relationships. This is key to protein development because her one mutation in a long amino acid sequence can dictate the entire design, Buehler says. Diffusion models learn to generate new data through the process of adding noise to the training data and learning to remove the noise and recover the data. These models are often more effective than other models in producing high-quality, realistic data that can be conditioned to achieve a set of goals that meet design requirements.
The researchers used this architecture to build two machine learning models that can predict different new amino acid sequences that form proteins that meet structural design goals.
“In the biomedical industry, we may not need a completely unknown protein because we do not know its properties. But you can use these models to generate spectra and adjust certain knobs to control them,” says Buehler.
Common folding patterns of amino acids, known as secondary structures, produce different mechanical properties. For example, proteins with alpha-helical structures produce elastic materials, while proteins with beta-sheet structures produce rigid materials. Alpha helices and beta sheets combine to create materials with silky stretch and strength.
Researchers have developed two models, one that manipulates the global structural properties of the protein and the other that manipulates it at the amino acid level. Both models work by combining these amino acid structures to produce proteins. For models that manipulate global structural properties, the user enters the desired percentages of different structures (eg, 40% alpha helices and 60% beta sheets). The model then generates sequences that satisfy those targets. In the second model, scientists also specify the order of amino acid structures. This allows for more fine-grained control.
The model is connected to algorithms that predict protein folding that researchers use to determine the 3D structure of proteins. Then calculate the resulting properties and check them against the design specifications.
Realistic yet novel design
They tested the model by comparing the new protein to known proteins with similar structural properties. In many cases there was some degree of overlap with existing amino acid sequences, mostly around 50-60%, but there were also some completely new sequences. The level of similarity suggests that many of the proteins produced can be synthesized, he added Buehler.
To ensure that the predicted proteins were reasonable, the researchers tried to trick the model by entering physically impossible design goals. They were impressed that instead of producing an improbable protein, the model produced the closest synthesizable solution.
“Learning algorithms can find hidden relationships in nature, which gives us confidence that whatever we get from the model is very likely to be realistic,” says Ni.
Next, researchers plan to experimentally validate some of the new protein designs by creating them in the lab. We would like to continue extending and improving the model so that we can develop amino acid sequences.
“For the applications we care about, such as sustainability, medicine, food, health and material design, we need to go beyond what nature has done. It is a new design tool that can be used to help solve some of the most pressing social problems we face,” says Buehler.
This research was supported in part by the MIT-IBM Watson AI Lab, the U.S. Department of Agriculture, the U.S. Department of Energy, the Army Research Service, the National Institutes of Health, and the Office of Naval Research.