Learning the language of molecules to predict their properties

Discovering new materials and drugs typically requires a manual trial-and-error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict the properties of molecules to narrow down the molecules that need to be synthesized and tested in the lab.

Researchers at MIT and MIT-Watson AI Lab have developed a new integrated framework that can predict molecular properties and simultaneously generate new molecules much more efficiently than these common deep learning approaches.

To teach a machine learning model to predict the biological or mechanical properties of molecules, researchers need to present millions of labeled molecular structures to the machine learning model. This is a process known as training. Due to the cost of molecular discovery and the challenge of manually labeling millions of structures, large training datasets are often difficult to obtain, limiting the effectiveness of machine learning approaches. Limited.

In contrast, the system created by MIT researchers can effectively predict molecular properties using only a small amount of data. Their system has a fundamental understanding of the rules that determine how the building blocks combine to produce effective molecules. These rules capture similarities between molecular structures and help the system generate new molecules and predict their properties in a data-efficient manner.

The method outperforms other machine learning approaches on both large and small datasets and accurately identifies molecular properties even when given datasets with less than 100 samples. We were able to predict and generate viable molecules.

“Our goal with this project is to use data-driven methods to accelerate the discovery of new molecules, so that models can make predictions without all these costly experiments. can be trained,” said lead author Minhao Guo. He is a graduate student in Computer Science and Electrical Engineering (EECS).

Guo’s co-authors include MIT-IBM Watson AI Lab research staff Veronika Thost, Payel Das, and Jie Chen. Samuel Song, 23, and Aditya Balachandran, 23, recent MIT graduates. and lead author Wojciech Matusik, professor of electrical engineering and computer science, member of the MIT-IBM Watson AI Lab, and head of the computational design and manufacturing group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). I’m here. The research will be presented at an international conference on machine learning.

learn the language of molecules

Achieving the best results with a machine learning model requires a training dataset containing millions of molecules with properties similar to those the scientist wants to discover. In practice, these domain-specific datasets are usually very small. As such, researchers use a model pre-trained on a large dataset of common molecules and apply it to a much smaller, targeted dataset. However, these models do not capture much domain-specific knowledge and tend to perform poorly.

The MIT team took a different approach. They created a machine learning system that automatically learns a molecular “language” (known as a molecular grammar) using only small domain-specific datasets. Use this grammar to build viable molecules and predict their properties.

Linguistic theory generates words, sentences, or paragraphs based on a set of grammatical rules. Molecular grammars can be thought of in the same way. It is a set of production rules that specify how atoms and substructures are combined to produce molecules or polymers.

A single molecular grammar can represent a huge number of molecules, just like a linguistic grammar can generate a large number of sentences using the same rules. Molecules with similar structures use the same grammatical production rules, and the system learns to understand these similarities.

Structurally similar molecules often have similar properties, so the system uses basic knowledge of molecular similarity to more efficiently predict the properties of new molecules.

“Once we have this grammar as a representation of all the different molecules, we can use it to drive the process of property prediction,” says Guo.

The system uses reinforcement learning to learn the generative rules of the molecular grammar. This is a trial-and-error process in which the model rewards actions that move it closer to achieving the goal.

However, since there are potentially billions of ways to combine atoms and substructures, the process of learning grammar generation rules is too computationally expensive for all but the smallest datasets.

Researchers separated the molecular grammar into two parts. The first part, called the metagrammer, is a generic, widely applicable grammar that is manually designed and initially fed into the system. After that, we only need to learn a much smaller molecule-specific grammar from the domain dataset. This hierarchical approach speeds up the learning process.

Big result, small dataset

In experiments, the researchers’ new system generates viable molecules and polymers simultaneously, and even with only a few hundred samples of domain-specific datasets, they can accurately predict them better than some common machine learning approaches. predicted characteristics. Several other methods also required a costly pre-training procedure, which the new system avoids.

This technique was particularly useful for predicting physical properties of polymers, such as the glass transition temperature, the temperature required for a material to transition from solid to liquid. Because the experiments require very high temperatures and pressures, obtaining this information manually is often very costly.

To push their approach even further, the researchers reduced the single training set by more than half, to just 94 samples. Their model still achieved comparable results to methods trained using the entire dataset.

“This grammar-based representation is very powerful, and because the grammar itself is a very general representation, it can be extended to many different types of graphical data. We’re trying to identify other uses,” Guo said. .

In the future, we also aim to extend the current molecular grammar to include 3D geometries of molecules and polymers that are key to understanding interactions between polymer chains. We are also developing an interface to display the learned grammar generation rules to the user, solicit feedback to correct rules that may be wrong, and increase the accuracy of the system.

This research is partially funded by the MIT-IBM Watson AI Lab and its member company Evonik. Paper: “Hierarchical Grammar Induced Geometry for Data Efficient Molecular Property Prediction”

Source link