Prediction of molecular properties using language learning

The painstaking trial-and-error process commonly used to find new materials and drugs can take decades and cost millions of dollars. Scientists frequently use machine learning to predict chemical properties and select molecules to synthesize and test in the lab. Researchers at MIT and the MIT-Watson AI Lab have created a new integrated framework that performs both molecular property prediction and molecule synthesis significantly faster than standard deep learning techniques.

Machine learning models must be exposed to millions of labeled molecular structures and trained to learn how to predict the biological or mechanistic attributes of molecules. Nevertheless, the efficiency of machine learning techniques is usually limited by the need for more readily available large training datasets.

In contrast, the MIT researchers’ algorithms can accurately predict molecular features with minimal data. They have a system of understanding the basic rules of how architectural components interact to form legitimate compounds. These principles enable the system to synthesize new molecules and efficiently predict their properties by capturing commonalities between molecular structures. Our strategy outperformed previous machine learning methods on large and small datasets and made successful predictions.

Minghao Guo, lead author and graduate student in Computer Science and Electrical Engineering (EECS), said: “Our goal with this project is to use data-driven techniques to accelerate the discovery of new molecules so that models can be trained to make predictions without all these costly experiments. is.”

Molecular Grammar is a machine learning system developed by the MIT team that automatically learns a molecular “language” using only small domain-specific datasets. Use this syntax to create feature molecules and predict their attributes. Using reinforcement learning, a trial-and-error process that rewards behavior that brings the model closer to achieving a goal, the system learns the generative rules of a molecular language.

For best results, the training dataset for your machine learning model should be millions of molecules with attributes comparable to those you want to reveal. These domain-specific datasets are usually very small in practice. To apply the model to a much smaller and tuned dataset, the researchers first trained the model on a large dataset of a wide range of compounds. However, the domain-specific knowledge contained in these models is limited. Therefore, frequent improvements are required.

In tests, the new system the researchers developed produced functional molecules and polymers simultaneously, even with a small number of samples in domain-specific datasets, and was more accurate than some well-known machine learning methods. predicted their properties. Some other systems also required a costly pre-training stage, but the new methodology does not.

This method has proven particularly good for predicting the physical properties of polymers, such as the glass transition temperature and the temperature at which a material changes from solid to liquid. Obtaining this information manually is often costly, as experiments must be performed at very high temperatures and pressures. To advance the methodology, the researchers reduced the single training set by more than half to just 94 samples, but their model still performed as well as approaches trained using the full dataset. produced results.

Guo said: “Once we have this grammar as a representation of all the different molecules, we can use it to facilitate the process of property prediction.”

he added “The representations based on this grammar are very powerful. Also, the grammar itself is a very general representation, so it can be extended to many different kinds of graphical data. We are trying to identify other uses.”

Researchers hope to extend the molecular language to incorporate the 3D geometry of molecules and polymers, which is essential for understanding polymer chain interactions. We are also working on an interface that displays the learned grammar generation rules to the user and requests comments to correct incorrect rules to improve the accuracy of the system.

The MIT-IBM Watson AI Lab and its member company Evonik funded this research.

Reference magazines: