Neuroscaling methods cannot accurately predict the bond dissociation energy of H molecules

Machine Learning


By pursuing increasingly accurate molecular simulations, the machine learning community will build vaster foundational models in the hopes of unlocking transferable predictive powers at scale alone. We tested this assumption with Siwoo Lee from the Department of Chemistry and Bioeng from Princeton University and Ajuboen from the Computer Science and colleagues' departments by using chemical calculations to scale model capacity and training data. Their research focuses on predicting the bond dissociation energy of hydrogen, the simplest molecule. It reveals surprising limitations. Models trained with only stable molecular structures will no longer be able to capture the basic shapes of the energy curve, resulting in poor performance. Importantly, even the largest models trained on a wide range of datasets struggle to replicate the basic repulsion energy curves expected from the interaction of two protons, suggesting that simply increasing scales do not guarantee reliable chemical modeling, and a deeper understanding of the underlying laws of physics is essential.

Advances in machine learning for molecules and materials

Machine learning research is rapidly advancing in the field of chemistry and materials science, providing new tools for understanding and predicting the behavior of molecules and materials. This progress covers methods such as quantum machine learning, combining the principles of quantum chemistry with machine learning algorithms. Researchers have developed a machine learning power field, replacing traditional methods for calculating interatomic forces into models trained with quantum machine data. Neural networks, particularly convolutional and graphical neural networks, are central to these efforts, and are effective in analyzing spatial and structural data specific to molecular systems.

Recently, state-space models like Mamba have been investigated to capture long-range dependencies within molecular structures. Importantly, these networks are often designed to respect the symmetry present in molecules and materials, ensuring accurate and physically meaningful predictions. Certain architectures such as Schnet and Orbnet utilize these principles to model atomic orbitals adapted to quantum interactions and symmetry, and Spookynet incorporates more accurate degrees of freedom of electrons. The development of Delta-Machine Learning provides a way to build models directly from quantum chemical calculations.

Transfer learning models and basic models such as UMA and Mattersim are also prominent, allowing researchers to apply knowledge from one dataset to another to create widely applicable models. This work relies on large datasets of molecular structures and properties, including databases such as GDB and QM9, and benchmarks such as Rowan benchmarks, to train and evaluate these algorithms. These advances will promote advances in material discovery, atomistic simulations, and prediction of material properties, and ultimately accelerate the development of new technologies.

Scaling of neural networks for molecular properties.

Researchers have investigated whether the capacity of neural networks increases, training data set size improves the ability to model chemical properties, and focuses on the bond dissociation energies of hydrogen molecules in particular. They employed a dataset of quantum chemical calculations to systematically scale both the number of training samples and the complexity of the neural network model. Performance was assessed in a holdout test set, allowing researchers to determine whether larger models and more data led to improved predictions. To rigorously test our understanding of the basic chemical principles of the model, the team focused on the bond dissociation energy curve of H₂, the simplest molecule possible.

The models were trained on a dataset containing varying amounts of stable molecular structures and evaluated for their ability to predict changes in energy as bonds stretch and break. Recognizing that the model may work well with stable structures, but may struggle with dissociation of bonds, researchers augmented training data with non-terrestrial structures, including those representing elongated and distorted shapes. This inclusion aims to assess the ability to expose models to a wider range of molecular structures and extrapolate beyond training data. An important aspect of the methodology included training models on datasets with over 101 million structures, covering both stable diatom molecules and dissociative molecules.

This large-scale training was intended to determine whether an increase in data volume could overcome the limitations of the model's ability to accurately explain bond dissociation. Importantly, the team also evaluated the model's performance in trivial cases of two naked protons. This is a system governed solely by Coulomb's law. This served as a fundamental test of whether the model learned the fundamental physics underlying electronic structure theory, rather than simply remembering patterns in training data. The researchers then compared the model's predictions with analytically known energy curves derived from Coulomb's law, providing a clear benchmark to assess an understanding of basic physical principles.

Scaling cannot capture hydrogen bond dissociation

This study shows that simply increasing the size of neural networks and training data sets does not necessarily improve the ability to model bond dissociation energies of quantum chemical systems, particularly hydrogen molecules. The experiments show that even the largest basic models trained on datasets of over 101 million structures fail to accurately replicate the H2 bond dissociation curves consistently, indicating the fundamental limitations of the ability to learn essential physics. Regardless of the training data or the amount of model capacity, the model showed no discernible improvements in predictions of H₂ bond dissociation energy, highlighting the important gap between scaling and achieving physically meaningful results. The researchers meticulously tested models trained on diverse datasets of equilibrium and geometry molecules, and observed that increasing the number of training samples improves the performance of standard holdout test sets.

However, this improvement did not lead to accurate prediction of H₂ bond dissociation curves, even when the model was trained with compressed and extended geometry. Even more surprising, the largest model failed to perfectly replicate the simple repulsion energy curves of two naked protons, a system governed by the fundamental Coulomb's law. This impediment indicates that despite the large scale of training data and model parameters, the model does not learn the underlying physical principles that manage electronic structures. Further analysis showed that inability to accurately predict the H₂ bond dissociation energy of the model is not merely a problem with insufficient training data.

Even with the training set including non-terrestrial state structures, the model only showed modest improvements. The inability of these large-scale basic models to capture the basic physics of the simplest diatom molecules suggests that scaling alone is not sufficient to construct reliable quantum chemical models. These findings highlight scaling as a major pathway to improving generalization and challenge the general paradigm of the machine learning community that raises important questions about the role of physical principles in the design of accurate and reliable models of quantum chemical systems.

Large-scale models fail to generalize quantum chemistry

This study shows that simply increasing the size of neural network models and the amount of training data does not guarantee accurate predictions of quantum chemistry, even for chemically diverse data sets. The team investigated the ability of these large-scale models to predict bond dissociation energies of hydrogen molecules, a fundamental test case in chemistry. The results show that models trained with only stable molecular structures cannot accurately replicate even the basic shape of the energy curve, indicating a lack of generalizability. Including data from stretchable and compressed shapes improves prediction, but this improvement stems from exposure to these specific configurations rather than learning the underlying physics principles.

In particular, even the largest fundamental models trained in an extensive collection of quantum chemical calculations exhibit significant defects in predicting the behavior of simple diatom molecules outside the equilibrium bonding region. These models mispredict the interaction between two protons and, despite their potential inclusion as induction bias, are unable to accurately represent the fundamental Coulomb's law. This suggests that current large-scale models act primarily as data-driven interprolators and struggle to achieve true physical generalizations. The authors acknowledge that alternative machine learning techniques, such as modifying semi-imperial quantum chemistry methods and selecting minimal training data, could provide promising tools for future research. Overall, this work highlights the need for new strategies to develop fast and accurate property predictions for new molecules and materials, moving beyond the limits of scaling alone.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *