Physically embedded machine learning force fields revolutionize organic matter

Machine Learning


A breakthrough in the field of molecular dynamics simulation has been announced by Jiang Jiang and his team at the Institute of Chemistry, Chinese Academy of Sciences. Their pioneering work addresses long-standing challenges associated with machine learning force fields (MLFFs) when applied to organic molecular systems. Specifically, the scientists addressed key issues such as the collapse of molecular structures during long simulations and the inaccuracy of predicting macroscopic properties such as density and viscosity. By directly integrating physics principles into the development and refinement of MLFF, we devised two innovative physics embedding techniques that significantly improve simulation stability and prediction accuracy even with limited training data.

Molecular dynamics simulations serve as essential tools for exploring the behavior of chemical, biological, and material systems at the atomic scale. Traditional approaches include high-precision ab initio molecular dynamics, which, although accurate, require extensive computational resources and remain unsuitable for large-scale or long-time simulations. Conversely, classical empirical force fields offer computational efficiency but often compromise accuracy, especially for complex organic molecules characterized by diverse interaction types. MLFF has emerged as a promising compromise that leverages data-driven models to replicate potential energy surfaces with greater accuracy while maintaining efficiency. Nevertheless, MLFF struggles to accurately represent both intramolecular covalent bonds and weak intermolecular van der Waals forces simultaneously, resulting in unstable simulations and incorrect macroscopic predictions.

The greatest difficulty lies in the dual nature of organic systems. Intramolecular interactions dominated by strong covalent bonds require the capture of high-energy states, while intermolecular forces must be characterized with sufficient fidelity to reproduce bulk properties. Traditional MLFF training, which is primarily driven by quantum mechanical reference data, often ignores extreme bond distortions and introduces structural defects such as nonphysical bond breaks during long simulations. On the other hand, although the model’s microscopic predictions (energy, force, radial distribution functions) may appear accurate, the macroscopic thermodynamic properties can deviate significantly from experimental values ​​due to poor modeling of intermolecular interactions.

To overcome these obstacles, Jiang’s team proposed a two-stage physical embedding framework. The first step is a physics-based adaptive bond length sampling method. This technique integrates empirical force field knowledge, specifically topology files outlining the classification of atoms and bonds along with bond force constants, into the data sampling pipeline. Unlike traditional uniform bond extension approaches, the adaptive method strategically targets high-energy bond regions that are commonly underestimated but prone to structural collapse and assigns sampling probabilities based on physical bond stiffness. This subtle sampling ensures comprehensive coverage of the important configuration space without inducing non-physical artifacts such as self-consistent field (SCF) non-convergence or anomalous forces.

Empirical validation was performed on three representative organic systems: a fluorinated engineering fluid, an alanine tripeptide, and an acetaminophen molecule. Only 50 single-molecule samples were used per system for model training and validation, and the collapse probabilities of the original MACE MLFF model were 59%, 22%, and 77%, respectively. After integration of adaptive bond length sampling, the refined MLFF passed 100 independent 100 ps high-temperature molecular dynamics runs without structural damage. These results demonstrate a significant improvement in the long-term stability and robustness of the MLFF achieved with minimal data augmentation.

The second innovation addresses the persistent challenge of accurately predicting macroscopic properties associated with intermolecular forces. Jiang’s group introduced a top-down correction mechanism based on embedding physical equations. This approach leverages the DFT-Corrected Screening Overlap (DFT-CSO) dispersion equation and embeds it in the MLFF architecture as a tunable correction term. By adjusting the damping parameter to adjust the strength of the dispersion interaction, this method adjusts the intermolecular potential to reproduce the experimental density, effectively compensating not only the MLFF fitting error but also the systematic biases inherent in the underlying quantum mechanical criteria.

Applying this correction to battery electrolyte solvents, especially mixtures of ethylene carbonate (EC) and methyl ethyl carbonate (EMC), as well as pure EMC systems, significantly improved prediction accuracy. The density prediction error was reduced by 78% and 88% for the two model variants, and the deviations reached values ​​as low as 0.006 and 0.012 g/cm3, respectively. The viscosity accuracy was similarly improved by 38% and 77%, highlighting the ability of this technique to improve the prediction of kinetic properties with negligible additional computational cost. Parameter optimization requires only a few hours, highlighting the practicality of this approach for rapid model refinement and deployment.

Interpretability studies have reinforced the physical significance of embedded modifications. Volume scan analysis showed changes in the interaction potential minimum after correction, validating the ability of this method to directionally enhance or attenuate intermolecular forces. In particular, the change in the root mean square error of the atomic forces is less than 0.8 meV/Å, well within the model-specific fitting uncertainties, confirming that the macroscopic improvement results from subtle force adjustments. The radial distribution function remained essentially unchanged, indicating that the intramolecular structural representation was preserved and the correction selectively improved the fidelity of intermolecular interactions.

This dual-strategy framework consists of physics-aware adaptive data sampling and physically grounded model post-processing, providing a scalable and low-cost solution to the long-standing bottleneck in MLFF construction. It bridges the gap between data-driven modeling and fundamental physical chemistry, producing MLFFs that are at the same time accurate, interpretable, and transferable between diverse organic molecular systems. This method not only improves simulation accuracy but also enables high-fidelity studies of engineering fluids, peptides, pharmaceutical compounds, and organic solvents under a practical computational budget.

This effort highlights the significant benefits of incorporating domain knowledge directly into machine learning pipelines, overcoming traditional approaches that typically increase dataset size or model complexity to achieve performance gains. Such embedding not only reduces data dependence but also supports rapid model adaptation and fine-tuning to new chemical environments. Future research prospects include extending the correction framework with additional tunable physical parameters and expanding the scope to efficiently capture kinetic phenomena such as viscosity, which is expected to further improve predictive power and applicability.

This seminal work was published as an open access article in CCS Chemistry, the flagship journal of the Chinese Chemical Society. This work was supported by the National Natural Science Foundation of China and the Strategic Priority Research Program of the Chinese Academy of Sciences. By combining advanced computational modeling with deep physical insight, Jian Jiang and colleagues have demonstrated a promising path toward more reliable and generalizable molecular simulations with broad implications for chemistry, materials science, and drug discovery.

Research theme: not applicable

Article title: Physical embedding of machine learning force fields for organic systems

News publication date: March 6, 2026

Web references:
https://www.chinesechemsoc.org/journal/ccschem
http://dx.doi.org/10.31635/ccschem.026.202506780

References: Jiang J. et al. CCS Chem., 2025, 7(3): 716-730.

Image credits: CCS Chemistry

keyword

Machine learning, force fields, molecular dynamics, physical embedding, organic systems, adaptive bond length sampling, intermolecular interactions, DFT-CSO correction, macroscopic property prediction, simulation stability, quantum chemistry, computational modeling

Tags: Advances in Molecular Force FieldsAtomistic Scale ModelingPrediction of Density and ViscosityLimited Training Data for MLFFMachine Learning in Chemical SimulationsStability of Molecular Dynamics SimulationsOrganic Molecular Systems ModelingOvercoming Molecular Structure CollapsePhysical Embedding Techniques for MLFFPrediction Accuracy of MLFF for Physically Embedded Machine LearningForce FieldsSimulation of Macroscopic Properties



Source link