Researchers unlock machine learning possibilities with 120 million atomic compositions for material discovery

Machine Learning


Creating reliable machine learning models for understanding material behavior now faces important hurdles due to the consistency and fragmented nature of existing atomic orbital data. Ali Ramlaoui, Martin Siron, Inel Djafar and colleagues at Entalpic are tackling this challenge with the introduction of Lemat-Traj, a comprehensive, unified dataset that includes over 120 million atomic compositions. Built from a large material repository, this curated collection standardizes data formats and ensures high quality across commonly used computational methods, thereby reducing barriers to the development of accurate, transferable inter-machine learning potentials. Lemat-Traj, which includes both stable and high-energy states, clearly improves the performance of machine learning models, reduces errors in predicting material behavior, and paves the way for accelerated material discovery. The team also offers Lematerial-Fetcher, an open source library designed to facilitate the continuous expansion and maintenance of large material data sets, ensuring the long-term utility of the wider research community.

This large resource aggregates data from prominent repositories including Materials Project, Alexandria and OQMD, significantly reducing the barriers to accurate and transferable machine learning training. LEMAT-TRAJ standardizes data representations and harmonizes the results of calculations performed with the widely used density functional theory (DFT) functions, ensuring consistency across a wide range of sources. The team also developed Lematerial-Fetcher, a modular, extensible open source library, to automate the process of retrieving, transforming, verifying and harmonizing data from a variety of sources, creating a reproducible framework for materials science data sets.

The experiments show that fine-tuning the MACE model using LEMAT-TRAJ reduces force prediction errors for the mitigation task by more than 36%, and improves the performance of the matte bench discovery stability benchmark by 10%. Lemat-Traj packs both near-equilibrium and low-force states, previously underestimated but important regimes for precise geometric optimization. The analysis shows that LEMAT-TRAJ comprehensively sample potential energy surfaces during the relaxation path, capturing both high-energy structures and states close to equilibrium, and making them a valuable resource for advancing interatomic development, multi-fidelity learning, and self-monitoring learning techniques. Researchers have introduced Lemat-Traj, a large collection of DFT computations designed to improve the accuracy and generalizability of machine learning possibilities in materials science.

This dataset combines MPTRJ, OQMD, and Alexandria data from materials projects to provide comprehensive and diverse resources to train machine learning possibilities. The rigorous filtering and processing of data from a variety of sources ensures quality and consistency. This study shows that various data generation strategies, such as molecular dynamics and active learning and geometry optimization, capture clear but complementary regions of the potential energy surface. A single data source is often insufficient to create truly generic possibilities.

Models trained with Lemat-Traj achieve excellent performance in predicting energy, force and stress, especially when combined with high-intensity data sets. Principal component analysis shows that Lemat-Traj and MPTRJ have similar potential energy surface landscapes, and LEMAT-TRAJ offers high resolution due to its large size. The model consistently works best with intra-distributed test data, enhancing the importance of diverse training data.

👉Details
🗞 Lemat-Traj: A scalable, unified data set of material trajectories for atomic modeling
🧠arxiv: https://arxiv.org/abs/2508.20875



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *