Although scientists increasingly recognize the important role that chemical bonds play in determining material properties, the systematic integration of chemical bonds into machine learning workflows remains a major challenge. Aakash Ashok Naik, Nidal Dhamrait, and Katharina Moeltzen, working in collaboration with the Department of Materials Chemistry at the Federal Institute for Materials Testing in Berlin and the Friedrich Schiller University Jena Institute for Condensed Matter Theoretical Optics, addressed this gap by expanding a previously established solid-state chemical bonding database to cover approximately 13,000 materials. This expanded database, with contributions from Christina Ertural, Philipp Benner, Gian-Marco Rignanese (UC Leuven Institute for Condensed Matter Nanoscience (IMCN)), and Janine George, facilitates the derivation of new chemical bond descriptors and the rigorous evaluation of their impact on the performance of machine learning models. Their work demonstrates that the incorporation of these descriptors significantly enhances the prediction of elastic, vibrational, and thermodynamic properties and, importantly, enables the discovery of interpretable relationships between bonding properties and key material behaviors.
However, the chemical bonds themselves are poorly represented in these models. This study aimed to develop and implement a coupled descriptor to improve the predictive ability of machine learning for thermal conductivity. The researchers used a combination of density functional theory calculations, symbolic regression, and machine learning techniques to quantify the binding properties. Specifically, we calculated phonon lifetimes and Grüneisen parameters for a dataset of 117 materials and used symbolic regression to identify significant relationships between these parameters and material properties. The resulting bonded descriptor was incorporated into a machine learning model for predicting thermal conductivity, and it was demonstrated that the bonded descriptor significantly improves the accuracy of thermal conductivity prediction compared to models based only on compositional and structural features. This database is used to derive a new set of quantum chemical bond descriptors, and a systematic evaluation is performed using statistical significance tests to assess how their inclusion impacts the performance of machine learning models. Machine learning models rely solely on features that would otherwise be derived from structure and composition. Machine learning algorithms are widely used for data-driven materials discovery in both forward and backward design approaches. In forward designs, where the goal is to predict material properties based on structure or composition, the performance of a machine learning algorithm depends on how well the material is represented by a set of features or descriptors. Numerous studies have demonstrated the utility of these descriptors for building machine learning models to screen materials for applications such as catalysis, ferroelectrics, and thermoelectrics. The concept of chemical bonding, although not a quantum mechanically observable quantity, has proven useful in rationalizing both organic and inorganic materials. Several theoretical frameworks have been developed to characterize the bonds in solid materials, including wave function and distribution analysis, real-space electron density analysis, and energy partitioning methods. These frameworks regularly inform the understanding and tuning of various material properties, and the quantities obtained through such coupled analyzes serve as valuable descriptors for data-driven material discovery. To date, descriptors derived from readily available geometric information have often been used to approximate bonding within materials. The current study was motivated by the lack of a large-scale comparison of the predictive power of geometric and quantum chemical bonding descriptors in machine learning of material properties of solid-state materials. Recent developments in quantum chemical bond analysis workflows have enabled high-throughput computation of quantum chemical bond descriptors derived from ab initio calculations. As part of this research, ICOOP measures the number of electrons involved in a bond. ICOHP, quantifying the strength of covalent bonds. ICOBI indicating bond order. Additionally, Mulliken and Lowdin atomic charges, projected density of states (PDOS), and Madelung energies are also available. Using this large database, the predictive value of descriptors derived from these coupling metrics was evaluated for data-driven materials discovery. Because these binding metrics have not been comprehensively evaluated in data-driven materials science, this study focused on statistical descriptors derived from COHP, ICOHP, and atomic charge, extracted using LobsterPy, a Python package for generating summaries of binding properties and converting data into machine learning-compatible formats. These descriptors quantify atomic interactions and are therefore closely related to vibrational properties governed by atomic force constants. Therefore, the target material properties considered include the maximum coupled projected force constant, the last peak of the phonon density of states (DOS), thermodynamic data (heat capacity, vibrational entropy, Helmholtz free energy, internal energy, etc.), mean square thermal displacement, elastic data (volume and shear modulus), and lattice thermal conductivity. The rationale for choosing these targets is that the bond projection force constant measures the bond stiffness, the last peak in the phonon DOS indicates the strongest coupling, and the thermodynamic properties, volume/shear modulus, mean square displacement, and lattice thermal conductivity are generally correlated with chemical bonding. These coupling descriptors are orders of magnitude cheaper to compute than using standard density functional theory (DFT) simulations to compute target properties such as phonons, elastic modulus, and thermal conductivity. The assessment addressed three key questions: (a) Are such quantum chemical descriptors relevant for predicting these material properties? (b) Can such bond descriptors be replaced by descriptors derived from compositional and structural data? (c) Do chemical bond descriptors in quantum chemistry contain complementary information that improves prediction accuracy beyond simple compositional or structural descriptors? This study began by testing the relevance of quantum chemical descriptors for learning these properties, and then analyzed the correlation between bond descriptors and structural or compositional descriptors to assess whether the former provide any complementary information. The impact of including these descriptors on the predictive performance of machine learning models, specifically Random Forest and MODNet, was then evaluated, and significance tests were performed on the trained models to determine whether the observed improvements were statistically significant. The descriptor importance from the trained model was extracted using explainable artificial intelligence (XAI) techniques, specifically Shapley additive explanation (SHAP) and permuted feature importance (PFI) to identify the most influential descriptors. Since the inclusion of bonding descriptors improves prediction performance, we applied the symbolic regression method SISSO to investigate whether a simple and intuitive expression could be found to relate these descriptors to properties. Descriptors were evaluated using multiple methods and across a variety of target material properties and discussed with a focus on a few representative examples that best represent the main conclusions of the study. Full methodology details are provided in Section 3, and the complete result set for all methods and targets is available on the repository’s GitHub page. An initial descriptor selection was performed to avoid overfitting. Bond strength (bwdf total, Icohp total), effective coordination number (EIN ICOHP), geometry-based local environment descriptors, and element-based properties such as atomic weight and covalent radius. High rankings of these descriptors were only observed for the maximum value of the coupled projection force constant (max pfc), the last phonon DOS peak (last ph peak), the average total/Peierls lattice thermal conductivity (log klat 300/log kp 300), the volume/shear modulus (log k vrh/log g vrh), and the mean square displacement (log msd). The statistical coupling descriptors are Helmholtz energy (H 25, H 305, H 705), vibrational entropy (S 25, S 305, S 705), internal energy (U 25, U 305, U 705), and heat capacity (Cv 25, Cv 305, Cv 705). It was ranked relatively low in terms of thermodynamic properties such as and was given a subscript. Statistical evaluation revealed that the performance of machine learning models is significantly improved when these descriptors are incorporated together with traditional structure- and composition-derived features. Specifically, it was demonstrated that models that predict material properties improve their accuracy by including bonding information. This study focused on descriptors extracted from crystal orbital Hamiltonian populations (COHP), integrated COHP (ICOHP), and atomic charges, utilizing the LobsterPy package for automatic data processing. Analysis of the maximum bond projection force constant, a measure of bond stiffness, showed improved prediction accuracy when using bond descriptors. Additionally, the last peak of the phonon density of states, a benchmark property used to evaluate machine learning models in Matbench, was predicted with higher accuracy. This peak indicates the strongest bonding within the material and benefited from including bonding information in the model. Evaluation of thermodynamic properties such as heat capacity, vibrational entropy, Helmholtz free energy, and internal energy also showed positive correlations with the new descriptors. Elastic data, especially bulk modulus and shear modulus, as well as mean square thermal displacement were predicted with higher accuracy. In particular, lattice thermal conductivity, a property closely related to atomic interactions, also benefited from the inclusion of quantum chemical bond descriptors in the machine learning model. These results demonstrate the value of incorporating coupled information into data-driven materials discovery workflows. The continuous pursuit of materials discovery requires increasingly sophisticated predictive tools. For too long, machine learning in this field has relied on describing materials simply by what they are made of, overlooking the important question of how their atoms are connected. This work represents a step towards correcting that imbalance and demonstrates the power of explicitly incorporating chemical bonding information into material modeling. The creation of extended chemical bonding databases and the development of associated descriptors provides a path beyond composition-based predictions. While improvements in model accuracy across a variety of properties, from elasticity to thermal conductivity, are notable, the real potential lies in its ability to reveal the underlying physical relationships. Using symbolic regression to derive intuitive representations of these properties suggests a future in which machine learning not only predicts that materials will behave a certain way, but also helps us understand why. However, this is not a complete solution. The database is rich, but limited in scope, and the descriptors themselves are based on specific coupling models. It is important to extend this research to a broader range of materials and bonding environments. Moreover, true testing will apply these descriptors to truly new materials, moving beyond the limits of existing knowledge. The next phase will focus on integrating with more advanced machine learning architectures and quantifying uncertainty, recognizing the inherent limitations of predictive models. Ultimately, this approach promises to move materials science closer to a truly predictive discipline rather than a purely empirical discipline.
