Hierarchical transfer learning: an agile and equitable strategy for machine-learning interatomic models

ML-IAMs are characterized by two main features: model architecture and descriptor. For the majority of ML-IAMs, these features are structured such that parameters for interactions between atoms of various types are inseparable; introduction of an additional atom type requires generating a new model with all parameters updated^{2,3,5,11,13,15,22}. As a consequence, ML-IAMs are generally fit a-la-carte for a given target system, atomic composition, and set of conditions. Specifically, use of an atom-centered descriptor or use of complex architecture (e.g., neural network, graph based, etc) generally precludes generating models that are explicitly compositionally-extensible. In the section below, we will show that the cluster-based descriptor and parametrically linear form used by the ChIMES ML-IAM overcomes this limitation, enabling a unique opportunity for developing chemically extensible models through a hierarchical transfer learning strategy.

Model overview

In this section, we provide a brief overview of the ChIMES ML-IAM, emphasizing features salient to the presently described transfer learning strategy. For a more detailed discussion of the model and its underlying form, we direct the reader to refs. ⁶ and ³³.

ChIMES describes system energy through an explicit many-body cluster expansion, i.e:

$$E=\mathop{\sum }\limits_{i}^{{n}_{{\rm{a}}}}{E}_{i}+\mathop{\sum\sum }\limits_{i > j}^{{{n}_{{\rm{a}}}}}{E}_{ij}+\mathop{\sum\sum\sum }\limits_{i > j > k}^{{{n}_{{\rm{a}}}}}{E}_{ijk}+\cdots \,,$$

(1)

where E is the total ChIMES energy for a system of n_a atoms, E_i is the energy for a single atom, i, E_ij, and E_ijk are the energy for a cluster of two or three atoms (i.e., ij or ijk), respectively. This expansion can extend to arbitrary bodiedness, though all models produced to date contain a maximum of 4-body terms^{6,7,26,27,33,34,35,36,37,38}. Interactions between pairs of atoms are described through Chebyshev polynomial series that take as input interatomic pair distances, i.e., for a pair of two atoms ij:

$${E}_{ij}\propto \mathop{\sum }\limits_{\alpha }^{{{\mathcal{O}}}_{2{\rm{B}}}}{c}_{\alpha }^{{e}_{i}{e}_{j}}{T}_{\alpha }\left({s}_{ij}^{{e}_{i}{e}_{j}}\right),$$

(2)

where T_α is a Chebyshev polynomial of order α, ${s}_{ij}^{{e}_{i}{e}_{j}}$ is a transformed pair distance between atoms ij of element type e_ie_j, ${{\mathcal{O}}}_{2{\rm{B}}}$ is the user-defined two-body order for the polynomial series, and ${c}_{\alpha }^{{e}_{i}{e}_{j}}$ are the Chebyshev polynomial coefficients that comprise the fitting parameters of the model. Note that we use the “ ∝ ” symbol to indicate that smoothing and penalty functions have been excluded from these equations for simplicity. Many-body interactions are treated as the product of interactions for constituent atom pairs. For example, a three-body interaction is given by:

$${E}_{ijk}\propto \sum _{\beta }\mathop{\sum }\limits_{\gamma }^{{{\mathcal{O}}}_{3{\rm{B}}}}\mathop{\sum }\limits_{\delta }\!{\scriptstyle{\prime}\atop }{c}_{\beta \gamma \delta }^{{e}_{i}{e}_{j},{e}_{i}{e}_{k},{e}_{j}{e}_{k}}{T}_{\beta }\left({s}_{ij}^{{e}_{i}{e}_{j}}\right){T}_{\gamma }\left({s}_{ik}^{{e}_{i}{e}_{k}}\right){T}_{\delta }\left({s}_{jk}^{{e}_{j}{e}_{k}}\right),$$

(3)

Where the “’” indicates that the sum only considers terms for which at least two of β, γ, and δ are non-zero, ensuring a true three-body interaction. Previous work has shown that this functional form is well suited for describing variety of systems, including inorganic materials, covalent materials, condensed phase reacting systems, and molecular systems and materials^{23,26,33,34,36,38}.

Models are fit by force-, energy-, and/or stress matching to gas- and/or condensed-phase atomic configurations labeled by a ground-truth method such as DFT; this is achieved by minimizing an objective function of the form:

$${F}_{{\rm{obj}}}\propto \sqrt{\mathop{\sum }\limits_{i=1}^{{n}_{{\rm{f}}}}\left({w}_{{{\rm{E}}}_{i}}^{2}\Delta {E}_{i}^{2}+\mathop{\sum }\limits_{j=1}^{{n}_{{\rm{a}}}}\mathop{\sum }\limits_{k=1}^{3}{w}_{{{\rm{F}}}_{ijk}}^{2}\Delta {F}_{ijk}^{2}+\mathop{\sum }\limits_{j=1}^{9}{w}_{{\sigma }_{ij}}^{2}\Delta {\sigma }_{ij}^{2}\right)},$$

(4)

where ΔX = X^DFT − X^ChIMES{c} and X is a force, energy or stress predicted by the superscripted method. F_obj and {c} are the weighted root-mean-squared error and model coefficients, respectively. The number of frames and atoms are given by n_f and n_a, respectively. F_ijk indicates the k^th Cartesian component of the force on atom j in configuration i while σ_ij and E_i indicates the j component of the stress tensor and the energy for configuration i, respectively. Weights for each force, energy, and stress are given by w.

Since ChIMES is entirely linear in its fitted parameters {c}, the model optimization problem can be recast as the following over-determined matrix equation:

$${\boldsymbol{w}}{\boldsymbol{M}}{\boldsymbol{c}}={\boldsymbol{w}}{{\boldsymbol{X}}}_{{\rm{DFT}}},$$

(5)

where X_DFT is the vector of ${F}_{ijk}^{{\rm{DFT}}}$, ${\sigma }_{ij}^{{\rm{DFT}}}$, and E^DFT values, w is a diagonal matrix of weights to be applied to the elements of X_DFT and rows of M, and the elements of design matrix M are given by:

$${M}_{ab}=\frac{\partial {X}_{a,{\rm{ChIMES}}\{c\}}}{\partial {c}_{b}}.$$

(6)

In the above, a represents a combined index over force and energy components, and b is the index over permutationally invariant model parameters. This allows model parameters to be rapidly generated by application of advanced linear solvers^38,39,40,41 and makes the model well suited for iterative and/or active-learning training strategies. For additional details, the reader is directed to refs. ⁶ and ²⁶.

Parameter hierarchy and transfer learning overview

ChIMES models view system configurations as a collection of fully connected graphs between atoms in n-body clusters, where atoms form nodes and the transformed distances between those atoms form the edges. We refer to these cluster graphs as the ChIMES descriptor. The parametric linearity characteristic to ChIMES models coupled with use of a cluster-centered many-body descriptor gives rise to an inherently hierarchical parameter structure that can be leveraged for transfer learning. Specifically, the atom cluster energy terms given in Eq. (1) can be further decomposed based on constituent atom types. For example, the two-body energy contributions for a system comprised entirely of C and N can be written as:

$$\mathop{\sum\sum }\limits_{i > j}^{{{n}_{{\rm{a}}}}}{E}_{ij}=\mathop{\sum\sum }\limits_{i > j}^{{{n}_{{\rm{C}}}}}{E}_{ij}^{{\rm{CC}}}+\mathop{\sum\sum }\limits_{i > j}^{{{n}_{{\rm{N}}}}}{E}_{ij}^{{\rm{NN}}}+\mathop{\sum }\limits_{i}^{{n}_{{\rm{C}}}}\mathop{\sum }\limits_{j}^{{n}_{{\rm{N}}}}{E}_{ij}^{{\rm{CN}}},$$

(7)

where n_C and n_N are the number of C and N atoms in the system, respectively i.e., n_C + n_N = n_a and ${E}_{ij}^{{e}_{i}{e}_{j}}$ is the two-body energy for a set of atoms ij of element types e_ie_j. Similarly, for a three-body interaction:

$$\begin{array}{l}\mathop{\sum\sum\sum}\limits_{i > j > k}^{{{n}_{{\rm{a}}}}}{E}_{ijk}=\mathop{\sum\sum\sum}\limits_{i > j > k}^{{{n}_{{\rm{C}}}}}{E}_{ijk}^{{\rm{CCC}}}\\ +\mathop{\sum\sum\sum }\limits_{i > j > k}^{{{n}_{{\rm{N}}}}}{E}_{ijk}^{{\rm{NNN}}}+\mathop{\sum\sum}\limits_{i > j}^{{{n}_{{\rm{C}}}}}\mathop{\sum }\nolimits_{k}^{{n}_{{\rm{N}}}}{E}_{ijk}^{{\rm{CCN}}}\\ +\mathop{\sum }\limits_{i}^{{n}_{{\rm{C}}}}\mathop{\sum\sum}\limits_{j > k}^{{{n}_{{\rm{N}}}}}{E}_{ijk}^{{\rm{CNN}}}.\end{array}$$

(8)

This logic can be extended for construction of higher-body interactions. Notably, pure component terms (and thus, parameters) for C and N interactions are confined to the first two terms of Eqs. (7) and (8), and are non-interacting. Two-atom-type cross-interactions are contained in the remaining terms. This point is illustrated in Fig. 1. For a C, H, O, and N ChIMES model, all terms for interactions between only C atoms and between only N atoms are contained in the orange C and N blocks, respectively, while all cross-interaction terms are contained within the yellow CN block. Critically, this means parameters for a given block in the same column can be fit in parallel, completely independently of one another. For example, a ChIMES model containing up to four body interactions for a C/N system is comprised of two “building blocks” containing {C, CC, CCC, CCCC} and {N, NN, NNN, NNNN} parameters, and a CN cross-term building block containing {CN, CCN, CNN, CCCN, CCNN, CNNN}. We note that, throughout this text, ‘C/N’ refers to systems or datasets containing both carbon and nitrogen atoms (e.g., C/N systems), as well as the corresponding models that fully describe them, including pure-C, pure-N, and C–N cross interactions. In contrast, ‘CN’ is used specifically to denote parameters associated with ‘C–N` cross-interactions.

**Fig. 1: Schematic of a ChIMES parameter hierarchy for a model describing C, H, O, and N containing systems.**

Following the precedent set by previous ChIMES model development endeavors, a ChIMES-CN model would traditionally be generated by fitting all of these parameters at once. However, the unique model structure also allows a “hierarchical transfer learning” approach wherein C- and N- building blocks are fit independently to pure-C and pure-N training data, respectively. CN-block parameters can then be fit to two-element system training data by replacing the definition of ΔX used in Eq. (4) with: $\Delta X={X}^{{{\rm{DFT}}}^{{\prime} }}-{X}^{{\rm{ChIMES}}\{c\}}$, where ${X}^{{{\rm{DFT}}}^{{\prime} }}={X}^{{\rm{DFT}}}-{X}^{{\rm{ChIMES}}-{\rm{C}}}-{X}^{{\rm{ChIMES}}-N}$ and ChIMES–C and ChIMES–N indicate X computed using the C and N block parameters, respectively. This same logic can be extended to trinary and quaternary systems e.g., the resulting CN parameter block along with the previously fit C and N parameter blocks could be used in development of, e.g. CHN, CON, and CHON models.

Prototypical system overview and training strategy

Like other ML-IAMs, these building blocks have historically been fit all at once, yielding ChIMES models for which applications are confined to specific set of atom types and limited to certain compositional ranges^{27,28,36,38,43}. Here, we explore efficacy of the hierarchical transfer learning strategy described above to develop a C/N model valid from near-ambient conditions to extreme conditions of ~10,000 K and 200 GPa that is suitable for describing any range of compositions from 0 to 100% N, by building upon previously generated ChIMES models for C (i.e., the 2024-Large model²⁶) and N³³. We note that this testbed C/N system is also interesting within the context of synthesis of N-doped graphictic materials for applications including catalysis, energy storage, and sensing^{44,45,46,47,48,49}. For example, shock-compression of C/N-rich precursor materials has been shown capable of producing nitrogen-containing graphitic nanoonions on sub μs timescales, which holds incredible promise as a high-throughput strategy for tailored synthesis of exotic and technologically relevant carbon nanomaterials^50,51. However, governing phenomena and associated kinetics remain poorly understood due to the extreme associated T and P and far-from equilibrium events that occur. Hence the present pure C/N systems can serve as a reductionist model for understanding this process.

Training data were generated through a combination of Kohn–Sham DFT molecular dynamics (MD) simulations and single point calculations, details of which can be found in Section “Methods”. The C/N binary phase diagram is unknown under our conditions of interst; thus, to generate training data, simulations were launched DFT-MD for three different C/N compositions, at a variety of temperatures and densities spanning 300 K/1 g cm³ to 9000 K/4 g cm³ as shown in the plot in Fig. 2. Systems at densities below 2.9 g cm³ were initialized with a graphitic structure, while higher density initial configurations were initialized with a diamond-like structure; N-atoms were then introduced by random substitution. Simulations were run for at least 5 ps; 20 training configurations were taken from each of the 10 simulations to build the initial 298 configuration training set. As is shown in Fig. 2, resulting configurations span graphitic, compressed gas, and high-density liquid, containing both small molecules and larger, polymeric structures. This training set was supplemented with 98 configurations for 3 solid C/N materials, mp-1985, mp-571653, and mp-563, found in the materials project database⁵², comprising cell optimization trajectories under 0, 5, 10, 20 and 40 GPa. Note that the former two structures have been observed experimentally^53,54,55.

**Fig. 2: Overview of the training data used for model development in this work.**

Models were generated using version the 2.0.0 of the ChIMES-LSQ package⁸. Fitting was automated via version 2.0.0 of the ChIMES Active Learning Driver, ALDriver^6,17 an open source python workflow tool for automated ChIMES model generation via iterative fitting. ChIMES simulations were conducted using ChIMES_MD, which is a part of the ChIMES-LSQ package.The Standard iterative learning strategy coupled with the newly implemented hierarchical learning capability was used for the present work. Briefly, in the iterative learning strategy, an initial (It-1) model is trained to the DFT-MD-generated training set. The model is then deployed in parallel simulations at a selection of the training compositions, temperatures, and densities. Generally, early models are not adequately informed by the available training data and can give rise to unstable simulations that frequently sample poorly informed regions of the model (e.g., near the inner cutoff) and do not conserve the appropriate quantity. Our active learning strategy³³ attempts to select up to 20 such configurations from each simulation, as well as 20 configurations from otherwise stable portions of the simulation. These ChIMES-generated configurations are then assigned labels (forces, energies, and stresses) via single-point DFT calculation, and then added to the training set, from which the next iteration model is generated; this is repeated for a user-specified number of iterative learning cycles. A total of 10 cycles were used in the present work. A weighting factor of w = n_I/I is applied to each training point, where n_I is the total number of requested iterative learning cycles and I is the current cycle, counting from 1. This has the effect giving DFT-MD generated configurations highest priority weights, which prevents the unphysical configurations generated by early ChIMES models (e.g., that may have extremely small interatomic distances) from driving the fit away from relevant physicochemical space. We note that the need for this weighting strategy arises from our desire to generate maximally efficient models, which means that at our target level of model complexity we may not be able to simultaneously describe near and extremely far from optimal structures equally well. Instead, by applying this decaying weighting scheme, we maintain importance of “ground truth” DFT configurations and ensures models converge with subsequent iterations while still adding the benefit of “rare event sampling” and longer-time scales accessible to ChIMES-based MD.

As in previous work³³, initial weights were set to w_F = 1.0 kcal mol⁻¹ Å, w_E = 0.3 kcal mol⁻¹, and w_σ = 100.0 kcal mol⁻¹ Å⁻³. These weights account for differences in the relative abundance and magnitude of the force, energy, and stress tensor training data. Models contained 1- through 4-body interactions, with corresponding polynomial orders of ${{\mathcal{O}}}_{{\rm{2b}}}=25$, ${{\mathcal{O}}}_{{\rm{3b}}}=10$, and ${{\mathcal{O}}}_{{\rm{4b}}}=4$. Remaining hyperparameters for the fits were selected using previously described ChIMES heuristic approaches^33,36 and are given in Table 1. All ChIMES simulations were run using either the ChIMES_MD code available in the CHIMES_LSQ repository⁸, or via LAMMPS⁵⁶ through version 2.0.0 of the ChIMES_Calculator Library⁵⁷. Simulations used a 0.2 fs time step and were run for up to 100 ps, however all analysis was performed on only the first 5 ps to enable consistent comparison with DFT. Additional details on use of these tools is provided in Supplementary Section I.

Table 1 Model hyperparameters including the inner cutoff (r_cut,in), 2-, 3-, and 4-body outer cutoffs (r_cut,out,2b, r_cut,out,3b, and r_cut,out,4b, repectively), and Morse variable for distance transformation (λ)

To assess efficacy of the proposed hierarchical transfer learning capability, three models were generated, henceforth referred to as “Standard”, “Hierarch,” and “Partial” for models fit using the standard a-la-carte approach, the new hierarchical transfer learning strategy, or a partial hierarchical strategy where two parameter blocks are learned simultaneously, respectively. We begin by comparing model performance relative to DFT, for C/N systems, and then extend our study to pure C and pure N.

Performance for C/N systems

Though one might intuitively expect the Standard and Partial models to outperform the Hierarchical model for the C/N system, we find that in general, all models perform equally well for this system. Thus, in this subsection, we will only present data from the Standard model when it shows significant deviations from the Hierarchical model.

We begin with discussion of numerical metrics. The complete Hierarch C/N model contains a total of 3026 parameters. Of those parameters, the 442 for C and 462 for N were fixed, taken from earlier work, and only the remaining 2122 corresponding to CN cross-interactions were fit. Figure 3 provides force, energy, and stress parity plots for fitting iterations 1, 2, 4, 8, and 10 of 10. Note that each plot only shows new training data introduced at that iteration. Overall, we find excellent agreement with DFT. Distributions of stress and energy remain relatively unchanged between iterations but there is a clear increase in the spread of forces at It-2, arising from configurations generated by the poorly-constrained It-1 ChIMES model. By It-4, the DFT generated range and distribution of forces (i.e., in It-1) are recovered by the ChIMES model; by It-10, models yield simulations for which the relevant quantity is conserved (see Supplementary Information Fig. 1).

**Fig. 3: Force, stress, and energy parity plots for successive active learning iterations of the Hierarch model.**

Moving on to physical property metrics, we find that pressure predictions at each training state point are within error of the DFT-predicted value (see Supplementary Information Fig. 2), while C/N crystal cold compression curves exhibits absolute percent differences ranging from 0.1 to 1.7% (see Supplementary Information Table II). Diffusion coefficients are also in good agreement with DFT (See Supplementary Fig. 3), though ChIMES models underpredict these values for the two graphitic structures comprising cases 1 and 2. This disagreement is due to use of r_cut,out ≤ 5, which precludes recovery of the low-lying dispersion forces that modulate inter-sheet interactions, but greatly reduces the model’s computational expense. Ongoing work is exploring overcoming this limitation by explicitly including D2 corrections in the ChIMES model²⁶. Radial pair distribution functions (RDFs) and vibrational power spectra for simulations using the Hierarch model are provided in Fig. 4. In general, we find excellent agreement with DFT, despite the diversity of structure, chemistry, and bonding across the investigated state points.

**Fig. 4: RDFs (left) and corresponding vibrational power spectra (right) predicted by DFT (thick solid lines) and the Hierarchically learned ChIMES model (thin dashed lines).**

To further quantify chemical evolution in each system, we determine mole fractions and corresponding lifetimes for atomic species and small molecules (C, N, N₂, N₃, and C₂N₂) observed in each of the 8 reactive simulations (i.e., with T > 300 K). A comprehensive overview of data is given in Supplementary Information Figs. 4, 5. As shown in Fig. 5, we once again find excellent agreement with DFT, consistent with the accuracy typical for an a-la-carte model^28,33,38,42. Species lifetimes are also in good agreement with DFT. Notably, the ChIMES simulations indicate a large spread in lifetimes for case-5 (1500 K, 1 g cm⁻³, 50 %N). This is due to coupling between low [N], low density, and short timescales, which makes ensuing chemistry sensitive to simulation initialization (e.g., structure and initial velocities).

Fig. 5: Parity plot comparing molfractions (top) and corresponding lifetimes (bottom) for small molecules and atomic species found in each case simulation, predicted by DFT and the Hierarchically learned ChIMES model.

Performance for pure C and N systems

In the previous section, performance of models trained on C/N data was evaluated for predicting CN data. In this section, we ask how well these models perform when predicting properties of pure C and pure N systems. While reading this section, recall the following:

The Standard model attempts to learn pure-C and pure-N, and C-N cross interaction parameters simultaneously from C/N training data.
The Partial model uses C parameters that were trained on C data, and attempts to learn pure-N and C-N cross interaction parameters from C/N training data.
The Hierarch model uses C and N parameters that were trained on just C and just N data, respectively, and attempts to learn only C-N cross interaction parameters from C/N training data.

Hence, the Hierarch model will represent best possible performance for both pure-C and pure-N and the Partial will yield the exact same performance as the Hierarch model for pure-C, since it uses the exact same C parameters. We note that results for the Standard and Partial fit models are taken from a single independent simulation, and that results are only presented when they are found to vary significantly between model development strategies.

Beginning with analysis of performance for Pure C systems, comparing the Standard and Hierarch models, we find that predicted pressures, RDFs, and vibrational power spectra are of similar accuracy (see Supplementary Information Table III and Supplementary Fig. 6). Particularly notable exceptions to this are shown in Fig. 6, where for low density, high temperature carbyne-like state points, the Standard model yields poor recovery of the corresponding RDFs. This result is unsurprising, since the system structures are dissimilar from anything in the C/N training set (see Fig. 2). Additionally, we find that diffusion coefficients for C at 6000 K and 2.5 g cm⁻³ and 7000 K and 2 g cm⁻³ are significantly over predicted relative to DFT (see Fig. 7). As shown in Table 2, we find the Standard model performs notably worse when predicting diamond and graphite lattice parameters, which is surprising since the C/N training data contains both graphite-like configurations and high density liquid. In particular, our 2017 ChIMES-Carbon model⁷ was trained to only a single liquid carbon state point, yet produced a significantly improved diamond lattice parameter (i.e., a = 3.565 Å).

**Fig. 6: Radial pair distribution functions and corresponding vibrational power spectra for the pure carbon system at selected temperatures and pressures.**

**Fig. 7: Diffusion coefficients for liquid carbon at selected temperatures and pressures.**

Table 2 Carbon lattice parameters (Å) for carbon predicted by DFT, the 2024 ChIMES carbon model (used by the Hierarchically and Partially-Hierarchically learned models), and the traditionally fit (Standard) models

Whereas in the paragraph above focused on pure C systems performance of the Hierarch and Partial models are expected to be identical (i.e., since they use the same underlying C-block parameters), performance for pure N systems will vary. Hence here, we compare performance of the Standard, Hierarch, and Partial models against DFT. In general, we find that all models yield good recovery of the DFT equation of state (Supplementary Table IV), RDFs, vibrational power spectra (Fig. 8), diffusion coefficients, and mole fractions and lifetimes for species formed (Fig. 9), with the Hierarch model yielding slightly better results, just as was seen for the pure C in the previous section. Notable exceptions to this performance include predicted pressure at 8000 K, 4.5 g cm⁻³, where DFT and the Hierarch model are in good agreement with P = 204.67 and 202.4 GPa respectively, versus the Standard and Partial, which over predict this value by approximately 60 GPa (i.e., with P = 257.20 and 259.4, respectively). The 300 K diffusion coefficient is also significantly underestimated by the Standard and Partial models. Small deficiencies are also observed in RDF and vibrational power spectra for the Standard and Partial Hierarch models. Namely, the 300 K, 1 g cm⁻³ RDF peak at r ≈ 3.5 Å is sharper shifted to larger r relative to DFT, and the and 6000 K, 2.5 g cm⁻³ power spectrum, which is missing the N₂ vibration peak near 2250 cm⁻¹, and exhibits non-zero vibrations between 1000 and 1500 cm⁻¹ unseen in the DFT data.