The procedure used to generate an ensemble of models, characterized by \({\pi }_{{\mathcal{H}}}^{* }\), a probability density over model parameters, is described in details in “Methods”. The ability of the ensemble \({\pi }_{{\mathcal{H}}}^{* }\) to characterize the uncertainty on the predictions of the MLE model is assessed using a comprehensive validation suite of properties that are often of interest in practical applications of MLIAPs, including perfect crystal properties, defects, and energy barriers. Note that none of the validation properties were explicitly included in the training data, which was generated without any input from domain experts according to the procedure described in ref. 13, and therefore can be considered as an assessment of the UQ procedure on genuinely unseen test data.
Pointwise properties
In keeping with the traditional ML literature, the most common approach to characterizing the performance of MLIAPs is through point-wise error metrics measured on a testing set that is nominally independent of the training set. Predicting the distribution of errors incurred by the MLIAP is therefore a natural objective. Figure 1a) reports the distribution of the ratio of actual point-wise test errors to the difference between the MLE and UQ ensemble models. The results demonstrate that the overall error distributions from the MLE is extremely well captured by the resampled POPS ensemble \({\pi }_{{\mathcal{H}}}^{* }\). This shows that the deviation between the predictions of the MLE and those of individual samples from the model ensemble provides a representative statistical estimate of the actual difference between the MLE and (often unknown) exact reference value. The ensemble also provides excellent bounds on predictions: maximal and minimal predictions of an ensemble of 500 models sampled from \({\pi }_{{\mathcal{H}}}^{* }\) fails to bound the actual reference energies and forces in only 2.2% and 3.0% of the cases, respectively. Furthermore, the bounds provided by the model ensemble capture very specific features of individual predictions. E.g., in addition to capturing the generic increase in error with increasing energy or forces that results from the reweighting scheme used to train the MLIAP, “outlier” points with unusually large errors compared to their neighboring peers are very accurately captured (c.f. the outlier points in Fig. 1).

Top row (a, b): UQ on energies; Bottom row (c, d): UQ on forces. Left column (a, c): distribution of test errors for the MLE against the reference data (black) and from \({\pi }_{{\mathcal{H}}}^{* }\)ansatz against the MLE (green). MAE mean absolute error relative to the minimum loss solution, EV envelope violation, fraction of points lying outside of the max/min bound. Right column (b, d): parity plot of actual vs MLE predicted energies. Shaded areas show the min/max range of predictions over all members of \({\pi }_{{\mathcal{H}}}^{* }\). In this case, 10,000 models were sampled.
This correlation between actual errors and predicted uncertainties can be seen in Fig. 2, where a estimate of pointwise error was obtained as half of the min/max range of predictions from models sampled from \({\pi }_{{\mathcal{H}}}^{* }\). As the results clearly show, the actual errors are directly correlated with the confidence interval predicted by the POPS approach.

a energies; b forces. The predicted error metric is here taken as half of the range of the minimum/maximum predicted values. Means and medians are taken with respect to bins of actual MAE errors.
These results clearly show that the UQ ensemble not only captures average error behavior, but closely resolves high uncertainty regions that result from particularly detrimental combinations of test point and intrinsic model limitations. The ability to confidently bound predictions is also a powerful feature that can be used to easily propagate worst-case scenarios to more complex quantities, as will be shown in the following.
It is important to note that, in the misspecified setting, the trained model will be a function of the specific distribution from which the training data was sampled even in the infinite data limit, in contrast to the well-specified case where the correct model can be recovered from different training distributions62. This dependence will in general, lead to errors being larger in sparse regions than in dense regions63, a phenomenon called error amplification. This is broadly consistent with the behavior observed in Figs. 1 and 11 below, where sparser regions generally tend to show higher errors (although sparsity should ideally be assessed in feature space, not in the output space). Crucially, when this effect persists in the infinite data limit, it is better described as a form of data-distribution-dependent misspecification error than as a form of epistemic error.
Perfect crystal properties
A second key class of properties of direct interest to applications is the quantification of the stability of different crystal structures. Figure 3 and Tables 1–3 demonstrate that the UQ ensemble accurately captures the actual errors in formation energy, equilibrium volume, and bulk modulus over 13 different crystal structures that vary broadly in topology and unit cell sizes. These results are obtained by using atomistic configurations and simulation cells that were individually optimized under corresponding MLIAPs, in contrast to evaluating point-wise energies on the reference structures relaxed with DFT.

a Equilibrium formation energy for different crystal structures relative to the BCC phase; b ratio of equilibrium volumes to the BCC phase; c bulk moduli. MLE predictions are shown in blue, DFT reference values in red. Individual ensemble predictions and bounds are shown in grey.
The results clearly show that the UQ ensemble accurately captures the uncertainty inherent to different phases, providing tightly distributed predictions where the actual errors are low and more diverse predictions where the actual errors are large, in addition to accurately bounding the actual predictions in all cases.
Tables 1 to 3 also show that the standard deviation of the UQ ensemble predictions provide a statistically representative indication of the magnitude of the actual errors, as the mean ratio of the MLE error to the standard deviation of the ensemble is close to 1, except for the formation volume where the ensemble overestimates the errors by about a factor of 2 on average.
In all cases, the extreme values predicted by the ensemble bound the actual reference result, providing strong guarantees.
Furthermore, in addition to information regarding the absolute accuracy of the predictions, it is often desirable to establish whether the MLIAPs can be expected to predict the relative ordering of certain properties across different phases, e.g., of the formation energy, which determines the most thermodynamically stable phase at low temperature. Figure 4a and b demonstrate that the distribution of Spearman rank correlation coefficients between MLE and members of the UQ ensemble (blue histograms) provides representative estimates of the actual correlation between MLE and the reference data (black vertical line): while most potentials agree with the MLE with regards to the ordering of the formation energies, the relative ordering of the equilibrium volumes shows a much broader distribution. In both cases, the Spearman correlation coefficient between MLE and the reference is contained within a one standard deviation interval around the ensemble to MLE mean. This is a very desirable feature, as it enables the end-user to establish confidence in the accuracy of ranked comparisons without access to reference data.

The panels report histograms of rank correlations between MLE and ensemble models; Black line: rank correlation between DFT and MLE. The shaded area corresponds to a one standard deviation interval around the mean of the UQ ensemble. a Equilibrium formation energies of different phases; b Equilibrium volumes of different phases; c SIA formation energies for different structures; d surface formation energies along different orientation.
Transformation pathways between crystal structures are also relevant to the analysis of phase transitions. A range of such continuous transformation paths is reported in Fig. 5. The MLE MLIAP closely reproduces reference DFT results for the four paths that were considered. In all cases, the distribution of predictions from the UQ ensemble is tightly concentrated, except for the orthorhombic bcc → bct → bcc path where the predictions in the bct region are somewhat broader. In all cases, the UQ ensemble bounds the reference DFT values while providing a representative quantification of the actual error incurred by the MLE MLIAP.

MLE predictions are shown in blue, DFT reference values in red. Individual ensemble predictions and bounds are shown in grey. Panel a trigonal transformation path; b tetragonal transformation path; c orthorhombic transformation path; d hexagonal transformation path.
Phonons
Another key indicator of the thermodynamics and dynamics of crystal structures is provided by phonon dispersion relations, which are often prized as they can be correlated with scattering or spectroscopic experiments, as well as quantify the contribution of vibration-entropic effects to the thermodynamic stability of different crystal structures. Note that phonon properties derive from the diagonalization of energy Hessians or dynamical matrices and are therefore determined by second-order derivatives of the energy, which were not explicitly present in the training set.
Therefore, POPS ensembles were not explicitly constructed to match elements of the Hessian. Comparison of DFT and MLE results (c.f., Fig. 6) shows that the MLIAP performs well at low frequencies, but significantly overestimates the vibrational density of states at high frequencies (c.f., right panel), potentially reflecting the absence of Hessian training data. Correspondingly, the range of spectra predicted by the ensemble is also very broad, suggesting low confidence in the predictions. The UQ ensemble however, still correctly bounds the reference spectrum across the whole range of wavevectors.

a phonon dispersion in the BCC phase; b vibrational density of states. MLE predictions are shown in blue and DFT reference values in red. Individual ensemble predictions and bounds are shown in grey.
Defect energetics
Finally, due to their critical role in determining the mechanical properties of engineering materials, the energetics of defects are often key quantities used to train and validate potentials. We considered two classes of defects: self-interstitial atoms (SIAs) — which are particularly important to understand the behavior of materials under irradiation — and free surfaces. In both cases, formation energies were obtained self-consistently using the energy-minimizing structures predicted by each potential. The results are presented in Fig. 7 and Table 4. The energy scale for SIA formation is accurately captured by the MLE model, and the ensemble results bound the actual formation energies. The standard deviation of the UQ ensemble provides an excellent statistical representation of the actual error incurred by the MLE. In this case, the formation energies for 110 and OS variants are underestimated by the MLE, leading to a different predicted ordering of the relative defect stabilities. As shown in Fig. 4, the distribution of Spearman rank correlation coefficients between MLE and members of the UQ ensemble is also very broad, consistent with the observed ranking disagreement between MLE and reference values; the rank correlation coefficient between the reference and the MLE is found within one standard deviation of the mean of the correlation coefficients between MLE and ensemble.

MLE predictions are shown by blue bars, DFT reference values by red bars, and ensemble predictions by blue lines.
Surfaces are another class of important planar defects that, e.g., control the shape of nanoparticles. Figure 8 and Table 5 demonstrates that the MLE MLIAP in provides an adequate representation of the energies of different facets. In this case, the standard deviation of the ensemble prediction conservatively overestimates the actual errors by about a factor of 4, once again providing worst-case bounds that always include the actual reference value.

MLE predictions are shown by blue bars, DFT reference values by red bars, and ensemble predictions by blue lines.
Figure 4 also shows that the ordering of surface energies is robustly captured by the MLE MLIAP, which is corroborated by the narrow distributions of rank correlation coefficients between MLE and members of the UQ ensemble. In these cases also, the distribution of Spearman coefficients is consistent with the very high correlation between the MLE and the reference.
Energy barriers
In addition to thermodynamics, an assessment of uncertainty of properties related to defect kinetics is often extremely desirable, especially since kinetic properties can be exponentially sensitive to transition barrier energetics. This makes it extremely important to avoid overly pessimistic UQ, since it can translate into exponentially large differences in predicted characteristic timescales. Furthermore, saddle points are computationally expensive to harvest in large numbers using reference quantum methods, which makes them potentially drastically underrepresented in most training sets for MLIAPs. Figure 9 reports on the performance of the UQ ensemble for the minimum energy pathway of a first neighbor vacancy hop in BCC W. The results show that the MLE overestimates the reference results by a significant margin (about 0.5 eV), but that the UQ ensemble offers a quantitatively appropriate estimation of the error on the energy barrier. Note that the minimum energy pathways were individually reconverged for each MLIAP, and not simply reevaluated along the reference minimum energy pathway.

MLE predictions are shown in blue, DFT reference values in red, and ensemble predictions in grey.
Fast UQ propagation via implicit expansions
Many material properties, such as defect energetics and energy barriers, are calculated via local energy minimization. In principle, propagation of parameter uncertainty to these properties requires brute force repetition of simulations, which quickly becomes unfeasible as system size or system count increases. In this section, we apply a recently introduced approach to assess the predictions from the UQ ensemble by employing the implicit differentiation of atomic minima33. The implicit derivative emerges by noting that a stationary atomic configuration X* is an implicit function of the potential parameters Θ. As shown in ref. 33, the implicit derivative of atomic configurations, \({\nabla }_{{\mathbf{\Theta }}}{{\bf{X}}}_{{\mathbf{\Theta }}}^{* }\), can be computed efficiently for linear-in-descriptor potentials. This enables the calculation of the change in stationary configurations, \(\Delta {{\bf{X}}}_{{\mathbf{\Theta }}}^{* }\), under relatively small potential perturbations, ΔΘ, without re-minimization of the system for each potential sample. This method is advantageous in scenarios where performing molecular statics calculations is expensive due to the large system size or a large number of ensemble potentials.
Here, we apply the implicit approach to two UQ estimation cases presented above: (1) equilibrium volumes of BCC and HCP W phases and (2) minimum energy pathways for a first-neighbor vacancy hop in BCC W. For both cases, implicit derivative of the equilibrium volumes \({V}_{{\mathbf{\Theta }}}^{* }\), \({\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }\) is sufficient for the predictions. More details of the implicit expansion method and various forms of truncation/approximation are given in ref. 33.
For UQ of the equilibrium volumes, we first compute the implicit derivatives \({\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }\) at BCC and HCP minima with the MLE potential. Then, for each potential sample from the UQ ensemble, we predict the BCC and HCP volume change \(\Delta {V}_{{\mathbf{\Theta }}}^{* }=\Delta {\mathbf{\Theta }}\cdot {\nabla }_{{\mathbf{\Theta }}}{V}_{{\mathbf{\Theta }}}^{* }\). Left panel of Fig. 10 shows the predicted BCC and HCP volume ratios vs their true values obtained with minimization for potentials from the UQ ensemble. For UQ of the minimum energy pathways, we perform the full calculation with the MLE potential, and identify the initial and saddle-point configurations. We then compute the implicit volume derivative at the initial configuration. We predict the energy change at initial and saddle-point configurations using the Taylor expansion for atomic energy with implicit derivative (see ref. 33 for more details). Figure 10, right panel, shows the implicit derivative predictions of the energy barriers compared to the full pathway calculations.

a ratio of equilibrium volumes of HCP and BCC W phases; b minimum energy pathway barriers for a first-neighbor vacancy hop in BCC W.
For both cases, the implicit derivative technique provides the predictions within less then 4% of error for both cases. Since the goal of the POPS approach is to provide the worst-case bounds for the quantities of interest, combination of the UQ ensemble potentials with the implicit derivative predictions provides the ultimate efficient scheme for the UQ of the molecular statics properties.
Application to universal MLIAPs
Recent message-passing neural network (MPNN) models64,65 have shown impressive approximation ability to predict atomic energies and forces of diverse multi-specie configurations across the periodic table60,61. There is thus significant interest in assessing the accuracy of these ‘universal’ MLIAPs (UMLIAPs), to determine both uncertainty in predictions and select optimal training configurations for fine-tuning, where additional training data is used to adjust a small subset of model parameters.
The per-atom energy prediction of UMLIAPs \({E}_{MPNN}^{i}\) is produced by a readout function64,65, which typically receives scalar-valued messages from the MPNN, namely scalar-valued node features. In the framework of this paper, we can therefore treat the scalar-valued input to the readout layer as per-atom descriptors Di, as in the MACE MPNN model64. To motivate forthcoming studies of misspecification-aware UQ and fine-tuning for UMLIAPs, we applied the POPS UQ scheme to a linear ‘corrector’ for the MACE-MPA-0 foundation model60, trained on the mptraj61 and sAlex66 datasets. Specifically, we consider a simple linear model in addition to the MACE-MPA-0 prediction, giving a loss function
$$L({\boldsymbol{\Theta }})=\frac{1}{2}\sum _{i}\parallel {E}_{{\rm{DFT}}}^{i}-{E}_{{\rm{MACE}}-{\rm{MPA}}-0}^{i}-{\boldsymbol{\Theta }}\cdot {{\bf{D}}}_{i}{\parallel }^{2},$$
(1)
where \({{\bf{D}}}_{i}\in {{\mathbb{R}}}^{256}\) is the MACE per-atom descriptor vector. We emphasize that the goal of the linear corrector formalism is not to improve MACE-MPA-0 model but to capture the misspecification uncertainty, namely the model-form errors, which are manifest as uncertain model parameters. In principle, every parameter of the model has a misspecification uncertainty. However, for UQ purposes, we are free to consider a subset of parameters fixed, in effect restricting the function space for propagation and thus typically overestimating the uncertainties on the remaining variable parameters. For MACE-MPA-0, we consider all parameters fixed apart from the linear component of the readout layer, a procedure which allows us to use POPS to determine parameter uncertainties, giving the same benefits for conservative error prediction and robust propagation of model-form errors to simulation results as demonstrated above for the SNAP form of MLIAPs. Similar results could be acheived using e.g., any additive linear model or linearizing around the minimum loss solution. Study of the misspecification parameter covariance also gives an architecture-sensitive error measure that can be used to guide fine-tuning, a detailed study of which will be presented elsewhere.
We applied the POPS scheme to obtain a posterior distribution \({\pi }_{{\mathcal{H}}}^{* }({\boldsymbol{\Theta }})\) trained over energies from the mptraj dataset, including all 89 elements of mptraj with a 50:50 train:test split. As shown in Fig. 11, the ability of POPS to accurately predict test error distributions and bound worst-case errors seen for linear MLIAPs is maintained in application to linear correctors for UMLIAPs. We observe excellent coverage of the test error distribution over at least ± 4 standard deviations, with a small envelope violation of 1%. These preliminary results show that the general misspecification-aware framework introduced here can be applied to recent universal MLIAPs; future work will develop this approach both for uncertainty propagation and active learning workflows for UMLIAP fine-tuning.

a distribution of test errors for the MLE against the reference data (black) and from \({\pi }_{{\mathcal{H}}}^{* }\)ansatz against the MLE (green); b parity plot of MLE predicted vs actual energies for a subset of points. MAE: mean absolute error relative to the minimum loss solution. EV: envelope violation, fraction of points lying outside of max/min bound. Shaded areas show min/max range of predictions over all members of \({\pi }_{{\mathcal{H}}}^{* }\).
Comparison with standard uncertainty ensembles
As discussed in the introduction, a popular approach to gauge MLIAP uncertainty is an ensemble bagging predictor49, where an ensemble of models is produced by subsampling the training set, as opposed to the POPS ensemble and hypercube approaches presented in the present paper. As previously discussed31,50, there is no principled limit in how the number of ensemble members should be chosen, and the training cost in general scales with the number of ensemble members. Furthermore, whilst predictions in e.g., energy errors from ensemble methods can in principle be improved with e.g., an ad hoc rescaling factor, there is no theoretical basis for expecting the resultant error bounds to be robust, i.e., that the ground truth lies within these bounds. These theoretical limitations persist when propagating uncertainty and bounding quantities of interest. To give an example of the issues with bagging ensemble methods, we apply a simple 5-member ensemble for the POPS linear corrector discussed in the previous section, in addition to a model with a simple rescaling factor applied to energy predictions such that the mean training and predicted variance agree. As can be seen in Fig. 12, the unscaled ensemble significantly underestimates the test errors and has an extremely high envelope violation (EV) of 80%. Whilst applying a rescaling factor improves the predicted error histogram, as can be expected, the EV rate remains as high as 14%, which in practice means the predicted bounds are not reliable for property prediction. As a further demonstration, Fig. 13 shows the predicted vs actual error correlations as in Fig. 1, where it can be seen that even when compared to the rescaled ensemble, POPS shows a far superior error correlation. We emphasize that these behaviors remain when increasing the number of ensemble members, whilst the training cost is much higher than the POPS approach for NE≥2. These simple tests demonstrate how POPS provides an efficient and robust means to both predict test error distributions and, uniquely, bound test errors, which as we have shown above, allow for bounding of propagated materials properties.

The ensemble results (orange) and directly compared with the POPS linear corrector in Fig. 11. We also consider a model with an ad hoc rescaling chosen to match the predicted and training MAE (green). a and b are as in Fig. 11, with only the rescaled model shown on the right panel.

a using the POPS or b using the unscaled and rescaled ensemble linear correctors. As in Fig. 2, we take the predicted error to be the half the min/max range across the ensemble, focusing on the error range up to 0.5eV/atom for which sufficient statistics are available (at least five data points in each interval). As can be seen, even when compared to the rescaled ensemble, the POPS predicted errors show a much stronger correlation with the actual observed errors, in addition to the robust max/min bounds evidenced above.
