Accuracy of the machine-learning method for parameter estimation under the IM model
After training the three ML methods for each demographic parameter of the standard IM model on the training data set, we evaluated their performances on the test data set. We compared the predictions of these three ML methods using different metrics (Table 1). In the following, N_current refers to results for population 1. Since the model is symmetric and both population sizes have the same prior, the results obtained for N_current_1 and N_current_2 are similar (see Table S4 for population 2).
Root mean square error (RMSE) and mean absolute error (MAE) are commonly used measures, and the choice between them depends on the specificities of the issue to be tackled (Chai and Draxler 2014). The squaring operation in RMSE penalizes large errors more heavily, which is particularly appropriate when avoiding outlier prediction is critical. For three of the four demographic parameters, Split_time, Migration_rate and N_current, MLP performed better than RF and XGB for these two metrics. Conversely, for the N_ancestral parameter, the XGB model performed better according to the RMSE metric, while the MLP model performed better according to the MAE metric.
The MAE metric presents the advantage of directly reporting the average magnitude of the errors. Split_time was the demographic parameter for which the MLP model outperformed the two other models by the largest margin: the MAE of RF and XGB were 522 and 467 generations respectively, i.e., 36 and 21% higher than the MAE of MLP, which was only of 385 generations. For the migration rate, which ranged between zero and 10−3 per generation in the simulations, MLP achieved an average MAE of around 7.5 × 10−5 per generation, compared with 8.0 × 10−5 and 8.8 × 10−5, respectively, for the XGB and RF models, i.e., a higher MAE by 6 and 17%, respectively. For ancestral and current effective population sizes, the lowest average errors were also achieved by the MLP models. For ancestral effective population size, MLP outperformed only marginally the other methods, with a mean error for MLP of 127 individuals compared with 129 and 130 individuals, respectively, for the RF and XGB models. The difference was more substantial for the current effective population size: for this parameter, the MAE of RF and XGB were higher by 14 and 4%, respectively, than the MAE of 375 of the MLP.
The NMAE represents a normalized version of the mean absolute error (MAE), in which the error for a given parameter is divided by the difference between the minimum and maximum bounds of the prior of this parameter. This allowed us to have a comparison between the estimates of all parameters, whatever their range. N_ancestral had the lowest NMAE (0.0127). It was thus the parameter that could be estimated with the highest accuracy, followed by N_current (NMAE of 0.0375). Migration_rate and Split_time parameters, with respective NMAE of 0.0752 and 0.0771, showed the lowest accuracy.
Observing predicted versus observed values allowed us to obtain more detailed information about the predictions for each parameter (Fig. 2; Fig. S3). RF and XGB methods showed both a downward bias for high values of splitting time and migration rate, in contrast to MLP (Fig. 2). We also observed significant upward biased predictions for the RF model for low migration rates, leading to a large overall error for this method on this demographic parameter.

Predicted values versus observed values for the four isolation-with-migration parameters: Split_time (in generations), Migration_rate (per generation), N_ancestral, and N_current, estimated with three different ML methods: RF (Random Forest), XGB (XGBoost), and MLP (Multilayer perceptron).
Comparison with ABC methods
The performance of the ABC algorithms depended on both the tolerance thresholds and the choice of using either the median or the mean of the posterior as estimator (Table S3). Whatever the tolerance threshold chosen, the three ML methods all outperformed these ABC algorithms on the same simulation set (Fig. S4). The ABC rejection algorithm always showed lower performances than the ABC loclinear and ABC neuralnet methods. Compared with the three ML methods, the performances of ABC algorithms in predicting current and ancestral effective population sizes were substantially lower. The difference between the three ML methods and the ABC algorithms was less pronounced for the migration rate parameter, but still in favor of the ML methods.
Accuracy of the machine-learning method for parameter estimation under the SC model
We then trained the same hyper-parametrized models on the SC scenarios. In the following, N_current and Growth_rate refer to results for population 1. As the model is symmetrical and the priors are the same for both populations, the results are similar to those for population 2.
For the parameters already present in the standard IM model, MLP was again generally the best-performing method (Table 2). This was particularly noteworthy for Split_time, for which RF and XGB showed both a downward bias for high values of that parameter (Fig. S5). N_ancestral remained the most accurately predicted parameter with the lowest NMAE (4.19 × 10−2 for MLP). The model with the lowest RMSE for Growth_rate prediction was XGB (6.06 × 10−4), while MLP showed the lowest MAE (4.51 × 10−4). Finally, for the migration duration, MLP achieved the best performances in terms of both RMSE and MAE.
In the following sections, we focus on the standard isolation with migration model (IM).
Assessing accuracy discrepancies across demographic scenarios
We focus in this section on MLP, as it showed the lowest MAE. We investigated if there was a heterogeneity in prediction quality according to the demographic scenarios. In order to address this question, we used regression to identify to what extent the different demographic parameters could explain prediction errors (Fig. S6). We then plotted these prediction errors as a function of the two most explanatory parameters (Fig. 3).

Standardized errors \(\left({ {\hat{y}} }_{i}-{y}_{i}\right)/\left({y}_{\max }-{y}_{\min }\right)\) where y is the target demographic parameter: Split_time (in generations) (A), Migration_rate (per generation) (B), N_ancestral (C), and N_current (D), as a function of the two variables most predictive of these errors under the MLP method. Cold colors correspond to areas of low error and therefore better model prediction, while warm colors correspond to areas of higher error.
Regarding the prediction of split time, we observed that the best performance was achieved for smaller ancestral effective population sizes (Fig. 3A). As this effective population size increased, the prediction error increased. This effect was more pronounced for older split times. When predicting the migration rate, MLP performed less efficiently for recent split times. As the split time increased, the prediction quality was higher for a lower migration rate (Fig. 3B).
As for the ancestral effective population size, the values taken by the current effective population sizes had the largest impact on the quality of the prediction. The lowest accuracy was observed for the ancestral size when both current populations showed low effective sizes. When one of the two effective population sizes was small, prediction quality remained lower than average, regardless of the effective size of the second population (Fig. 3C).
Split time was the most decisive parameter in the quality of the prediction for current effective population size parameters. As this time increased, the prediction became increasingly accurate, whereas demographic scenarios with recent split time and high current effective population sizes led to the highest errors.
Different patterns of summary statistic usage across ML models
We estimated the importance levels of various summary statistics classes for each ML model and each demographic parameter in order to determine to which extent they differ in that aspect. We computed the importance of each variable, using the PFI method for the three ML methods, which computes the extent to which the quality of a prediction of a given model decreases when a single variable is randomly shuffled.
We observed first that RF and XGB allocated quite similar importance levels to the various classes of summary statistics (Fig. 4), while the MLP method generally showed different behaviors. For Split_time, the SFS class stood out for the two ensemble methods, with importance levels of 0.71 and 0.66 respectively. The JSFS class was second in terms of importance (importance levels of 0.10 and 0.16 for RF and XGB, respectively), but the gap was high with the SFS class. Conversely, for MLP, the JSFS class was the most important, with an importance of 0.63, ahead of the SFS and AFIBS classes. The differences among methods were much less pronounced for the Migration_rate parameter, where the AFIBS, SFS, and foremost JSFS classes were used predominantly by all methods. Nevertheless, the XGB and RF models also gave significant importance to Fst, unlike the MLP model.

Importance levels of the different classes of summary statistics as a percentage of the total for each demographic parameter and each method (RF, XGB, MLP).
The differences were much more striking for the effective population sizes. For the ancestral effective population size, RF and XGB focused almost exclusively on the IBS class, with importance levels of 0.99 and 0.96, respectively. Conversely, MLP used the different classes much more evenly, with five classes showing noticeable levels of importance. The discrepancy among methods was a bit lower but still striking for the current effective population sizes. All methods used several classes of summary statistics, but the importance levels were strongly unbalanced toward AFIBS for RF and XGB, with values 0.66 and 0.65 respectively, while MLP used again the different classes more evenly.
This similarity in the variables used by XGB and RF was confirmed by observing the individual importance levels of each summary statistics (see Fig. S7). In the case of split time, the variables related to SFS were of the utmost importance for the predictions made by these two methods. In a similar manner, these methods assigned the greatest importance to Fst, followed by statistics from the JSFS class for predicting the migration rate. The most important variables for predicting the ancestral effective population size belonged all to the IBS class, while for the current effective population size these variables belonged to different classes: AFIBS, JSFS and winHET. The most important summary statistics chosen by MLP differed strikingly from those chosen by RF and XGB. They also exhibited markedly lower importance values. Indeed, the maximum importance attributed to a single feature by MLP was 2.6% for the Migration_rate parameter. For this method, importance was clearly spread over a much larger number of variables than for the two other methods (Fig. S8).
Towards improved explainability
The ML methods do not directly provide the impact of each individual statistic on the parameter estimation, due to their inherent complexity, nonlinear relationships, or the fact that they are a combination of a set of predictors. This makes the interpretation of the individual impact of each summary statistics more complex. While some methods such as the PFI method used above are effective in revealing features that are important for prediction, they reach their limit when it comes to quantifying the impact of features while taking into account how they interact with each other. Interpretation techniques may then prove necessary to provide insights into the role of summary statistics in complex models. We therefore used SHAP (Lundberg and Lee 2017), a method based on Shapley values that measures the average marginal contribution of each feature when combined with all other to make a prediction for a given parameter. In addition to identifying the most important summary statistics for each parameter, this method also allows investigating finely how these statistics affect the model outputs, quantifying both the direction and magnitude of their contributions to the prediction of each parameter. We illustrate the contributions of this method in the IM model for the Split_time predictions obtained with the XGB model, which combined high accuracy with an ability to bring out statistics with substantial importance (Fig. S7).
Focusing first on the summary statistics with the greatest impact, we observed that the frequency of tripletons in the overall population (SFS_3_all-mu) showed the highest mean of absolute SHAP values (Fig. 5A). This statistic was also the one that showed the highest importance (Fig. S7). The distribution of SHAP values for the most important variables allowed us to understand the impact of individual summary statistics on predictions (Fig. 5B). We observe that SFS_3_all-mu had the greatest upward and downward impact on the prediction of Split_time. While the highest values of this variable almost always had an upward impact, we see that intermediate values had a substantial impact, particularly downward, when combined with other statistics (Fig. 5C). We also observed that for the variable SFS_19_p1-mu, which represents the frequency of SNPs with the ancestral allele found in a single haplotype, higher values have a greater impact on the parameter compared to lower values, which have a more limited impact (Fig. 5B).

A Top features with the highest means of absolute SHAP values for the XGB model on the Split_time parameter. B Feature effects indicated by the Shapley value on the x-axis for the top 5 features with their values represented by color. Each point corresponds to one instance. C Shapley values as a function of the normalized SFS_3_all-mu (gray points), distribution of the normalized SFS_3_all-mu (gray bars).
