We conducted case studies on four historical African conflicts: (1) The Mali conflict with a simulation period of 300 days from 29th February 2012, (2) Burundi conflict with a simulation period of 396 days from 1st May 2015, (3) South Sudan conflict with a simulation period of 604 days from 15th December 2013, (4) Central African Republic conflict with a simulation period of 820 days from 1st December 2013. While these conflicts started in different contexts, they share common driven factors, such as violence political instability and civil war, which have led to large-scale displacement. The chosen time periods capture the critical phases of each conflict, where major forced displacement crises occurred, and ensure the availability of reliable data for model validation. The conflict instance for each country is denoted as Mali 2012, Burundi 2015, South Sudan 2013, and CAR 2013, respectively.
Daily conflict/no conflict prediction using the RF model
Our initial analysis reveals a significant imbalance in our dataset, with over 95% of the data points representing periods of peace (value 0) across all four scenarios. This imbalance poses a challenge for our forecasting task, particularly because we need to predict conflict for various time horizons, ranging from one year to several years into the future. It is particularly problematic because our training data is heavily skewed towards peace (value 0), while the test data is less skewed. This discrepancy leads to the model underpredicting conflicts (value 1) across different time horizons.
To address this challenge, we employ a downsampling technique on the training dataset, where we randomly omit a portion of the peaceful events. The downsampling factor was determined through an optimisation process, aiming to strike a balance between accuracy and recall. Accuracy tends to be exceptionally high if we exclusively predict peaceful events, considering their preponderance in the dataset. Conversely, recall provides insight into how effectively we predicted conflict-related events. Our objective is to strike the optimal balance aligned with the requirements of the migration model. Supplementary Fig. 1 displays the metric’s performance under various downsampling values. In this graph, the downsampling factor ‘k’ corresponds to the reduction factor applied to the training dataset featuring zero values. Our analysis indicates that the most favourable compromise is achieved by downsizing the peaceful observations by a factor of 20, corresponding to \(k=20\).
We evaluate the performance of the RF Classifier by comparing it against two baseline models, both based on a Bernoulli distribution. The first model uses a Bernoulli distribution with \(p=0.5\), simulating a coin toss. The second model uses a Bernoulli distribution with p equal to the proportion of 1s in the actual data, reflecting the true distribution of observed values.
Figure 3 shows the recall, accuracy, and ROC-AUC score across three conflict forecasting models (RF model, Bernoulli distribution, and random guess). The ROC curve score measures how well a model can tell the difference between conflict and no-conflict cases. It measures the area under the curve of the true positive rate against the false positive rate. A score of 1 means perfect predictions. A score of 0.5 or lower means the model is no better than random guessing (Fig. 3).

Comparison of recall vs accuracy across models: the circle shape represents the RF model, the diamond shape a Bernoulli distribution, and the square shape a random guess. The plain shape represents the mean value and the transparent markers represent the four simulations for each model. The ROC-AUC score is illustrated by a gradient red color, where a deeper red shade indicates a higher score. The dotted line is the product of accuracy and recall when equal to 0.3 and 0.6.
As can be seen, the RF model consistently outperforms other models, achieving the highest scores in both the product of recall and accuracy, as well as in the ROC-AUC metric (see Supplementary Table 2 for more detailed results). In contrast, the Random and Bernoulli models exhibit low standard deviations in accuracy and recall across all study cases, with their observations clustered around specific points: (0.5, 0.5) for the random guess and (0, 1) for the Bernoulli model. These patterns can be attributed to the inherent characteristics of each model. The Bernoulli model, predicting mostly zeros, achieves high accuracy in unbalanced datasets but struggles to correctly identify conflict events (low recall). The random guess, by definition, predicts an equal proportion of ‘1’s and ‘0’s, resulting in scores of approximately 0.5 for both recall and accuracy. The RF model, however, demonstrates more varied performance across cases. It achieves excellent scores for Burundi (with a product of recall and accuracy around 0.6), average performance for Mali and the Central African Republic (CAR), and a low recall score for South Sudan—though still twice as high as the Bernoulli model (0.053 compared to 0.026). The RF model’s ROC-AUC scores further highlight its superiority, particularly in the case of Burundi where it achieves a score of 0.78. For CAR, the RF model shows a slight improvement (0.65), while for Mali and South Sudan, its performance is comparable to that of the Random and Bernoulli models. This analysis underscores the RF model’s potential for conflict prediction, especially in certain contexts, while also highlighting the variability in its performance across different cases.
The forecasting classification confusion matrices are displayed in Fig. 4. In the four cases, we are overpredicting conflict events, or we have high values of false positives. However, the majority of cases are predicted as peaceful, especially in South Sudan in 2013, with 93% of True negative cases. Only Burundi has slightly higher false positive rate with 24% of the cases, compared to 16% and 13% for Mali and Central African Republic. This can be explained by the slight difference in the peaceful event distribution, as Burundi has a slighter lower percentage around 95% when the mean value is 97%. It is also the studied case with the higher True positive rate with almost 4% of the observations. On the other side, the RF model has high false negative value with South Sudan 2013 with around 2.5% and only 0.15 % of true positive observations.

Confusion matrices in percentage for the four studied cases. Top-left: Mali 2012, top-right: Central African Republic 2013, bottom-left: Burundi 2015 and bottom-right: South Sudan 2013; predicted in x-axis and actuals in y-axis.
The favourable outcomes observed in Burundi may be attributed to the relatively short prediction horizon (approximately 1 year) and the limited number of locations (7). Conversely, South Sudan presents a more challenging scenario with 25 localities to forecast and an extended prediction period of almost 2 years. Additionally, in the 5 years preceding the prediction period, 5 localities (constituting 20% of the total) accounted for approximately 50% of the conflict events. This concentration could lead our model to centralize predictions around these areas and potentially overlook the broader distribution of conflict events. Similar patterns emerge in the Central African Republic, where our model accurately forecasts high-conflict localities but may miss the dispersion in others. Lastly, in Mali, the scarcity of conflict events (approximately 1% of the overall test set) prompts the RF model to overpredict.
We also evaluated eXtreme Gradient Boosting (XGBoost) as an alternative to the RF model for daily conflict/no-conflict prediction over a future time period. As shown in Supplementary Table 3, XGBoost performs worse than the RF model in terms of recall, accuracy, and the ROC-AUC metrics. Additional analysis on the impact of the forecast horizon on the classification metrics is presented in the Supplementary Fig. 2. As expected, the score decreases with time in the Burundi and CAR examples. In Mali and South Sudan, time brings more and more fluctuations in the F1-score, showing increasing uncertainty in the predictions.
First conflict onset prediction using the RF model
We use the RF model to predict the time until the first conflict onset at the locality level, measured in days over a future time period. Once a conflict is predicted to begin in a specific region, we assume that it will persist for the remainder of the simulation period. This assumption simplifies the model while still capturing the often protracted nature of local conflicts.
Similarly to the daily conflict/no conflict prediction, we encounter a challenge with non-events, which in regression results in a higher value for the upcoming onset in terms of the number of days. To tackle this issue, we also implement a downsampling technique for the outputs that extend beyond 120 days in the training set. To evaluate the performance of the RF model, we use the same benchmark models, Bernoulli distribution and random guess, as in the daily conflict/no conflict prediction. We simulate when each model makes its first ‘conflict’ prediction. The day this happens is recorded as the predicted onset of conflict.
To evaluate the forecast results, the log ratio of the Mean Squared Error (MSE) of the benchmark model over the RF model is plotted in Fig. 5. As can be seen, our model has significantly better results than the two benchmark models. A positive log ratio implies a higher MSE, and hence worse performance for the benchmark model compared to our model. We find that 66% of the localities (represented by grey points) have a positive log ratio for the Bernoulli model (with p the proportion of one in the data), while 75% for the Random model. The mean values of the log ratios are 0.51 and 0.47, respectively. The statistical evaluation highlights this superiority, with a p-value of 0.003 for the Bernoulli model and 0.02 for the random model, derived from a t-test on the log ratio, testing its deviation from zero.

Boxplot comparing the log ratios of two models, Bernoulli and random guess, with the RF model. Each point represents an individual data sample. The red triangles denote the mean value of MSE log ratio for each model. The p-values indicate the statistical significance of the difference between the log ratios for each model from 0. A positive log ratio implies a higher MSE, and worse performance for the benchmark model compared to our model.
The XGBoost model was also tested for the first conflict onset prediction. Supplementary Fig. 3 plots the log ratio of MSE of the benchmark models over XGBoost. As can be seen, the model’s log ratios with benchmark models were not significantly different from zero. One potential explanation is the tendency of XGBoost to overfit, especially with the noisy fatalities covariates. Unlike XGBoost, RF has unrelated random trees, which avoid the risk of overfitting.
Conflict-driven population displacement simulation
To evaluate the added value of two types of conflict progressions generated by the RF model in predicting the displacement of people, we compared our method with the Flee model presented in10. We constructed three simulation instances, each corresponding to a different conflict progression input: (1) Flee (recorded conflict), where conflict progression is based on the ground truth (i.e., ACLED conflict data) as used in10; (2) Flee (predicted conflict [RF-daily]), where conflict progression is generated by daily conflict/no-conflict prediction using the RF model; and (3) Flee (predicted conflict [RF-onset]), where conflict progression is generated by first conflict onset prediction using the RF model.
The accuracy of three simulation instances was assessed by the Average Relative Difference (ARD) metric, which is calculated as follows:
$$\begin{aligned} E
(1)
where \(n_{sim,x,t}\) denotes the number of refugees predicted by a simulation in each camp x of the set of all camps S at time t, \(n_{data,x,t}\) denotes the observational data from UNHCR for each camp x of the set of all camps S at time t, and \(N_{data,all,t}\) is an aggregation of the observational data from UNHCR for all camps at time t. The ARD is a linear error measure which complies that every mismatch in the estimation of a human arrival should contribute equally to the error score. As a result, both an overprediction and an underprediction of arrivals by \(100\%\) would result in an ARD score of 1.0. An ARD value of 0.0 indicates that a simulation is completely in line with the validation data (0% error). It is possible for ARD values to be higher than 1.0. This occurs when the Flee model overpredicts actual arrivals by more than \(100\%\), a phenomenon that frequently occurs in very early stages of an armed conflict. For ease of communication and consistency with previous literature, we will indicate a simulation to have an error of \(50\%\) when the ARD is 1.0 and an error of \(0\%\) when the ARD is 0.0.
To configure simulations using the Flee code, we modified the default assumptions, resulting in two different rulesets: ruleset 1.0, which follows the assumptions proposed in47, and ruleset 2.0, which provides a more realistic version by incorporating additional movement rules to simulate more complex behaviors. See supplementary materials for a description of the two rulesets.
Similar to many other simulation codes, Flee is non-deterministic, resulting in variations in results with each execution. To reduce the impact caused by aleatoric uncertainty, we execute 100 replicas of each individual simulation. To reduce the execution time for all conflict instances, we apply the FabFlee automation tool and a pilot job mechanism, i.e., QCG-PilotJOB49, to efficiently run ensemble forecasts on the ARCHER2 supercomputer.
Table 1 presents the ARD results of all simulation instances in four conflict scenarios. To interpret these results, we need to acknowledge that a conflict forecast that exactly corresponds to the recorded ACLED data would result in a near-perfect match in ARD scores, as only aleatoric uncertainty of the probabilistic Flee algorithm, which is typically around 0.5%, would introduce noise in the results. Therefore, when this ARD difference is systematically very small, we can choose to produce forecasts reliably using the Flee model without having to rely on ACLED data (which covers only historical events).
As shown in Table 1, the ARD values range between 0.25 and 0.6 in all cases, which means that the simulation is at least 70% correct relative to the UNHCR data in all. Among the three simulation instances, both Flee (predicted conflict [RF-daily]) and Flee (predicted conflict [RF-onset]) obtain lower ARD values (i.e., better validation scores) than Flee (recorded conflict) in all conflict scenarios except for CAR 2013. Flee (predicted conflict [RF-daily]) outperforms Flee (predicted conflict [RF-onset]) in Mali 2012 and South Sudan 2013, while it performs worse in the remaining cases. The change from Ruleset 1.0 to the more realistic Ruleset 2.0 results in lower ARD values in most scenarios, which indicates that more realistic movement assumptions have a positive impact on the simulation of conflict-driven population displacement. Moreover, the three simulation instances show relatively low standard deviations of ARD, indicating the robustness of the predictions. In summary, simulations incorporating predicted conflict progressions achieve comparable accuracy to those using recorded conflicts. Namely, the coupled model is effective in modeling conflict dynamics even without using ACLED data, providing an alternative to forecast future conflicts.

The number of camp populations predicted by three simulation instances in a single simulation run and observed data from UNHCR (left), and ARD values for simulation results compared to observed data from UNHCR (right) under ruleset 1.0 for the four conflict scenarios. Blue line: Flee (recorded conflict), yellow line: Flee (predicted conflict [RF-daily]), green line: Flee (predicted conflict [RF-onset]), red line: observational data from UNHCR.
In Fig. 6, we present the daily number of arrivals in camps predicted by three simulation instances in a single simulation run and observed data from UNHCR (Fig. 6a, c, e, g), and ARD values for simulation results compared to observational data from UNHCR (Fig. 6b, d, f, h) under ruleset 1.0 for the four conflict scenarios. For Mali 2012 in Fig. 6a, both Flee (predicted conflict [RF-daily]) and Flee (predicted conflict [RF-onset]) overestimate the number of arrivals in camps after the initial days of the simulation, while Flee (recorded conflict) underestimates during the majority of the simulation period. The ARD of Flee (predicted conflict [RF-daily]) fluctuates more significantly compared to Flee (predicted conflict [RF-onset]), particularly between days 50 and 150, reaching 0.9, as shown in Fig. 6b. Flee (predicted conflict [RF-onset]) performs better than Flee (recorded conflict), with lower or similar ARD values for most of the simulation period.
In Burundi 2015 situation (Fig. 6c), Flee (predicted conflict [RF-onset]) and Flee (recorded conflict) significantly underestimate the number of arrivals in camps during the early and middle simulation period and overestimate the numbers at the later stage of simulation, while Flee (predicted conflict [RF-daily]) significantly underestimates the number of arrivals in camps during the whole simulation period. As can be seen in Fig. 6d, Flee (predicted conflict [RF-onset]) performs better than Flee (recorded conflict), with lower ADR during most of the days in the simulation, while Flee (predicted conflict [RF-daily]) performs worse than Flee (recorded conflict) during the middle and later simulation period (after around day 180).
For South Sudan 2013 (Fig. 6e), Flee (predicted conflict [RF-onset]) predicts a lower number of arrivals in camps than observational data from UNHCR at the early stage of simulation and overestimates the numbers afterwards, while Flee (predicted conflict [RF-daily]) predicts lower numbers for most of the simulation period. Flee (recorded conflict) predicts a much lower number of arrivals in camps than the UNHCR data at the beginning of the simulation, and then the difference between the predicted numbers and the UNHCR data becomes less. Fig. 6f shows that Flee (predicted conflict [RF-daily]) achieves lower ARD values than Flee (recorded conflict) during most of the simulation period, achieving ARD values below 0.5 after around day 250. Although Flee (predicted conflict [RF-onset]) obtains higher ARD values (0.8-1) in the initial stage, the ARD decreases significantly afterwards and maintains an ARD lower than 0.2 from the middle stage of the simulation. Flee (recorded conflict) performs the worst, with ARD values above 0.5 for the first half of the simulation.
In CAR 2013 situation, as shown in Fig. 6g and h, Flee (predicted conflict [RF-onset]) predicts similar results with Flee (recorded conflict), but with slightly higher ARD values. Flee (predicted conflict [RF-daily]) predicts a lower number of arrivals in camps than the UNHCR data during the whole simulation period, with higher ARD values than the other two models for most of the simulation period. Among the three simulation instances, Flee (recorded conflict) achieves the lowest ARD values for most of the simulation period. Flee (predicted conflict [RF-daily]) and Flee (predicted conflict [RF-onset]) achieve ARD values lower than 0.5 for about two-thirds of the simulation period, but Flee (predicted conflict [RF-onset]) achieves much lower ARD values than Flee (predicted conflict [RF-daily]). The main reason for the lower accuracy of the coupled model on this conflict instance could be that the prediction of conflicts becomes less accurate when predicting for a longer time.
