Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data

Skilful data-driven seasonal forecasts

Over the 23-year assessment period the pattern of seasonal skill (1-3 month lead) demonstrated by ACE2 closely resembles that of the dynamical model for mean sea level pressure (MSLP, Fig. 1a, b). This is remarkable considering ACE2 was designed for stable climate simulations, with no deliberate attempt to capture seasonal predictability. While much of the tropical skill is due to the persistence of slowly evolving processes such as ENSO from the initialisation of the tropical oceans^31,32, ACE2 also exhibits skill across the tropical land and the extratropics, including the North Atlantic and North Pacific. Interestingly, ACE2 also exhibits reduced skill over Eurasia, as seen in the physics-based model GloSea. In most regions the ACE2 correlation is weaker than that for GloSea. For example, the area-average correlation across the northern hemisphere extratropics (20°N to 90°N) is 0.39 in ACE2 and 0.44 in GloSea, while over the tropics (20°S to 20°N) the scores are 0.79 and 0.82, respectively. In comparison, a persistence forecast using October monthly mean conditions scores 0.17 across the northern hemisphere and 0.52 across the tropics. Subsampling predictions across years indicates no evidence that these results are biased by predictions based on initial conditions seen during the training of ACE2 (Supplementary Figs. 2 and 3).

**Fig. 1: Skilful seasonal (DJF) predictions from the ACE2 machine learning and GloSea dynamical models with a lead time of 1-3 months.**

For temperature (Fig. 1c, d) we continue to see large regions of skill from ACE2, including South America, Africa, Australia and parts of North America. As seen for MSLP, GloSea outperforms ACE2 across many parts of the world with the area-weighted mean correlation across the northern hemisphere extratropics at 0.41 in ACE2 and 0.45 in GloSea, and 0.68 and 0.77 respectively across the tropics. The skill for both systems is lower for precipitation, however the ACE2 model (Fig. 1e) once again closely resembles that of GloSea (Fig. 1f), particularly across the tropics, the Caribbean and east Asia.

These results demonstrate that the ACE2 model can skilfully predict seasonal variability across many parts of the world with a lead time of 1-3 months.

Predictability of the North Atlantic Oscillation

The NAO is the primary mode of seasonal variability across the North Atlantic³³ and is a key focus for extratropical seasonal prediction^34,35,36. ACE2 can predict the DJF-mean NAO³⁷ with a correlation score of r = 0.47 (Fig. 2a), at a lead time of 1–3 months. This is statistically significant at the 95% level (p = 0.023) and is highly competitive with a range of dynamical models. For example, over a shorter 19-year analysis period (1993–2011) ACE2 exhibits higher NAO skill (r = 0.42) than 4 operational ensemble prediction systems³⁶.

**Fig. 2: Skilful predictions of the DJF-mean North Atlantic Oscillation (NAO).**

It is important to note that only the 9 winters between 2002 and 2010 are fully independent of the ACE2 training period¹³. Over this shorter period the NAO correlation remains high (r = 0.6), although with reduced significance due to the smaller sample size (p = 0.07). Skill is also high across an extended 1981–2022 period (r = 0.52) and a subsampling analysis suggests that these NAO results are not biased by predictions from years within the ACE2 training period (Supplementary Figs. 1 and 3).

Interestingly, ACE2 gives a poor prediction of the extreme winter in 2009/2010 (see Section “The extreme winter of 2009/2010” below). Nevertheless, given the long autoregressive forecasts, the lack of a well resolved stratosphere, and the use of non-interacting, persisted SSTs, the ACE2 model skilfully predicts the NAO. This is surprising as both stratospheric variability and interactive ocean processes underpin dynamical model skill^38,39.

We also find that the ACE2 and GloSea NAO predictions are not strongly correlated (r = 0.34, p = 0.11) and so there may be additional value in combining them. Indeed, an ensemble mean constructed from both models results in an NAO correlation score of r = 0.65 (p < 0.01), matching that estimated by GloSea with an extended ensemble size of 127 members. Furthermore, after removing the climatological mean, the ACE2 and GloSea NAO predictions appear to be drawn from the same underlying distribution (two-sample KS-test, 95% confidence). This indicates that ACE2 could also be utilised to enhance dynamical model ensembles.

In addition to skilful seasonal predictions, the ACE2 ensemble closely matches the dynamical model in terms of NAO variability. Following initialisation, we find that the ACE2 ensemble mean error and ensemble spread increase in line with GloSea (Fig. 2, Equations (1) and (2)). Furthermore, the DJF-mean total standard deviation across all years and members is 4.3 hPa in ERA5, 3.6 hPa in ACE2 and 3.8 hPa in GloSea. For the ensemble mean variability the standard deviation is 1.11 hPa in ACE2 and 1.21 hPa in GloSea. The lagged-ensemble methodology used here therefore enables sufficient ensemble member spread to develop, but other methods for ensemble generation are key topics for future research.

In line with dynamical models^34,40,41, ACE2 NAO skill also increases strongly with ensemble size (solid line, Fig. 2c). This is encouraging as it is much cheaper and quicker, in computational terms, to increase the ensemble size of data-driven models compared to dynamical models. However, it can also be seen that when the ACE2 ensemble mean is used to predict one of its own individual members (so-called ‘perfect model’ skill), the skill is markedly lower (r = 0.25, dashed lines in Fig. 2c) than the ACE2 skill in predicting the observed NAO (thick solid lines, Fig. 2c). The ratio of predictable components (Equation (3)) provides a measure of observed and modelled predictability and variance. For ACE2 this quantity is found to be 1.6, only slightly less than the 1.8 for GloSea, but still greater than 1 (90% confidence). This indicates that for ACE2, the ensemble mean variance is small compared to the total ensemble variance given its skill in predicting the observed NAO⁴².

Therefore, despite having been trained only on reanalysis data, the ACE2 predictions also exhibit a signal-to-noise error which resembles that found in dynamical models^{34,40,42,43,44}. This is somewhat surprising as it may suggest that the signal-to-noise error is not restricted to a physical model error and instead occurs due to some other damping effect on the predictable signal. For example, weak eddy forcing and feedback are one hypothesised cause of the error⁴⁵, however these characteristics are not weak within the reanalysis used to train ACE2. Further investigation of ACE2 characteristics is needed, but we note that machine learning predictions can also exhibit damping and smoothing of the kinetic energy spectrum^11,46 potentially leading to similar errors in forecast anomaly amplitude. It is possible that the same qualitative behaviour occurs for different reasons in the ACE2 and GloSea models, but further research is needed to understand if this is the case.

ENSO as a driver of seasonal skill

ENSO is the primary mode of interannual climate variability and is a key driver of seasonal skill across many parts of the world^47,48. In this section we investigate whether ACE2 is correctly capturing ENSO teleconnections.

Composite differences between El Niño and La Niña years (Fig. 3) reveal that ACE2 exhibits very similar teleconnection patterns to those seen in ERA5 and GloSea for both MSLP and surface temperature. In particular, we find El Niño deepens the Aleutian low and influences the North Atlantic jet, extending eastward from the Caribbean. This suggests that ACE2 is capturing the ENSO relationship on the subtropical jet, an important mechanism underpinning the global influence of ENSO^47,49. In terms of the surface temperature response, ACE2 also exhibits very similar ENSO teleconnections to ERA5 and GloSea, particularly over North America, South America, southern Africa and Australia. These composites indicate that ACE2 is correctly capturing the regional interannual variability associated with ENSO across many parts of the world despite being trained only on the 6-hourly evolution of the atmosphere.

**Fig. 3: Influence of ENSO on DJF surface conditions.**

The extreme winter of 2009/2010

As a final part of our assessment we focus on predictions for the extreme northern hemisphere winter of 2009/2010, which is part of the independent dataset withheld during the training of ACE2. This winter is characterised by a record negative NAO, well beyond the anomalies seen in other years. It was also subject to a minor and a major sudden stratospheric warming (SSW), a strong El Niño and an easterly Quasi Biennial Oscillation (QBO)⁵⁰. The winter mean MSLP anomaly (Fig. 4a) exhibits a very zonal negative NAO which is well captured by GloSea (Fig. 4c). However, the ACE2 ensemble mean prediction does not appear to capture this signal with only slightly above average pressure across the Arctic (Fig. 4b). This is surprising given the strong tropical forcing and potentially indicates a limitation of ACE2 in predicting extreme, out of sample conditions. Exploring this further, we find that both ERA5 and GloSea exhibit a weakened stratospheric polar vortex Fig. 4d, f), while ACE2 exhibits near-normal vortex strength (Fig. 4e).

In terms of SSWs, the winter comprised of a minor warming in December 2009 and a major warming in January 2010, reflecting the increased SSW probability due to the El Niño and easterly QBO^50,51,52,53. GloSea appears to capture this increase, with 81% of members (51 out of 63) experiencing easterly zonal winds at 10hPa and 60°N within the winter. This is significantly higher than GloSea’s climatological probability of 62% (two proportion Z-test, 95% confidence level). In comparison, only 39% of ACE2 members (25 out of 64) exhibit easterly stratospheric winds in the upper most model layer (above 50mb), which is not significantly different to the climatological rate of 40%. This indicates that the ACE2 model is not correctly capturing the disruption to the stratospheric polar vortex during winter 2009/2010.

Furthermore, the SSW probability within ACE2 is relatively consistent across El Niño (45%) and La Niña (36%) years, neither of which are significantly different from neutral years (41%, one-tailed two proportion Z-test, 95% confidence level). GloSea and ERA5 however exhibit significant differences between active and neutral ENSO years, with a higher chance of an SSW during El Niño^54,55,56,57. This suggests that while the ACE2 can exhibit sub-seasonal stratospheric variability¹³ it is not fully capturing the ENSO teleconnection to the stratosphere despite realistic tropospheric teleconnections.

Source link