Task-oriented machine learning surrogates for tipping points of agent-based models

Given high dimensional spatio-temporal trajectory data acquired through ABM simulations, the main steps of the framework are summarized as follows (see also Fig. 4, for a schematic):

a.

Discover low-dimensional latent spaces, on which the emergent dynamics can be described at the mesoscopic or the macroscopic scale.
b.

Identify, via machine-learning, black-box mesoscopic IPDEs, ODEs, or (after further dimensional reduction), macroscopic mean-field SDEs.
c.

Locate tipping points by exploiting numerical bifurcation analysis of the different surrogate models.
d.

Use the identified (NN-based) surrogate mean-field SDEs to perform rare-event analysis (uncertainty quantification) for the catastrophic transitions. This is done here in two ways: (i) performing repeated brute-force simulations around the tipping points, (ii) for this effectively 1D problem, using explicit statistical mechanical (Feynman-Kac) formulas for escape time distributions.

**Fig. 4: Schematic of the machine learning-based approach for the multiscale modeling and analysis of tipping points.**

In what follows, we present the elements of the methodology. For further details about the methodology and implementation, see in the Supplementary information (SI).

Discovering low-dimensional latent spaces

The computational modeling of complex systems featuring a multitude of interacting agents poses a significant challenge due to the enormous number of potential states that such systems can have. Thus, a fundamental step, for the development of ROMs that are capable of effectively capturing the collective behavior of ensembles of agents is the discovery of an embedded, in the high-dimensional space, low-dimensional manifold and an appropriate set of variables that can usefully parametrize it.

Let’s denote, by ${{{{{{{{\bf{X}}}}}}}}}_{k}\in {{\mathbb{R}}}^{D}$, k = 1, 2, … the high-dimensional state of the ABM at time t. The goal is then to project/map the high-dimensional data onto lower-dimensional latent manifolds ${{{{{{{\mathcal{M}}}}}}}}\subset {{\mathbb{R}}}^{D}$, that can be defined by a set of coarse-scale variables. The hypothesis of the existence of this manifold is related to the existence of useful ROMs and vice versa.

Here, to discover such a set of coarse-grained coordinates for the latent space, we used DMAPs^26,28,54 (see section D1 of the Supplementary Information (SI) for a brief description of the DMAPs algorithm).

For both ABMs, we have some a priori physical insight for the mesoscopic description. For the financial ABM, one can for example use the probability density function (pdf) \(\rho (X)dx={\mathbb{P}}(X
(10)

where F⁽ⁱ⁾, i = 1, 2, …m are m nonlinear integro-differential operators; u(x, t) = [u⁽¹⁾(x, t), …, u^(m)(x, t)] is the vector containing the spatio-temporal fields, ${{{{{{{{\mathcal{D}}}}}}}}}^{{{{{{{{\boldsymbol{\nu }}}}}}}}}{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},t)$ is the generic multi-index ν-th order spatial derivative at time t:

$$\begin{array}{ll}&{{{{{{{{\mathcal{D}}}}}}}}}^{{{{{{{{\boldsymbol{\nu }}}}}}}}}{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t):=\left\{\frac{{\partial }^{| {{{{{{{\boldsymbol{\nu }}}}}}}}| }{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},t)}{\partial {x}_{1}^{{\nu }_{1}}\cdots \partial {x}_{d}^{{\nu }_{d}}},{\nu }_{1},\ldots,{\nu }_{d}\ge 0\right\},\\ &{{{{{{{\rm{where}}}}}}}}\,| {{{{{{{\boldsymbol{\nu }}}}}}}}|={\nu }_{1}+{\nu }_{2}+\cdots+{\nu }_{d},\,\end{array}$$

(11)

${I}_{1}^{(i)},{I}_{2}^{(i)},\ldots \,$ are a collection of integral features on subdomains ${{{\Omega }}}_{1}^{(i)},{{{\Omega }}}_{2}^{(i)},\cdots \subseteq {{\Omega }}$:

$${I}_{j}^{(i)}({{{{{{{\bf{u}}}}}}}})={\int}_{{{{\Omega }}}_{j}^{(i)}}{K}_{j}^{(i)}({{{{{{{\bf{x}}}}}}}},\, {{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t))d{{\Omega }},\,j=1,2,\ldots ;$$

(12)

${K}_{j}^{(i)}:{{\mathbb{R}}}^{d}\times {{\mathbb{R}}}^{m}\mapsto {{\mathbb{R}}}^{d}$ are nonlinear maps and ${{{{{{{\boldsymbol{\varepsilon }}}}}}}}\in {{\mathbb{R}}}^{p}$ denotes the (bifurcation) parameters of the system. The right-hand-side of the i-th IPDE depends on say, a number of γ⁽ⁱ⁾ variables and on bifurcation parameters from the set of features:

$${{{{{{{{\mathcal{S}}}}}}}}}^{(i)}= \left\{{{{{{{{\bf{x}}}}}}}},\, {{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t),{{{{{{{\mathcal{D}}}}}}}}{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t),{{{{{{{{\mathcal{D}}}}}}}}}^{{{{{{{{\bf{2}}}}}}}}}{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t),\ldots,\right.\\ \left.{{{{\mathcal {D}}}}}^{{{{{{{{\boldsymbol{\nu }}}}}}}}}{{{{{{{\bf{u}}}}}}}}({{{{{{{\bf{x}}}}}}}},\, t),{I}_{1}^{(i)},{I}_{2}^{(i)},\ldots,{{{{{{{\boldsymbol{\varepsilon }}}}}}}}\right\}.$$

(13)

At each spatial point x_q, q = 1, 2, …, M and time instant t_s, s = 1, 2, …, N, a single sample point (an observation) in the set ${{{{{{{{\mathcal{S}}}}}}}}}^{(i)}$ for the i-th IPDE can be described by a vector ${Z}_{j}^{(i)}\equiv {Z}_{(q,s)}^{(i)}\in {{\mathbb{R}}}^{{\gamma }^{(i)}}$, with j = q + (s − 1)M. Here, we assume that such mesoscopic IPDEs in principle exist, but they are not available in closed-form. Henceforth, we aim to learn the macroscopic laws by employing a Feedforward Neural Network (FNN), in which the effective input layer is constructed by a finite stencil (sliding over the computational domain), mimicking convolutional operations where the applied “filter” involves values of our field variable(s) u⁽ⁱ⁾ on the stencil, and returns features ${Z}_{j}^{(i)}\in {{{{{{{{\mathcal{S}}}}}}}}}^{(i)}$ of these variables at the stencil center-point, i.e., spatial derivatives as well as (local or global) integrals (see Fig. 5a for a schematic).

**Fig. 5: Schematic of the neural networks used for constructive machine learning assisted surrogates.**

The FNN is fed with the sample points ${Z}_{j}^{(i)}\in {{\mathbb{R}}}^{{\gamma }^{(i)}}$ and its output is an approximation of the time derivative u_t. In sections D2 and D3 of the Supplementary Information (SI) we describe two different approaches (gradient descent and random projections) to train such a FNN. For alternative operator-learning approximation methods, see e.g., ref. ⁷⁴ and ref. ⁵¹.

Remark on learning mean-field ODEs

The proposed framework can be also applied for the simpler task of learning a system of m ODEs in terms of the m variables u = (u⁽¹⁾, u⁽²⁾, …, u^(m)), thus learning i = 1, …, m functions:

$$\frac{d{u}^{(i)}
(14)

where ${{{{{{{\bf{p}}}}}}}}\in {{\mathbb{R}}}^{k}$ denotes the parameter vector.

Feature selection with automatic relevance determination

Training the FNN with the “full” set of inputs ${{{{{{{{\mathcal{S}}}}}}}}}^{(i)}\subseteq {{\mathbb{R}}}^{{\gamma }^{(i)}}$, described in Eq. (13), consisting of all local mean field values as well as all their coarse-scale spatial derivatives (up to some order ν) is prohibitive due to the curse of dimensionality. Therefore, one important task for the training of the FNN is to extract a few “relevant”/dominant variable combinations. Towards this aim, we used Automatic Relevance Determination (ARD) in the framework of Gaussian processes regression (GPR)²⁹. The approach assumes that the collection of all observations ${{{{{{{{\bf{Z}}}}}}}}}^{(i)}=({Z}_{1}^{(i)},{Z}_{2}^{(i)},\ldots,{Z}_{MN}^{(i)})$, of the features ${z}_{l}\in {{{{{{{{\mathcal{S}}}}}}}}}^{(i)}$, are a set of random variables whose finite collections have a multivariate Gaussian distribution with an unknown mean (usually set to zero) and an unknown covariance matrix K. This covariance matrix is commonly formulated by a Euclidean distance-based kernel function k in the input space, whose hyperparameters are optimized based on the training data. Here, we employ a radial basis kernel function (RBF), which is the default kernel function in Gaussian process regression, with ARD:

$$K_{jh}^{(i)}=k(Z^{(i)}_j,Z^{(i)}_h,{{{{{\boldsymbol{\theta}}}}}}^{(i)})=\theta_0^{(i)} \exp\left( -\frac{1}{2}\sum\limits_{l=1}^{\gamma^{(i)}} \frac{z_{l,j}-z_{l,h}}{\theta_l^{(i)}}\right);$$

(15)

${{{{{{{{\boldsymbol{\theta }}}}}}}}}^{(i)}=[{\theta }_{0}^{(i)},{\theta }_{1}^{(i)},\ldots,{\theta }_{{\gamma }^{(i)}}^{(i)}]$ are a (γ⁽ⁱ⁾ + 1)-dimensional vector of hyper-parameters. The optimal hyperparameter set ${\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}}}^{(i)}$ can be obtained by minimizing a negative log marginal likelihood over the training data set (Z⁽ⁱ⁾, Y⁽ⁱ⁾), with inputs the observation Z⁽ⁱ⁾ of the set ${{{{{{{{\mathcal{S}}}}}}}}}^{(i)}$ and corresponding desired output given by the observation Y⁽ⁱ⁾ of the time derivative ${u}_{t}^{(i)}$:

$${\tilde{{{{{\boldsymbol{\theta}}}}}}}^{(i)}={{\arg}\min}_{{{{{{\boldsymbol{\theta}}}}}}^{(i)}} -\log{p({{{{{\mathbf{Y}}}}}}^{(i)}|{{{{{\mathbf{Z}}}}}}^{(i)},{{{{{\boldsymbol{\theta}}}}}}^{(i)})}.$$

(16)

As can be seen in Eq. (15), a large value of θ_l nullifies the difference between target function values along the l-th dimension, allowing us to designate the corresponding z_l feature as “insignificant”. Practically, in order to build a reduced input data domain, we define the normalized effective relevance weights ${W}_{r}^{(i)}(\cdot )$ of each feature input ${z}_{l}\in {{{{{{{{\mathcal{S}}}}}}}}}^{(i)}$, by taking:

$${\bar{W}}_{r}^{(i)}({z}_{l})=exp(-{\tilde{\theta }}_{l}^{(i)}),\,{W}_{r}^{(i)}({z}_{l})=\frac{{\bar{W}}_{r}^{(i)}({z}_{l})}{{\sum }_{l}{\bar{W}}_{r}^{(i)}({z}_{l})}.$$

(17)

Thus, we define a small tolerance tol in order to disregard the components such that ${W}_{r}^{(i)}({z}_{l}) \, < \, tol$. The remaining selected features (${W}_{r}^{(i)}({z}_{l})\ge tol$) can still successfully (for all practical purposes) parametrize the approximation of the right-hand-side of the underlying IPDE.

Macroscopic mean-field SDEs via neural networks

Here, we present our approach for the construction of embedded surrogate models in the form of mean-field SDEs. Under the assumption that we are close (in phase- and parameter space) to a previously located tipping point, we can reasonably assume that the effective dimensionality of the dynamics can be reduced to the corresponding normal form. We already have some qualitative insight on the type of the tipping point, based for example on the numerical bifurcation calculations that located it (e.g., for the financial ABM from the analytical FP IPDE, from our surrogate IPDE, or from the EF analysis^52,56), while for the epidemic ABM from the EF analysis in ref. ⁵⁹. For both these two particular problem, we have found that the tipping point corresponds to a saddle-node bifurcation.

Given the nature of the bifurcation (and the single variable corresponding normal form) we identify a one-dimensional SDE, driven by a Wiener process, from data. We note that learning higher-order such SDEs, or SDEs based on the more general Lévy process and the Ornstein-Uhlenbeck process⁷⁵, is straightforward.

For a diffusion process with drift, say X_t = {x_t, t > 0}, the drift, μ(x_t) and diffusivity σ²(x_t) coefficients over an infinitesimally small-time interval dt, are given by:

$$\begin{array}{ll} &\mu(x_t)={\lim}_{\delta t \rightarrow 0} \frac{1}{\delta t}{\mathbb{E}}(\delta x_t|X_t=x_t),\\ &\sigma^2(x_t)={\lim}_{\delta t \rightarrow 0} \frac{1}{\delta t}{\mathbb{E}}(\delta x_t^2|X_t=x_t),\end{array}$$

(18)

where, δx_t = x_t+δt − x_t.

The 1D SDE driven by a Wiener process W_t reads:

$$d{x}_{t}=\mu ({x}_{t};\varepsilon )dt+\sigma ({x}_{t};\varepsilon )d{W}_{t}.$$

(19)

Here, for simplicity, we assume that the one-dimensional parameter ε, enters into the dynamics, via the drift and diffusivity coefficients. Note that the parameter ε can be either the parameter g introduced in Eq. (3) or the parameter λ for the epidemic ABM (representing the probability that a susceptible individual may get infected). Our goal is to identify the functional form of the drift μ(x, ε) and the diffusivity σ(x, ε) given noisy data close to the tipping point via machine learning. For the training, the data might be collected from either long-time trajectories or short bursts initialized at scattered snapshots, as in the EF framework. These trajectories form our data set of input-output pairs of discrete-time maps. A data point in the collected data set can be written as $({x}_{0}^{(k)},{h}^{(k)},{x}_{1}^{(k)}{\varepsilon }^{(k)})$, where ${x}_{0}^{(k)}$ and ${x}_{1}^{(k)}$ measures two consecutive states at ${t}_{0}^{(k)}$ and ${t}_{1}^{(k)}$ with (small enough) time step ${h}^{(k)}={t}_{1}^{(k)}-{t}_{0}^{(k)}$ and ε^(k) is the parameter value for this pair. Based on the above formulation, going to ${x}_{1}^{(k)}$ from ${x}_{0}^{(k)}$ by:

$${x}_{1}^{(k)}={x}_{0}^{(k)}+\int\nolimits_{{t}_{0}^{(k)}}^{{t}_{1}^{(k)}}\mu ({x}_{t};{\varepsilon }^{(k)})dt+\int\nolimits_{{t}_{0}^{(k)}}^{{t}_{1}^{(k)}}\sigma ({x}_{t};{\varepsilon }^{(k)})d{W}_{t}.$$

(20)

Here, for the numerical integration of the above equation to get stochastic realizations of ${x}_{1}^{(k)}$, we assume that the Euler-Maruyama numerical scheme can be used, reading:

$${x}_{1}^{(k)} \, \approx \, {x}_{0}^{(k)}+{h}^{(k)}\mu ({x}_{0}^{(k)};{\varepsilon }^{(k)})+\sigma ({x}_{0}^{(k)};{\varepsilon }^{(k)}){{\Delta }}{W}^{(k)},$$

(21)

where ${{\Delta }}{W}^{(k)}={W}_{{t}_{1}^{(k)}}-{W}_{{t}_{0}^{(k)}}\in {\mathbb{R}}$ is a one-dimensional random variable, normally distributed with expected value zero and variance h^(k).

Considering the point ${x}_{1}^{(k)}$ as a realization of a random variable X₁, conditioned on ${x}_{0}^{(k)}$ and h^(k), drawn by a Gaussian distribution of the form:

$$\begin{array}{ll}&{p}_{{X}_{1}}({x}_{1}^{(k)})={\mathbb{P}}\left({X}_{1}={x}_{1}^{(k)}| {X}_{0}={x}_{0}^{(k)},{h}^{(k)}\right) \sim \\ & \sim {{{{{{{\mathcal{N}}}}}}}}\left({x}_{0}^{(k)}+{h}^{(k)}\mu ({x}_{0}^{(k)};{\varepsilon }^{(k)}),{h}^{(k)}\sigma {({x}_{0}^{(k)};{\varepsilon }^{(k)})}^{2}\right),\end{array}$$

(22)

we approximate the drift $\mu ({x}_{0}^{(k)};{\varepsilon }^{(k)})$ and diffusivity $\sigma ({x}_{0}^{(k)};{\varepsilon }^{(k)})$ functions by simultaneously training two neural networks, denoted as μ_θ and σ_θ, respectively. This training process involves minimizing the loss function:

$$\begin{array}{ll} &{{{{{\mathcal{L}}}}}}(\theta|x_0^{(k)},x_1^{(k)},h^{(k)}) :=\\ &=\sum\limits_{k} \frac{(x_1^{(k)} – x_0^{(k)} – h^{(k)}\mu_{\theta}(x_0^{(k)};\varepsilon^{(k)}))^2}{h^{(k)}\sigma_{\theta}(x_0^{(k)};\varepsilon^{(k)})^2}+\\ &+{\log}|h^{(k)}\sigma_{\theta}(x_0^{(k)};\varepsilon^{(k)})^2|. \end{array}$$

(23)

which is derived in order to maximize the log-likelihood of the data and where θ denotes the trainable parameters (e.g., weights and biases of the neural networks μ_θ and σ_θ). A schematic of the Neural Network, based on Euler-Maruyama, is shown in Fig. 5b.

Locating tipping points via our surrogate models

In order to locate the tipping point, based on either the mesoscopic IPDE or the embedded mean-field 1D SDE model, we construct the corresponding bifurcation diagram in its neighborhood, using pseudo-arc-length continuation as implemented in numerical bifurcation packages. For the identified SDE, we used its deterministic part, i.e., the drift term, to perform continuation. The required Jacobian of the activation functions of the neural network is computed by symbolic differentiation. Note that this is just a validation step (we already know the location and nature of the tipping point).

Rare-event analysis/UQ of catastrophic shifts

Given a sample space Ω, an index set of times T = {0, 1, 2, … } and a state space S, the first passage time, also known as mean exit time or mean escape time, of a stochastic process x_t: Ω × T ↦ Son a measurable subset A ⊆ S is a random variable which can be defined as

$$\tau (\omega ):=\inf \{t\in T| {x}_{t}(\omega )\in S\setminus A\},$$

(24)

where ω is a sample out of the space Ω. One can define the mean escape time from A, which works as the expectation of τ(ω):

$$\langle \tau \rangle :={\mathbb{E}}[\tau (\omega )].$$

(25)

For a n-dimensional stochastic process, as it is the ABM under study, S is typically set to be ${{\mathbb{R}}}^{n}$, and A is usually a bounded subset of ${{\mathbb{R}}}^{n}$. In the case of our local 1D SDE model, the subset A reduces to an open interval (a, b), with the initial condition of this stochastic process x₀ also chosen in this interval.

We discuss two ways for quantifying the uncertainty of the occurrence of those rare-events. The first, presented in the main text, involves direct computational “cheap” temporal simulations of the 1D SDE, where one gets an empirical probability distribution; the second, presented only in section F4 of the Supplementary Information (SI), is a closed-form expression, based on statistical mechanics⁷⁶, for the mean escape time (assuming an exponential distribution of escape times).

Computation of escape times based on temporal simulations of the 1D SDE

We perform numerical integration of multiple stochastic trajectories of the SDE to obtain an estimation of the desired mean escape time. The algorithm for estimating the escape times of a one-dimensional stochastic process with initial condition x₀ on the interval (a, b) can be described as follows:

1.

Given a fixed time step h > 0, we perform i + 1 numerical integration steps of the SDE until a stopping condition, i.e., x_t exits, for the first time t_i+1, the interval at a or b.
2.

Record the time τ = t_i, which corresponds to a realization of the escape time.
3.

Repeat the above steps 1–2 for K iterations, and each time collect the observed escape time.
4.

Compute the statistical mean value of the collected escape times that corresponds to an approximation of the mean escape time.

In our case, the initial condition x₀ was set equal to the stable steady state, and the termination condition a or b at which we consider the dynamics escaped/exploded.