A variational framework for residual-based adaptivity in neural PDE solvers and operator learning

Machine Learning


Variational residual-based attention methods

The potential function \(\Phi :{{\mathbb{R}}}_{+}\to {{\mathbb{R}}}_{+}\) is a crucial parameter for vRBA. It determines the form of the sampling distribution and (consequently) the corresponding primal problem. We restrict Φ to be a non-negative, convex, increasing, superliner function; we show two examples and give detailed calculations in the coming sections. The training process starts by sampling N i.i.d. random variables \({\{{X}_{i}\}}_{i=1}^{N}\) uniformly from the spatial domain Ω and calculating the corresponding residuals r(Xi). The general method then involves three steps per iteration:

  1. 1.

    updating the tilted distribution q, which generally proportional to the derivative (or sub-differential) \({\Phi }^{{\prime} }\);

  2. 2.

    updating the model parameters θ via a line search method (first- or second-order);

  3. 3.

    updating the temperature parameter ϵ using an annealing scheme which can depend on Φ.

We elaborate on each step below, providing the implementation specifics for the examples shown in the Results section.

Update the tilted distribution

While the generalized Gibbs variational principle does not typically yield a closed-form update rule for the distribution q, for a fixed potential Φ, one can obtain the optimal tilt by the appropriate annealing schedules (to be discussed in the coming subsection). Under this assumption, the optimal sampling distribution for the next iteration qk+1 is given by

$${q}^{k+1}(x)\propto {\Phi }^{{\prime} }\left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right),$$

(12)

and we evaluate only at the collocation points \({\{{X}_{i}\}}_{i=1}^{N}\). The above form holds for all choices of potential functions while normalization differs. Only when Φ(r) = er, we can immediately deduce that

$${q}^{k+1}(x)=\frac{\exp \left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\int }_{\Omega }\exp \left(\frac{r(y;{\theta }^{k})}{{\epsilon }^{k}}\right)dy}.$$

(13)

For other choices of Φ, normalization will be enforced via the choice of ϵk (the annealing parameter) instead, i.e., the proportionality constant in (12) is one.

For both models, the collocation points {Xi} are randomly sampled and the empirical target distribution qk+1 can be prone to spurious fluctuations. To promote stability, especially when using fast-growing potentials like Φ(r) = er, we smooth the target distribution over time using an exponential moving average (EMA). Furthermore, we introduce an additional smoothing mechanism by interpolating between the adaptive distribution qk+1 and the base uniform distribution p. The combined update rule for the importance weights reads

$${\lambda }_{i}^{k+1}=\gamma {\lambda }_{i}^{k}+{\eta }^{* }(\phi {q}^{k+1}({X}_{i},{\theta }^{k})+(1-\phi ){p}_{u}({X}_{i}))$$

(14)

where (γ [0, 1)] is a memory term and η* is a learning rate. The parameter ϕ [0, 1] controls the degree of adaptivity; ϕ = 1 corresponds to the fully adaptive case from which the original method is recovered, while smaller values of ϕ increase stability by biasing the distribution towards uniformity. For stability reasons, which becomes particularly important for second-order methods, we have found that normalizing the learning rate as \({\eta }^{* }=\eta /ma{x}_{\varOmega }{q}^{k+1}\) is beneficial. The resulting vector λk+1 can be interpreted as the smoothed, unnormalized distribution that guides the optimization. While it incorporates information from the optimal distribution qk+1, it is not itself a probability mass function as it does not necessarily sum to one.

Update the model parameters

Once the smoothed importance weights \(\{{\lambda }_{i}^{k+1}\}\) have been computed, they can be used to formulate the loss function in two primary ways: by guiding a resampling process or by directly weighting the residuals.

Adaptive Sampling

In this approach, the weights are first normalized to recover a smooth probability mass function (p.m.f.) over the discrete domain Ω

$${\overline{q}}_{i}^{k+1}=\frac{{\lambda }_{i}^{k+1}}{{\sum }_{j}{\lambda }_{j}^{k+1}}$$

(15)

This new distribution \(\overline{q}\) is then used to resample a new set of training points \({\{{X}_{i}\}}_{i=1}^{N}\) from the full collocation set Ω. By focusing the sampling on high-importance regions, the loss can be computed as a standard, unweighted mean-squared error on this new, more challenging set of points

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}r{({X}_{i},{\theta }^{k+1})}^{2}.$$

(16)

Notably, this framework can recover methods similar to residual-based adaptive sampling13 by setting the potential to Φ(x) = x2 + 1 and the EMA parameters to η* = 1 and γ = 0.

Importance Weighting

Alternatively, the weights can be used directly in an importance weighting scheme. In this case, the training points \({\{{X}_{i}\}}_{i=1}^{N}\) are sampled uniformly from Ω. The weights \(\{{\lambda }_{i}^{k+1}\}\) are then applied directly to the residuals within the loss calculation, creating a weighted objective that reads

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\lambda }_{i}^{k+1}r({X}_{i},{\theta }^{k+1})]}^{2}.$$

(17)

One might worry that squaring the weights and residual depart from the procedure outlined previously. By an application of Jensen’s inequality, one can obtain that squaring the residual (and weights when applicable) corresponds to solving a strictly stronger problem with the added benefits of differentiability.

This framework is also general enough to recover other popular methods. For example, with the potential Φ(x) = x2 + 1 and specific choices of EMA parameters, it is possible to recover the traditional residual-based adaptive weights15 and their variations42,45.

Once the objective function \({\mathcal{L}}({\theta }^{k+1})\) is calculated, the model parameters are optimized using a line search algorithm of the form

$${\theta }^{k+1}={\theta }^{k}+{\alpha }^{k}{p}^{k}$$

(18)

where αk is the step size, and \({p}^{k}=-{H}_{k}{\nabla }_{\theta }{\mathcal{L}}({\theta }^{k})\) is the update direction which depends on the gradient of the objective function and some symmetric matrix Hk9.

Update the regularization parameter

As alluded to before, the choice of the annealing schedule depends on the choice of potential Φ. There are two cases to be discussed, and the latter has two further subcases.

Case I: Exponential Potential (Φ(x) = e
x)

For this choice, there are no requirements on the choice of ϵ thanks to the duality between entropy and free energy. The particular choice we implemented reads

$${\epsilon }^{k+1}=\frac{cma{x}_{\varOmega }r(x;{\theta }^{k})}{\log (2+k)}.$$

(19)

This schedule has several advantages. Using the maximum residual, \(ma{x}_{\varOmega }| r|\), in the numerator provides a dynamic, problem-dependent characteristic scale for the temperature ϵ. This helps to normalize the magnitude of the residuals relative to the magnitude of the solution itself. The logarithmic term in the denominator ensures a slow and stable decay, such that ϵk → 0 as k, which gradually sharpens the distribution’s focus on the largest residuals. In the context of simulated annealing, logarithmic decay is sufficiently slow to guarantee global convergence [ref. 46, Theorem 1].

Case II: general potentials

For this case, the optimality of the distribution q holds under the normalization condition

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{{\rm{r}}({\rm{x}};{\rm{\theta }})}{{\rm{\epsilon }}}\right)p(dx)=1.$$

(20)

For this case, the constraint is satisfied by choosing ϵk to be the normalizing constant. There are two further cases.

1. There are cases when the choice of ϵ can be computed analytically, for example, the polynomials Φ(r) = rp + c for any p (1, ) and \(c\in {\mathbb{R}}\). In this case, we can see that

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{r(x;\theta )}{\epsilon }\right)p(dx)=p{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)={\epsilon }^{p-1},$$

(21)

so

$$\epsilon ={\left(\frac{1}{p}{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)\right)}^{1/p-1}.$$

(22)

2. In many other cases, e.g., \(\Phi (r)=\cosh (r)\), \(\Phi (r)={e}^{{r}^{2}}\), or \(\Phi (r)=(1+r)\log (1+r)-r\), analytic computation might challenging. We calculate ϵk dynamically at each training step using a Newton-Raphson solver. To ensure stability and speed, we derive robust initial guesses based on the asymptotic behavior of each potential: \({\epsilon }_{0}\approx {r}_{\max }/{\text{ln}}(2N)\) for \(\cosh (r)\), \({\epsilon }_{0}\approx {r}_{\max }/\sqrt{{\text{ln}}N}\) for \({e}^{{r}^{2}}\), and \({\epsilon }_{0}\approx \overline{r}/(e-1)\) for the logarithmic potential. Due to the strict convexity of these functions and the accuracy of the initialization, the solver typically converges to machine precision in a few iterations (See Algorithm 1). This dynamic adjustment induces minimal computational overhead relative to the full backpropagation step. In particular, for all the analyzed examples, the inclusion of vRBA involves < 10% of computational overhead from the vanilla models.

Algorithm 1

Newton-Raphson Solver for ϵk

Require: Residual batch r; Potential Φ; Max iterations M = 20; Tolerance δ = 10−8.

Ensure: Optimal scaling parameter ϵ.

 1: Compute statistics: \({r}_{\max }\Leftarrow \max (r)\), \(\overline{r}\Leftarrow mean(r)\), N length(r).

 2: Initialization (Asymptotic Approximation):

 3: if \(\Phi (r)=\cosh (r)\) then

 4:   \(\epsilon \Leftarrow {r}_{\max }/ln(2N)\)

 5: else if \(\Phi (r)={e}^{{r}^{2}}\) then

 6:   \(\epsilon \Leftarrow {r}_{\max }/\sqrt{ln(N)}\)

 7: else if \(\Phi (r)=(1+r)\log (1+r)-r\) then

 8:   \(\epsilon \Leftarrow \overline{r}/(e-1)\)

 9: end if

10: \(\epsilon \Leftarrow \max (\epsilon ,\delta )\)

11: Newton-Raphson Loop:

12: for j = 1…M do

13:  u r/ϵ

14:  Compute constraint residual: \(F(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime} }(u)-1\)

15:  Compute gradient: \({F}^{{\prime} }(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime\prime} }(u)\cdot (-u/\epsilon )\)

16:  Update: \(\epsilon \Leftarrow \epsilon -F(\epsilon )/({F}^{{\prime} }(\epsilon )-\delta )\)

17:  Clamp: \(\epsilon \Leftarrow \max (\epsilon ,\delta )\)

18: end for

19: return ϵ

Physics-informed neural networks (PINNs)

In the Physics-Informed Neural Network (PINN) framework, the goal is to approximate the solution \(\bar{u}(x)\) of a PDE or ODE using a parameterized representation model, u(x; θ). The training objective is to minimize a loss function composed of several residual terms that enforce the problem’s physical and data constraints.

The primary component is the PDE residual, which measures how well the model satisfies the governing equations. It is defined using a differential operator \({\mathcal{F}}\) as:

$$r(x):=| {\mathcal{F}}[u(\cdot ;\theta )]| .$$

(23)

Additionally, the model must match any available data, which may include boundary conditions, initial conditions, or sparse observations from a known function \(\bar{u}(x)\). This is enforced through a data-fit residual, defined as the pointwise error:

$$r(x):=| \bar{u}(x)-u(x;\theta )| .$$

(24)

The total loss function, \({\mathcal{L}}\), is a weighted sum of the individual loss terms computed from these residuals. For a problem with governing equations (E), boundary/initial conditions (B), and observational data (D), the total loss is:

$${\mathcal{L}}={m}_{E}{{\mathcal{L}}}_{E}+{m}_{B}{{\mathcal{L}}}_{B}+{m}_{D}{{\mathcal{L}}}_{D},$$

(25)

where mE, mB, and mD are global weights that balance the contribution of each term. The individual loss functions (\({{\mathcal{L}}}_{E}\), \({{\mathcal{L}}}_{B}\), \({{\mathcal{L}}}_{D}\)) are each computed as described in equation (16) for the adaptive sampling approach or as in equation (17) for the importance weighting. To ensure the update directions induced by the different loss components are balanced, we employ the self-scaling mechanism presented in42.

Global weights

Notice that for first-order optimizers such as ADAM, the update direction for PINNs (i.e., equation (25)) is given by:

$${p}^{k}=-{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})-{m}_{D}{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k}),$$

(26)

where \({\nabla }_{\theta }{{\mathcal{L}}}_{E}\), \({\nabla }_{\theta }{{\mathcal{L}}}_{B}\), and \({\nabla }_{\theta }{{\mathcal{L}}}_{D}\) are the loss gradients which can be represented as high-dimensional vectors defining directions to minimize their respective loss terms. Notice that if the gradient magnitudes are imbalanced, one direction will dominate, which may lead to poor convergence. To address this challenge, we propose modifying the magnitude of the individual directions by scaling their respective global weights. In particular, we fix mE and update the remaining global weights using the rule:

$${m}_{B}^{k}=\alpha {m}_{B}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||},$$

(27)

$${m}_{D}^{k}=\alpha {m}_{D}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||},$$

(28)

where α [0, 1] is a stabilization parameter47. This formulation computes the iteration-wise average ratio between gradients, enabling normalized scaling, which, on average, allows us to define a balanced update direction \({\widehat{p}}^{k}\):

$${\widehat{p}}^{k}\approx -{m}_{E}||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||\,\left[{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||}-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||}\right].$$

(29)

Under this approach, all loss components have balanced magnitudes, allowing each optimization step to minimize all terms effectively.

A detailed description of the proposed method presented in Algorithm 2. For the second-order experiments, we follow the general methodology of9, which uses the SSBroyden optimizer after 5k Adam pre-training iterations. The crucial modification in our work is that the sampling distribution is generated by our vRBA framework, rather than the standard RAD formulation used in the reference.

Algorithm 2

vRBA for PINNs

Require: Representation model \({\mathcal{M}}\); Training points XB, XD, XE; Optimizer parameters lr; vRBA parameters \(\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},{\alpha }_{g},{m}_{E},{\gamma }_{g}\); Iterations per stage Nstage; Total iterations Ntrain; Boolean flags adaptive_weights, adaptive_distribution.

Ensure: Optimized network parameters θ.

 1: Initialize network parameters θ0.

 2: Initialize weights \({\lambda }_{\alpha ,i}^{0}\Leftarrow 0.1{\lambda }_{max0}\) for each loss component α {B, D, E}.

 3: Initialize sampling p.m.f. \({\overline{q}}_{\alpha }\) to be uniform for each α.

 4: k 0.

 5: while k < Ntrain do

 6: \({\lambda }_{max}\Leftarrow \min ({\lambda }_{max0}+k/{N}_{stage},{\lambda }_{cap})\)

 7:  γk 1 − η/λmax

 8:  for α {B, D, E} do

 9:   if adaptive_distribution then

10:    Update sampling p.m.f: \({\overline{q}}_{\alpha }^{k}\Leftarrow {{\boldsymbol{\lambda }}}_{\alpha }^{k}/\sum ({{\boldsymbol{\lambda }}}_{\alpha }^{k})\)

11:   end if

12:   Sample batch \({X}_{\alpha }^{k} \sim {\overline{q}}_{\alpha }^{k}\) from Xα.

13:   Compute predictions: \({u}_{\alpha ,i}\Leftarrow {\mathcal{M}}({\theta }^{k},{x}_{\alpha ,i}^{k})\) for each \({x}_{\alpha ,i}^{k}\in {X}_{\alpha }^{k}\).

14:   Compute residuals \({r}_{\alpha ,i}^{k}\) using equations (24) or (23).

15:   Update tilted distribution \({q}_{\alpha ,i}^{k}\) using equation (12).

16:   Apply EMA: \({\lambda }_{\alpha ,i}^{k+1}\Leftarrow {\gamma }^{k}{\lambda }_{\alpha ,i}^{k}+{\eta }^{* }{q}_{\alpha ,i}^{k}\).

17:   if adaptive_weights then

18:    Compute loss term: \({{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({\lambda }_{\alpha ,i}^{k+1}{r}_{\alpha ,i}^{k})}^{2}\).

19:   else

20:    Compute loss term: \({{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({r}_{\alpha ,i}^{k})}^{2}\).

21:   end if

22:   Compute gradient \({\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}\).

23:    Update average gradient: \(\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k}\parallel \Leftarrow {\gamma }_{g}\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k-1}\parallel +(1-{\gamma }_{g})\parallel {\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}\parallel\).

24:   end for

25:   Update global weight: \({m}_{D}^{k+1}\Leftarrow {\alpha }_{g}{m}_{D}^{k}+(1-{\alpha }_{g}){m}_{E}\frac{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{E}}^{k}| | }{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{D}}^{k}| | }\).

26:    Define total update direction: \({p}^{k}\Leftarrow -{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}^{k}-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}^{k}-{m}_{D}^{k+1}{\nabla }_{\theta }{{\mathcal{L}}}_{D}^{k}\).

27:   Update parameters: θk+1 θk + lr pk.

28:   k k + 1.

29: end while

Benchmarks

Allen-Cahn

The Allen-Cahn equation is a widely recognized benchmark in PINNs due to its challenging characteristics. The 1D Allen-Cahn PDE is defined as:

$$\frac{\partial u}{\partial t}=k\frac{{\partial }^{2}u}{\partial {x}^{2}}-5u({u}^{2}-1),$$

(30)

where k = 10−4. The problem is further defined by the following initial and periodic boundary conditions:

$$u(0,x)={x}^{2}\cos (\pi x),\,\forall x\in [-1,1],$$

(31)

Burgers Equation

The Burgers’ equation is defined as

$${u}_{t}+u{u}_{x}=\nu {u}_{xx},$$

(32)

where u represents the velocity field, subject to the dynamic viscosity. In this study we consider two separate cases where \(\nu =\frac{1}{100\pi }\) and \(\nu =\frac{1}{1000}\). The initial conditions are described as follows

$$u(0,x)=-\sin (\pi x),\,\forall x\in \Omega ,$$

(33)

$$u(t,-1)=u(t,1)=0,\,\forall t\ge 0,$$

(34)

defined over the domain Ω = (−1, 1) × (0, 1) and periodic boundary conditions in x.

Korteweg-De Vries (KdV)

The Korteweg-De Vries (KdV) equation is a canonical model for shallow water waves and serves as a rigorous benchmark for PINNs due to the presence of third-order spatial derivatives and nonlinear soliton interactions. The PDE is defined as:

$$\frac{\partial u}{\partial t}+6u\frac{\partial u}{\partial x}+\frac{{\partial }^{3}u}{\partial {x}^{3}}=0,$$

(35)

defined over the spatio-temporal domain x [0, 20] and t [0, 5]. The problem is closed by the following initial and boundary conditions:

$$\begin{array}{c}\begin{array}{rcl}\begin{array}{rc}u(0,x) & =\end{array} & {g}_{0}(x), & \forall x\in [0,20],\end{array}\\ \begin{array}{cccc}\begin{array}{rc}u(t,0) & =\end{array} & {g}_{1}
(36)

where the boundary functions g0, g1, g2, and g3 are derived by evaluating the analytical solution at the domain boundaries. The exact solution, describing the interaction of two solitons, is given by:

$$u(x,t)=\frac{2({c}_{1}-{c}_{2})\left[{c}_{1}{\cosh }^{2}\left(\frac{\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+{c}_{2}{\sinh }^{2}\left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}}{2}\right)\right]}{{\left[\left(\sqrt{{c}_{1}}-\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}+\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+\left(\sqrt{{c}_{1}}+\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}-\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)\right]}^{2}},$$

(37)

where ζi = xcitxi for i = 1, 2. The specific parameters for this benchmark are set to c1 = 6.0, c2 = 2.0, x1 = − 2.0, and x2 = 2.0.

Operator learning

Let \({\mathcal{X}}\) be a space of functions over a domain \({\Omega }_{X}\subset {{\mathbb{R}}}^{{d}_{x}}\), and \({\mathcal{Y}}\) be a space of functions over \({\Omega }_{Y}\subset {{\mathbb{R}}}^{{d}_{y}}\). The operator of interest is

$${\mathcal{G}}:{\mathcal{X}}\ni v\,\mapsto {\mathcal{G}}[v]\in {\mathcal{Y}}.$$

The goal is to learn a parametric model Gθ that approximates \({\mathcal{G}}\). The residual \(R:{\mathcal{X}}\times {\Omega }_{y}\to {{\mathbb{R}}}^{+}\) for this task is defined as the difference between the operator’s prediction and the true solution and reads

$$R(v,x;\theta )=| {G}_{\theta }(v)(x)-{\mathcal{G}}[v](x)| ,$$

(38)

where \(v\in {\mathcal{X}}\) is an input function and \({\mathcal{G}}[v]\) is the corresponding true output function evaluated at a point x ΩY. The training data consists of Nfunc input-output function pairs, \({\{{v}_{j},{\mathcal{G}}[{v}_{j}]\}}_{j=1}^{{N}_{func}}\), where each output function \({\mathcal{G}}[{v}_{j}]\) is evaluated at N discrete points \({\{{x}_{i}\}}_{i=1}^{N}\). The standard loss is an average over both the function instances and the spatial points:

$${\mathcal{L}}(\theta )=\frac{1}{{N}_{{\text{func}}}}\mathop{\sum }\limits_{j=1}^{{N}_{{\text{func}}}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[R({v}_{j},{x}_{i};\theta )]}^{2}$$

(39)

A single importance sampling or weighting scheme is ill-suited for this problem due to the two distinct levels of discretization (in function space and spatial domains). To address this, we propose a mixed strategy: importance weighting is used for the spatial points within each function, while adaptive sampling is used for the functions themselves. This is motivated by the fact that many neural operators have a fixed spatial discretization, making weighting a natural fit, while the function space offers more flexibility for sampling.

The loss function for a batch of bu functions is updated as follows

$${\mathcal{L}}(\theta )=\frac{1}{{b}_{u}}\mathop{\sum }\limits_{j=1}^{{b}_{u}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\Lambda }_{i,j}R({v}_{j},{x}_{i};\theta )]}^{2},$$

(40)

where the functions \({\{{v}_{j}\}}_{j=1}^{{b}_{u}}\) are sampled from the full set of training functions. The term Λ is a matrix of importance weights, where Λi,j corresponds to point xi for function vj. These weights are constructed from a target p.m.f. matrix Qk constructed based on the choice of potential. For instance when Φ(x) = ex, \({Q}^{k}\in {{\rm{{\mathbb{R}}}}}^{N\times {N}_{{\text{func}}}}\) is defined recursively as follows

$${Q}_{i,j}^{k+1}({\theta }^{k})=\frac{\exp \left(\frac{R({v}_{j},{x}_{i};{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\sum }_{\ell =1}^{N}\exp \left(\frac{R({v}_{j},{x}_{\ell };{\theta }^{k})}{{\epsilon }^{k}}\right)}.$$

(41)

Note that each column of the matrix Q (for a fixed function j) is a p.m.f. over the spatial points, focusing attention on high-residual regions for that specific function. The weights are then smoothed over time with an EMA

$${\Lambda }_{i,j}^{k+1}=\gamma {\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}({\theta }^{k}).$$

(42)

As in the previous case, we can set the learning rate for stability, for example, by normalizing it as \({\eta }^{* }=\eta /ma{x}_{j}{Q}_{i,j}\). Note that this choice of η* achieves a normalization per function which is consistent with our two-level discretization. This EMA formulation has the useful property of keeping the weights bounded. As described in15, the update rule ensures that the weights are constrained to the interval \({\Lambda }_{i,j}\in (0,\frac{{\eta }^{* }}{1-\gamma })\), which aids in stabilizing the training process.

A key advantage of this framework is that, if η ≠ 1 − γ, we can leverage the imbalance on learned spatial weights, Λi,j, to construct a sampling distribution over the functions themselves. The intuition is that functions with higher overall residuals will naturally accumulate larger Λ values over time. Therefore, we propose the following approach to create a function-level sampling distribution. First, we compute an aggregated importance score sj for each function by summing its spatial weights

$${s}_{j}=\mathop{\sum }\limits_{i=1}^{N}{\Lambda }_{i,j}.$$

(43)

These scores are then normalized to create a p.m.f. over the function space:

$${\overline{q}}_{j}=\frac{{s}_{j}}{{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }}.$$

(44)

This distribution \(\overline{q}\) can then be used to sample the most informative functions vj for the next training batch. A detailed description of the proposed method is given in Algorithm 3

Algorithm 3

vRBA for Operator Learning

Require: Neural Operator Gθ; Training data \({\{{v}_{j},{u}_{j}\}}_{j=1}^{{N}_{func}}\); Optimizer parameters lr; vRBA parameters \(\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},\gamma\); Batch size bu; Update frequency Nupdate; Total iterations Ntrain.

Ensure: Optimized network parameters θ.

 1: Initialize network parameters θ0.

 2: Initialize weights \({\Lambda }_{i,j}^{0}\Leftarrow 0.1{\lambda }_{max0}\) for all i, j.

 3: Initialize function sampling p.m.f. \({\overline{q}}_{j}^{0}\Leftarrow 1/{N}_{func}\) for all j.

 4: for k 0 to Ntrain − 1 do

 5: \({\lambda }_{max}\Leftarrow \min ({\lambda }_{cap},{\lambda }_{max0}+k/{N}_{stage})\)

 6:  γk 1 − η/λmax

 7:  Sample a batch of bu function indices \({{\mathcal{J}}}_{k} \sim {\overline{q}}^{k}\).

 8:   Compute the batch residuals: \({R}_{i,j}^{k}\Leftarrow {G}_{{\theta }^{k}}({v}_{j})({x}_{i})-{u}_{j}({x}_{i})\) for \(i\in \{1..N\},j\in {{\mathcal{J}}}_{k}\).

 9:  Update target distribution \({Q}_{i,j}^{k+1}\) using \(| {R}_{i,j}^{k}|\) (via Eq. (12)).

10:  Update weights via EMA: \({\Lambda }_{i,j}^{k+1}\Leftarrow {\gamma }^{k}{\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}\) for \(j\in {{\mathcal{J}}}_{k}\).

11:    Compute weighted loss for the batch: \({{\mathcal{L}}}^{k}\Leftarrow \frac{1}{{b}_{u}N}{\sum }_{j\in {{\mathcal{J}}}_{k}}{\sum }_{i=1}^{N}{[{\Lambda }_{i,j}^{k+1}{R}_{i,j}^{k}]}^{2}\).

12:  Compute gradient of the loss: \({g}^{k}\Leftarrow {\nabla }_{\theta }{{\mathcal{L}}}^{k}{| }_{\theta ={\theta }^{k}}\).

13:  Update parameters: θk+1 θklr gk.

14:  if \(k\,(mod\,\,{N}_{update})==0\)then

15:   Aggregate scores: \({s}_{j}^{k+1}\Leftarrow {\sum }_{i=1}^{N}{\Lambda }_{i,j}^{k+1}\) for j = 1. . Nfunc.

16:   Normalize to form new p.m.f.: \({\overline{q}}_{j}^{k+1}\Leftarrow {s}_{j}^{k+1}/{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }^{k+1}\).

17:  end if

18: end for

DeepONet

DeepONet consists of two networks – a trunk network and a branch network. The trunk network encodes spatial coordinates and learns a basis in the target function space, while the branch network maps the input function, evaluated at a fixed set of sensors, to coefficients that project onto this learned basis. The resulting dot product yields the output function at each spatial location. This design is rooted in the operator approximation theorem and enables expressive and efficient modeling of nonlinear operators. DeepONet and its variants are widely applied in mechanics, high-speed flows36, materials science, and multi-phase flows48.

SVD-DeepONet

To address the challenges of modeling discontinuous solutions such as shocks, a two-step training strategy49 is often employed to enhance the standard DeepONet architecture. In this approach, the trunk network is trained first to extract a basis, which is then orthonormalized using QR factorization or Singular Value Decomposition (SVD). While QR factorization ensures orthonormality, SVD is frequently preferred because it provides a unique solution and generates a hierarchical set of orthonormal basis functions that allow for physical interpretation of the flow features. Once this optimized basis is established, the branch network is trained in the second stage to map input parameters to the corresponding coefficients. This modification significantly improves the network’s accuracy, efficiency, and robustness, particularly when solving Riemann problems with extreme pressure ratios36.

FNO

FNO learn solution operators by leveraging spectral convolutions in the Fourier domain4. The input function is first lifted to a high-dimensional latent space through pointwise linear transformations. A Fourier transform is applied to these lifted features, enabling convolutional operations to be performed as multiplications in frequency space. High-frequency modes are typically truncated to enforce smoothness, reduce overfitting, and improve training dynamics. The result is then transformed back to physical space via the inverse Fourier transform and projected to the target dimension. The global receptive field of FNOs makes them particularly effective for modeling long-range dependencies in solutions to PDEs, as demonstrated in applications such as weather forecasts, porous media flows, and turbulence.

TC-UNet

Unlike FNOs, TC-UNet27 operates entirely in physical space using local convolutions. The architecture is based on a UNet, a hierarchical fully convolutional neural network that captures multiscale features through successive downsampling and upsampling. TC-UNet uses time conditioning via feature-wise linear modulation (FiLM)50, applied at each level of the hierarchy. This allows the model to adaptively modulate intermediate features based on the time coordinate input, enabling accurate modeling of spatiotemporal dynamics. TC-UNet or UNet-based architectures are particularly well-suited for problems characterized by sharp gradients36 or fine-scale structures51 and are, in general, more robust to spectral bias52 compared to other neural operator architectures.

Benchmarks

Bubble growth dynamics

We study the dynamics of a single gas bubble in an incompressible liquid governed by the Rayleigh-Plesset (R-P) equation48, a nonlinear ordinary differential equation describing the evolution of the bubble radius R(t) under a time-varying pressure field P(t). Under isothermal assumptions and negligible temperature variations, the simplified linearized R-P equation reads

$$-\frac{\Delta p
(45)

where r(t) = R(t) − R0 is the deviation from the initial bubble radius R0, ρL is the liquid density, νL is the kinematic viscosity, γ is the surface tension, and PG0 is the initial gas pressure inside the bubble.

We generate a dataset by numerically solving equation (45) for 1000 independent realizations of the forcing function Δp(t), which is constructed as a product of a Gaussian random field and a smooth ramp function, following the procedure in48. Specifically, the pressure field is modeled as

$$\Delta p
(46)

where the vector of conservative variables U and the flux vector F(U) are defined as

$$U=\left(\begin{array}{c}\rho \\ \rho u\\ \rho E\end{array}\right),\,F(U)=\left(\begin{array}{l}\rho u\\ \rho {u}^{2}+p\\ u(\rho E+p)\end{array}\right).$$

Here, ρ denotes the fluid density, u the velocity, and p the pressure. The total energy E is related to the pressure by the equation of state for an ideal gas, \(\rho E=\frac{p}{\gamma -1}+\frac{1}{2}\rho {u}^{2}\), with the specific heat ratio set to γ = 1.4.The system is subject to discontinuous initial conditions consisting of two constant states, UL and UR, separated by a diaphragm at x = xc. In this study, we specifically focus on the High-Pressure Ratio (HPR) regime, characterized by extreme pressure jumps (up to a ratio of 1010) across the discontinuity1. The dataset is generated by varying the initial left-state pressure pL while keeping the right-state parameters fixed, utilizing an exact Riemann solver to provide the ground truth solutions at a final time tf.While the system involves three primitive variables, we restrict the neural operator to map the initial pressure parameter pL exclusively to the density field ρ(x, tf). The density profile is particularly challenging and representative as it uniquely exhibits all three fundamental wave structures inherent to the Riemann problem: the rarefaction wave, the contact discontinuity, and the shock wave.

Navier-Stokes Equations- Kolmogorov’s flow

We consider the two-dimensional unsteady Navier-Stokes equations in vorticity formulation, modeling an incompressible, viscous fluid on the periodic domain (x, y) (0, 2π)2. The system is driven by a Kolmogorov-type external forcing, as previously studied in53, and is governed by:

$$\left\{\begin{array}{l}{\partial }_{t}\omega +{\boldsymbol{u}}\cdot \nabla \omega =\nu \Delta \omega +f(x,y),\\ \nabla \cdot {\boldsymbol{u}}=0,\\ \omega (x,y,0)={\omega }_{0}(x,y),\end{array}\right.$$

(47)

with viscosity ν = 10−3, and the source term defined as

$$f(x,y)=\chi (\sin (2\pi (x+y))+\cos (2\pi (x+y))),$$

(48)

where χ = 0.1. The Laplacian Δ acts in two spatial dimensions, ω denotes the vorticity, and u is the velocity.

Initial conditions ω0(x, y) are sampled from a Gaussian random field with zero mean and covariance operator \({\mathcal{N}}(0,{7}^{3/2}{(-\Delta +49I)}^{-5/2})\). To generate the data, we employ a Fourier-based pseudo-spectral solver introduced in4. The simulation output consists of 1000 spatiotemporal vorticity realizations, each on a 512 × 512 spatial grid, subsequently downsampled to 128 × 128 for downstream learning tasks.

We partition the dataset into training, validation, and testing subsets in an 80:10:10 ratio. A neural operator model \({\mathcal{G}}\) is trained to predict evolution of the vorticity field by learning the mapping from the initial condition at t = 0 to the interval [t (0, 50]).

For the 2D Navier-Stokes problem, we train a Fourier Neural Operator (FNO) to learn the mapping from an initial vorticity field ω0(x, y) to the full spatiotemporal solution ω(x, y, t).

Wave Equation

We investigate the propagation of acoustic waves governed by the linear wave equation in heterogeneous media. In 2D, the governing equation is given by:

$$\left\{\begin{array}{lc}{\partial }_{t}^{2}u({\boldsymbol{x}},t)={c}^{2}({\boldsymbol{x}})\Delta u({\boldsymbol{x}},t), & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\,t\in [0,2],\\ u({\boldsymbol{x}},0)={u}_{0}({\boldsymbol{x}}), & \begin{array}{cl}\,{\partial }_{t}u({\boldsymbol{x}},0)=0, & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\end{array}\end{array}\right.$$

(49)

where u(x, t) represents the acoustic pressure at spatial location x = (x, y), c(x) is the spatially varying wave speed, and Δ denotes the Laplacian operator. We assume fully reflective (homogeneous Dirichlet) boundary conditions throughout the domain.

For the spatially varying wave speed, we set \(c(x,y)=1+\sin (x)\sin (y)\). The initial pressure profile u0(x) is modeled as a localized Gaussian source centered at a point xc, i.e.,

$${u}_{0}({\boldsymbol{x}})=\exp \left(-\frac{\parallel {\boldsymbol{x}}-{{\boldsymbol{x}}}_{c}{\parallel }^{2}}{10}\right),$$

with xc sampled randomly on the spatial grid. We solve this system numerically using a second-order finite difference method on a grid with a spatial resolution of 64 × 64 and generate 1000 simulations corresponding to different realizations of u0. The dataset is partitioned into training, validation, and test sets in the ratio 80:10:10. We train a neural operator \({\mathcal{G}}\) to learn the mapping u(x, 0) u(x, t) for all [t (0, 2]).

In our final operator learning example, we train a Time-Conditioned U-Net (TC-UNet) to learn the solution operator for the 2D wave equation, mapping an initial pressure profile u0(x) to the full wave propagation over time u(x, t).

vRBA hyperparameter selection

This section details the selection of the hyperparameters introduced by the vRBA framework. The present formulation does not require extensive problem-specific tuning; in fact, all hyperparameters discussed below were held constant across every benchmark presented in this paper. Unless otherwise stated, we utilize a consistent set of vRBA hyperparameters: an Exponential Moving Average (EMA) memory of γ = 0.999, an EMA learning rate of η = 0.01, and a smoothing factor of ϕ = 1.0.

Annealing schedule

For the exponential potential, the convex duality between entropy and free energy warrants any choice of annealing schedule. We observe the parallel between vRBA and simulated annealing, which is provably convergent under sufficiently low temperature decay, e.g., [ref. 46, Theorem 1]. Inspired by such theoretical results, our annealing schedule (see Eq. (19)) follows a logarithmic decay in the number of iterations scaled by a universal constant, which we took to be one (i.e., c = 1) in all our examples.

For the general potential-dependent case, there are generically two cases. For potentials such as \(\cosh (r)\), \({e}^{{r}^{2}}\), and \((1+r)\log (1+r)-r\) where the appropriate ϵ is not easily found via analytic calculations, the annealing parameter is determined dynamically at each iteration by solving the optimality condition via Newton’s method. On the other hand, for the polynomial (rp, for p > 1) potentials, the formulation remains effectively constant throughout training. Consequently, these parameters are governed by theoretical or numerical optimality conditions rather than manual hyperparameter selection.

EMA learning rate

Here, the values of λ are bounded in the interval (0, η*/(1 − γ)), implying a maximum importance score of \({\lambda }_{\max }={\eta }^{* }/(1-\gamma )\). These parameters were originally introduced in the RBA framework15, and we utilize the exact same values established in that work. We note that the use of Exponential Moving Averages (EMA) to stabilize stochastic estimates is a standard practice in machine learning, most notably in the Adam optimizer43, where it is essential for convergence stability. Subsequent analyses in related studies, such as the sensitivity analysis performed for standard RBA (Φ(r) = r2) in KKANs42, have investigated the effect of \({\lambda }_{\max }\) on convergence. That study demonstrated that while initializing with a low bound (\({\lambda }_{\max }\approx 1\)) yields suboptimal results due to insufficient attention, and overly aggressive bounds (\({\lambda }_{\max } > 20\)) can slightly degrade performance, there exists a broad, robust plateau of optimal performance for \({\lambda }_{\max }\in [5,20]\). The configuration used in this paper targets \({\lambda }_{\max }\approx 10\), which lies in this optimal regime. Thus, the selection of η* is not a new free parameter requiring tuning, but is a fixed value inherited from previous empirical evidence.

EMA memory parameter

Similar to the learning rate, we inherit the memory parameter from the previous studies that introduced the RBA framework15. To further validate the robustness of the proposed method, we performed a sensitivity analysis in the Supplementary Information. Our results indicate that our framework is quite robust to this value. For operator learning, γ cannot be zero since we need to collect the λ scores to sample the function space (see Eq. (43)); however, our results indicate that γ [0.4, 0.999] outperforms the baseline. On the other hand, for PINNs with second-order optimizers, it is even possible to train without any memory, significantly outperforming the baselines. Nevertheless, the values used in this study, inherited from previous studies, lead to the best results in our benchmarks.

Stabilization parameter

The stabilization parameter ϕ governs the convex combination of the adaptive distribution q and a uniform prior pu. We primarily utilized pure adaptivity by setting ϕ = 1.0 for all PIML examples, while adopting a slight regularization for Operator Learning benchmarks. A sensitivity analysis provided in SI confirms that the method is robust to this choice, as vRBA consistently outperforms the baseline even in the absence of regularization. Notably, the inclusion of the uniform prior yields performance gains specifically for potentials targeting the L norm, such as the exponential variants, by mitigating their aggressive nature; for variance-minimizing potentials, the sensitivity to ϕ is negligible.

SSBroyden optimizer

This section details our custom JAX implementation of the Self-Scaled Broyden (SSBroyden) optimizer, which was used for all second-order optimization experiments. The original method, proposed by Urbán et al.9, relies on modified SciPy routines that are CPU-bound and not directly portable to a JAX-native, GPU-accelerated workflow.

Our implementation preserves the core SSBroyden update logic, which dynamically computes scaling and updating parameters. However, the line search portion of the algorithm required a complete rewrite. Due to the absence of SciPy’s advanced line search routines in JAX Scipy, we developed a custom three-stage fallback line search mechanism to promote robust convergence. This procedure creates a cascade of attempts with progressively more strict Wolfe conditions, starting with strict parameters (c2 = 0.9) and constraining them (c2 = 0.8, then c2 = 0.5) only upon failure. This adaptation was essential for ensuring the optimizer could consistently make progress on the challenging loss landscapes of the problems studied.

Implementation details

Physics-Informed Neural Networks (PINNs) We evaluate two optimization strategies for PINNs.

First-order optimization

For the Allen-Cahn equation, we adopt the network architecture and self-scaling strategy from42, utilizing a 6-layer network (H = 64) with \(\tanh\) activations and Fourier Feature embeddings. Training proceeds for 3 × 105 Adam iterations (lr = 10−3) with vRBA applied as an importance weighting scheme.

Second-order optimization

For Allen-Cahn and Burgers’ equations, we follow the methodology of9, employing a 3-layer network (H = 30) with periodic encodings. Training consists of 5000 initialization iterations (Adam) followed by 60,000 iterations using the SSBroyden optimizer. Here, vRBA is applied via importance sampling with a resampling frequency of 100 iterations.

Operator learning

For operator learning tasks, we employ a hybrid vRBA strategy: importance weighting is applied to the spatial domain, while importance sampling is applied to the function space. Detailed listings of all network dimensions, learning rates, and specific coefficient settings are provided in SI.

DeepONet

For Bubble Growth Dynamics, we use a DeepONet with 4 hidden layers (H = 100, GELU activation) for both branch and trunk networks, optimized via Adam.

SVD-DeepONet

For the Sod-Shock tube problem, we utilize the SVD-DeepONet architecture from36 with adaptive activation functions54, comprising 6 hidden layers (H = 100, \(\tanh\) activation) for both branch and trunk.

FNO and TC-UNet

For the Navier-Stokes benchmarks, we adopt the exact Fourier Neural Operator (FNO) and Time-Conditioned U-Net (TC-UNet) architectures and code provided by27 to ensure direct comparability.



Source link