A variational framework for residual-based adaptivity in neural PDE solvers and operator learning

Variational residual-based attention methods

The potential function $\Phi :{{\mathbb{R}}}_{+}\to {{\mathbb{R}}}_{+}$ is a crucial parameter for vRBA. It determines the form of the sampling distribution and (consequently) the corresponding primal problem. We restrict Φ to be a non-negative, convex, increasing, superliner function; we show two examples and give detailed calculations in the coming sections. The training process starts by sampling N i.i.d. random variables ${\{{X}_{i}\}}_{i=1}^{N}$ uniformly from the spatial domain Ω and calculating the corresponding residuals r(X_i). The general method then involves three steps per iteration:

1.

updating the tilted distribution q, which generally proportional to the derivative (or sub-differential) ${\Phi }^{{\prime} }$;
2.

updating the model parameters θ via a line search method (first- or second-order);
3.

updating the temperature parameter ϵ using an annealing scheme which can depend on Φ.

We elaborate on each step below, providing the implementation specifics for the examples shown in the Results section.

Update the tilted distribution

While the generalized Gibbs variational principle does not typically yield a closed-form update rule for the distribution q, for a fixed potential Φ, one can obtain the optimal tilt by the appropriate annealing schedules (to be discussed in the coming subsection). Under this assumption, the optimal sampling distribution for the next iteration q^k+1 is given by

$${q}^{k+1}(x)\propto {\Phi }^{{\prime} }\left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right),$$

(12)

and we evaluate only at the collocation points ${\{{X}_{i}\}}_{i=1}^{N}$. The above form holds for all choices of potential functions while normalization differs. Only when Φ(r) = e^r, we can immediately deduce that

$${q}^{k+1}(x)=\frac{\exp \left(\frac{r(x;{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\int }_{\Omega }\exp \left(\frac{r(y;{\theta }^{k})}{{\epsilon }^{k}}\right)dy}.$$

(13)

For other choices of Φ, normalization will be enforced via the choice of ϵ^k (the annealing parameter) instead, i.e., the proportionality constant in (12) is one.

For both models, the collocation points {X_i} are randomly sampled and the empirical target distribution q^k+1 can be prone to spurious fluctuations. To promote stability, especially when using fast-growing potentials like Φ(r) = e^r, we smooth the target distribution over time using an exponential moving average (EMA). Furthermore, we introduce an additional smoothing mechanism by interpolating between the adaptive distribution q^k+1 and the base uniform distribution p. The combined update rule for the importance weights reads

$${\lambda }_{i}^{k+1}=\gamma {\lambda }_{i}^{k}+{\eta }^{* }(\phi {q}^{k+1}({X}_{i},{\theta }^{k})+(1-\phi ){p}_{u}({X}_{i}))$$

(14)

where (γ ∈ [0, 1)] is a memory term and η^* is a learning rate. The parameter ϕ ∈ [0, 1] controls the degree of adaptivity; ϕ = 1 corresponds to the fully adaptive case from which the original method is recovered, while smaller values of ϕ increase stability by biasing the distribution towards uniformity. For stability reasons, which becomes particularly important for second-order methods, we have found that normalizing the learning rate as ${\eta }^{* }=\eta /ma{x}_{\varOmega }{q}^{k+1}$ is beneficial. The resulting vector λ^k+1 can be interpreted as the smoothed, unnormalized distribution that guides the optimization. While it incorporates information from the optimal distribution q^k+1, it is not itself a probability mass function as it does not necessarily sum to one.

Update the model parameters

Once the smoothed importance weights $\{{\lambda }_{i}^{k+1}\}$ have been computed, they can be used to formulate the loss function in two primary ways: by guiding a resampling process or by directly weighting the residuals.

Adaptive Sampling

In this approach, the weights are first normalized to recover a smooth probability mass function (p.m.f.) over the discrete domain Ω

$${\overline{q}}_{i}^{k+1}=\frac{{\lambda }_{i}^{k+1}}{{\sum }_{j}{\lambda }_{j}^{k+1}}$$

(15)

This new distribution $\overline{q}$ is then used to resample a new set of training points ${\{{X}_{i}\}}_{i=1}^{N}$ from the full collocation set Ω. By focusing the sampling on high-importance regions, the loss can be computed as a standard, unweighted mean-squared error on this new, more challenging set of points

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}r{({X}_{i},{\theta }^{k+1})}^{2}.$$

(16)

Notably, this framework can recover methods similar to residual-based adaptive sampling¹³ by setting the potential to Φ(x) = x² + 1 and the EMA parameters to η^* = 1 and γ = 0.

Importance Weighting

Alternatively, the weights can be used directly in an importance weighting scheme. In this case, the training points ${\{{X}_{i}\}}_{i=1}^{N}$ are sampled uniformly from Ω. The weights $\{{\lambda }_{i}^{k+1}\}$ are then applied directly to the residuals within the loss calculation, creating a weighted objective that reads

$${\mathcal{L}}({\theta }^{k+1})=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\lambda }_{i}^{k+1}r({X}_{i},{\theta }^{k+1})]}^{2}.$$

(17)

One might worry that squaring the weights and residual depart from the procedure outlined previously. By an application of Jensen’s inequality, one can obtain that squaring the residual (and weights when applicable) corresponds to solving a strictly stronger problem with the added benefits of differentiability.

This framework is also general enough to recover other popular methods. For example, with the potential Φ(x) = x² + 1 and specific choices of EMA parameters, it is possible to recover the traditional residual-based adaptive weights¹⁵ and their variations^42,45.

Once the objective function ${\mathcal{L}}({\theta }^{k+1})$ is calculated, the model parameters are optimized using a line search algorithm of the form

$${\theta }^{k+1}={\theta }^{k}+{\alpha }^{k}{p}^{k}$$

(18)

where α^k is the step size, and ${p}^{k}=-{H}_{k}{\nabla }_{\theta }{\mathcal{L}}({\theta }^{k})$ is the update direction which depends on the gradient of the objective function and some symmetric matrix H_k⁹.

Update the regularization parameter

As alluded to before, the choice of the annealing schedule depends on the choice of potential Φ. There are two cases to be discussed, and the latter has two further subcases.

Case I: Exponential Potential (Φ(x) = e
^x)

For this choice, there are no requirements on the choice of ϵ thanks to the duality between entropy and free energy. The particular choice we implemented reads

$${\epsilon }^{k+1}=\frac{cma{x}_{\varOmega }r(x;{\theta }^{k})}{\log (2+k)}.$$

(19)

This schedule has several advantages. Using the maximum residual, $ma{x}_{\varOmega }| r|$, in the numerator provides a dynamic, problem-dependent characteristic scale for the temperature ϵ. This helps to normalize the magnitude of the residuals relative to the magnitude of the solution itself. The logarithmic term in the denominator ensures a slow and stable decay, such that ϵ^k → 0 as k → ∞, which gradually sharpens the distribution’s focus on the largest residuals. In the context of simulated annealing, logarithmic decay is sufficiently slow to guarantee global convergence [ref. ⁴⁶, Theorem 1].

Case II: general potentials

For this case, the optimality of the distribution q holds under the normalization condition

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{{\rm{r}}({\rm{x}};{\rm{\theta }})}{{\rm{\epsilon }}}\right)p(dx)=1.$$

(20)

For this case, the constraint is satisfied by choosing ϵ^k to be the normalizing constant. There are two further cases.

1. There are cases when the choice of ϵ can be computed analytically, for example, the polynomials Φ(r) = ∣r∣^p + c for any p ∈ (1, ∞) and $c\in {\mathbb{R}}$. In this case, we can see that

$${\int }_{\Omega }{\Phi }^{{\prime} }\left(\frac{r(x;\theta )}{\epsilon }\right)p(dx)=p{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)={\epsilon }^{p-1},$$

(21)

$$\epsilon ={\left(\frac{1}{p}{\int }_{\Omega }| r(x;\theta ){| }^{p-1}p(dx)\right)}^{1/p-1}.$$

(22)

2. In many other cases, e.g., $\Phi (r)=\cosh (r)$, $\Phi (r)={e}^{{r}^{2}}$, or $\Phi (r)=(1+r)\log (1+r)-r$, analytic computation might challenging. We calculate ϵ^k dynamically at each training step using a Newton-Raphson solver. To ensure stability and speed, we derive robust initial guesses based on the asymptotic behavior of each potential: ${\epsilon }_{0}\approx {r}_{\max }/{\text{ln}}(2N)$ for $\cosh (r)$, ${\epsilon }_{0}\approx {r}_{\max }/\sqrt{{\text{ln}}N}$ for ${e}^{{r}^{2}}$, and ${\epsilon }_{0}\approx \overline{r}/(e-1)$ for the logarithmic potential. Due to the strict convexity of these functions and the accuracy of the initialization, the solver typically converges to machine precision in a few iterations (See Algorithm 1). This dynamic adjustment induces minimal computational overhead relative to the full backpropagation step. In particular, for all the analyzed examples, the inclusion of vRBA involves < 10% of computational overhead from the vanilla models.

Algorithm 1

Newton-Raphson Solver for ϵ^k

Require: Residual batch r; Potential Φ; Max iterations M = 20; Tolerance δ = 10⁻⁸.

Ensure: Optimal scaling parameter ϵ.

1: Compute statistics: ${r}_{\max }\Leftarrow \max (r)$, $\overline{r}\Leftarrow mean(r)$, N ⇐ length(r).

2: Initialization (Asymptotic Approximation):

3: if $\Phi (r)=\cosh (r)$ then

4: $\epsilon \Leftarrow {r}_{\max }/ln(2N)$

5: else if $\Phi (r)={e}^{{r}^{2}}$ then

6: $\epsilon \Leftarrow {r}_{\max }/\sqrt{ln(N)}$

7: else if $\Phi (r)=(1+r)\log (1+r)-r$ then

8: $\epsilon \Leftarrow \overline{r}/(e-1)$

9: end if

10: $\epsilon \Leftarrow \max (\epsilon ,\delta )$

11: Newton-Raphson Loop:

12: for j = 1…M do

13: u ⇐ r/ϵ

14: Compute constraint residual: $F(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime} }(u)-1$

15: Compute gradient: ${F}^{{\prime} }(\epsilon )\Leftarrow \frac{1}{N}\sum {\Phi }^{{\prime\prime} }(u)\cdot (-u/\epsilon )$

16: Update: $\epsilon \Leftarrow \epsilon -F(\epsilon )/({F}^{{\prime} }(\epsilon )-\delta )$

17: Clamp: $\epsilon \Leftarrow \max (\epsilon ,\delta )$

18: end for

19: return ϵ

Physics-informed neural networks (PINNs)

In the Physics-Informed Neural Network (PINN) framework, the goal is to approximate the solution $\bar{u}(x)$ of a PDE or ODE using a parameterized representation model, u(x; θ). The training objective is to minimize a loss function composed of several residual terms that enforce the problem’s physical and data constraints.

The primary component is the PDE residual, which measures how well the model satisfies the governing equations. It is defined using a differential operator ${\mathcal{F}}$ as:

$$r(x):=| {\mathcal{F}}[u(\cdot ;\theta )]| .$$

(23)

Additionally, the model must match any available data, which may include boundary conditions, initial conditions, or sparse observations from a known function $\bar{u}(x)$. This is enforced through a data-fit residual, defined as the pointwise error:

$$r(x):=| \bar{u}(x)-u(x;\theta )| .$$

(24)

The total loss function, ${\mathcal{L}}$, is a weighted sum of the individual loss terms computed from these residuals. For a problem with governing equations (E), boundary/initial conditions (B), and observational data (D), the total loss is:

$${\mathcal{L}}={m}_{E}{{\mathcal{L}}}_{E}+{m}_{B}{{\mathcal{L}}}_{B}+{m}_{D}{{\mathcal{L}}}_{D},$$

(25)

where m_E, m_B, and m_D are global weights that balance the contribution of each term. The individual loss functions (${{\mathcal{L}}}_{E}$, ${{\mathcal{L}}}_{B}$, ${{\mathcal{L}}}_{D}$) are each computed as described in equation (16) for the adaptive sampling approach or as in equation (17) for the importance weighting. To ensure the update directions induced by the different loss components are balanced, we employ the self-scaling mechanism presented in⁴².

Global weights

Notice that for first-order optimizers such as ADAM, the update direction for PINNs (i.e., equation (25)) is given by:

$${p}^{k}=-{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})-{m}_{D}{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k}),$$

(26)

where ${\nabla }_{\theta }{{\mathcal{L}}}_{E}$, ${\nabla }_{\theta }{{\mathcal{L}}}_{B}$, and ${\nabla }_{\theta }{{\mathcal{L}}}_{D}$ are the loss gradients which can be represented as high-dimensional vectors defining directions to minimize their respective loss terms. Notice that if the gradient magnitudes are imbalanced, one direction will dominate, which may lead to poor convergence. To address this challenge, we propose modifying the magnitude of the individual directions by scaling their respective global weights. In particular, we fix m_E and update the remaining global weights using the rule:

$${m}_{B}^{k}=\alpha {m}_{B}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||},$$

(27)

$${m}_{D}^{k}=\alpha {m}_{D}^{k-1}+(1-\alpha )\frac{||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||},$$

(28)

where α ∈ [0, 1] is a stabilization parameter⁴⁷. This formulation computes the iteration-wise average ratio between gradients, enabling normalized scaling, which, on average, allows us to define a balanced update direction ${\widehat{p}}^{k}$:

$${\widehat{p}}^{k}\approx -{m}_{E}||{\nabla }_{\theta }{{\mathcal{L}}}_{E}||\,\left[{\nabla }_{\theta }{{\mathcal{L}}}_{E}({\theta }^{k})-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{B}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{B}||}-\frac{{\nabla }_{\theta }{{\mathcal{L}}}_{D}({\theta }^{k})}{||{\nabla }_{\theta }{{\mathcal{L}}}_{D}||}\right].$$

(29)

Under this approach, all loss components have balanced magnitudes, allowing each optimization step to minimize all terms effectively.

A detailed description of the proposed method presented in Algorithm 2. For the second-order experiments, we follow the general methodology of⁹, which uses the SSBroyden optimizer after 5k Adam pre-training iterations. The crucial modification in our work is that the sampling distribution is generated by our vRBA framework, rather than the standard RAD formulation used in the reference.

Algorithm 2

vRBA for PINNs

Require: Representation model ${\mathcal{M}}$; Training points X_B, X_D, X_E; Optimizer parameters lr; vRBA parameters $\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},{\alpha }_{g},{m}_{E},{\gamma }_{g}$; Iterations per stage N_stage; Total iterations N_train; Boolean flags adaptive_weights, adaptive_distribution.

Ensure: Optimized network parameters θ.

1: Initialize network parameters θ⁰.

2: Initialize weights ${\lambda }_{\alpha ,i}^{0}\Leftarrow 0.1{\lambda }_{max0}$ for each loss component α ∈ {B, D, E}.

3: Initialize sampling p.m.f. ${\overline{q}}_{\alpha }$ to be uniform for each α.

4: k ⇐ 0.

5: while k < N_train do

6: ${\lambda }_{max}\Leftarrow \min ({\lambda }_{max0}+k/{N}_{stage},{\lambda }_{cap})$

7: γ^k ⇐ 1 − η/λ_max

8: for α ∈ {B, D, E} do

9: if adaptive_distribution then

10: Update sampling p.m.f: ${\overline{q}}_{\alpha }^{k}\Leftarrow {{\boldsymbol{\lambda }}}_{\alpha }^{k}/\sum ({{\boldsymbol{\lambda }}}_{\alpha }^{k})$

11: end if

12: Sample batch ${X}_{\alpha }^{k} \sim {\overline{q}}_{\alpha }^{k}$ from X_α.

13: Compute predictions: ${u}_{\alpha ,i}\Leftarrow {\mathcal{M}}({\theta }^{k},{x}_{\alpha ,i}^{k})$ for each ${x}_{\alpha ,i}^{k}\in {X}_{\alpha }^{k}$.

14: Compute residuals ${r}_{\alpha ,i}^{k}$ using equations (24) or (23).

15: Update tilted distribution ${q}_{\alpha ,i}^{k}$ using equation (12).

16: Apply EMA: ${\lambda }_{\alpha ,i}^{k+1}\Leftarrow {\gamma }^{k}{\lambda }_{\alpha ,i}^{k}+{\eta }^{* }{q}_{\alpha ,i}^{k}$.

17: if adaptive_weights then

18: Compute loss term: ${{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({\lambda }_{\alpha ,i}^{k+1}{r}_{\alpha ,i}^{k})}^{2}$.

19: else

20: Compute loss term: ${{\mathcal{L}}}_{\alpha }^{k}\Leftarrow \frac{1}{| {X}_{\alpha }^{k}| }{\sum }_{i}{({r}_{\alpha ,i}^{k})}^{2}$.

21: end if

22: Compute gradient ${\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}$.

23: Update average gradient: $\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k}\parallel \Leftarrow {\gamma }_{g}\parallel {\nabla }_{\theta }{\overline{{\mathcal{L}}}}_{\alpha }^{k-1}\parallel +(1-{\gamma }_{g})\parallel {\nabla }_{\theta }{{\mathcal{L}}}_{\alpha }^{k}\parallel$.

24: end for

25: Update global weight: ${m}_{D}^{k+1}\Leftarrow {\alpha }_{g}{m}_{D}^{k}+(1-{\alpha }_{g}){m}_{E}\frac{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{E}}^{k}| | }{| | {\nabla }_{\theta }{{\overline{{\mathcal{L}}}}_{D}}^{k}| | }$.

26: Define total update direction: ${p}^{k}\Leftarrow -{m}_{E}{\nabla }_{\theta }{{\mathcal{L}}}_{E}^{k}-{m}_{B}{\nabla }_{\theta }{{\mathcal{L}}}_{B}^{k}-{m}_{D}^{k+1}{\nabla }_{\theta }{{\mathcal{L}}}_{D}^{k}$.

27: Update parameters: θ^k+1 ⇐ θ^k + lr ⋅ p^k.

28: k ⇐ k + 1.

29: end while

Benchmarks

Allen-Cahn

The Allen-Cahn equation is a widely recognized benchmark in PINNs due to its challenging characteristics. The 1D Allen-Cahn PDE is defined as:

$$\frac{\partial u}{\partial t}=k\frac{{\partial }^{2}u}{\partial {x}^{2}}-5u({u}^{2}-1),$$

(30)

where k = 10⁻⁴. The problem is further defined by the following initial and periodic boundary conditions:

$$u(0,x)={x}^{2}\cos (\pi x),\,\forall x\in [-1,1],$$

(31)

Burgers Equation

The Burgers’ equation is defined as

$${u}_{t}+u{u}_{x}=\nu {u}_{xx},$$

(32)

where u represents the velocity field, subject to the dynamic viscosity. In this study we consider two separate cases where $\nu =\frac{1}{100\pi }$ and $\nu =\frac{1}{1000}$. The initial conditions are described as follows

$$u(0,x)=-\sin (\pi x),\,\forall x\in \Omega ,$$

(33)

$$u(t,-1)=u(t,1)=0,\,\forall t\ge 0,$$

(34)

defined over the domain Ω = (−1, 1) × (0, 1) and periodic boundary conditions in x.

Korteweg-De Vries (KdV)

The Korteweg-De Vries (KdV) equation is a canonical model for shallow water waves and serves as a rigorous benchmark for PINNs due to the presence of third-order spatial derivatives and nonlinear soliton interactions. The PDE is defined as:

$$\frac{\partial u}{\partial t}+6u\frac{\partial u}{\partial x}+\frac{{\partial }^{3}u}{\partial {x}^{3}}=0,$$

(35)

defined over the spatio-temporal domain x ∈ [0, 20] and t ∈ [0, 5]. The problem is closed by the following initial and boundary conditions:

$$\begin{array}{c}\begin{array}{rcl}\begin{array}{rc}u(0,x) & =\end{array} & {g}_{0}(x), & \forall x\in [0,20],\end{array}\\ \begin{array}{cccc}\begin{array}{rc}u(t,0) & =\end{array} & {g}_{1}
(36)

where the boundary functions g₀, g₁, g₂, and g₃ are derived by evaluating the analytical solution at the domain boundaries. The exact solution, describing the interaction of two solitons, is given by:

$$u(x,t)=\frac{2({c}_{1}-{c}_{2})\left[{c}_{1}{\cosh }^{2}\left(\frac{\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+{c}_{2}{\sinh }^{2}\left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}}{2}\right)\right]}{{\left[\left(\sqrt{{c}_{1}}-\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}+\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)+\left(\sqrt{{c}_{1}}+\sqrt{{c}_{2}}\right)\cosh \left(\frac{\sqrt{{c}_{1}}{\zeta }_{1}-\sqrt{{c}_{2}}{\zeta }_{2}}{2}\right)\right]}^{2}},$$

(37)

where ζ_i = x − c_it − x_i for i = 1, 2. The specific parameters for this benchmark are set to c₁ = 6.0, c₂ = 2.0, x₁ = − 2.0, and x₂ = 2.0.

Operator learning

Let ${\mathcal{X}}$ be a space of functions over a domain ${\Omega }_{X}\subset {{\mathbb{R}}}^{{d}_{x}}$, and ${\mathcal{Y}}$ be a space of functions over ${\Omega }_{Y}\subset {{\mathbb{R}}}^{{d}_{y}}$. The operator of interest is

$${\mathcal{G}}:{\mathcal{X}}\ni v\,\mapsto {\mathcal{G}}[v]\in {\mathcal{Y}}.$$

The goal is to learn a parametric model G_θ that approximates ${\mathcal{G}}$. The residual $R:{\mathcal{X}}\times {\Omega }_{y}\to {{\mathbb{R}}}^{+}$ for this task is defined as the difference between the operator’s prediction and the true solution and reads

$$R(v,x;\theta )=| {G}_{\theta }(v)(x)-{\mathcal{G}}[v](x)| ,$$

(38)

where $v\in {\mathcal{X}}$ is an input function and ${\mathcal{G}}[v]$ is the corresponding true output function evaluated at a point x ∈ Ω_Y. The training data consists of N_func input-output function pairs, ${\{{v}_{j},{\mathcal{G}}[{v}_{j}]\}}_{j=1}^{{N}_{func}}$, where each output function ${\mathcal{G}}[{v}_{j}]$ is evaluated at N discrete points ${\{{x}_{i}\}}_{i=1}^{N}$. The standard loss is an average over both the function instances and the spatial points:

$${\mathcal{L}}(\theta )=\frac{1}{{N}_{{\text{func}}}}\mathop{\sum }\limits_{j=1}^{{N}_{{\text{func}}}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[R({v}_{j},{x}_{i};\theta )]}^{2}$$

(39)

A single importance sampling or weighting scheme is ill-suited for this problem due to the two distinct levels of discretization (in function space and spatial domains). To address this, we propose a mixed strategy: importance weighting is used for the spatial points within each function, while adaptive sampling is used for the functions themselves. This is motivated by the fact that many neural operators have a fixed spatial discretization, making weighting a natural fit, while the function space offers more flexibility for sampling.

The loss function for a batch of b_u functions is updated as follows

$${\mathcal{L}}(\theta )=\frac{1}{{b}_{u}}\mathop{\sum }\limits_{j=1}^{{b}_{u}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{[{\Lambda }_{i,j}R({v}_{j},{x}_{i};\theta )]}^{2},$$

(40)

where the functions ${\{{v}_{j}\}}_{j=1}^{{b}_{u}}$ are sampled from the full set of training functions. The term Λ is a matrix of importance weights, where Λ_i,j corresponds to point x_i for function v_j. These weights are constructed from a target p.m.f. matrix Q^k constructed based on the choice of potential. For instance when Φ(x) = e^x, ${Q}^{k}\in {{\rm{{\mathbb{R}}}}}^{N\times {N}_{{\text{func}}}}$ is defined recursively as follows

$${Q}_{i,j}^{k+1}({\theta }^{k})=\frac{\exp \left(\frac{R({v}_{j},{x}_{i};{\theta }^{k})}{{\epsilon }^{k}}\right)}{{\sum }_{\ell =1}^{N}\exp \left(\frac{R({v}_{j},{x}_{\ell };{\theta }^{k})}{{\epsilon }^{k}}\right)}.$$

(41)

Note that each column of the matrix Q (for a fixed function j) is a p.m.f. over the spatial points, focusing attention on high-residual regions for that specific function. The weights are then smoothed over time with an EMA

$${\Lambda }_{i,j}^{k+1}=\gamma {\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}({\theta }^{k}).$$

(42)

As in the previous case, we can set the learning rate for stability, for example, by normalizing it as ${\eta }^{* }=\eta /ma{x}_{j}{Q}_{i,j}$. Note that this choice of η^* achieves a normalization per function which is consistent with our two-level discretization. This EMA formulation has the useful property of keeping the weights bounded. As described in¹⁵, the update rule ensures that the weights are constrained to the interval ${\Lambda }_{i,j}\in (0,\frac{{\eta }^{* }}{1-\gamma })$, which aids in stabilizing the training process.

A key advantage of this framework is that, if η ≠ 1 − γ, we can leverage the imbalance on learned spatial weights, Λ_i,j, to construct a sampling distribution over the functions themselves. The intuition is that functions with higher overall residuals will naturally accumulate larger Λ values over time. Therefore, we propose the following approach to create a function-level sampling distribution. First, we compute an aggregated importance score s_j for each function by summing its spatial weights

$${s}_{j}=\mathop{\sum }\limits_{i=1}^{N}{\Lambda }_{i,j}.$$

(43)

These scores are then normalized to create a p.m.f. over the function space:

$${\overline{q}}_{j}=\frac{{s}_{j}}{{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }}.$$

(44)

This distribution $\overline{q}$ can then be used to sample the most informative functions v_j for the next training batch. A detailed description of the proposed method is given in Algorithm 3

Algorithm 3

vRBA for Operator Learning

Require: Neural Operator G_θ; Training data ${\{{v}_{j},{u}_{j}\}}_{j=1}^{{N}_{func}}$; Optimizer parameters lr; vRBA parameters $\eta ,{\lambda }_{ma{x}_{0}},{\lambda }_{cap},\gamma$; Batch size b_u; Update frequency N_update; Total iterations N_train.

Ensure: Optimized network parameters θ.

1: Initialize network parameters θ⁰.

2: Initialize weights ${\Lambda }_{i,j}^{0}\Leftarrow 0.1{\lambda }_{max0}$ for all i, j.

3: Initialize function sampling p.m.f. ${\overline{q}}_{j}^{0}\Leftarrow 1/{N}_{func}$ for all j.

4: for k ⇐ 0 to N_train − 1 do

5: ${\lambda }_{max}\Leftarrow \min ({\lambda }_{cap},{\lambda }_{max0}+k/{N}_{stage})$

6: γ^k ⇐ 1 − η/λ_max

7: Sample a batch of b_u function indices ${{\mathcal{J}}}_{k} \sim {\overline{q}}^{k}$.

8: Compute the batch residuals: ${R}_{i,j}^{k}\Leftarrow {G}_{{\theta }^{k}}({v}_{j})({x}_{i})-{u}_{j}({x}_{i})$ for $i\in \{1..N\},j\in {{\mathcal{J}}}_{k}$.

9: Update target distribution ${Q}_{i,j}^{k+1}$ using $| {R}_{i,j}^{k}|$ (via Eq. (12)).

10: Update weights via EMA: ${\Lambda }_{i,j}^{k+1}\Leftarrow {\gamma }^{k}{\Lambda }_{i,j}^{k}+{\eta }^{* }{Q}_{i,j}^{k+1}$ for $j\in {{\mathcal{J}}}_{k}$.

11: Compute weighted loss for the batch: ${{\mathcal{L}}}^{k}\Leftarrow \frac{1}{{b}_{u}N}{\sum }_{j\in {{\mathcal{J}}}_{k}}{\sum }_{i=1}^{N}{[{\Lambda }_{i,j}^{k+1}{R}_{i,j}^{k}]}^{2}$.

12: Compute gradient of the loss: ${g}^{k}\Leftarrow {\nabla }_{\theta }{{\mathcal{L}}}^{k}{| }_{\theta ={\theta }^{k}}$.

13: Update parameters: θ^k+1 ⇐ θ^k − lr ⋅ g^k.

14: if $k\,(mod\,\,{N}_{update})==0$then

15: Aggregate scores: ${s}_{j}^{k+1}\Leftarrow {\sum }_{i=1}^{N}{\Lambda }_{i,j}^{k+1}$ for j = 1. . N_func.

16: Normalize to form new p.m.f.: ${\overline{q}}_{j}^{k+1}\Leftarrow {s}_{j}^{k+1}/{\sum }_{\ell =1}^{{N}_{func}}{s}_{\ell }^{k+1}$.

17: end if

18: end for

DeepONet

DeepONet consists of two networks – a trunk network and a branch network. The trunk network encodes spatial coordinates and learns a basis in the target function space, while the branch network maps the input function, evaluated at a fixed set of sensors, to coefficients that project onto this learned basis. The resulting dot product yields the output function at each spatial location. This design is rooted in the operator approximation theorem and enables expressive and efficient modeling of nonlinear operators. DeepONet and its variants are widely applied in mechanics, high-speed flows³⁶, materials science, and multi-phase flows⁴⁸.

SVD-DeepONet

To address the challenges of modeling discontinuous solutions such as shocks, a two-step training strategy⁴⁹ is often employed to enhance the standard DeepONet architecture. In this approach, the trunk network is trained first to extract a basis, which is then orthonormalized using QR factorization or Singular Value Decomposition (SVD). While QR factorization ensures orthonormality, SVD is frequently preferred because it provides a unique solution and generates a hierarchical set of orthonormal basis functions that allow for physical interpretation of the flow features. Once this optimized basis is established, the branch network is trained in the second stage to map input parameters to the corresponding coefficients. This modification significantly improves the network’s accuracy, efficiency, and robustness, particularly when solving Riemann problems with extreme pressure ratios³⁶.

FNO

FNO learn solution operators by leveraging spectral convolutions in the Fourier domain⁴. The input function is first lifted to a high-dimensional latent space through pointwise linear transformations. A Fourier transform is applied to these lifted features, enabling convolutional operations to be performed as multiplications in frequency space. High-frequency modes are typically truncated to enforce smoothness, reduce overfitting, and improve training dynamics. The result is then transformed back to physical space via the inverse Fourier transform and projected to the target dimension. The global receptive field of FNOs makes them particularly effective for modeling long-range dependencies in solutions to PDEs, as demonstrated in applications such as weather forecasts, porous media flows, and turbulence.

TC-UNet

Unlike FNOs, TC-UNet²⁷ operates entirely in physical space using local convolutions. The architecture is based on a UNet, a hierarchical fully convolutional neural network that captures multiscale features through successive downsampling and upsampling. TC-UNet uses time conditioning via feature-wise linear modulation (FiLM)⁵⁰, applied at each level of the hierarchy. This allows the model to adaptively modulate intermediate features based on the time coordinate input, enabling accurate modeling of spatiotemporal dynamics. TC-UNet or UNet-based architectures are particularly well-suited for problems characterized by sharp gradients³⁶ or fine-scale structures⁵¹ and are, in general, more robust to spectral bias⁵² compared to other neural operator architectures.

Benchmarks

Bubble growth dynamics

We study the dynamics of a single gas bubble in an incompressible liquid governed by the Rayleigh-Plesset (R-P) equation⁴⁸, a nonlinear ordinary differential equation describing the evolution of the bubble radius R(t) under a time-varying pressure field P_∞(t). Under isothermal assumptions and negligible temperature variations, the simplified linearized R-P equation reads

$$-\frac{\Delta p
(45)

where r(t) = R(t) − R₀ is the deviation from the initial bubble radius R₀, ρ_L is the liquid density, ν_L is the kinematic viscosity, γ is the surface tension, and P_G0 is the initial gas pressure inside the bubble.

We generate a dataset by numerically solving equation (45) for 1000 independent realizations of the forcing function Δp(t), which is constructed as a product of a Gaussian random field and a smooth ramp function, following the procedure in⁴⁸. Specifically, the pressure field is modeled as

$$\Delta p
(46)

where the vector of conservative variables U and the flux vector F(U) are defined as

$$U=\left(\begin{array}{c}\rho \\ \rho u\\ \rho E\end{array}\right),\,F(U)=\left(\begin{array}{l}\rho u\\ \rho {u}^{2}+p\\ u(\rho E+p)\end{array}\right).$$

Here, ρ denotes the fluid density, u the velocity, and p the pressure. The total energy E is related to the pressure by the equation of state for an ideal gas, $\rho E=\frac{p}{\gamma -1}+\frac{1}{2}\rho {u}^{2}$, with the specific heat ratio set to γ = 1.4.The system is subject to discontinuous initial conditions consisting of two constant states, U_L and U_R, separated by a diaphragm at x = x_c. In this study, we specifically focus on the High-Pressure Ratio (HPR) regime, characterized by extreme pressure jumps (up to a ratio of 10¹⁰) across the discontinuity1. The dataset is generated by varying the initial left-state pressure p_L while keeping the right-state parameters fixed, utilizing an exact Riemann solver to provide the ground truth solutions at a final time t_f.While the system involves three primitive variables, we restrict the neural operator to map the initial pressure parameter p_L exclusively to the density field ρ(x, t_f). The density profile is particularly challenging and representative as it uniquely exhibits all three fundamental wave structures inherent to the Riemann problem: the rarefaction wave, the contact discontinuity, and the shock wave.

Navier-Stokes Equations- Kolmogorov’s flow

We consider the two-dimensional unsteady Navier-Stokes equations in vorticity formulation, modeling an incompressible, viscous fluid on the periodic domain (x, y) ∈ (0, 2π)². The system is driven by a Kolmogorov-type external forcing, as previously studied in⁵³, and is governed by:

$$\left\{\begin{array}{l}{\partial }_{t}\omega +{\boldsymbol{u}}\cdot \nabla \omega =\nu \Delta \omega +f(x,y),\\ \nabla \cdot {\boldsymbol{u}}=0,\\ \omega (x,y,0)={\omega }_{0}(x,y),\end{array}\right.$$

(47)

with viscosity ν = 10⁻³, and the source term defined as

$$f(x,y)=\chi (\sin (2\pi (x+y))+\cos (2\pi (x+y))),$$

(48)

where χ = 0.1. The Laplacian Δ acts in two spatial dimensions, ω denotes the vorticity, and u is the velocity.

Initial conditions ω₀(x, y) are sampled from a Gaussian random field with zero mean and covariance operator ${\mathcal{N}}(0,{7}^{3/2}{(-\Delta +49I)}^{-5/2})$. To generate the data, we employ a Fourier-based pseudo-spectral solver introduced in⁴. The simulation output consists of 1000 spatiotemporal vorticity realizations, each on a 512 × 512 spatial grid, subsequently downsampled to 128 × 128 for downstream learning tasks.

We partition the dataset into training, validation, and testing subsets in an 80:10:10 ratio. A neural operator model ${\mathcal{G}}$ is trained to predict evolution of the vorticity field by learning the mapping from the initial condition at t = 0 to the interval [t ∈ (0, 50]).

For the 2D Navier-Stokes problem, we train a Fourier Neural Operator (FNO) to learn the mapping from an initial vorticity field ω₀(x, y) to the full spatiotemporal solution ω(x, y, t).

Wave Equation

We investigate the propagation of acoustic waves governed by the linear wave equation in heterogeneous media. In 2D, the governing equation is given by:

$$\left\{\begin{array}{lc}{\partial }_{t}^{2}u({\boldsymbol{x}},t)={c}^{2}({\boldsymbol{x}})\Delta u({\boldsymbol{x}},t), & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\,t\in [0,2],\\ u({\boldsymbol{x}},0)={u}_{0}({\boldsymbol{x}}), & \begin{array}{cl}\,{\partial }_{t}u({\boldsymbol{x}},0)=0, & {\boldsymbol{x}}\in {[0,\pi ]}^{2},\end{array}\end{array}\right.$$

(49)

where u(x, t) represents the acoustic pressure at spatial location x = (x, y), c(x) is the spatially varying wave speed, and Δ denotes the Laplacian operator. We assume fully reflective (homogeneous Dirichlet) boundary conditions throughout the domain.

For the spatially varying wave speed, we set $c(x,y)=1+\sin (x)\sin (y)$. The initial pressure profile u₀(x) is modeled as a localized Gaussian source centered at a point x_c, i.e.,

$${u}_{0}({\boldsymbol{x}})=\exp \left(-\frac{\parallel {\boldsymbol{x}}-{{\boldsymbol{x}}}_{c}{\parallel }^{2}}{10}\right),$$

with x_c sampled randomly on the spatial grid. We solve this system numerically using a second-order finite difference method on a grid with a spatial resolution of 64 × 64 and generate 1000 simulations corresponding to different realizations of u₀. The dataset is partitioned into training, validation, and test sets in the ratio 80:10:10. We train a neural operator ${\mathcal{G}}$ to learn the mapping u(x, 0) ↦ u(x, t) for all [t ∈ (0, 2]).

In our final operator learning example, we train a Time-Conditioned U-Net (TC-UNet) to learn the solution operator for the 2D wave equation, mapping an initial pressure profile u₀(x) to the full wave propagation over time u(x, t).

vRBA hyperparameter selection

This section details the selection of the hyperparameters introduced by the vRBA framework. The present formulation does not require extensive problem-specific tuning; in fact, all hyperparameters discussed below were held constant across every benchmark presented in this paper. Unless otherwise stated, we utilize a consistent set of vRBA hyperparameters: an Exponential Moving Average (EMA) memory of γ = 0.999, an EMA learning rate of η = 0.01, and a smoothing factor of ϕ = 1.0.

Annealing schedule

For the exponential potential, the convex duality between entropy and free energy warrants any choice of annealing schedule. We observe the parallel between vRBA and simulated annealing, which is provably convergent under sufficiently low temperature decay, e.g., [ref. ⁴⁶, Theorem 1]. Inspired by such theoretical results, our annealing schedule (see Eq. (19)) follows a logarithmic decay in the number of iterations scaled by a universal constant, which we took to be one (i.e., c = 1) in all our examples.

For the general potential-dependent case, there are generically two cases. For potentials such as $\cosh (r)$, ${e}^{{r}^{2}}$, and $(1+r)\log (1+r)-r$ where the appropriate ϵ is not easily found via analytic calculations, the annealing parameter is determined dynamically at each iteration by solving the optimality condition via Newton’s method. On the other hand, for the polynomial (r^p, for p > 1) potentials, the formulation remains effectively constant throughout training. Consequently, these parameters are governed by theoretical or numerical optimality conditions rather than manual hyperparameter selection.

EMA learning rate

Here, the values of λ are bounded in the interval (0, η^*/(1 − γ)), implying a maximum importance score of ${\lambda }_{\max }={\eta }^{* }/(1-\gamma )$. These parameters were originally introduced in the RBA framework¹⁵, and we utilize the exact same values established in that work. We note that the use of Exponential Moving Averages (EMA) to stabilize stochastic estimates is a standard practice in machine learning, most notably in the Adam optimizer⁴³, where it is essential for convergence stability. Subsequent analyses in related studies, such as the sensitivity analysis performed for standard RBA (Φ(r) = r²) in KKANs⁴², have investigated the effect of ${\lambda }_{\max }$ on convergence. That study demonstrated that while initializing with a low bound (${\lambda }_{\max }\approx 1$) yields suboptimal results due to insufficient attention, and overly aggressive bounds (${\lambda }_{\max } > 20$) can slightly degrade performance, there exists a broad, robust plateau of optimal performance for ${\lambda }_{\max }\in [5,20]$. The configuration used in this paper targets ${\lambda }_{\max }\approx 10$, which lies in this optimal regime. Thus, the selection of η^* is not a new free parameter requiring tuning, but is a fixed value inherited from previous empirical evidence.

EMA memory parameter

Similar to the learning rate, we inherit the memory parameter from the previous studies that introduced the RBA framework¹⁵. To further validate the robustness of the proposed method, we performed a sensitivity analysis in the Supplementary Information. Our results indicate that our framework is quite robust to this value. For operator learning, γ cannot be zero since we need to collect the λ scores to sample the function space (see Eq. (43)); however, our results indicate that γ ∈ [0.4, 0.999] outperforms the baseline. On the other hand, for PINNs with second-order optimizers, it is even possible to train without any memory, significantly outperforming the baselines. Nevertheless, the values used in this study, inherited from previous studies, lead to the best results in our benchmarks.

Stabilization parameter

The stabilization parameter ϕ governs the convex combination of the adaptive distribution q and a uniform prior p_u. We primarily utilized pure adaptivity by setting ϕ = 1.0 for all PIML examples, while adopting a slight regularization for Operator Learning benchmarks. A sensitivity analysis provided in SI confirms that the method is robust to this choice, as vRBA consistently outperforms the baseline even in the absence of regularization. Notably, the inclusion of the uniform prior yields performance gains specifically for potentials targeting the L^∞ norm, such as the exponential variants, by mitigating their aggressive nature; for variance-minimizing potentials, the sensitivity to ϕ is negligible.

SSBroyden optimizer

This section details our custom JAX implementation of the Self-Scaled Broyden (SSBroyden) optimizer, which was used for all second-order optimization experiments. The original method, proposed by Urbán et al.⁹, relies on modified SciPy routines that are CPU-bound and not directly portable to a JAX-native, GPU-accelerated workflow.

Our implementation preserves the core SSBroyden update logic, which dynamically computes scaling and updating parameters. However, the line search portion of the algorithm required a complete rewrite. Due to the absence of SciPy’s advanced line search routines in JAX Scipy, we developed a custom three-stage fallback line search mechanism to promote robust convergence. This procedure creates a cascade of attempts with progressively more strict Wolfe conditions, starting with strict parameters (c₂ = 0.9) and constraining them (c₂ = 0.8, then c₂ = 0.5) only upon failure. This adaptation was essential for ensuring the optimizer could consistently make progress on the challenging loss landscapes of the problems studied.

Implementation details

Physics-Informed Neural Networks (PINNs) We evaluate two optimization strategies for PINNs.

First-order optimization

For the Allen-Cahn equation, we adopt the network architecture and self-scaling strategy from⁴², utilizing a 6-layer network (H = 64) with $\tanh$ activations and Fourier Feature embeddings. Training proceeds for 3 × 10⁵ Adam iterations (lr = 10⁻³) with vRBA applied as an importance weighting scheme.

Second-order optimization

For Allen-Cahn and Burgers’ equations, we follow the methodology of⁹, employing a 3-layer network (H = 30) with periodic encodings. Training consists of 5000 initialization iterations (Adam) followed by 60,000 iterations using the SSBroyden optimizer. Here, vRBA is applied via importance sampling with a resampling frequency of 100 iterations.

Operator learning

For operator learning tasks, we employ a hybrid vRBA strategy: importance weighting is applied to the spatial domain, while importance sampling is applied to the function space. Detailed listings of all network dimensions, learning rates, and specific coefficient settings are provided in SI.

DeepONet

For Bubble Growth Dynamics, we use a DeepONet with 4 hidden layers (H = 100, GELU activation) for both branch and trunk networks, optimized via Adam.

SVD-DeepONet

For the Sod-Shock tube problem, we utilize the SVD-DeepONet architecture from³⁶ with adaptive activation functions⁵⁴, comprising 6 hidden layers (H = 100, $\tanh$ activation) for both branch and trunk.

FNO and TC-UNet

For the Navier-Stokes benchmarks, we adopt the exact Fourier Neural Operator (FNO) and Time-Conditioned U-Net (TC-UNet) architectures and code provided by²⁷ to ensure direct comparability.

Source link

注册免费账户 commented on Top 4 Insights All C-Suite Leaders + Market Researchers Need To Be Aware Of In 2024: Thanks for sharing. I read many of your blog posts
binance h"anvisning commented on Bluebirds bags $5M for sales AI that finds the ‘best leads first’: I don't think the title of your article matches th
注册 commented on Samsung Partners with Youth Entrepreneur Club to launch “Samsung Innovation Campus” in Qatar: Can you be more specific about the content of your
open binance account commented on Universities ‘must show students how to use AI’: I don't think the title of your article matches th
Skapa ett gratis konto commented on Apple’s job listings suggest it plans to infuse AI in multiple products: I don't think the title of your article matches th

Variational residual-based attention methods

Update the tilted distribution

Update the model parameters

Adaptive Sampling

Importance Weighting

Update the regularization parameter

Case I: Exponential Potential (Φ(x) = e x)

Case II: general potentials

Algorithm 1

Physics-informed neural networks (PINNs)

Global weights

Algorithm 2

Benchmarks

Allen-Cahn

Burgers Equation

Korteweg-De Vries (KdV)

Operator learning

Algorithm 3

DeepONet

SVD-DeepONet

FNO

TC-UNet

Benchmarks

Bubble growth dynamics

Navier-Stokes Equations- Kolmogorov’s flow

Wave Equation

vRBA hyperparameter selection

Annealing schedule

EMA learning rate

EMA memory parameter

Stabilization parameter

SSBroyden optimizer

Implementation details

First-order optimization

Second-order optimization

Operator learning

DeepONet

SVD-DeepONet

FNO and TC-UNet

Related Posts

Case I: Exponential Potential (Φ(x) = e
^x)