Benchmarking scientific machine-learning approaches for flow prediction around complex geometries

Geometric representation

We evaluate two geometric representations: SDF and binary masks. SDFs are a scalar field indicating the shortest distance from each point in the prediction domain to the object’s boundary. The signed distance field represents the shortest distance from a given point in space to the surface of a geometric shape. It takes negative values inside the object, positive values outside the object, and zero values on the boundary surface. In contrast, the binary mask represents geometry as a binary field, with 0 inside the object and 1 outside, offering a more straightforward, less informative structure regarding relative distances to the boundary layer. An example of the SDF and binary mask for three sample geometries from our dataset is shown in Fig. 2. We aim to assess whether there is added value in using a continuous representation of distance from object boundary versus a simple binary mask on capturing fluid behavior around objects.

**Fig. 2: Comparison of geometry representations for different geometries.**

To quantitatively evaluate the impact of geometric representation on the performance of the SciML model, we utilize a unified scoring scale. This scale presents error metrics in an understandable range between 0 and 100. A score of 0 indicates the least favorable outcome, where the models predict zero across all fields, and a score of 100 corresponds to model predictions that align with the high precision of computational fluid dynamics (CFD) simulations.

We assess the effect of geometry representation on SciML prediction error. The dataset, consisting of 3000 samples, is randomly divided into an 80-20 train/test split. The test dataset, containing 600 samples, is held constant across all experiments to ensure consistent evaluation of model performance. As shown in Fig. 3, scOT-T, poseidon-T, and CNO achieve higher scores with the mask representation, while other neural operators tend to perform better with the SDF. Additionally, Tables 1, 2 show that scOT and Poseidon models outperform the other neural operators by roughly an order of magnitude. Sample field predictions and error comparisons between SDF and mask representations for the velocity in the y-direction of a random test sample are displayed in Fig. S.1. The error plot indicates that the mask representation produces lower error for the poseidon-T and CNO models, whereas the SDF yields lower error for the geometric-DeepONet model. This difference suggests that scOT, Poseidon, and CNO models benefit from the sharpness of the binary mask, while other neural operators perform better when using the continuous boundary information provided by the SDF. This observation is not intuitive as SDF is a richer field that provides information on how close the object boundary is versus simple in/out information through the binary mask.

Table 1 The score of SciML models trained on the full dataset using the signed distance field at two different difficulty levels (random and extrapolatory)

Table 2 The score of SciML models trained on the full dataset using the binary mask at two different difficulty levels (random and extrapolatory)

Key takeaways

Impact on Model Performance: Vision transformer-based models such as scOT-T and poseidon-T, along with the CNO model, exhibit improved accuracy when using binary mask representations, while other neural operators perform better using the SDF.
Performance Comparison: scOT and Poseidon models outperform other neural operators scoring 20 points higher in performance metrics (corresponding to an order of magnitude lower MSE).

Data sufficiency

Evaluating the impact of training dataset size is critical for understanding the practical feasibility of deploying SciML models, especially in scenarios where generating large datasets is computationally expensive or time-intensive. To investigate the role of training dataset size in the performance of SciML models, we conduct a series of experiments using subsets of the FlowBench dataset. The baseline experiment uses the full training dataset of 2,400 samples, representing the complete FlowBench dataset. Four additional experiments are performed with subsets of 1200, 800, 400, and 240 samples, while the same fixed test dataset of 600 samples is used for all experiments to calculate the error metrics. This ensures that any performance changes are solely due to variations in the training dataset size, and not from differences in test data. These experiments address key questions regarding data sufficiency: How much data is required to achieve reasonable performance? Are there general trends in data requirements across the SciML models? Do certain models demonstrate greater data efficiency than others, and does the choice of geometry representation (e.g., SDF vs. mask) influence these trends?

In Fig. 4, model performance varies substantially based on the geometry representation, with notable differences between the Signed Distance Field (SDF) and binary mask. Across all models-poseidon-T, scOT-T, scOT-B, CNO, FNO, and Geometric-DeepONet-larger sample sizes consistently lead to progressively higher score values. However, neural operators reach an asymptotic error limit of around 800 samples in the mask representation, where additional data has minimal impact on further increasing score values. In contrast, scOT and Poseidon models maintain improvements up to 1200 samples, demonstrating their capacity to effectively utilize larger data sizes. This trend underscores the influence of a smooth geometry representation, such as SDF, in enhancing model learning and highlights the ability of scOT and Poseidon models to leverage larger sample sizes due to their larger architectures.

**Fig. 4: Comparison of score values vs. sample size for different scientific machine learning models using SDF and binary mask representations.**

This pattern reveals that when trained with the simpler binary mask, neural operators reach their data utility limit at approximately 800 samples. In contrast, with the SDF representation, these models continue to improve with more data. Poseidon-T, in particular, highlights the benefits of pretraining in data-limited scenarios. When trained on fewer than 800 samples, Poseidon significantly outperforms scOT, achieving an MSE that is an order of magnitude lower as shown in Fig. 4. This performance advantage is especially notable compared to other neural operators, as poseidon-T and scOT-T achieve MSE values around 10⁻⁴ (score = 50) in data-sparse scenarios, emphasizing their efficiency.

The training dataset often does not fully represent the target distribution for which the model is designed. To address this, we assess the model’s ability to extrapolate and make out-of-distribution predictions (“extrapolatory”) by employing a test dataset that includes field solutions for lid-driven cavity flows with Reynolds numbers either in the top or bottom 10% of the range while restricting the training dataset to Reynolds numbers from the middle 80%. While the scOT and Poseidon models consistently outperform other neural operators across both geometry representations, their performance remains stable in the extrapolatory split regardless of dataset size. This observation suggests that in out-of-distribution scenarios, the ability to extrapolate relies more on the inherent robustness of the model architectures than on the volume of training data. Detailed results for all models at smaller dataset sizes are provided in the Data Sufficiency section in the Supplementary (Tables S.1–S.8). Additionally, sample field predictions and error comparisons for models trained on 240 versus 800 samples, specifically for y-velocity of an example sample, are shown in Fig. S.2 in the Supplementary.

Key takeaways

Impact of Sample Size on Performance: Neural operators benefit from increased data sizes when using the SDF representation, showing continuous improvement, whereas their performance saturates around 800 samples with the binary mask.
Performance in Data-Limited Scenarios: pre-trained model Poseidon-T demonstrates superior accuracy in data-limited scenarios, reaching an MSE around 10⁻⁴ (score = 50) with fewer than 800 samples.

Extrapolation capabilities

The training dataset often fails to fully represent the target distribution for which the model is intended or may not encompass its entire range. We design two train-test splitting strategies to evaluate the model’s ability to make out-of-distribution predictions. For the out-of-distribution experiment, the test dataset comprises field solutions for lid-driven cavity flows with Reynolds numbers in the top or bottom 10% of the range, while the training dataset is restricted to Reynolds numbers from the middle 80%. In contrast, the baseline experiment employs a random train-test split, ensuring that both datasets contain samples spanning the entire distribution of Reynolds numbers. The distributions of Reynolds numbers for both splitting strategies are shown in Fig. 5.

**Fig. 5: Histogram of Reynolds numbers for train and test splits.**

We train models for each geometric representation using both random and extrapolatory datasets. As shown in Fig. 6, notable differences exist among the models for the extrapolatory data split and substantial performance gaps between each model’s random and extrapolatory splits. Specifically, models trained on the random split show stronger performance, with Poseidon and scOT achieving nearly an order of magnitude lower error than other models. For neural operators such as FNO, DeepONet, Geometric-DeepONet, and WNO, we observe that using the SDF as a geometric representation provides marginal but consistent improvements for the extrapolatory data split. This can be attributed to the SDF’s ability to encode richer geometric information, including the precise location and structure of objects within the domain, compared to the binary mask’s simpler representation. These results highlight the ongoing challenge of accurately extrapolating to out-of-distribution complex fluid dynamics simulations. To further illustrate this, we provide sample field predictions and error comparisons between models trained on the random and extrapolatory splits, focusing on y-velocity for an example sample, as shown in Fig. S.3 in the Supplementary. While our results indicate that transformer-based models such as Poseidon and scOT achieve lower errors for in-distribution flow conditions, their performance in extrapolatory regimes remains suboptimal. This suggests that state-of-the-art architectures struggle to generalize to unseen geometries and Reynolds number cases, highlighting a critical gap in current SciML modeling abilities.

**Fig. 6: Comparison of score values for different models using random and extrapolatory test/train splits.**

Key takeaways

Extrapolation Challenges: Testing on extreme Reynolds numbers (top and bottom 10%) reveals that all models substantially underperform when generalizing to out of distribution predictions, with only minimal differences in performance observed between them. This indicates a challenge in accurately predicting flow behavior in extrapolatory regimes, regardless of the model architecture or geometric representation.

Performance metrics

We evaluate model performance using three metrics: global accuracy (M1), boundary layer accuracy (M2), and physical consistency (M3). Global accuracy (M1) measures overall prediction accuracy across the domain, excluding the geometry. Boundary layer accuracy (M2) focuses on errors within the boundary layer (SDF between 0 and 0.2), highlighting precision near surfaces. Physical (PDE) consistency (M3) assesses adherence to governing laws by evaluating momentum residuals, ensuring the physical plausibility of predictions. Notably, our results consistently show that boundary layer MSE errors (M2) are lower than global MSE errors (M1) across all SciML models, as the velocity values are close to zero near the geometry, resulting in a lower absolute MSE within the boundary layer. This observation is consistent across all results in Table 1, Table 2, and Tables S.1–S.8. Performance in the boundary layer is critical for downstream tasks such as calculating the coefficients of lift and drag using SciML predictions.

The M3 metric, defined as the L₂ norm of the momentum residuals \(\left(\sqrt{{r}_{x}^{2}+{r}_{y}^{2}}\right)\), evaluates the models’ ability to satisfy underlying physical laws (PDEs) in fluid dynamics simulations. Analysis of the M3 metric reveals that DeepONet consistently achieve the lowest M3 error across all dataset configurations, as shown in Table 1, Table 2, and Tables S.1 to S.8. Vision transformer-based foundation models leverage their image-focused architecture for efficient feature extraction. However, they may struggle to capture fine-scale details and global dependencies in high-resolution scenarios, which can lead to less smooth outputs. In contrast, neural operators such as DeepONet are designed to learn mappings between function spaces allowing them to produce smoother and continuous predictions. Although neural operators architecture is not specifically tailored only for physics problems, the smoothness of the learned mappings is suitable for problems governed by partial differential equations. Consequently, DeepONet exhibits lower errors on the M3 physical consistency metric (momentum residuals), which is further explained in the Residual Calculation section in the Supplementary.

As illustrated in Fig. S.4, which compares residuals in the x and y directions for mask and SDF representations in Poseidon-T, CNO, and DeepONet, there is no significant difference in residual values between the SDF and mask representations. However, DeepONet outperforms the other models in the M3 metric, achieving the lowest residuals for the sample shown. In particular, the element-wise momentum residual of DeepONet is only non-zero near the top boundary and the surface of the geometry, where the velocity and pressure gradients are the highest. Conversely, Poseidon-T, and to a lesser extent, CNO, exhibit relatively high residuals throughout the domain. This finding suggests that, although it does not have the lowest MSE, DeepONet achieves a more accurate approximation of the solution gradient (see Fig. S.5), effectively satisfying the underlying PDE behavior more robustly. Additionally, residual values are generally lower in the extrapolatory split than in the random split, likely due to the higher proportion of samples with low Reynolds numbers in the extrapolatory split, simplifying the enforcement of PDE constraints for SciML models.

Key takeaways

Boundary vs. Global Accuracy (M1, M2): Boundary layer MSE (M2) is consistently lower than global MSE (M1), as the reduced velocity near the geometry leads to naturally lower absolute errors. This indicates that models effectively learn the near-zero velocity conditions around the geometry.
PDE Consistency (M3): DeepONet achieves the lowest M3 error by leveraging its basis-function based architecture to model continuous fields. This enables DeepONet to capture smoother solutions and thus better satisfy governing equations compared to vision transformer-based models, which excel in feature extraction.

Computational performance and parameter analysis

The numerical efficiency of the different SciML models, including model size, training time, and inference time, is presented in Table 3. The scOT and Poseidon models exhibit similar values across these metrics due to their shared architecture. Among all models, FNO stands out as the fastest model to train, completing training in just 2.29 hours, due to its relatively simple architecture and a smaller number of parameters compared to other large models. DeepONet and geometric-DeepONet are the fastest in inference time, with DeepONet requiring less than a second per sample due to its small model size. On the other hand, the larger models, such as WNO and the base and large versions of scOT and poseidon, take significantly longer to train due to their substantial number of parameters.

Table 3 Performance metrics of SciML models

Ideally, one would want to evaluate the performance of these models with same parameter count. However, training and evaluating the performance of models with equal parameter counts across these diverse architectures—ranging from neural operators such as FNO, CNO, and DeepONet to vision transformers such as scOT and Poseidon—is very challenging in practice. These architectures have fundamentally distinct structural paradigms and inductive biases; for example, FNO learns global mappings in the Fourier domain while scOT/Poseidon utilizes attention mechanisms that naturally require higher parameter counts for query-key-value projections. Forcing equal parameter count would require non-standard configurations that could impair each model’s ability to learn relevant features (Recent theoretical work⁴³ also suggests this, i.e. architectural choices operate on different parameter scales). Thus, comparing well-tuned instances of each architecture, each operating on its design-optimal parameter scale, provides a more realistic assessment of their strengths and weaknesses. This approach, consistent with standard benchmarking practice⁴⁴, highlights the trade-offs between computational cost and predictive performance in SciML applications.

Figure 7 shows that for the global accuracy metric M1, neural operators such as DeepONet, FNO, and CNO, which have relatively small model sizes (under 15 million parameters), perform poorly compared to the much larger vision transformer-based models such as scOT and Poseidon. In contrast, the residual consistency metric M3 shows the opposite pattern: smaller models, particularly DeepONet and Geometric-DeepONet, outperform their larger counterparts. These models yield lower residual errors, suggesting a better ability to capture the underlying physical operator. This result underscores the efficiency of compact neural operators and indicates the need for more data for larger models (Scot and Poseidon) to learn the underlying physics.

**Fig. 7: Effect of model size on performance.**

Key takeaways

Computational efficiency vs. accuracy: Larger foundation models achieve higher global accuracy (M1), but require significantly more training time. In contrast, smaller neural operator models are faster to train and deploy, and larger models in terms of physical consistency (M3).
Compact models generalize better physically: Smaller and more interpretable architectures appear to be better able to learn physically consistent operators from limited data.

Source link