๐ Original Info
- Title:
- ArXiv ID: 2512.20941
- Date:
- Authors: Unknown
๐ Abstract
Data-driven surrogate models are increasingly adopted to accelerate vehicle design. However, open-source multi-fidelity datasets and empirical guidelines linking dataset size to model performance remain limited. This study investigates the relationship between training data size and prediction accuracy for a graph neural network (GNN) based surrogate model for aerodynamic field prediction. We release an open-source, multi-fidelity aerodynamic dataset for double-delta wings, comprising 2448 flow snapshots across 272 geometries evaluated at angles of attack from 11 โข to 19 โข at ๐๐ = 0.3 using both vortex lattice method (VLM) and Reynolds-Averaged Navier-Stokes (RANS) solvers. The geometries are generated using a nested Saltelli sampling scheme to support future dataset expansion and variance-based sensitivity analysis. Using this dataset, we conduct a preliminary empirical scaling study of the MF-VortexNet surrogate by constructing six training datasets with sizes ranging from 40 to 1280 snapshots and training models with 0.1 to 2.4 million parameters under a fixed training budget. We find that the test error decreases with data size with a power-law exponent of -0.6122, indicating efficient data utilization. Based on this scaling law, we estimate that the optimal sampling density is approximately eight samples per dimension in a d-dimensional design space. The results also suggest improved data utilization efficiency for larger surrogate models, implying a potential trade-off between dataset generation cost and model training budget.๐ Full Content
This work addresses these two interrelated gaps. First, we present an open-source, multi-fidelity aerodynamic dataset for double-delta wings across 272 unique geometries and multiple AOAs ranging from 11 โข to 19 โข at low speed (๐๐ = 0.3). The DOE employs a nested Saltelli sampling scheme to generate geometric configurations, enabling nested geometric variations and supporting future variance-based sensitivity analysis and uncertainty quantification. Aerodynamic coefficients, surface solutions, and volume solutions are provided in open formats, along with the CFD configuration files, to maximize downstream usability of the dataset. Second, we perform an empirical scaling study of the MF-VortexNet model using nested subsets of the dataset. By selecting between 8 and 256 geometric configurations for training set construction, we quantify a power-law decay of prediction error with increasing sampling density in the design space.
The remainder of this paper is organized as follows. Section II outlines the relevant technical framework of the different analysis components, including geometry generation environment, low-and high-fidelity solvers, and the MF-VortexNet surrogate model. Section III presents a detailed description of the double-delta wing aerodynamic dataset, including the design space, DOE, simulation configurations, and simulation results. Section IV presents the setup and results of the scaling study. Finally, Section V discusses the implications of the current study and outlines future work.
As this study involves modeling external flow over wings with multiple solvers at different fidelities, and utilizing a machine learning model MF-VortexNet to learn the mapping from a pressure-geometry-flow-condition state to a high-fidelity pressure field, this section provides a brief discussion of the modules used throughout the study.
The SUAVE conceptual design environment [29] is used to generate vehicles from geometry parameters, drive low-fidelity simulations, and create surface-mesh tessellations using its built-in OpenVSP [30] interface. Depending on the subsequent applications, the model representations of a given vehicle configuration may have different fidelities.
At the lowest fidelity, an instance of the Vehicle class in SUAVE is created, taking in the corresponding design variable vector x โ R ๐ , where ๐ is the dimension of the design space, and a prescribed process that constructs a vehicle from x. Details of the computer program structure, including the Vehicle class definition, can be found in Lukaczyk et al. [29]. When x is changed, different instances are created to represent different vehicle configurations.
When different analyses are conducted on the vehicle instance, different types of instance representation are needed. When the VLM (see Section II.B) is utilized, a lattice grid of the vortex distribution is created based on the Vehicle instance by incorporating the encoded geometric parameters. Additional configuration settings, such as the number of panels or the distribution functions used for the lattice discretization, may change the vehicle-geometry representation for a given instance. The lattice grid and vortex distribution are used by the VLM solver to perform aerodynamic analyses.
Likewise, for CFD simulations, a representation of the OML of the vehicle is required by the solver. Under such analyses, we use SUAVE’s OpenVSP [30] interface to translate the SUAVE instance to an OML boundary representation. Subsequently, OpenVSP’s ComputeCFDMesh function is used to generate a tessellated surface mesh. This mesh is then imported into Pointwise, together with a far-field boundary defined with a diameter of 40 mean aerodynamic chords (MACs), to define the fluid domain. The fluid domain is then further discretized to generate the volume mesh for CFD analyses.
The VLM solver solves the potential-flow problem for the perturbation velocity potential, subject to a no-penetration boundary condition on the lifting surface. The lifting surface is discretized into a quadrilateral panel lattice. Horseshoe vortices are defined on each panel’s quarter-chord as well as its side bounds, with semi-infinite trailing legs extending downstream. The no-penetration boundary condition is enforced at the collocation points located at the center of three-quarter-chord of each panel. This leads to a linear system for the vortex strength of each panel’s horseshoe vortex, which can be solved. The resultant vortex strengths at each panel can then be summed for induced-velocity computation using the Biot-Savart law [31]. The pressure perturbation due to the induced velocity is computed using Bernoulli’s theorem [31]. The specific VLM used in this work is shipped with SUAVE, which is based on Miranda et al.’s VORLAX implementation, and is suitable for compressibility-corrected, inviscid, attached flows [32].
Due to the compressibility correction, the VLM predictions for vehicles at high speed and low AOA is relatively accurate. The computational cost, at a chordwise-by-spanwise lattice of 30 ร 32, is negligible. In addition, the solver’s convergence is robust across a wide design space, making it ideal for rapid design space exploration studies. However, due to the neglect of nonlinear flow effects such as vortex lift and separation, the prediction accuracy deteriorates for vehicles at low speed under high AOAs. As shown by Shen and Alonso in their previous study comparing the QOIs computed from VLM and RANS CFD, VLM results for relevant Delta wing configurations are relatively accurate up to AOA of 10 โข [23].
The SU2 RANS CFD solver is used to compute aerodynamics [33]. The solver solves the RANS equations, in which nonlinear flow effects are accounted for. For turbulence modeling, the Spalart-Allmaras turbulence model with rotation corrections (SA-R) [34] is used. The SA-R turbulence model adds a one-equation eddy-viscosity closure to the RANS equations, such that the eddy viscosity is adjusted across different flow regions based on the dominance of rotational and strain effects.
For the present application, the selected solver and turbulence model achieve a balance between the desired accuracy and cost. The chosen solver captures viscous effects and vortex systems for flow that is attached or mildly separated, which constitutes the majority of the flow conditions considered in the current study. Several previous works have demonstrated the suitability of using the SA-R RANS solver for external-flow modeling with strong vortex presence [34,35]. For Delta wings and double-delta wings, several studies have highlighted the capability of SA model family to predict vortex strength and breakdown [36], to model aerodynamics at higher AOA with accurate capture of vortex-vortex interactions [35], as well as to capture vortex lift-induced effects on surface pressure distribution over wings [28,[36][37][38].
The chosen SA-R RANS formulation performs well for attached and mildly separated flows, but its accuracy degrades in regimes dominated by massive separation or unsteady fluid dynamics. For double-delta wings, these effects become significant near the critical AOAs at approximately 18 โข for some geometries, such that vortex sheet oscillation or separated flow are seen. As a result, higher-fidelity approaches such as detached-eddy simulation would be required to achieve reliable predictions. However, the computational cost of these methods is prohibitive to the authors for the broad design-space sampling required in the present study. Consequently, they are not included in the current dataset but represent a potential avenue for future multi-fidelity dataset construction.
A surrogate model is deployed to map low-fidelity panel-wise pressure coefficients solved by the VLM solver to the corresponding pressure field of the CFD solver on the same lifting-surface lattice. The surrogate model is MF-VortexNet [10], a light-weight physics-informed GNN machine-learning model. To contextualize field data, the VLM lattice is represented by a graph, where nodes aggregate local geometric descriptors, flow conditions, and low-fidelity pressure-coefficient features, while edges encode spatial distances among panels. Within this framework, MF-VortexNet employs a U-Net-like encoder-decoder architecture built from multi-head graph-attention convolutional blocks [39] to construct latent representations that capture pressure loading induced by nonlinear vortical structures and other complex flow phenomena, and then decodes these representations to corrected nodal pressure fields that approximate CFD solutions while maintaining robust prediction accuracy across a design space that is typical in the conceptual-design phase.
The surrogate model is embedded into a multi-fidelity aerodynamic workflow as described by Shen et al. to construct paired low-and high-fidelity fields for offline training [10]. In online assessment, the baseline fields computed by the VLM solver are subsequently corrected by MF-VortexNet, thereby yielding “quasi-CFD” pressure distributions. Subsequently, derived force and moment coefficients can be computed from the augmented pressure loading at a computational cost comparable to that of the underlying VLM solver. Because the model operates on full fields rather than on a small set of scalar QOIs, arbitrary integrated metrics can be obtained by post-processing the corrected pressure distribution, which improves both flexibility and interpretability of early-stage design studies, and provides a means to couple with solvers in other disciplines. When applied to low-speed, high AOA double-delta wing configurations, the MF-VortexNet-augmented workflow demonstrates substantial design-fidelity improvement in capturing moment and stability performance relative to relying on VLM alone, while maintaining generalization across unseen geometries and freestream conditions within a prescribed design space [23,24].
This section details the double-delta wing aerodynamic dataset. We first define the parametric double-delta wing geometry and its six-dimensional design space. Next, we describe the DOE, which utilizes a nested Saltelli sampling scheme to ensure scalable sampling. Finally, we present our numerical simulation framework and results, including the automated meshing pipeline and solver configurations for generating the paired VLM and CFD data.
Six design variables are used to define the double-delta wing shape. Figure 1 shows the half-wing about the central-body symmetry plane. The geometry has six design parameters, including B, the wing span (in inches); BW2, the fractional location of the inboard-to-outboard break-chord relative to the semi-span; SW1 (ฮ ๐๐ ), the inboard leading-edge sweep angle; SW2 (ฮ ๐๐ข๐ก ), the outboard leading-edge sweep angle; SR2 (ฮ ๐ก๐ ), the outboard trailing-edge sweep angle; and DROOP (๐ฟ), the camber-deflection angle applied to the baseline airfoil. The baseline airfoil is constructed from the root-airfoil geometry of the RW23 wing presented by Wiersema et al. [40] and is applied uniformly across the span.
To allow a richer representation of camber variation within the design space, ๐ฟ acts by deflecting the baseline airfoil’s leading edge camber about the 30% chord from leading edge. The airfoil morphing process begins with cubic-spline fitting of the baseline airfoil’s upper and lower surfaces from their discrete coordinates. The camber and thickness distributions are then extracted on a shared chordwise grid. The droop morphing is subsequently applied to the baseline camber, introducing ๐ถ 1 and ๐ถ 2 parametric discontinuities [41] at the pivot point and producing two separate Bรฉzier curves for the leading and trailing cambers, respectively. To ensure smoothness, the Bรฉzier control-point parameters for the leading camber line are solved using a sequential least-squares programming (SLSQP) solver in SciPy to minimize the ๐ถ 1 and ๐ถ 2 continuity residuals between the leading-and trailing-camber curves at the pivot point. The modified camber is then combined with the original thickness distribution to reconstruct the upper and lower surfaces, with an additional round of smoothing applied to both surfaces’ curves.
The spanwise twist distribution is defined as follows: the inboard section, from the root to the break-chord location, maintains a constant twist of 1.5 โข leading edge up, while the outboard section applies a linear washout from 1.5 โข to -1.5 โข toward the tip.
The design space x is defined as
where ๐ฅ ๐๐๐ ๐ and ๐ฅ ๐๐๐ฅ ๐ are the bounds. The selection of the design variable bounds is based on first referencing the design space of recent commercial supersonics technology program, including Supersonic Configurations at Low Speeds (SCALOS) [42], and High-Speed Civil Transport (HCST) [43]. From these studies, the nominal inboard and outboard leading edge sweep (ฮ ๐๐ and ฮ ๐๐ข๐ก ) is 68 โข and 45 โข respectively. The aspect ratio of the wing platform is taken to be 2.5. From these nominal values, we consider the design space included in SCALOS program [42], and adjust the design variable bounds so that the created design space does not produce non-physical designs. Specifically, design variable combinations that lead to negative tip chord under our DOE. This constraint reduces the design space represented in this dataset, resulting in a ฮ ๐๐ upper bound smaller than those explored in the SCALOS program. However, the constraint ensures that the resulting samples are balanced, as no non-physical designs are removed from the samples, in contrast to our previous experiments [24]. This design space is presented in Table 2.
The DOE involves the selection of combinations of design variables to generate geometry samples. We have three major goals for the DOE. Firstly, we aim to achieve efficient and homogeneous design-space filling, such that the sampling scheme does not introduce sampling bias that leads to clustered design samples within the design space, which could bias the performance of subsequently trained surrogate models. Secondly, the sampling scheme should produce an extendable dataset, such that when the dataset needs to be expanded, new sampling points can be added without requiring existing experiments to be re-run. This property is particularly important when the experiment is costly, such as the CFD simulations in this study. Finally, we want the sampling scheme to support future downstream assessments of the surrogate models trained from the dataset, including applications involving sensitivity analyses or uncertainty quantification.
Popular sampling methods are considered, including factorial sampling, random sampling, Latin Hypercube Sampling (LHS), and quasi-random sequences. The quasi-random sequence sampling methods, particularly Sobol sequence, meet the previously-mentioned DOE goals better. Sobol sequence constructs a sequence of spatial quadrature points that enable fast convergence of n-dimensional integral, that is, to find a sequence ๐ฅ ๐ with length ๐ such that,
approximates the integration
Solving for the Sobol sequence directly is costly, and modern software packages use fast Sobol sequence algorithms in which subsequent Sobol numbers are generated by XOR-ing previous Sobol number with a predefined table of direction numbers [44]. For the current DOE, we adopt SALib [45], which contains a Sobol sequence generation routine by Joe and Kuo [44]. For a required ๐ samples in ๐ dimensional space, the Sobol sequence generator generates a base matrix ๐ of size (๐ + skipvalue) ร 2๐. The skipvalue is applied to discard the first several rows of the Sobol sequence matrix, as repetitions may occur [46], thereby improving the uniformity of samples generated. For a predefined skipvalue and problem dimension, the generated Sobol sequence is deterministic. Throughout the experiments, we set the skipvalue to 16.
From the base matrix, we follow Saltelli’s sampling scheme [47] using SALib’s sample.saltelli function with function call of saltelli.sample(problem, skipvalue=16, calc_second_order=False). The process firstly constructs two sampling matrices ๐ด and ๐ต, such that
Then assembling ๐ cross-sampling matrices ๐ด๐ต (๐ ) by cross swapping columns of ๐ต to a copy of the ๐ด matrix. Mathematically, this is equivalent to,
The final design samples matrix X in unit hypercube is constructed by concatenating all ๐ด, ๐ต, and ๐ด๐ต matrices vertically, such that, (1) . . .
The final design samples ๐ can be obtained from X by column-wise scaling using the upper-and-lower bounds of the corresponding design variables. The adopted sampling scheme meets our DOE goals. Firstly, the Saltelli sampling scheme is deterministic and reproducible. By generating samples from a Sobol sequence, the method achieves uniform space-filling properties that minimize clustering and sampling bias, leading to balanced datasets ideal for machine-learning model training. Secondly, because the Sobol sequence is nested, doubling the dataset size yields a refined design-space coverage without modifying the existing design points, ensuring that dataset scaling does not require rerunning existing experiments. Moreover, the Saltelli framework naturally supports variance-based sensitivity analysis and uncertainty quantification, thereby enriching the downstream applications of the dataset.
The Saltelli sequence samples ๐ are used exclusively for model training. Because Saltelli sampling enforces a homogeneous distribution, randomly partitioning ๐ to form a test set would disrupt this structure and introduce bias into the remaining training samples. To avoid this issue, we generate a separate holdout test set independent of ๐ with 16 geometries. The holdout test set is generated using LHS sampling via SALib’s sample.latin function [45]. The resulting unit-hypercube samples are subsequently scaled using the same procedure employed in the Saltelli sampling process described previously.
As a result of the sampling scheme described, one can generate geometry configurations at different levels of density in a nested structure. We took six values of ๐, in multiples of 2, to generate six levels of datasets, shown in Table 3. Each geometry in the dataset is evaluated at nine AOAs, ranging from 11 โข to 19 โข in increments of 1 Together, these configurations constitute of the training geometry set of the dataset. Due to the nested structure of the sampling scheme, the set of geometries in the Level 1 dataset is a strict subset of those in the Level 2 dataset, and this nested structure is preserved across all higher dataset levels. The largest dataset is designated as the Level 6 set. All designs in the Level 6 set are shown in Figure 2(a) in top view, and the corresponding side view at the root in Figure 2 (b). In both figures, the blue edges denote the configurations included in the training geometry set, and the red edges indicate the configurations belonging to the holdout test set.
From Figure 2(a), it is evident that the training set spans a wide design space, ranging from low-aspect-ratio trapezoidal wings (with the largest tip-chord lengths) to delta wings (with the smallest tip-chord lengths). The sampling density is relatively uniform across the design space. Similarly, the holdout test set also exhibits a consistent sampling density, as indicated by the even distribution of red edges across the design space. Similarly, uniform design space coverage for ๐ฟ can be observed in Figure 2 To further illustrate the sampling density, Figure 3 presents the pair plots of the design variables for samples in the Level 6 set and holdout set. The off-diagonal panels show the scatter plots of paired design variables, while the diagonal panels show the histograms of the sampling density with respect to each design variable. Samples belonging to the training geometry set are plotted in blue, while samples corresponding to the holdout test set are shown in red. As indicated by the paired plots, no specific clustering is observed across any design variable pairs, and within each design variable, a uniform sampling density is achieved across the range of design-variable bounds. When examining the distribution of the holdout test set, the test samples also span the design space evenly. Because the sampling scheme for the training set is nested, the sampling density distribution across training set levels is consistent, and hence the sampling density pair plots for the sparser sampling levels are not shown.
Vehicle geometries are generated from the design variable samples using SUAVE’s write_vsp_mesh interface to produce the corresponding OpenVSP geometry model, as discussed in Section II.A. The CFD mesh generation involves two stages: surface meshes are first produced using OpenVSP’s ComputeCFDMesh function, and volume meshes are then created in Pointwise using its T-Rex extrusion function. The T-Rex procedure extrudes the surface mesh into the freestream to form an anisotropic boundary layer mesh with controlled growth and resolution suitable for RANS simulations.
Several parameters control the mesh properties of OpenVSP’s ComputeCFDMesh surface meshes. These parameters include source definitions, the maximum mesh-edge length (enumerator CFD_MAX_EDGE_LEN), the maximum edgegrowth ratio (CFD_GROWTH_RATIO), and 3D rigorous smoothing and size-field limiting using nearest-neighbor matching of adjacent mesh elements (CFD_LIMIT_GROWTH_FLAG). For the source definition, we add a trailing-edge source (WLineSource) with an influence radius of 0.1 MAC and a mesh-edge-length target of 0.003 MAC, in addition to the default sources at the leading edge and wing tips. The added trailing-edge refinement improved the volume-mesh quality near the trailing edge, and the refinement size is determined based on engineering judgment. The maximum edge-growth ratio is set to 1.2, and 3D rigorous smoothing is enabled. We control the surface-mesh density by adjusting the maximum mesh-edge length. Six levels of mesh refinement are defined, with maximum edge lengths of 0.1, 0.05, 0.0375, 0.025, 0.02, and 0.01 MAC. We select a representative geometry from the dataset to determine a mesh configuration that is applied consistently across the full dataset. These surface meshes are shown in Figure 4. As expected, increasing the mesh-refinement level increases the mesh density. The corresponding surface cell element counts are 1.01 ร 10 5 , 1.05 ร 10 5 , 1.08 ร 10 5 , 1.18 ร 10 5 , 1.25 ร 10 5 , and 1.82 ร 10 5 from Level 1 to Level 6 respectively. The volume mesh is created from the exported surface mesh using a modified implementation of the VSP2CFD code [48]. The primary parameters that control the volume-mesh properties are the maximum number of anisotropic T-Rex boundary-layer layers (TRexMaximumLayers), the first layer thickness (TRexCondition-setSpacing), and the boundary decay rate of the unstructured mesh around wall surfaces (BoundaryDecay). For all mesh levels, the boundary decay rate is set to 0.85. The first layer thickness values for the corresponding surface meshes from Level 1 to Level 6 are 6 ร 10 -5 , 2 ร 10 -5 , 9 ร 10 -6 , 7 ร 10 -6 , 6 ร 10 -6 , and 6 ร 10 -6 (m), respectively. The maximum number of anisotropic layers is set to 25, 35, 40, 50, 45, and 50 layers from Level 1 to Level 6. As a result, the total volume cell counts are 5.32, 6.13, 6.57, 6.96, 7.54, and 10.72 million cells, respectively. The volume-mesh sizing study focuses primarily on identifying a proper boundary-layer mesh setup that allows the SA-R turbulence model [34] to capture the boundary-layer flow correctly, thereby yielding accurate aerodynamic and flow predictions. Table 4 summarizes the mesh configurations used in this study.
A mesh convergence study is conducted by running CFD simulations with identical freestream conditions and solver configurations across all mesh levels. We use the representative geometry at Mach 0.3 AOA 13 โข flow and monitor the convergence of ๐ถ ๐ท and ๐ถ ๐ , as well as the resultant average ศณ+ . We define the area-averaged ๐ฆ + as: and the resulting values for each mesh level are reported in Table 4. For the SA-RC turbulence model without wall functions, we target ศณ+ โ 1, which is achieved by selecting mesh level 4 and beyond. The convergence behavior for ๐ถ ๐ท and ๐ถ ๐ is shown in Figure 5. It is observed that the Level 4 mesh with 6.96 ร 10 6 volume cells achieves a difference within 1% of the finest grid used in this study while maintaining a manageable computation cost for large dataset generation. Therefore, this mesh configuration is selected and used for all subsequent configurations’ meshing.
The fluid domain consists of a freestream far-field with a diameter of 40 MAC being 630 (m), centered at the wing apex, and the wing surface. The far-field boundary is assigned with a far-field boundary condition, while the wing surface is assigned with adiabatic wall (no-slip) condition. Freestream condition is specified as standard air, with gas constant of 287.06, at 285 (K), and static pressure of 7.18 ร 10 4 (Pa). The air has a constant heat capacity ratio of 1.4. Sutherland viscosity model is used, with a reference viscosity of 1.715 ร 10 -5 (Ns/m 2 ) [49], and a constant laminar Prandtl number of 0.72 and turbulent Prandtl number of 0.9. The reference Reynolds number is based on average the MAC of 16 (m), yielding ๐ ๐ ๐ ๐ ๐ = 8.04 ร 10 7 . The aerodynamic coefficients are computed using a specific moment origin on the body and reference area defined according to the sizing rules in SUAVE. These configuration specific reference values are included in the dataset repository.
Spatial discretization employs the Jameson-Schmidt-Turkel (JST) scheme [50] for the flow equations with Green-Gauss gradient reconstruction, using a JST sensor tuned with coefficients (0.5, 0.02). The turbulence equation is discretized using a scalar upwind scheme without MUSCL reconstruction. Temporal advancement uses implicit Euler for both flow and turbulence, with an adaptive Courant-Friedrichs-Lewy (CFL) number. The linear systems are solved by FGMRES with ILU preconditioning and a residual tolerance of 10 -7 . The convergence criterion is set to be 10 -6 for the continuity equation, with an additional convergence criterion for for the drag coefficient ๐ถ ๐ท requiring changes below 10 -5 over 100 Cauchy iterations. These convergence criteria are applied consistently throughout all runs in the dataset.
CFD runs are conducted using the Stanford Research Computing Center (SRCC) facility with a 32-core 2.4 GHz partition. The total computational cost of the dataset is 310, 835 CPU-hours for 2, 448 snapshots. For illustration of the variety of flow conditions represented in the dataset, Figure 6 shows three representative configurations (call signs as subtitles) at AOA of 15 โข . These geometries are selected from the holdout test set, spanning the platform types ranging from trapezoid wings, cropped Delta wings, to double-delta wings. The wing surfaces are colored by ๐ถ ๐ , and Q-criterion iso-surfaces of 100 are added to illustrate the vortex sheets. The variation of geometries leads to drastically different surface pressure distributions and primary vortex structures, enriching the geometric and flow features for surrogate modeling of flow prediction of the double-delta wing family. Among the 2, 448 CFD simulations, 57 runs failed to converge to the specified residual target using the numerical scheme described above. These runs correspond primarily to highly cambered geometries at higher AOAs above 16 โข . The convergence failure arises from the unsteady nature of critically separated flow, for which the selected physics models and numerical scheme cannot recover a steady solution at the residual target. We believe it is more valuable to provide solutions with consistently converged results, even at a lower fidelity, rather than removing these snapshots. As a result, we lowered the order of the convective scheme and adopted AUSM [51] for these runs. The configuration files contained in the dataset annotate which convective scheme is used for each run. Future extensions of the dataset may consider using higher-fidelity methods, such as detached eddy simulation, for these runs.
The dataset is hosted on our Huggingface repository at https://huggingface.co/datasets/yirens/double-delta-aero under GNU Lesser General Public License v2.1. The repository includes all CFD results, including the SU2 configuration files, surface and volume solutions in vtk format, convergence history files, and integrated aerodynamic loading summary files. The VLM solutions are provided in json format. All files are organized and indexed by vehicle call sign for ease of access. [25,[52][53][54]. Together, the studies show that model performance under a given computational budget ๐ถ, when measured by an appropriate loss metric, follows a power-law relationship with training dataset size and model size:
where ๐ is the test loss, ๐ is the model size, ๐ท is the dataset size, and ๐ 0 is the irreducible error. Coefficients ๐ 1 , ๐ 2 , ๐ผ, ๐ฝ, and ๐ 0 are fitting parameters. Such formulation is applicable to GNNs, as indicated by previous study [26]. We define ๐ as the mean squared error (MSE) evaluated on the holdout test set. Recall that MF-VortexNet learns a mapping F that takes a low-fidelity pressure-stream-condition-geometry state, represented as a graph ๐บ, and produces a predicted pressure distribution over the lattice, ๐ถ pred ๐ , that approximates the ground-truth pressure distribution ๐ถ gt ๐ obtained from high-fidelity simulations. The test loss is then defined as
The remainder of this section outlines the procedure used to identify the scaling-law coefficients based on the test error ๐, with the results interpreted as preliminary.
The empirical scaling experiment aims to estimate the exponent coefficients in Equation 7, enabling us to quantify how prediction accuracy scales with training data size and to determine the data requirements necessary to achieve a target accuracy. The experiment comprises three components: the training data, the model, and the training configurations.
We define the training data scaling under investigation as the increase in the number of snapshots resulting from higher sampling density over the a defined geometric design space. As summarized in Table 3, using Saltelli sampling, the geometric configurations in the training set (described in Section III) are subdivided into six levels. From Level 1 to Level 6, the training set includes a total of 8, 16, 32, 64, 128, and 256 geometric configurations, respectively. For each level, flow snapshots at five AOAs are selected at 11 โข , 13 โข , 15 โข , 17 โข , and 19 โข . Together, our training data scaling involves ๐ท values of 40, 80, 160, 320, 640, and 1280 graphs, respectively. Each graph provides a multi-fidelity representation of a geometric configuration and its associated flow state.
The decision to include only five AOAs per geometry, rather than all available AOAs in the dataset, is motivated by the need to balance dataset size and model capacity under tight computational budget. Because the primary objective of the data scaling study is to determine the geometric sampling density required to train MF-VortexNet, we prioritize scaling the number of geometries rather than the number of flow snapshots per geometry. As discussed by Hoffmann et al., dataset size, model size, and computational budget should be scaled proportionally [54]. Subsampling the available AOAs therefore enables training with smaller models at lower computational cost, while preserving the qualitative empirical scaling trends.
An additional consideration concerns how to define the dataset size ๐ท for a collection of graphs. Prior studies across different domains have adopted different definitions, including the volume of digital information measured in terabytes [55], the number of individual graphs [56], and the total number of edges [26]. Consequently, there is currently no standardized definition of ๐ท in graph-based setting, and its interpretation remains domain-specific. In this work, we define ๐ท as the total number of graphs in the dataset. This definition is aligned with the downstream design application, where designers seek to understand how densely the design space needs to be sampled and how many high-fidelity simulations are required to train a surrogate model with sufficient accuracy. Such a definition directly links to the number of simulation runs required and is thus more convenient for budget planning.
Besides the training set, we constructed the test set using the sixteen holdout test geometries, as described in Section III. To maintain the consistency with the training set in flow snapshot selection, we include include snapshots at AOAs of 11 โข , 13 โข , 15 โข , 17 โข , and 19 โข . In total, the test set comprises 80 graphs. The test set is relatively small. However, considering the even geometric distribution shown in Figure 3, we believe the test set contains adequate variability representative of the training set and is suitable for use in test loss evaluation.
Prior work has shown that performance scaling depends critically on the relationship between dataset size and model capacity. In the under-parameterized or over-parameterized regimes, corresponding to a disproportionately small or large ratios of ๐/๐ท, the test error no longer exhibits a power-law decay with respect to ๐ท, a phenomenon referred to as model scaling collapse [26,54]. To avoid inconclusive results derived from a single fixed-capacity model, we repeat the scaling experiment using four models with model sizes ranging from 0.10 million to 2.41 million trainable parameters.
The MF-VortexNet architecture has three hyper-parameters that are most relevant to model size [10], including the width of the nodal encoding for hidden layers (hidden_feature), the number of heads for multi-head attention (heads), and the total depth of the message passing blocks (hops). The hops parameter is related to the extent of nodal information propagation, and for the given 30 ร 32 lattice, we determined that a hops of 10 is adequate for the information to propagate across all nodes in a graph. This hyper-parameter is therefore held constant across models. To change the model size, we alter heads and hidden_feature. Specifically, increasing heads and hidden_feature increases the model size while widening the model. The resulting models created, along with their numbers of trainable parameter size (๐), are presented in Table 5. Once the model size is defined, additional hyper-parameters related to regularization and training must be specified [10]. Assigning a single fixed set of hyper-parameters across all models is not feasible, as models of different sizes require different training dynamics and regularization strategies to achieve convergence and target performance accuracy. Instead, we promote cross-model comparability by adopting a consistent hyper-parameter selection procedure that yields “equally optimal” configurations for each model.
For each model size, we employ Optuna [57] to optimize key training and regularization hyper-parameters, including the learning rate and weight decay of the ADAM optimizer, as well as the dropout ratio and skip-connection bandwidth (Alpha). Hyper-parameter optimization is performed using a subset of the Level 6 training set, constructed by randomly sampling 2048 snapshots, thereby ensuring that no information from the holdout test set is leaked during model selection. This dataset is further randomly split into an 80%-20% training-validation partition. For each model, the training configuration that achieves the lowest validation MSE is adopted as the training hyper-parameters.
To ensure fair comparison across model capacities, an identical computational budget, as discussed in the following section, is allocated to all models. We conduct a maximum of 35 training runs during hyper-parameter optimization for each model, with the first ten trials used as “warm start” evaluations for Optuna. All other Optuna optimization settings are left at their default values.
The MSE values of the optimized models are reported in Table 6. Despite differences in model sizes and training configurations, comparable test MSE values are achieved. The remaining hyper-parameters are reported to ensure reproducibility. These hyper-parameters are held fixed for each model across all data scaling runs.
To fit the neural scaling law coefficients in Equation 7, we conduct a controlled data-scaling experiment for each model listed in Table 5. The training-dataset size ๐ท is systematically varied, as discussed in Section IV.A.1, while keeping the training procedure fixed. This procedure is designed to isolate the effect of data scaling from variations in model capacity or computational cost.
When performing the ๐ท scaling using datasets at six levels, as discussed in Section IV.A.1, and across four model sizes ๐, careful design of the training scheme is needed such that, as a result of the experiment, the fitted ๐ฝ represents the training efficiency gain resulting from increased geometric sampling density, rather than being a consequence of increased computation. As an experimental design choice, we fix the maximum computational budget ๐ถ * available for all training runs across all ๐ท and ๐. ๐ถ * is defined as the maximum number of weights update steps available for each training fold, and is set to 2000. Four folds are used to train the model; in each fold, a 75%-25% training-to-validation data partitioning is used. Adopting k-fold cross-validation reduces the risk of model overfitting and is a critical component of weight scheduling for physics-informed loss integration [10]. In addition to fixing ๐ถ * , an early stopping patience of 50 epochs is used across training runs to prevent model overfitting. The early stopping mechanism terminates the training of the current fold when no improvement in validation loss is observed over the preceding 50 epochs. After examining the validation loss histories across epochs for all runs, we did not observe typical overfitting behavior, such as validation loss increases toward the end of training, except for the large model trained on the Level 1 dataset, which, due to the limited amount of validation data, is more prone to overfitting, supporting the appropriateness of the chosen early stopping patience.
Setting a fixed maximum number of weight-update iterations introduces an additional challenge for dataset-size scaling. In standard mini-batch gradient descent, each batch corresponds to a single weight-update step; for a training dataset of size ๐ท ๐ก and batch size ๐ต, a total of ๐ท ๐ก /๐ต weight updates are performed per epoch. Since ๐ต is constrained by hardware memory and is fixed at 128, the total number of weight updates scales with ๐ท ๐ก , introducing an inconsistency when isolating the effect of dataset size on model performance. Moreover, prior work suggests that the optimal computational budget ๐ถ does not necessarily scale linearly with ๐ท [54], and the scope of the present study is insufficient to identify the optimal scaling relationship between ๐ถ and ๐ท. Consequently, during the experiment we found the model hyper-parameters, as discussed in Section IV.A.2, are sensitive to ๐ถ, such that designing a ๐ถ allocation scheme across model sizes and data sizes for calibrating the performance becomes difficult. An imprudent ๐ถ scheduling may result in biased power-law observations.
To address this issue and isolate error scaling with respect to ๐ท alone, we fix the number of weight-update steps across dataset sizes by adopting a stochastic subsampling scheme. Specifically, at each epoch, we randomly draw ๐ต samples from the training set and perform a full-batch gradient update. Because of the stochastic nature of this procedure, and given a sufficiently large number of epochs, the model is exposed to the entire training dataset in expectation, while the total computational budget ๐ถ remains constant across different dataset sizes. As a result, for each fixed model size, the observed scaling behavior can be attributed solely to changes in sample distribution introduced by ๐ท scaling. Since the same number of flow conditions are selected at each level of the geometry set, this scaling behavior directly reflects gains in prediction accuracy arising from increased sampling density in the design space. The resultant ๐ versus ๐ท scaling results are conceptually similar to the iso-FLOP experiments reported by Hoffmann et al. [54].
The remaining training setup follows Shen et al. [10]. For each model training at a particular model size ๐ and dataset size ๐ท, we repeat the training six times under randomized initial seeds for initial model weights and dataset partitioning. Trial-level MSEs are aggregated to report per-level mean, standard deviation, minimum, and maximum values to improve the robustness of the scaling exponent estimation. Least-squares fitting using SciPy’s curve_fit function is used to fit the aggregated mean MSEs to the scaling law coefficients in Equation 7. All model trainings are executed on a 40GB NVIDIA A100 GPU.
The mean test MSE, evaluated on the holdout test set and averaged across all trails at each ๐ท, is plotted in Figure 7. Different data series represent scaling results from different models, from large to small. Overall, the test MSE decreases as ๐ท increases, although the rate of convergence varies across models. From the learning curves, three distinct learning regions can be identified. We highlight these regions by light red, white, and light green bands along the ๐ท axis. Such zoning is consistent with the learning curve behavior described by Hestness et al. and Rosenfeld et al. [52,53].
When the dataset is small (light red region), the learning curves exhibit behavior characteristic of the “small data region” [52], in which model performance is limited by insufficient training data provided. Consequently, the model performs a best-guess prediction. For models investigated in this study, this region corresponds to ๐ท โค 80. As ๐ท increases beyond the “small data region”, a “power-law region” [52] emerges, such that increasing ๐ท in this region helps the model to improve its prediction quality for generalization tasks, such as predictions in the holdout test set. Under the current model and training configuration, this region spans 80 โค ๐ท โค 320, corresponding to geometry sets of 16 to 64 configurations.
For larger data sizes, at ๐ท โฅ 320, an “asymptotic error region” is observed. We intentionally distinguish the terminology of this region from the “irreducible error region” described by Hestness et al. [52], as the error floor observed in Figure 7 should not be interpreted as an inherent limitation of the dataset or as insufficient model expressivity. Instead, it arises from training under a fixed computational budget ๐ถ * , as discussed in Section IV.A.3. To achieve fixed ๐ถ * across different values of ๐ทs, subsampling is used. Under the training setup of a batch size of 128, training data for ๐ท โฅ 170 cannot fit into the full-batch gradient descent, and thus the training data, when viewed as an entire set, are underutilized in each weight update step. Although all graphs in the training set are likely to be used during training, each graph in larger datasets is less exposed to the optimizer, and hence the resulting models are less optimally trained. This under-training effect explains why the test MSE for models trained at ๐ท = 1280 tends to be higher than those trained at ๐ท = 640. These trends are consistent with the iso-FLOP curves reported in previous literature [54], and this region should not be used for power-law fitting as the model training is bottlenecked by ๐ถ * . Among these regions, the power-law region provides information on data-size scaling. Figure 8 presents the test MSE as a function of ๐ท, together with the mean value, variability indicated by boxes spanning plus and minus one standard deviation, and the minimum and maximum values shown by whiskers in a box-plot style. The mean values are identical to those presented in Figure 7. It is observed that the larger models tends to have smaller coefficient of variation in the power-law region, indicating a more consistent performance across repeated trials.
Within the power-law region, Equation 7 can be further reduced to,
by absorbing the constant contributions from model size and irreducible error into the test loss, we focus on quantifying the scaling exponent ๐ฝ. To do so, we use SciPy’s curve_fit function to perform a non-linear least squares fit of the data. The mean MSE is treated as the dependent variable, and ๐ท is treated as the independent variable, with the corresponding standard deviations used to represent the relative uncertainty of the MSE measurements. The resulting power-law fits are shown as dashed lines in Figure 8(a)-(d).
From the power-law fitting, we estimate the scaling exponent ๐ฝ to be 0.5484, 0.6220, 0.6654, and 0.6128 for the Mini, Small, Medium, and Large models, respectively. The corresponding coefficient ๐ 2 are 1.2617, 1.7354, 2.4044, and 1.7778, respectively. The corresponding R-squared values are 0.9772, 0.9652, 0.9985, and 0.9965 respectively, indicating a high-quality fits for ๐ฝ estimation. These fitted power laws can then be used to answer the question of how much data is required to achieve a target prediction accuracy under the proposed training procedure and computational budget. Assuming that the minimum MSEs obtained from hyper-parameter optimization (Table 6) are reasonable estimates of models’ best performance under substantially larger dataset and are served as the loss target, we compute the optimal dataset size ๐ท * at the computational budget ๐ถ * by inverting Equation 9. The resulting values of ๐ท * are 261, 245, 317, and 261 for the Mini, Small, Medium, and Large models, respectively. These values correspond to geometry sets of 52, 49, 63, and 52 geometries, all of which are slightly smaller than the Level 4 dataset listed in Table 3.
The size of the geometry set can be further mapped to the corresponding sampling density in the unit hypercube. To this end, we first fit a power law relating the average nearest-neighbor distance โ, defined as the average Euclidean distance from each design sample to its nearest neighbor, to the number of geometry samples ๐บ contained in a dataset, using the data in Table 3. This fit yields โ = 0.62๐บ -0.15 , from which the estimated ๐ท * values correspond to โ in the range of 0.333 to 0.346. The corresponding one-dimensional optimal sampling density can then be estimated for space-filling sampling algorithms by accounting for dimensional scaling. Based on this analysis, selecting approximately eight sampling points per dimension is sufficient to achieve optimal surrogate model performance under the allocated computational budget ๐ถ * .
The data requirement for training MF-VortexNet is considerably lower than those of other machine-learning surrogate models capable of field prediction [15,16,18,21,22]. Such models typically require hundreds of geometries to achieve geometric generalizability, along with thousands of flow simulation snapshots to achieve field prediction accuracy. We believe the efficiency in data scaling roots in MF-VortexNet’s design in operating effectively in small-data use cases. By incorporating physics-informed loss functions, scheduled training strategies, and relatively compact model architectures, prior studies [10,24] have demonstrated that strong interpolation capability within the design space can be achieved using only 15 to 27 geometries with 189 to 600 flow snapshots. However, because the model remains relatively small compared to large-capacity architectures [18], its representational capacity is inherently limited. As a result, this model size and training scheme constraint may prevent the surrogate from fully exploiting the benefits of substantially larger datasets. Finally, we assess how the fitted scaling exponent varies with model size. Figure 9 shows the scaling relationship. We observe an approximately linear correlation between ๐ฝ and ๐ on a log-log scale when the model size is not excessively large. The Large model, however, deviates from this trend. The exact cause of this deviation requires further investigation, but one plausible explanation is that the model capacity exceeds what can be effectively optimized under the available computational budget. As shown in Table 6, the Large model employs a learning rate that is an order of magnitude smaller than those of the smaller models and achieves a higher test MSE than the Medium model, which contradicts typical expectations from model-size scaling [25,54]. We therefore interpret the selected hyper-parameters, although optimized using the same systematic procedure as for the other models, are compromises between training stability and performance under a fixed computational budget. It is likely that the Large model’s capability exceeds what is permitted by the optimizer to drive convergence beyond local minimum under ๐ถ * , and as a result the model’s representational capability is optimization-limited. Future work will extend the present data-scaling study to model trainings with higher computational budgets to better assess scaling behavior for larger models.
If the computational budget limitation is indeed the cause of the lower fitted ๐ฝ for the Large model, we would expect that increasing the computational budget ๐ถ * would lead to an increase in the estimation of ๐ฝ’s magnitude for the Large model, conditioned on proper adjustments of training hyper-parameters. Focusing on the overall trend in Figure 9, we conjecture that larger models tend to exhibit larger magnitudes of ๐ฝ, indicating faster convergence and more efficient utilization of training data, until the allocated computational budget becomes the limiting factor. Such behavior of larger models are more efficient to train [53] under limited computational budget is supported by other studies in the language modeling domain [58,59], and this is due to the greater capacity of larger models to learn task-adaptive representations that extract more features from each data sample. This conjecture suggests that even for domain-specific surrogate models, such as MF-VortexNet, there are benefits of constructing larger and wider models: increased model capacity not only enhances field representation and generalization across flow and geometric configurations, but also improves data utilization efficiency during training. As a consequence, larger models may require fewer training samples to achieve a target accuracy, potentially reducing the cost of dataset construction.
However, larger models also demand higher computational budgets for training. In practice, particularly during the conceptual design phase, computational resources are often pre-allocated. The optimal allocation of computational budget between dataset construction and model training when using domain-specific surrogate models therefore depends on the relative costs of generating CFD datasets and training surrogate models, and should be determined by jointly considering these costs together with the training-efficiency gains presented by model size scaling.
This study addresses two gaps in data-driven surrogate modeling for aerodynamic design: the limited presence of open-source, multi-fidelity datasets for flow simulations with complex physics over relatively simple geometries, and the lack of quantification of the empirical scaling laws for field-predicting GNN-based surrogate models.
A primary contribution of this work is the release of an open-source, multi-fidelity aerodynamic dataset for a parametric family of double-delta wings. To support data-driven modeling of nonlinear, vortex-dominated flows, we generated 272 unique geometries and evaluated each configuration at ๐๐ = 0.3 and angles of attack ranging from 11 โข to 19 โข using both VLM and RANS CFD solvers. The design of experiments (DOE) is carefully designed to enable future expansion of the dataset and to support variance-based sensitivities analyses. This dataset is publicly available and intended to serve as a benchmark for multi-fidelity surrogate modeling research.
Using this dataset, we conduct a preliminary empirical scaling study of the MF-VortexNet surrogate model. The scaling analysis is performed on six datasets, containing between 8 and 256 double-delta wing designs distributed in a hierarchical manner through sequential refinement of a design space’s sampling density. By selecting snapshots at specific free-stream conditions, we assembled six training datasets with sizes ๐ท ranging from 40 to 1280 graphs. Four models, with sizes ranging from 0.1 to 2.4 million trainable parameters, are trained under a fixed computational budget of at most 8000 weight update steps, subject to early stopping criteria. Collectively, we estimate that the test performance ๐ โฒ , evaluated using the test MSE on the holdout test set, scales with ๐ท as ๐ โฒ โ ๐ท -0.6122 . The observed scaling dependency is strong, indicating efficient utilization of the training data. Leveraging this scaling exponent, and benchmarking against the minimal test MSE achieved using a larger training dataset of size 2048, we estimate that the optimal geometric sampling distance for MF-VortexNet under the current computational budget is approximately 0.34 in a six-dimensional design space, which corresponds to a sampling density of eight samples along each dimension using space-filling sampling techniques. The relatively sparse sampling requirement supports the design intent of MF-VortexNet as an efficient multi-fidelity surrogate model for domain-specific applications.
By comparing the fitted data size scaling exponents across differently sized models trained under a fixed computational budget, we observe a log-log linear correlation between the model size and data sizing exponent, although the largest model trained in the current study does not follow this trend. We speculate that this anomaly in the largest model is caused by insufficient computational budget, resulting in a set of suboptimal hyper-parameters for training the model. We conjecture that, when sufficient computational resources are allocated, the magnitude of data scaling exponent increases with larger model size, thereby allowing more efficient utilization of the dataset for larger model. Following this conjecture, larger surrogate model could be trained with less training data, yielding lower cost for offline dataset generation, at the expense of increased model training computation requirements. Consequently, there exists a trade-off between the computational budget allocated to dataset generation and that allocated to model training. Further study is required to systematically vary the computational budget for surrogate model training and to evaluate additional model sizes, in order to obtain more statistically robust estimates of the power-law exponents and to better characterize this trade-off.
In the future, several extensions of the existing work can be made. Firstly, downstream applications, such as sensitivity assessments of surrogate model performance with respect to geometric design variables, can be performed leveraging the DOE used for the dataset. In addition, as approximately 2.3% of the snapshots in the dataset require higher-fidelity CFD solver to achieve convergence, incorporating higher fidelity solutions into the dataset may further enrich its fidelity levels. The preliminary assessment of the data-size scaling relationship conducted in this study indicates that the scaling-law analysis is incomplete without systematically varying the computational budget along with data and model size scaling. In future work, we aim to jointly scale the computational budget, model size, and dataset size to identify optimal resource-allocation strategies.
โข .