SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address th…

Authors: Yadi Cao, Sicheng Lai, Jiahe Huang

SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs
SimulCost: A Cost-A war e Benchmark f or A utomating Physics Simulations with LLMs Y adi Cao * 1 Sicheng Lai * † 2 Jiahe Huang * 1 Y ang Zhang * † 3 Zach Lawrence * 1 Rohan Bhakta * 1 Izzy F . Thomas * 1 Mingyun Cao * † 4 Chung-Hao Tsai 1 Zihao Zhou 1 Y idong Zhao 5 Hao Liu 6 Alessandro Marinoni 1 Alexey Arefie v 1 Rose Y u 1 Abstract Evaluating LLM agents for scientific tasks has focused on tok en costs while ignoring tool-use costs like simulation time and experimental re- sources. As a result, metrics like pass@k become impractical under realistic budget constraints. T o address this gap, we introduce S I M U L C O S T , the first benchmark targeting cost-sensiti ve pa- rameter tuning in ph ysics simulations. S I M U L - C O S T compares LLM tuning cost-sensitiv e pa- rameters against traditional scanning approach in both accurac y and computational cost, span- ning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simula- tor’ s cost is analytically defined and platform- independent. Frontier LLMs achiev e 46–64% success rates in single-round mode, dropping to 35–54% under high accuracy requirements, rendering their initial guesses unreliable espe- cially for high accuracy tasks. Multi-round mode improv es rates to 71–80%, but LLMs are 1.5– 2.5 × slower than traditional scanning, making them uneconomical choices. W e also in ves- tigate parameter group correlations for knowl- edge transfer potential, and the impact of in- context examples and reasoning ef fort, provid- ing practical implications for deployment and fine-tuning. W e open-source S I M U L C O S T as a static benchmark and extensible toolkit to fa- cilitate research on impro ving cost-aw are agen- tic designs for physics simulations, and for e x- panding new simulation en vironments. Code and * Equal contribution. † W ork partially done at UCSD. 1 Univ ersity of California San Diego 2 The Chinese University of Hong K ong, Shenzhen 3 Peking Univ ersity 4 Univ ersity of Califor - nia, Los Angeles 5 ETH Zurich 6 California Institute of T echnology . Correspondence to: Y adi Cao < yadicao95@gmail.com > , Rose Y u < roseyu@ucsd.edu > . Pr eprint. Marc h 24, 2026. data are av ailable at https://github.com/ Rose- STL- Lab/SimulCost- Bench . 1. Introduction Large Language Models (LLMs) sho w promise for scientific workflo ws through code generation, complex reasoning, and especially tool calling ( T ang et al. , 2023 ; Huang et al. , 2024 ; Patil et al. , 2023 ). By offloading computations to domain- specific tools, LLM agents reduce hallucinations through grounded outputs. This capability has accelerated work- flows in scientific domains including chemistry ( Bran et al. , 2024 ), computational fluid dynamics ( P andey et al. , 2025 ; Y ue et al. , 2025 ; Chen et al. , 2024a ), and data science ( Chen et al. , 2024b ; Majumder et al. , 2024 ). T o systematically ev aluate these capabilities, recent bench- marks hav e begun inv estigating LLMs for assisting sci- ence, including scientific question answering and tool usage across domains ( Zhang et al. , 2025 ; Lai et al. , 2023 ; Xu et al. , 2024 ; Elrefaie et al. , 2025 ). Howe ver , existing ev aluations focus on task correctness and token costs while ov erlooking tool costs. Metrics like pass@k with large k implicitly treat tool usage as free, which becomes impractical in realistic scientific workflo ws where simulations or experiments consume significant time or materials. In physics simulations, the numerical pa- rameter choices directly impact both solution quality and cost ( Arisaka & Li , 2024 ; Kumar & Nair , 2024 ; Thakur & Nadarajah , 2025 ). Increasing spatial or temporal resolu- tion improv es accuracy but with cost scaling quadratically or cubically . Domain experts develop intuition for these trade-of fs through experience, balancing cost against quality via informed guesses and iterativ e refinement. Y et with- out mechanisms to e v aluate such cost-a wareness in LLMs under current pass@k and success-rate-only metrics, we risk de veloping agents that achiev e correctness only after numerous unaffordable trials. T o address this gap, we present S I M U L C O S T , the first bench- mark designed to ev aluate LLMs’ cost-a wareness in tuning 1 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Dataset Curation Textbooks Online Repositories Research Papers Burgers Diffusion-Reaction Epoch Euler 2D FEM Heat 1D Heat 2D Navier-Stokes Hasegawa-Mima Nonlinear Hasegawa-Mima Linear Euler 1D LLM Agent System Prompt: User Prompt: Unstructured MPM Simulation Feedback Solver Playground Cost=342529178 Evaluation Too Coarse Plan Too Fine Plan Differs okay Differs too much Calling Tool Multi-round Single-round Reference Solution Cost=123877 Solver Development Ref. Solution Search Variation Design Filtering Summary Stas 12 Simulators 3 Domains 3 Accuracy Levels 4816 T asks F igur e 1. Overview of S I M U L C O S T . Our benchmark ev aluates LLM agents on cost-sensitiv e parameter tuning across 12 physics simulators spanning fluid dynamics, solid mechanics, and plasma physics. Given a simulation task, tuning mode, and accuracy requirement, the LLM proposes tunable parameters in either single-round (initial guess) or multi-round (trial-and-error) mode. The challenge is balancing accuracy and cost: overly coarse parameters fail accurac y requirements, while ov erly fine parameters waste computation. physics simulators. Unlike existing benchmarks that only measure success rate, S I M U L C O S T additionally e v aluates computational efficienc y . W e quantify cost by counting core operation FLOPs for each simulation instance, which correlates with wall-clock time on serial machines. Our benchmark spans 12 simulators across fluid dynamics, heat transfer , solid mechanics, and plasma physics. Our main contributions: (1) the first benchmark measur- ing both success rate and computational cost for LLM- automated physics simulations; (2) an extensible toolkit with 12 simulators featuring platform-independent cost tracking; (3) ev aluation of state-of-the-art LLMs compared against brute-force scanning and Bayesian optimization; (4) ablation studies providing practical guidance on poten- tial improv ements such as kno wledge transfer , in-context learning, and reasoning effort. Through systematic e v aluation, we report se veral k ey find- ings. (1) Frontier LLMs achiev e 46–64% success rates in single-round mode, dropping to 35–54% under high accu- racy requirements. This indicates that LLMs’ initial guesses are unreliable and only useful for quick previe ws when users lack parameter intuition. (2) Multi-round mode im- prov es success rates to 71–80%, making it necessary for high-accuracy tasks. Howe v er , LLM trial-and-error is 1.5– 2.5 × slower than brute-force scanning, suggesting practi- tioners should let LLMs in v oke scanning algorithms rather than relying on internal reasoning alone. (3) Common pa- rameters lik e spatial resolution are easier to tune than solver- specific ones like con vergence tolerances. Howe v er , there is little correlation between individual tunable parameters ev en within the same type, indicating that fine-tuning on cheaper simulators is unlikely to transfer to expensi v e ones. (4) In-context learning improv es single-round success by 15– 25% but anchors models to demonstrated parameter regimes, degrading multi-round performance. This indicates naive retriev al-augmented generation may not be a complete solu- tion. W e hope S I M U L C O S T will serv e as a pioneering bench- mark for e v aluating cost-aw areness in LLM-based scientific agents. Respecting tool cost is crucial for realistic work- flows where each experiment requires significant computa- tional resources or physical materials. W e will open-source S I M U L C O S T upon publication, including the static bench- mark and extensible toolkit, enabling the community to both explore improvements for the cost-awareness of LLM-based scientific agents and create new simulation en vironments. 2. Related W ork LLMs Benchmarks for Science: Knowledge-Based vs. T ool-Use. LLM benchmarks for science fall into two cat- egories: scientific kno wledge ev aluation and tool-using agents. The first category includes QA benchmarks such as SciBench ( W ang et al. , 2024 ), SciEx ( Dinh et al. , 2024 ), and APBench ( W u et al. , 2025b ), which ev aluate reason- ing across mathematics, physics, chemistry , and astrody- namics. These benchmarks are limited to problems with closed-form, hand-calculatable solutions. Realistic scien- tific workflo ws require comple x e xternal tools like numer- ical simulations, moti v ating the second cate gory: LLMs as scientific agents ( Ren et al. , 2025 ). Domain-specific systems lik e ChemCro w ( Bran et al. , 2024 ) inte grate 18 chemistry tools for synthesis and drug discovery , while OpenFO AMGPT ( P andey et al. , 2025 ), Foam-Agent ( Y ue et al. , 2025 ), and CFDLLMBench ( Somasekharan et al. , 2025 ) target computational fluid dynamics through work- flow automation and kno wledge evaluation. ScienceAgent- Bench ( Chen et al. , 2024b ) and Disco veryBench ( Majumder 2 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 1. Comparison of benchmarks for LLM agents in scientific and engineering domains. T ool Cost: whether it tracks computational costs of external tools (separate from LLM inference). Platform Indep.: whether tool cost measurement is platform-independent (not based on wall time). T oolkit: whether it provides an extensible simulation en vironment. –: not applicable (no tool cost). ∗ Ambiguous: costs are reported but mix ed with other components or based on wall time. † Claims computational cost but reports only LLM token costs in USD. Benchmark Domain T ool Cost Platform Indep. T oolkit T askBench ( Shen et al. , 2024 ) General T ools ✗ – ✗ CA TP-LLM ( W u et al. , 2025a ) General T ools ✓ ✗ ✗ ScienceAgentBench ( Chen et al. , 2024b ) Data Science ✗ – ✗ Discov eryBench ( Majumder et al. , 2024 ) Data Science ✗ – ✗ Auto-Bench ( Chen et al. , 2025 ) Causal Discov ery ✓ ✓ ✗ MLE-Bench ( Chan et al. , 2024 ) ML Engineering ✓ ∗ ✗ ✗ MLGym ( Nathani et al. , 2025 ) ML Research ✓ ∗ ✗ ✗ Agent-Lab ( Schmidgall et al. , 2025 ) ML Research ✓ ∗ ✗ ✗ BioML-bench ( Science Machine , 2025 ) Biomedical ML ✓ ∗ ✗ ✗ ChemCrow ( Bran et al. , 2024 ) Chemistry ✗ – ✗ AstaBench ( Bragg et al. , 2025 ) Scientific Research ✗ † – ✗ CFDLLMBench ( Somasekharan et al. , 2025 ) CFD ✗ – ✓ S I M U L C O S T (Ours) Physics Simulations ✓ ✓ ‡ ✓ ‡ 11/12 solvers; EPOCH uses wall-time (see Section 5 ). et al. , 2024 ) target data science workflo ws. This work fo- cuses on the latter category , LLMs using verified simulation tools, and addresses the missing cost dimension in existing benchmarks. The Missing T ool Cost Consideration. Despite grow- ing interest in scientific agents, most benchmarks neglect the cost of tool ex ecution, a critical factor in real scien- tific workflo ws where tool costs (such as simulation time) dominate ov er LLM API costs. For example, ScienceAgent- Bench ( Chen et al. , 2024b ), Disco veryBench ( Majumder et al. , 2024 ), and most agentic systems ( Nathani et al. , 2025 ; Kapoor et al. , 2024 ) track only LLM token costs. AstaBench ( Bragg et al. , 2025 ) claims “computational cost” but reports only token costs in USD. W orks such as Code- PDE ( Li et al. , 2025 ) and SciCode ( T ian et al. , 2024 ) com- bine code generation with pass@K metrics. While v aluable for assessing code synthesis capabilities, these benchmarks address a different challenge than properly using existing tools, and do not consider ex ecution cost as a metric. Sev eral benchmarks address this gap by recording wall- time, e.g., Agent-Lab ( Schmidgall et al. , 2025 ), MLE- Bench ( Chan et al. , 2024 ), MLGym ( Nathani et al. , 2025 ), and BioML-bench ( Science Machine , 2025 ). Howe ver , these measurements conflate LLM inference with tool e xe- cution time and v ary across hardware, making cross-study comparisons unreliable. Auto-Bench ( Chen et al. , 2025 ) achiev es platform-independence by counting interventions as the tool cost, b ut this assumes uniform cost per inter- vention and remains in the data science domain. CA TP- LLM ( W u et al. , 2025a ) models cost-aware tool planning via function-as-a-service, but is limited to generic tools with relativ ely fixed costs per function, which breaks down for simulations where parameters like grid resolution alter cost of the same function by orders of magnitude. MetaOpen- FO AM ( Chen et al. , 2024a ) is the only prior work targeting physics simulations with cost awareness. Ho we ver , it is a workflo w automation frame work rather than a comprehen- siv e benchmark, and records w all-time that conflates LLM inference with solver e xecution. S I M U L C O S T bridges these gaps by defining cost based on computational complexity analysis of each solver , provid- ing platform-independent measurements with guaranteed reproducibility . W e focus on parameter tuning of physics simulations where the parameters of interest fundamentally influence both simulation accuracy and computational cost. T able 1 summarizes our distinctions. T raditional Parameter T uning Methods and Bench- marks. The tasks in S I M U L C O S T fall within the parame- ter tuning category in scientific softw are usage. Traditional approaches range from grid search (scan) and constrained optimization to Bayesian methods ( Goodfellow et al. , 2016 ; Snoek et al. , 2012 ), including tree-structured Parzen estima- tors ( Bergstra et al. , 2011 ), SMA C with random forest sur- rogates ( Hutter et al. , 2011 ), and multi-fidelity approaches like Hyperband ( Li et al. , 2018 ) and BOHB ( Falkner et al. , 2018 ). Related benchmarks like Design-Bench ( T rab ucco et al. , 2022 ) and In verse-Bench ( Zheng et al. , 2025 ) tar get case-by-case parameter optimization b ut consider cost im- plicitly as optimization steps, which only holds for tools with uniform costs. HPOBench ( Eggensperger et al. , 2021 ) and Y AHPO ( Pfisterer et al. , 2022 ) explicitly track ev alu- ation costs but focus on ML hyperparameters. W e adopt 3 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs brute-force scanning as our reference and include a standard GP-based Bayesian optimization baseline (Section 3.5 ). A comprehensiv e comparison ag ainst the full HPO toolkit is an important direction for future work. 3. The SimulCost T oolkit 3.1. Overview W e present S I M U L C O S T , the first comprehensive bench- mark for ev aluating LLM agents’ success rate and cost- efficienc y in physics simulation parameter tuning (see Fig- ure 1 for a schematic ov erview). Our 12 simulators span fluid dynamics, solid mechanics, and plasma physics. W e ev aluate both single-round inference, assessing physics and numerical intuition, and multi-r ound inference, testing adap- tiv e parameter optimization through solver interaction (Sec- tion 3.3 ). Beyond static ev aluation, we open-source the complete solver library as an extensible T oolkit for bench- mark extension (Section 3.2 ). Each solver presents LLMs with parameter tuning tasks: giv en a ph ysics-based simulation scenario, the LLM must select parameter v alues that satisfy accuracy requirements while minimizing computational cost. In reality , simulators usually require tuning multiple parameters for a successful ex ecution. Howe ver , domain experts usually isolate poten- tial issues upon seeing unsuccessful simulation runs and tune the corresponding parameters one by one. W e adopt this pattern and constrain the LLM to tune individual param- eters while fixing others to reasonable (b ut non-optimal) val- ues chosen by rule-of-thumb practices (e.g., CFL = 0 . 25 for explicit time-stepping). This isolation a voids the comple xity of multi-parameter optimization and enables a meaningful grid search (scan) baseline. For each solv er , we define the tool cost using computational complexity analysis by count- ing dominant operations FLOPs, which strongly correlates with wall time on a single-core processor; we defer each solver’ s cost formulas to Appendix A.2 . W e also include EPOCH ( Arber et al. , 2015 ), a mature and widely-used particle-in-cell (PIC) plasma simulation code, as a repre- sentativ e real-w orld production solver . Since EPOCH is a compiled binary with particle dynamics that resist closed- form FLOP estimation, we use wall-clock time on a fixed hardware configuration as its cost metric (Section A.2 ). W e ev aluate LLM performance at three accuracy lev els (lo w , medium, high) and create task variations by combining dif- ferent initial/boundary conditions, en vironmental v ariables, and non-target parameter settings, and the accuracy level requirement (Appendix A.1 ). In all, S I M U L C O S T comprises 2,916 single-round tasks and 1,900 multi-round tasks across 12 solv ers (T able 2 ). 3.2. Dataset Curation Our curation pipeline has four phases (detailed in Ap- pendix A.1 ): (1) Solver development - developing new or adapting existing solvers, (2) Refer ence solution search - finding reference parameters for each task using brute- force search (Algorithms 1 and 2 ), (3) V ariation design - scaling up task div ersity by combining accurac y thresholds (low/medium/high), initial/boundary conditions, en viron- ments, and non-target parameter settings, and (4) Filtering - since experts define the accuracy thresholds consistently for all combinations, some may fail to find a solution that meets the threshold within the maximum scan iterations. W e filter these infeasible tasks post-hoc to a v oid unnecessarily high computational time for later LLM ev aluations. Expert contributions. Our expert team includes 11 ex- perts in fields related to physics, mechanical engineering, and nuclear engineering: 2 professors, 3 postdocs, 2 doctoral students, and 5 undergraduates. Experts pro vided physics background knowledge, de veloped or adapted solvers, de- signed parameter tuning tasks, scaled-up scenario v ariations, and established accuracy thresholds. They documented specifications in a consistent markdown format, portions of which con vert to LLM prompts (Appendix B.3 ). All experts recei ved coding and documentation training before curation. W e follo wed institutional ethical standards and compensated all participants. Data sources. All solvers are sourced from textbooks, online repositories, and research papers (per -solver details in Appendix A.2 ). Original scripts typically contain bugs or lack cost-related metadata, requiring expert deb ugging and adaptation for quality assurance, cost calculation, and consistent API design. Quality validation. Solvers are cross-validated by a sep- arate expert when domain experts submit pull requests for dev eloped solvers. W e also verify the optimal parameters found by the scan algorithms to ensure they hav e di verse distributions, preventing exploitation of biased reference solutions. T ask groups. T o analyze performance patterns across pa- rameter types, we categorize tunable parameters into four groups based on their role in simulation: Spatial : grid reso- lution parameters such as nx and dx that control discretiza- tion density , appearing in all solv ers; T emporal : timestep parameters such as dt and cfl that control time integration step; T olerance : con ver gence tolerance parameters such as tol and error threshold that determine when iter- ativ e solvers terminate early; and Misc : physics-specific parameters such as limiter coef ficients in reconstruction and relaxation factors in iteration ( Patankar , 1980 ) that v ary 4 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs case-by-case. This categorization enables two analyses: (1) whether LLMs’ pre-training knowledge f av ors common parameter types (Spatial, T emporal) over solv er-specific ones (Misc), and (2) whether within-group task correlations exceed between-group correlations, which would suggest knowledge transfer potential across solv ers sharing param- eter types. The parameter distribution across groups is: Spatial: 8 parameters, T emporal: 3, T olerance: 5, Misc: 7. T able 15 in the Appendix lists the complete parameter -to- group mapping. T oolkit. W e also open-source the complete solver library and API ( simulcost-tools ) as an extensible T oolkit for benchmark extension. Unlike static datasets, the T oolkit offers: (1) source code for all 12 solvers with standardized wrapper APIs, and (2) a configuration-consistent interface (based on Hydra ( Y adan , 2019 )) for defining simulation sce- narios that is easy to replicate and extend. Users can modify existing scenarios by adjusting parameter ranges or accurac y thresholds, create new task v ariations from existing solv ers, or integrate entirely ne w solvers follo wing the established cost-tracking workflo w . 3.3. Inference Modes Expert parameter tuning in physics-based simulations re- quires two distinct capabilities : (1) physical and numerical intuition for initial parameter estimation and (2) iterative adjustment through interactions with solvers. W e design two inference modes to e v aluate these skills, measuring both success rate and cost efficienc y: Single-Round Inference requires single-attempt parameter selection using LLM’ s prior knowledge to find the optimal parameter balancing accuracy and cost. Since the true opti- mal in continuous parameter space is unkno wn, we set the reference as the near -optimal solution with near -minimal cost that meets the accuracy threshold found by scan al- gorithm (Algorithms 1 and 2 ). Since discrete scan cannot guarantee a “true optimal” in continuous parameter space, LLMs can potentially match or even e xceed the reference efficienc y of 1.0, indicating e xpert-le vel intuition. Multi-Round Inference allo ws up to 10 trials. Human expertise usually only has patience for 5–10 trials before resorting to systematic sweeps, and our scan baseline uses approximately 20 grid points, so 10 trials gi v es LLMs half the budget of exhaustiv e search. Each trial’ s simulation feed- back includes: a) con ver gence status, b) RMSE between the current and a finer resolution to judge conv ergence, and c) accumulated cost is appended to the con versation list. LLMs can terminate early if the y deem a solution satisfac- tory . The reference cost is the accumulated cost from the scan algorithm (Algorithm 1 ). An efficienc y near 1.0 here merely indicates the LLM matched brute-force scan, a lo w bar since scan can be in vok ed ev en without LLM; LLMs with their reasoning ability are ideally e xpected to achie ve higher than 1.0 efficienc y . Multi-round mode applies to 11 of 23 parameters with monotonic cost-accuracy relationships, comprising 1,900 tasks . The remaining 12 parameters are ev aluated only in single-round mode. These excluded parameters tend to have non-monotonic cost-accuracy relationships, mak- ing them arguably the most challenging cases. Howe v er , non-monotonic parameters lack an intuitiv e progressi v e re- finement path, so no simple multi-round baseline exists. 3.4. Evaluation Metrics Our metrics cover both the success rate and the cost effi- ciency for LLM-proposed parameters. Both are computed per task instance , enabling meaningful within-solver com- parisons without requiring cross-domain cost normalization. Success Rate. Success S i ∈ { 0 , 1 } for task i indicates whether the task-dependent distance between the LLM- proposed simulation output and the reference output (found by Algorithms 1 or 2 ) falls within the accuracy threshold. See T able 2 for per -solver metric definitions and thresholds. Efficiency . Efficienc y measures ho w well the LLM’ s cost compares to the reference cost: E i = C bf i C sr , mr i × S i , (1) where C bf i is the brute-force reference cost, C sr , mr i is the LLM’ s cost, and superscript “sr” or “mr” denotes single- round or multi-round mode respecti v ely . For single-round, C bf,sr i is the near minimum-cost solution; for multi-round, C bf,mr i is the accumulated scan cost. For aggre gated results, we report arithmetic mean for suc- cess rate and geometric mean for ef ficienc y ov er suc- cessful samples only ( E i > 0 ). The geometric mean is standard practice for aggregating performance ratios and speedups ( Henning , 2006 ; Mattson et al. , 2020 ): it is scale- in v ariant, handles v alues spanning multiple orders of magni- tude without being dominated by outliers, and treats multi- plicativ e relationships symmetrically . F ailed tasks ( E i = 0 ) are excluded from ef ficienc y aggregation. 3.5. Bayesian Optimization Baseline For multi-round mode, we compare LLMs ag ainst Bayesian Optimization with Gaussian Process surrogate (BO-GP), a classical black-box optimization approach. Single-round BO would reduce to random sampling since the GP re- quires prior observ ations. Implementation details are in Sec- tion C.6 . 5 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs (a) Success Rate (b) Efficienc y F igur e 2. Ov erall performance across models, accuracy levels, and inference modes. SR = Single-Round, MR = Multi-Round; High/Med/Low = accurac y le vel requirements. 3.6. Ablation Study Setup Follo wing standard practice for ablation studies, we ev aluate on a core subset rather than the full benchmark. Specifically , we use 5 datasets (Euler 1D, Heat 1D, MPM 2D, NS Tran- sient 2D, HM Nonlinear) spanning fluid dynamics, solid mechanics, and plasma physics domains. In-Context Learning (ICL) V ariants. In practice, simu- lation teams accumulate experiment logs that could serv e as retriev al-augmented context, a realistic RA G setting where past runs inform ne w parameter choices. W e design three ICL v ariants to isolate specific information contributions: (1) Idealized ICL : accuracy-matched examples with full cost information, representing a well-curated experiment log; (2) Cost-Ignorant ICL : same accuracy-matched exam- ples but without cost data, testing whether cost aw areness is essential; and (3) Mixed-Accuracy ICL : examples from all accuracy le v els without matching, simulating a realistic log where historical runs target varying accuracy require- ments. This ablation is ev aluated with Claude-3.7-Sonnet only . Based on our results, this model is a mid-tier per- former , rank ed 2nd of 5 models in single-round success rate, with no significant gaps to top or bottom models. Reasoning Eff ort V ariants. A natural hypothesis is that deliberate reasoning about tunable parameters can narro w the search space and improv e initial guesses or explorations. GPT -5 exposes a “reasoning ef fort” parameter controlling the depth of chain-of-thought reasoning before answering. W e compare Minimal and High settings against the default (Medium) to test whether extended reasoning impro ves pa- rameter tuning in both single-round and multi-round set- tings. 4. Experiments 4.1. Experimental Setup W e ev aluate state-of-the-art LLMs including GPT -5-2025- 08-07, Claude-3.7-Sonnet-2025-02-19, GPT -OSS-120B, Llama-3-70B-Instruct, and Qwen3-32B across 12 datasets spanning fluid dynamics, solid mechanics, and plasma physics. Our e v aluation encompasses both single-round and multi-round inference modes at three accuracy lev els (lo w , medium, high). T emperature is set to 0.0 for all models except GPT -5, which uses its default setting. Additionally , we e valuate GPT -5’ s reasoning effort parameter (Minimal, Medium, High). 4.2. Main Results Figure 2 presents ov erall performance across models and accuracy le v els. In single-round mode, success rates v ary substantially across models (46–64%), with GPT -5 leading at 63.8%. Even this best rate falls short of practical reliability , requiring manual intervention e v ery 2–3 simulations. Multi-round inference substantially improves all models, with GPT -5 and GPT - OSS-120B both reaching ∼ 80% and ev en Llama-3-70B reaching 71%. Regarding performance across dif ferent accuracy le vels, single-round success drops sharply as accuracy require- ments tighten (65% → 41%), while multi-round remains stable (75% → 72%). Efficienc y interpretation differs by mode: in single-round, efficienc y > 1.0 means the LLM beat the optimal solution’ s cost. In multi-round, the reference is cumulati v e brute-force cost, so efficienc y ≈ 1.0 merely matches naiv e search. In single-round, all models achiev e 0.16–0.50 ef ficiency , using 2–6 × optimal compute. In multi-round, only GPT -OSS- 120B approaches 1.0 (at low accurac y), while other models 6 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs (a) (b) (c) (d) (e) (f) F igur e 3. Ke y findings. (a) Forest plot of multi-round improv ement by model and accuracy; all differences significant at p < 0 . 05 except Qwen3-32B at low accurac y . (b)–(c) T ask group summary: aggregated success rate and efficiency by parameter type; single-round (solid) vs. multi-round (hollo w). (d) T ask correlations: within-group vs. between-group shows no significant difference. (e)–(f) In-context learning (ICL) variants impro ve single-round success b ut de grade multi-round; cost information in examples preserv es ef ficiency . cluster around 0.4–0.7, taking 1.5–2.5 × brute-force cost. 4.3. Single-Round vs. Multi-Round Comparison Figure 3 (a) quantifies the improvement from multi-round inference using McNemar’ s test on paired samples. W e ob- serve that the success rate g ain is most pronounced at high accuracy lev els, with a mean improv ement of +28.9% across models ( p < 0 . 001 ), precisely where single-round struggles most. Lower accuracy lev els also benefit, b ut to a lesser extent. This asymmetry has a clear e xplanation: higher ac- curacy requirements narro w the acceptable parameter range, making “lucky guesses” increasingly unlikely . Multi-round mode compensates through trial-and-error , making it nec- essary for high-accuracy tasks. Howe ver , as noted above, LLM trial-and-error is slower than brute-force, so practi- tioners should let LLMs in v oke scanning algorithms in this setting. See detailed statistics in T able 8 and robustness analysis in Appendix C.3 . 4.4. T ask Group Analysis Using the parameter categorization from Section 3.2 , we analyze performance across Spatial, T emporal, T olerance, and Misc parameter groups (Figures 6a – 6b ). Success rate varies substantially across parameter groups in single-round mode (29–70%). Spatial and T olerance parameters cluster at the high end, likely benefiting from predictable cost-accuracy trade-offs in pre-training data, while Misc parameters show the widest v ariation due to their solver -specific nature. Multi-round mode compresses this v ariation (all groups: 61–79%), with Misc parameters gaining most (+23%). This pattern suggests that iterati ve exploration compensates for missing prior kno wledge, ben- efiting uncommon parameters the most. Efficienc y patterns re v eal a paradox: in single-round, Misc parameters (the hardest to solve) achiev e near-optimal ef- ficiency when successful, while Spatial parameters lag far behind. This suggests LLMs can find runnable values for common parameters like grid resolution b ut lack cost in- tuition, tending to ward “safer” v alues that work but w aste compute. In multi-round, Misc parameters exceed 1.0 at low accuracy and slightly outperform exhausti v e search, while other groups still remain below . W e analyzed within- vs. between-group task correlations to assess knowledge transfer potential. Neither single-round nor multi-round sho wed significant within-group correlation advanta ges ( p =0.35 and p =0.97, respecti vely), implying task difficulty is parameter -specific rather than type-driv en. This limits the viability of fine-tuning on cheap simulators to improv e performance on expensi v e ones. Full plots and correlation matrices appear in Appen- dices C.8 – C.9 . 7 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs 4.5. In-Context Learning Giv en that solver-specific parameters such as the Misc group lack reliable priors, a natural question is: can we lev erage historical examples to improv e performance? Follo wing the ablation setup in Section 3.6 , we e v aluate three ICL v ariants against Base (no examples) on the 5-dataset ablation subset using Claude-3.7-Sonnet: Idealized ICL (accuracy-matched examples with cost), Cost-Ignorant ICL (accuracy-matched without cost), and Mixed-Accurac y ICL (div erse e xamples without matching). The latter two variants simulate realistic scenarios where experiment logs may be partially lost or imperfectly matched to new requirements. Figures 3 (d)–(e) re veal a k ey trade-of f: ICL impr o ves single- r ound success b ut de grades multi-r ound performance . This pattern holds across all v ariants, suggesting that examples help narrow initial guess ranges b ut limit exploration, steer - ing models to shown parameter re gimes. Comparing variants in single-round mode: Mixed-Accuracy ICL achiev es the lar gest gains in both success (+24%) and efficienc y (1.5 × ), likely because div erse examples span a wider parameter range. Cost-Ignorant ICL sho ws compa- rable success gains but minimal ef ficienc y improvement, suggesting that cost information in examples is critical for efficienc y . Across task groups, T olerance parameters benefit most from ICL with the lar gest gains, while Spatial parameters sho w modest improvement. This mirrors the single-round vs. multi-round pattern and suggests that parameters less com- mon in pre-training benefit most from additional demonstra- tions. Detailed breakdowns appear in Appendix C.4 . 4.6. Bayesian Optimization Baseline W e compare LLM performance against Bayesian Optimiza- tion with Gaussian Process surrogate (BO-GP) across all 12 solvers in multi-round mode. BO achie ves comparable aggregate success rates to LLMs but sho ws higher inter - solver variance. Efficienc y-wise, LLMs have an obvious advantage, especially at low accuracy requirements (2.03 vs 1.02). W e found that BO’ s exploration strategy tends to choose extreme v alues early to maximize information g ain. Howe ver , this triggers early stopping if it selects the “finer” side of e xtreme bounds, resulting in high cumulativ e cost. In contrast, LLMs lev erage physics intuition from pre-training to make more informed initial guesses, which is especially helpful at lo w accuracy requirements where thresholds are more lenient. Full results appear in T ables 13 and 14 . Implementation details are in Appendix C.6 . 4.7. Additional Results W e summarize two additional findings with detailed analy- ses in the Appendix. Reasoning Effort : GPT -5’ s reasoning effort parameter sho ws no significant ov erall impact, as increased reasoning does not lead to better parameter se- lection (Section C.5 ). Failur e Modes : W e identify fi ve recurring failure patterns (false positi v es, blind exploration, instruction misunderstanding, prior bias, and conserv ati ve strategy) to guide future impro vements (Section B.6 ). 5. Limitations Our benchmark restricts LLMs to text-based solv er interac- tions without access to auxiliary tools. Real-world agents might le verage log parsing or custom timeout logic, b ut this constraint ensures fair comparison: many LLMs do not yet support full agentic mode reliably , and the choice of framew ork introduces confounding v ariables. In practice, our expert curation filters out unstable solv er configurations, and we enforce hard timeouts for potentially long-running cases, so failure recov ery is rarely needed. Additionally , we isolate single-parameter tuning rather than joint multi-parameter optimization. Baseline tractability driv es this choice: grid search complexity gro ws as O ( n k ) where n is the number of choices per parameter and k is the number of parameters. W ith n = 20 grid points, single- parameter search requires 20 ev aluations, while 3-parameter joint search requires 20 3 = 8 , 000 , making exhausti ve ref- erence solutions prohibiti vely e xpensi ve. Bayesian opti- mization could reduce cost b ut does not handle integer pa- rameters natively and introduces its own tuning decisions (surrogate model, acquisition function). As the first bench- mark that considers tool cost in this domain, we opt for a simple, assumption-free reference and only include one typical Bayesian-based approach as a baseline. 6. Conclusions and Future Dir ections W e introduced S I M U L C O S T , the first benchmark for ev alu- ating cost-aw are parameter tuning capabilities of LLMs in physics simulations. Our ev aluation spans 12 physics sim- ulators and 5 state-of-the-art LLMs. W e also open-source S I M U L C O S T upon publication, including the static bench- mark and extensible toolkit, enabling the community to both explore improvements for the cost-awareness of LLM-based scientific agents and create new simulation en vironments. Practical recommendations. Based on our findings: (1) LLMs’ initial guesses are generally unreliable (46–64% suc- cess) and only useful for quick previe ws when users lack parameter intuition, accuracy requirements are lo w , and op- timal cost efficienc y is not needed. (2) For high-accuracy tasks or finding a guaranteed cost-efficient configuration, 8 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs multi-round mode becomes necessary (71–80% success), but practitioners should let LLMs in vok e scanning algo- rithms rather than relying on internal reasoning alone, as the latter is 1.5–2.5 × slower . (3) Fine-tuning on cheaper sim- ulators is unlikely to transfer to expensi ve ones due to the lack of cross-parameter correlation, e v en within the same pa- rameter type. (4) ICL improves single-round performance but anchors models to demonstrated regimes, degrading multi-round exploration. Hence, naive RA G may not be a complete solution. (5) Using examples from a di v erse range of accuracy requirements helps exploration, and including cost information helps optimize cost efficienc y . Future directions. Several extensions would strengthen cost-aware scientific agents: (1) T ool-augmented tuning : equipping LLMs with timeout-based early stopping, callable search algorithms, and multi-modal feedback such as field visualizations to enable richer decision-making. (2) Human- in-the-loop evaluation : user studies measuring ho w LLM- suggested parameters accelerate expert workflows, other than autonomously tuning, would validate real-world util- ity . (3) Cost-awar e post-training : developing fine-tuning strategies that e xplicitly optimize for accurac y and compu- tational efficienc y . (4) Multi-parameter optimization : en- abling LLMs to jointly tune interdependent parameters with adaptiv e sampling approaches. (5) P arallel scaling : extend- ing cost tracking to parallel infrastructure with analytical complexity models that account for overhead and delays, or careful, reproducible wall-time measurements. A uthor Contributions Y . Cao led the project and major solver de velopment, and contributed to conceptual design. S. Lai led LLM e v aluation engineering. Y . Zhang led Bayesian optimization and acti ve learning ev aluation engineering. J. Huang, Z. Lawrence, R. Bhakta, I. F . Thomas, M. Cao, and Y . Zhao contributed to solver de v elopment. C.-H. Tsai contributed to LLM e v alua- tion engineering. Z. Zhou contributed to conceptual design, LLM experimental design, and open-source support. H. Liu contributed to conceptual design. A. Marinoni and A. Arefie v contrib uted to conceptual design and solver de velop- ment. R. Y u contributed to conceptual design and computing resource management. All authors contributed to discussion and manuscript revision. References Arber , T . D., Bennett, K., Brady , C. S., Lawrence-Douglas, A., Ramsay , M. G., Sircombe, N. J., Gillies, P ., Evans, R. G., Schmitz, H., Bell, A. R., and Ridgers, C. P . Con- temporary particle-in-cell approach to laser-plasma mod- elling. Plasma Physics and Contr olled Fusion , 57(11): 113001, 2015. Arisaka, S. and Li, Q. Accelerating legacy numerical solvers by non-intrusive gradient-based meta-solving, 2024. URL https://arxiv.org/abs/2405.02952 . Bergstra, J., Bardenet, R., Bengio, Y ., and K ´ egl, B. Algo- rithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems , volume 24, 2011. Boyd, J. P . Chebyshev and F ourier spectral methods . Courier Corporation, 2001. Bragg, J., D’Arc y , M., Balepur , N., Bareket, D., Dalvi, B., Feldman, S., Downe y , D., Hope, T ., Singh, A., et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite, 2025. URL https://arxiv. org/abs/2510.21652 . Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller , P . Augmenting large language mod- els with chemistry tools. Nature Mac hine Intelligence , 6: 525–535, 2024. Cao, Y ., Chen, Y ., Li, M., Y ang, Y ., Zhang, X., Aanjaneya, M., and Jiang, C. An efficient b-spline lagrangian/eu- lerian method for compressible flow , shock wa ves, and fracturing solids. ACM T ransactions on Graphics (T OG) , 41(5):1–13, 2022. Cao, Y ., Zhao, Y ., Li, M., Y ang, Y ., Choo, J., T erzopou- los, D., and Jiang, C. Unstructured moving least squares material point methods: a stable kernel approach with continuous gradient reconstruction on general unstruc- tured tessellations. Computational Mec hanics , 75(2): 655–678, 2025. Chan, J. S., Cho wdhury , N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., P atward- han, T ., et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv pr eprint arXiv:2410.07095 , 2024. Chen, T ., Anumasa, S., Lin, B., Shah, V ., Go yal, A., and Liu, D. Auto-bench: An automated benchmark for scientific discov ery in llms, 2025. URL abs/2502.15224 . Chen, Y ., Zhu, X., Zhou, H., and Ren, Z. Metaopenfoam: an llm-based multi-agent framework for cfd, 2024a. URL https://arxiv.org/abs/2407.21320 . Chen, Z., Chen, S., Ning, Y ., Zhang, Q., W ang, B., Y u, B., Li, Y ., Liao, Z., W ei, C., Lu, Z., De y , V ., Xue, M., Baker , F . N., Burns, B., Adu-Ampratwum, D., Huang, X., Ning, X., Gao, S., Su, Y ., and Sun, H. Sci- enceagentbench: T oward rigorous assessment of language agents for data-dri ven scientific disco very . arXiv pr eprint arXiv:2410.05080 , 2024b. 9 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Dinh, T . A., Mullov , C., B ¨ armann, L., Li, Z., Liu, D., Reiß, S., Lee, J., Lerzer , N., Gao, J., Peller-K onrad, F ., R ¨ oddiger , T ., W aibel, A., Asfour, T ., Beigl, M., Stiefel- hagen, R., Dachsbacher, C., B ¨ ohm, K., and Niehues, J. Sciex: Benchmarking lar ge language models on scientific exams with human expert grading and automatic grad- ing. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pp. 11592– 11610. Association for Computational Linguistics, 2024. Eggensperger , K., M ¨ uller , P ., Mallik, N., Feurer , M., Sass, R., Klein, A., A wad, N., Lindauer , M., and Hutter , F . HPOBench: A collection of reproducible multi-fidelity benchmark problems for HPO. In Thirty-fifth Conference on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k (Round 2) , 2021. Elrefaie, M., Qian, J., W u, R., Chen, Q., Dai, A., and Ahmed, F . Ai agents in engineering design: A multi-agent frame- work for aesthetic and aerodynamic car design, 2025. URL . Falkner , S., Klein, A., and Hutter , F . BOHB: Robust and ef ficient hyperparameter optimization at scale. In Interna- tional Confer ence on Machine Learning , pp. 1437–1446. PMLR, 2018. Goodfellow , I., Bengio, Y ., Courville, A., and Bengio, Y . Deep learning , volume 1. MIT Press, 2016. Hasega wa, A. and Mima, K. Pseudo-three-dimensional tur - bulence in magnetized nonuniform plasma. The physics of Fluids , 21(1):87–92, 1978. Henning, J. L. SPEC CPU2006 benchmark descriptions. A CM SIGARCH Computer Ar c hitectur e Ne ws , 34(4):1– 17, 2006. Huang, Y ., Shi, J., Li, Y ., Fan, C., W u, S., Zhang, Q., Liu, Y ., Zhou, P ., W an, Y ., Gong, N. Z., and Sun, L. Metatool benchmark for large language models: Deciding whether to use tools and which to use. In The T welfth International Confer ence on Learning Repr esentations , 2024. Hughes, T . J. The finite element method: linear static and dynamic finite element analysis . Courier Corporation, 2003. Hutter , F ., Hoos, H. H., and Le yton-Bro wn, K. Sequential model-based optimization for general algorithm config- uration. In International Conference on Learning and Intelligent Optimization , pp. 507–523. Springer , 2011. Jones, D. R., Schonlau, M., and W elch, W . J. Ef fi- cient global optimization of expensiv e black-box func- tions. Journal of Global Optimization , 13(4):455– 492, Dec 1998. ISSN 1573-2916. doi: 10.1023/ A:1008306431147. URL https://doi.org/10. 1023/A:1008306431147 . Kapoor , S., Stroebl, B., Siegel, Z. S., Nadgir , N., and Narayanan, A. Ai agents that matter . arXiv pr eprint arXiv:2407.01502 , 2024. Kumar , G. and Nair , A. G. Dominant balance-based adapti ve mesh refinement for incompressible fluid flo ws, 2024. URL . Lai, Y ., Li, C., W ang, Y ., Zhang, T ., Zhong, R., Zettlemoyer , L., Y ih, S. W ., Fried, D., W ang, S., and Y u, T . Ds-1000: A natural and reliable benchmark for data science code generation. In Pr oceedings of the 40th International Con- fer ence on Mac hine Learning , pp. 18319–18345, 2023. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and T alwalkar , A. Hyperband: A no v el bandit-based approach to hyperparameter optimization. J ournal of Machine Learning Resear c h , 18:1–52, 2018. Li, M., Fer guson, Z., Schneider , T ., Langlois, T . R., Zorin, D., Panozzo, D., Jiang, C., and Kaufman, D. M. Incre- mental potential contact: intersection-and in version-free, large-deformation dynamics. A CM T rans. Gr aph. , 39(4): 49, 2020. Li, S., Marwah, T ., Shen, J., Sun, W ., Risteski, A., Y ang, Y ., and T alwalkar , A. Codepde: An inference framew ork for llm-driv en pde solver generation, 2025. URL https: //arxiv.org/abs/2505.08783 . Majumder , B. P ., Surana, H., Agarwal, D., Mishra, B. D., Meena, A., Prakhar , A., V ora, T ., Khot, T ., Sabhar- wal, A., and Clark, P . Discoverybench: T ow ards data- dri ven disco very with lar ge language models, 2024. URL https://arxiv.org/abs/2407.01725 . Mattson, P ., Cheng, C., Coleman, C., Diamos, G., Micikevi- cius, P ., Patterson, D., T ang, H., W ei, G., Bailis, P ., Bittorf, V ., Brooks, D., Chen, D., Dutta, D., Gupta, U., Hazel- wood, K., Hock, A., Huang, X., Ike, A., Jia, B., Kang, D., Kanter , D., K umar , N., Liao, J., Ma, G., Narayanan, D., Oguntebi, T ., Pekhimenko, G., Pentecost, L., Reddi, V . J., Robie, T ., St. John, T ., T abber, T ., Wu, C., Xu, L., Y amazaki, M., Y oung, C., and Zaharia, M. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micr o , 40(2):8–16, 2020. Mockus, J., T iesis, V ., and Zilinskas, A. The application of Bayesian methods for seeking the extr emum , v olume 2, pp. 117–129. 09 2014. ISBN 0-444-85171-2. Nathani, D., Madaan, L., Roberts, N., Bashlykov , N., Menon, A., Moens, V ., Budhiraja, A., Magka, D., V orotilov , V ., Chaurasia, G., Hupkes, D., Cabral, R. S., Shavrina, T ., Foerster , J., Bachrach, Y ., W ang, W . Y ., and Raileanu, R. Mlgym: A new framew ork and bench- mark for adv ancing ai research agents, 2025. URL https://arxiv.org/abs/2502.14499 . 10 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Nogueira, F . Bayesian Optimization: Open source constrained global optimization tool for Python, 2014–. URL https:// github.com/bayesian- optimization/ BayesianOptimization . Pande y , S., Xu, R., W ang, W ., and Chu, X. Openfoamgpt: a rag-augmented llm agent for openfoam-based computa- tional fluid dynamics. Physics of Fluids , 37(3):035120, 2025. Patankar , S. V . Numerical Heat T ransfer and Fluid Flow . Hemisphere Publishing Corporation, 1980. Patil, S. G., Zhang, T ., W ang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massiv e apis, 2023. URL . Pfisterer , F ., Schneider , L., Moosbauer , J., Binder , M., and Bischl, B. Y AHPO Gym - an effi cient multi-objective multi-fidelity benchmark for hyperparameter optimiza- tion. In F irst Conference on Automated Mac hine Learn- ing (Main T rack) , 2022. Ren, S., Jian, P ., Ren, Z., Leng, C., Xie, C., and Zhang, J. T owards scientific intelligence: A surve y of llm-based scientific agents, 2025. URL abs/2503.24047 . Schmidgall, S., Su, Y ., W ang, Z., Sun, X., W u, J., Y u, X., Liu, J., Moor , M., Liu, Z., and Barsoum, E. Agent labora- tory: Using llm agents as research assistants, 2025. URL https://arxiv.org/abs/2501.04227 . Science Machine. Bioml-bench: Evaluation of ai agents for end-to-end biomedical ml. bioRxiv , 2025. doi: 10.1101/ 2025.09.01.673319. URL https://www.biorxiv. org/content/10.1101/2025.09.01.673319 . Shen, Y ., Song, K., T an, X., Zhang, W ., Ren, K., Y uan, S., Lu, W ., Li, D., and Zhuang, Y . T askbench: Benchmarking large language models for task automation. In Advances in Neural Information Pr ocessing Systems , 2024. Snoek, J., Larochelle, H., and Adams, R. P . Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Pr ocessing Systems , 25, 2012. Somasekharan, N., Y ue, L., Cao, Y ., Li, W ., Emami, P ., Bharga v , P . S., Acharya, A., Xie, X., and Pan, S. Cfdllm- bench: A benchmark suite for ev aluating large language models in computational fluid dynamics. arXiv pr eprint arXiv:2509.20374 , 2025. T ang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B., and Sun, L. T oolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023. URL https://arxiv.org/abs/2306.05301 . Thakur , P . and Nadarajah, S. A stabilized march approach to adjoint-based sensiti vity analysis of chaotic flo ws, 2025. URL . T ian, M., Gao, L., Zhang, S. D., Chen, X., Zheng, K., Chen, J., Zhang, D., Lu, Y ., Zhao, H., W ang, P ., Pei, J., and Zhang, Y . Scicode: A research coding benchmark curated by scientists. arXiv pr eprint arXiv:2407.13168 , 2024. T rab ucco, B., Geng, X., Kumar , A., and Levine, S. Design- bench: Benchmarks for data-driv en of fline model-based optimization. In International Confer ence on Machine Learning , pp. 21658–21676. PMLR, 2022. W akatani, M. and Hasega wa, A. A collisional drift wav e description of plasma edge turbulence. The Physics of fluids , 27(3):611–618, 1984. W ang, X., Hu, Z., Lu, P ., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A. R., Zhang, S., Sun, Y ., and W ang, W . Scibench: Evaluating college-lev el scientific problem- solving abilities of large language models. In Pr oceed- ings of the 41st International Confer ence on Mac hine Learning , pp. 50622–50649, 2024. W u, D., W ang, J., Meng, Y ., Zhang, Y ., Sun, L., and W ang, Z. Catp-llm: Empowering large language models for cost- aware tool planning. In Proceedings of the IEEE/CVF International Confer ence on Computer V ision , 2025a. W u, D., Zhang, R., Zucchelli, E., et al. Apbench and benchmarking large language model performance in fun- damental astrodynamics problems for space engineer- ing. Scientific Reports , 15:7944, 2025b. doi: 10.1038/ s41598- 025- 91150- 5. URL https://doi.org/10. 1038/s41598- 025- 91150- 5 . Xu, L., Mohaddes, D., and W ang, Y . Llm agent for fire dynamics simulations, 2024. URL https://arxiv. org/abs/2412.17146 . Y abe, T ., Xiao, F ., and Utsumi, T . The constrained interpo- lation profile method for multiphase analysis. Journal of Computational Physics , 169(2):556–593, 2001. Y adan, O. Hydra - a framework for eleg antly configur- ing complex applications. https://github.com/ facebookresearch/hydra , 2019. Y ue, L., Somasekharan, N., Cao, Y ., and Pan, S. F oam-agent: T owards automated intelligent cfd workflo ws. arXiv pr eprint arXiv:2505.04997 , 2025. Zhang, D., Zhoubian, S., Cai, M., Li, F ., Y ang, L., W ang, W ., Dong, T ., Hu, Z., T ang, J., and Y ue, Y . Datascibench: An llm agent benchmark for data science, 2025. URL https://arxiv.org/abs/2502.13897 . 11 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Zheng, H., Chu, W ., Zhang, B., W u, Z., W ang, A., Feng, B. T ., Zou, C., Sun, Y ., Kov achki, N., Ross, Z. E., et al. In v ersebench: Benchmarking plug-and-play diffusion pri- ors for in verse problems in physical sciences. arXiv pr eprint arXiv:2503.11043 , 2025. 12 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs A. Dataset Details A.1. Data generation pipeline Overall Module Design. W e designed standardized APIs for dif ferent components of this toolkit, namely: (a) the solver module that encapsulates the timestep of a simulator , tracks meta information related to computational cost, and dumps the physical fields at intermediate timesteps, (b) the runner module that supports issuing simulation runs using shell commands with support for different scenario v ariations, (c) the wrapper module that provides the API interface for brute-force search and LLM function calls, including simulation results query and comparison functions for con vergence check, (d) the brute-force search module that implements the search algorithm to find reference solutions. After code de velopment and search are conducted, (e) we also hav e a standardized markdo wn template to document the physics background, numerical solution details, tunable parameters, three dif ferent lev els of accuracy requirements, and scenario variations. Portions of this document con v ert to LLM prompts. Each solver’ s preparation and the tunable task generation follow a standardized pipeline: 1. Solver Dev elopment. An expert starts dev eloping or adapting the solver for gi ven physics simulation scenarios by either dev eloping from scratch, or adapting source code from textbooks, online repositories, and research papers. Development is achiev ed by either deri ving from a base template solver class and fulfilling the abstract functionalities mentioned in the ov erall design, or compiling the source code and wrapping the run script into the runner module. W e require experts to create transparent cost calculation rules follo wing complexity analysis, and therefore add additional meta information to track cost calculations. All intermediate physical fields are cached to disk to enable reuse without incurring repeated runs. Experts are also required to provide an API for comparison between two simulated ph ysical fields to check their distance, where the metric formulation and threshold values are case-dependent. 2. Optimal Solution Search and Reference Setup. For each task, we find optimal solutions that meet accuracy requirements with the lowest cost. W e use two search strategies depending on parameter characteristics. For monotonic parameters (where cost and accurac y ha v e a monotonic relationship, e.g., spatial resolution, time step size), we use progressi ve refinement (Algorithm 1 ): start with a coarse value and progressi v ely refine with fixed ratios (e.g., halv e the time step size, double the spatial resolution) until the distance between adjacent runs is within the accuracy threshold. For non-monotonic parameters (e.g., relaxation factor in SOR), we use exhausti ve grid search (Algorithm 2 ): discretize the search space within a feasible range using uniform grids, run simulations on all choices, and pick the con v er ged run with the lo west cost. For single-round tasks, the reference cost is the optimal cost found by search. For multi-round tasks (monotonic parameters only), the reference cost is the accumulated cost incurred during progressiv e refinement. 3. V ariation Design. W e scale up the simulation scenarios’ diversity via multiple types of variations: a) dif ferent engineering benchmarks (e.g., different initial and boundary conditions, dif ferent physical properties, etc.), b) giv en a tunable parameter as the search target, the other non-target tunable parameters can form a combinatorial space (e.g., in a Finite V olume based method, if the target parameter is the spatial resolution, the non-target parameters can include the spatial discretization ordering, the slope limiter , etc.), c) different accurac y thresholds (lo w , medium, high) to reflect real-w orld v arying design tolerances. The combination of these variations creates a large number of and diverse tasks from a limited number of classical benchmarks. 4. Filtering. W e conduct optimal solution search on all variations and only k eep those where brute-force search succeeded in finding the optimal solution. This ensures all selected tasks hav e a reference cost for comparison. During the pull request phase of each solver , we request a separate expert to conduct quality checks. In addition, statistical analysis of optimal parameters is conducted to ensure di verse optimal parameter distributions and pre vent models from memorizing ”golden values. ” A.2. Physics Backgr ounds Heat T ransfer 1D ( Patankar , 1980 ). This solver addresses the 1D heat conduction equation: ∂ T ∂ t = α ∂ 2 T ∂ x 2 using explicit finite dif ference methods with natural con vection boundary conditions at x = 0 and adiabatic conditions at x = L . The tunable parameters include the spatial resolution ( n space ) and the CFL number ( cfl ) that determines the 13 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Algorithm 1 Brute-Force Grid Search with Progressi v e Refinement 1: input Domain Ω , base grid size n 0 , max lev el L , threshold τ s,k , tolerance ε s 2: output Best feasible ˆ θ 3: T otal cost C ← 0 4: f or ℓ = 0 , . . . , L do 5: n ℓ ← n 0 2 ℓ , construct grid G ℓ 6: for all θ ∈ G ℓ not yet ev aluated do 7: Solve subproblem at θ with cost T solve ( θ ) 8: Evaluate E s ( θ ; σ ) , c s ( θ ; σ ) with cost T ev al ( θ ) 9: C ← C + T solve ( θ ) + T ev al ( θ ) 10: end for 11: F ℓ ← { θ ∈ G ℓ : E s ( θ ; σ ) ≤ τ s,k } 12: if F ℓ  = ∅ then 13: ˆ θ ℓ ← arg min θ ∈F ℓ c s ( θ ; σ ) 14: Refine G ℓ +1 around ˆ θ ℓ 15: end if 16: if ∆ ℓ ≤ ε s then 17: break 18: end if 19: end f or 20: r eturn ˆ θ , total cost C Algorithm 2 Exhaustiv e Grid Search for Non-Monotonic P arameters 1: input Domain Ω , grid size n , threshold τ s,k 2: output Best feasible ˆ θ , total cost C 3: T otal cost C ← 0 4: Construct uniform grid G over Ω with n points 5: f or all θ ∈ G do 6: Solve subproblem at θ with cost T solve ( θ ) 7: Evaluate E s ( θ ; σ ) , c s ( θ ; σ ) with cost T ev al ( θ ) 8: C ← C + T solve ( θ ) + T ev al ( θ ) 9: end f or 10: F ← { θ ∈ G : E s ( θ ; σ ) ≤ τ s,k } 11: if F  = ∅ then 12: ˆ θ ← arg min θ ∈F c s ( θ ; σ ) 13: else 14: Mark task as unsolved 15: end if 16: r eturn ˆ θ , total cost C 14 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs simulation time step by: ∆ t = cfl × (∆ x ) 2 2 α , where α is the thermal diffusi vity . The computational cost follows the relationship C = n space × n t , where n t is the number of time steps accumulated in the solver . The metric for con vergence is the RMSE of the heat flux at the con vection boundary at the final time step. This simulation has 25 dif ferent profiles with varying initial uniform temperatures and physical properties, generating 295 tasks in total. Steady State Heat T ransfer 2D ( Patankar , 1980 ). This solver tackles the 2D steady-state heat conduction equation with Dirichlet boundary conditions: ∇ 2 T = ∂ 2 T ∂ x 2 + ∂ 2 T ∂ y 2 = 0 using the Jacobi method with Successive Over -Relaxation (SOR). The tunable parameters include the grid size dx , the relaxation factor relax , the termination residual threshold error threshold , and the uniform initial temperature guess T init . The computational cost scales as C = (1 /dx ) 2 × n iter , where n iter is the number of iterations until termination. The conv er gence is e v aluated through the RMSE of the temperature field. The dataset comprises 8 dif ferent wall temperature combinations, generating a total of 662 tasks. Burgers 1D ( Patankar , 1980 ). This solver addresses the 1D Burgers equation: ∂ u ∂ t + u ∂ u ∂ x = ν ∂ 2 u ∂ x 2 using a second-order Roe method for accurate shock and rarefaction wa v e capturing. The tunable parameters mirror those of the Euler 1D solver: CFL number ( cfl ) that determines the simulation time step by: ∆ t = cfl × ∆ x | u | max , where | u | max is the maximum velocity magnitude, spatial resolution ( n space ), the limiter parameter beta for generalized minmod flux limiter , and the blending parameter k between 0-th and 1-st order interpolation scheme. The computational cost follows the relationship C = n space × n t , where n t is the number of time steps accumulated in the solver . The con v ergence metrics cov er both the RMSE of the solution fields, and the physical conservation principles: mass conserv ation, energy non-increasing property , and total v ariation (TV) non-increasing for stability . The dataset includes 5 dif ferent initial conditions (sinusoidal wav e, Sod tube, rarefaction, blasting, and double shock wa ve) representing v arious wa v e interaction scenarios, generating a total of 339 tasks. Euler 1D ( Patankar , 1980 ). 1 This solver implements the 1D Euler equations for compressible flo w: ∂ U ∂ t + ∂ F ( U ) ∂ x = 0 using the MUSCL-Roe method with superbee limiter for high-resolution shock capturing. The tunable parameters include the CFL number ( cfl ) that determines the simulation time step by: ∆ t = cfl × ∆ x | λ | max , where | λ | max is the maximum eigenv alue of the flux Jacobian, the spatial resolution ( n space ), the limiter parameter beta for generalized minmod flux limiter , and the blending parameter k between 0-th and 1-st order interpolation scheme. The computational cost follows the relationship C = n space × n t , where n t is the number of time steps accumulated in the solver . Conv er gence is e v aluated through multiple criteria: RMSE of the solution fields, positi vity preserv ation of density and pressure, and shock consistency validation. The dataset encompasses 3 classical benchmark profiles (Sod shock tube, Lax problem, and Mach 3), generating a total of 197 tasks. 1 Implementation based on https://www.psvolpiani.com/courses 15 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T ransient Na vier -Stok es 2D ( Y abe et al. , 2001 ). 2 This solver implements the 2D transient incompressible Navier-Stok es equations: ∂ u ∂ x + ∂ v ∂ y = 0 ∂ u ∂ t + u ∂ u ∂ x + v ∂ u ∂ y = − ∂ p ∂ x + 1 Re  ∂ 2 u ∂ x 2 + ∂ 2 u ∂ y 2  ∂ v ∂ t + u ∂ v ∂ x + v ∂ v ∂ y = − ∂ p ∂ y + 1 Re  ∂ 2 v ∂ x 2 + ∂ 2 v ∂ y 2  where u, v are velocity components, p is pressure, and Re is the Reynolds number . The tunable parameters include the spatial resolution ( resolution ) that determines the computational grid size, the CFL number ( cfl ) controlling time step stability through ∆ t = cfl × ∆ x , the relaxation factor ( relaxation factor ) for pressure correction con v ergence, and the residual threshold ( residual threshold ) for pressure solver con v ergence. The computational cost follows the relationship C = 2 × resolution 2 × ( num steps + total pressure iterations ) , where the factor of 2 accounts for the fixed aspect ratio domain configuration with x r esolution = 2 × resolution . Conv er gence is ev aluated through normalized velocity RMSE criteria, with temporal ev olution tracked throughout the simulation. The dataset encompasses 18 benchmark profiles across 6 dif ferent boundary conditions (simple circular obstacles, complex geometries, random obstacle fields, dual inlet/outlet configurations, dense obstacle arrays, and dragon-shaped obstacles) tested at three Reynolds numbers (Re=1000, 3000, 6000), generating a total of 244 tasks across dif ferent accuracy lev els and geometric complexities. Diffusion Reaction 1D ( Patankar , 1980 ). This solv er implements the 1D dif fusion-reaction: ∂ u ∂ t = ∂ 2 u ∂ x 2 + f ( u ) where the reaction term f ( u ) can be Fisher-KPP ( f ( u ) = u (1 − u ) ), Allee ef fect ( f ( u ) = u (1 − u )( u − a ) ), or Allen-Cahn cubic ( f ( u ) = u (1 − u 2 ) ). The tunable parameters include the CFL number ( cfl ) that controls time step stability through ∆ t = cfl × (∆ x ) 2 , the spatial resolution ( n space ) determining grid spacing via ∆ x = L/ n space , the Ne wton solver tolerance ( tol ) for conv er gence control, the minimum step size ( min step ) for line search rob ustness, and the initial step guess ( initial step guess ) for optimization aggressiv eness. The computational cost follows the relationship C = 3 × total newton iters × n space + total line search iters × n space , where the factor of 3 accounts for residual calculation, Jacobian assembly , and linear system solve in each Newton iteration. Conv er gence is ev aluated through Ne wton residual norms, physical bounds preservation, and wa v e propagation characteristics. The dataset encompasses 3 classical reaction types (Fisher-KPP trav eling wa ves, Allee effect threshold dynamics, and Allen-Cahn interface dynamics), generating a total of 90 tasks across dif ferent accuracy le v els and numerical parameter combinations. EPOCH ( Arber et al. , 2015 ). 3 This solver uses the EPOCH Particle-in-Cell (PIC) code to simulate 1D laser-plasma interactions, specifically the ionization of a carbon tar get irradiated by an ultra-intense laser pulse. The simulation solves Maxwell’ s equations: ∇ ×  E = − ∂  B ∂ t , ∇ ×  B = µ 0  j + ϵ 0 ∂  E ∂ t ! coupled with relativistic particle dynamics via the Lorentz force. The tunable parameters include the grid resolution ( nx ), the CFL multiplier ( dt multiplier ) controlling time step size through ∆ t = dt multiplier × ∆ x/c , the number of particles per cell ( npart ), the field solver order ( field order ∈ { 2 , 4 , 6 } ), and the particle weighting order ( particle order ∈ { 2 , 3 , 5 } ). The computational cost is measured by wall-clock time, as EPOCH is an external compiled code where internal complexity analysis is not applicable. Conv er gence is ev aluated through RMSE of the electric field profiles. The dataset encompasses 3 laser-target configurations with varying laser amplitudes and tar get densities, generating a total of 273 tasks across different accurac y le v els and parameter combinations. 2 Implementation based on https://github.com/takah29/2d- fluid- simulator 3 Implementation: https://github.com/epochpic/epoch 16 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Euler 2D ( Cao et al. , 2022 ). This solver addresses the 2D compressible Euler equations for in viscid gas dynamics: ∂ U ∂ t + ∇ · F ( U ) = 0 where U = ( ρ, ρu, ρv , ρE ) T are the conserv ati ve v ariables and F represents the flux tensor , with ρ denoting density , u, v velocity components, and E specific total energy . The solver employs an advection-projection fractional-step method with third-order WENO reconstruction, local Lax-Friedrichs Riemann solv er , and third-order TVD Runge-Kutta time integration. The tunable parameters include the spatial resolution ( n grid x ), the CFL number ( cfl ) controlling adaptiv e time step size through ∆ t = cfl · 2∆ x/ ( | u | + √ u 2 + 4 c 2 ) where c is the local sound speed, and the conjugate gradient tolerance ( cg tolerance ) for the pressure projection step. The computational cost follows the relationship C = N cells × ( N steps + N CG iters ) , accounting for both advection steps and iterativ e pressure solves. Con vergence is ev aluated through normalized RMSE averaged o ver density and pressure fields. The dataset comprises 9 benchmark profiles including true 2D problems (central explosion, supersonic stair flow) and pseudo-1D shock tube problems (Sod, Lax, high Mach, strong shock, interacting blast, symmetric rarefaction), generating a total of 232 tasks across dif ferent accurac y le vels and parameter combinations. FEM 2D ( Hughes , 2003 ; Li et al. , 2020 ). 4 This solver implements the 2D Finite Element Method for solid mechanics problems using an implicit Ne wton solv er with corotational elasticity . The solver addresses the momentum conserv ation equation: ρ ∂ 2 u ∂ t 2 = ∇ · σ + ρ g where u is the displacement field, ρ is density , σ is the Cauchy stress tensor , and g is gravitational acceleration. The stress tensor is computed using corotational elasticity with first Piola-Kirchhof f stress P = 2 µ ( F − R ) + λ ( J − 1) F − T , where F is the deformation gradient, R is the rotation matrix from polar decomposition, µ and λ are Lam ´ e parameters deriv ed from Y oung’ s modulus E and Poisson’ s ratio ν , and J = det( F ) is the volume ratio. The spatial discretization uses triangle mesh elements with virtual work principle for force computation, while temporal integration emplo ys an implicit backward Euler scheme with Ne wton-Raphson solv er and line search for rob ustness. The tunable parameters include the mesh resolution dx controlling element size in the x-direction and the CFL number cfl determining time step size. The computational cost scales quadratically with mesh refinement due to increased element count and degrees of freedom. Conv ergence is ev aluated through energy conservation metrics tracking the total energy variation var = σ ( E tot ) / mean ( | E tot | ) where E tot = E kinetic + E elastic potential , and L2 relative dif ferences between simulations with adjacent parameter values. The dataset encompasses 5 benchmarks including cantile ver beam bending, vibration bar dynamics, and twisting column simulations, generating a total of 103 tasks across different accurac y le v els and parameter combinations. Hasegawa-Mima (Linear) ( Hasegawa & Mima , 1978 ; W akatani & Hasegawa , 1984 ). This solver implements the linearized Hasega wa-Mima equation for drift w av e dynamics in magnetized plasmas: ∂ q ∂ t + v ∗ ∂ ϕ ∂ y = 0 , q = ∇ 2 ϕ − ϕ where ϕ is the electrostatic potential, q is the generalized vorticity , and v ∗ is the diamagnetic drift velocity in a periodic 2D domain. The solver employs fourth-order Runge-K utta (RK4) time integration with a Conjugate Gradient (CG) method for solving the Helmholtz equation ( ∇ 2 − I ) ϕ = q at each RK4 stage. The tunable parameters include the spatial resolution ( N ) determining grid spacing via ∆ x = ∆ y = L/ N where L = 2 π × 10 , the time step size ( dt ) controlling temporal discretization o ver the fixed total simulation time T = 10 , 000 , and the CG solver tolerance ( cg atol ) for iterative con v ergence control. The computational cost follows the relationship C = N CG × N 2 + N matvec × N 2 , where N CG is the total CG iterations across all time steps and N matvec counts sparse matrix-vector operations. Con v ergence is ev aluated through L2 RMSE between numerical solutions and analytical solutions obtained via 2D FFT spectral methods. The dataset encompasses 4 dif ferent initial conditions (monopole, dipole, sin x g auss y , and gauss x sin y), generating a total of 306 tasks across different accurac y le v els and parameter combinations. 4 Implementation based on https://github.com/squarefk/FastIPC 17 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 2. Summary of solvers in the SimulCost benchmark. Single/Multi columns show task counts per mode. Single-only parameters are used exclusi v ely in single-round e v aluation; Both parameters support both modes. Low/Med/High columns sho w accuracy thresholds. ∗ Thresholds vary by task type. See Appendix A.2 for physics background and citations. Solver Single Multi Single-only Params Both Params Low Med High Burgers 1D 339 255 beta , k cfl , n space 8 . 0 × 10 − 2 4 . 0 × 10 − 2 1 . 0 × 10 − 2 Diff-React 1D 90 90 – cfl , n space , tol vary ∗ vary ∗ vary ∗ EPOCH 273 145 dt multiplier , field order , particle order npart , nx 0.36 0.33 0.30 Euler 1D 197 148 beta , k cfl , n space 8 . 0 × 10 − 2 2 . 0 × 10 − 2 1 . 0 × 10 − 2 Euler 2D 232 232 – cfl , cg tolerance , n grid x 8 . 0 × 10 − 2 4 . 0 × 10 − 2 8 . 0 × 10 − 3 FEM 2D 103 103 – cfl , dx vary ∗ vary ∗ vary ∗ HM Linear 306 0 N , cg atol , dt – 1 . 0 × 10 − 2 1 . 0 × 10 − 3 5 . 0 × 10 − 4 HM Nonlinear 110 0 N , dt – 1 . 0 × 10 − 2 1 . 0 × 10 − 3 1 . 0 × 10 − 4 Heat 1D 295 295 – cfl , n space 1 . 0 × 10 − 2 1 . 0 × 10 − 3 1 . 0 × 10 − 4 Heat Steady 2D 662 470 relax , t init dx , error threshold 5 . 0 × 10 − 3 5 . 0 × 10 − 4 3 . 0 × 10 − 4 MPM 65 65 – cfl , n part , nx vary ∗ vary ∗ vary ∗ NS Transient 2D 244 97 relaxation factor , residual threshold cfl , resolution 0.60 0.30 0.15 T otal 2916 1900 Hasegawa-Mima (Nonlinear) ( Hasegawa & Mima , 1978 ; Boyd , 2001 ). This solver addresses the nonlinear Hasega wa- Mima equation for drift wa ve turb ulence in magnetized plasmas: ∂ q ∂ t + { ϕ, q } + v ∗ ∂ ϕ ∂ y = 0 , q = ∇ 2 ϕ − ϕ where { ϕ, q } = ∂ ϕ ∂ x ∂ q ∂ y − ∂ ϕ ∂ y ∂ q ∂ x is the Poisson bracket representing nonlinear advection. The solver employs a pseudo-spectral method with 2D FFT for spatial deri v ati ves, fourth-order Runge-Kutta (RK4) time integration, and 2/3 rule dealiasing to prev ent aliasing errors from quadratic nonlinearity . The tunable parameters include the spatial resolution ( N ) determining grid spacing via ∆ x = ∆ y = L/ N where L = 2 π × 10 , and the time step size ( dt ) controlling temporal discretization over the fixed simulation duration of 10,000 time units. The computational cost follows the relationship C = N FFT × N 2 × log 2 ( N 2 ) , where N FFT represents the total number of FFT operations (forward, in verse, and temporal integrations) performed during the simulation. Con ver gence is e v aluated through L2 RMSE by comparing solutions at dif ferent resolutions using linear interpolation, as no analytical solution is available for the nonlinear re gime. The dataset comprises 5 initial condition profiles (monopole, dipole, sinusoidal, sin x gauss y , and gauss x sin y), generating a total of 110 tasks across dif ferent accuracy lev els and parameter combinations. Material Point Method ( Cao et al. , 2025 ). This solver implements solid mechanics using the explicit Material Point Method (MPM) on unstructured meshes. The MPM solves the momentum conservation equation: ∂ v ∂ t + v · ∇ v = 1 ρ ∇ · σ + g where v is the velocity field, ρ is density , σ is the stress tensor computed using corotational elasticity , and g is gravitational acceleration. The method combines Eulerian grid-based momentum solving with Lagrangian particle-based material tracking through particle-to-grid transfer , grid momentum update, grid-to-particle transfer , and particle advection. The tunable parameters include the background grid resolution ( nx ) determining spatial discretization through ∆ x = L/ nx , the particle density per grid cell ( n part ) controlling material representation, and the CFL number ( cfl ) for temporal stability through ∆ t = cfl / ( v max,init / ∆ x ) where v max,init is the initial maximum velocity . The computational cost follows the relationship C = P iteration ( n particles + n neighbor communications ) , accounting for both particle operations and particle-particle interactions within the support radius. Con ver gence is e v aluated through energy conserv ation criteria, with hybrid absolute and relati ve thresholds depending on mean ener gy magnitude, as well as positi vity preserv ation for kinetic and potential energies. The dataset encompasses 3 benchmark profiles (cantilever beam bending under gra vity , vibration bar with elastic wa ve propagation, and disk collision with impact dynamics), generating a total of 65 tasks across dif ferent accurac y le vels and parameter combinations. A.3. Solver Summary T able 2 summarizes all 12 solvers in the benchmark, including task counts for single-round and multi-round e v aluation modes, distance metrics used for con v ergence checking, and precision thresholds at three accurac y le vels. 18 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs B. Experimental Details B.1. LLM Configuration and Hyperparameters • T emperature: 0.0 (except GPT -5 which uses default) • Max tokens: 2048 • Max function calls: 10 (multi-round), 1 (single-round) B.2. Computational En vir onment T able 3. Computational en vironment specifications. Component Specification CPU Intel Core i9-14900KF (24-core, 2.4–6.0GHz) Memory 64GB DDR5 (2 × 32GB, 5200MHz) GPU NVIDIA GeForce R TX 4090 (24GB GDDR6X) Note on repr oducibility : The computational en vironment is not critical for replicating most results, as our cost metrics are defined analytically based on computational complexity (FLOP counts). The exception is EPOCH, which uses w all-clock time as the cost metric; for this dataset, we use the abov e machine consistently to ensure reliable timing measurements. B.3. Selected Prompt Examples Euler 1D - Single-round - n space Prompt Example System Prompt Y our task is to find the optimal parameter , solving the 1D Euler equations for compressible in viscid flow , using a 2nd order MUSCL scheme with Roe flux and generalized superbee limiter . This serves as a simplified model for compressible fluid dynamics. Y ou should try to minimize the total cost incurred by function calls, but your primary goal is to successfully meet the con ver gence criteria. Y ou should always use the tool call function to finish the problem. W orkflow: n space (Number of grid cells) determines the spatial discretization resolution: ∆ x = L/n space where L is the domain length. Y ou may **only** change ‘n space‘. The v alue of k is **-1.0**, beta is **1.0**, cfl is **0.25**. **Y ou must not change them!** Y ou have only one opportunity to choose an optimal v alue for n space. No trial-and-error or multi-round optimization is permitted. Y our goal is to select a v alue that pro vides adequate spatial resolution while keeping computational cost reasonable. Step 1: Make your best **one-shot** guess for n space. Step 2: Call the Conv ergence T est Function and check if con ver ged. Step 3: Output final answer with no further tool calls. User Prompt QID: 1 Problem: Euler 1D Equations with 2nd Order MUSCL-Roe Method This simulation solves the 1D Euler equations for compressible in viscid flow , using a 2nd order MUSCL scheme with Roe flux and generalized superbee limiter: Conservati ve form: ∂ U ∂ t + ∂ F ( U ) ∂ x = 0 19 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Where the conservati ve variables and flux are: U =   ρ ρu ρE   , F =   ρu ρu 2 + p u ( ρE + p )   Primitiv e v ariables: • ρ = density • u = velocity • p = pressure • E = specific total energy Equation of state: p = ( γ − 1) ρ  E − u 2 2  where γ is the ratio of specific heats. Spatial Discretization: The spatial discretization uses MUSCL reconstruction with blending parameter k : U L j + 1 2 = U j + 1 + k 4 ψ ( r j )( U j +1 − U j ) U R j + 1 2 = U j +1 − 1 + k 4 ψ ( r j +1 )( U j +2 − U j +1 ) where k is a blending coef ficient between central ( k = 1 ) and upwind ( k = − 1 ) scheme, and ψ ( r ) is the slope limiter function. Slope Limiting: The slope limiter uses a generalized superbee limiter: ψ ( r ) = max [0 , max [min( β r, 1) , min( r , β )]] where β is the limiter parameter controlling dissipation. The slope ratio r at interface j is defined as: r j = U j +1 − U j U j +2 − U j +1 This ratio indicates the local non-smoothness, which will be the input into the slope limiter to achie ve the TVD condition. Flux Computation: The interface flux is computed using the Roe approximate Riemann solver: F j + 1 2 = 1 2  F ( U L ) + F ( U R )  − 1 2 | A | ( U R − U L ) where | A | is the Roe matrix with Roe-av eraged quantities. Initial condition cases: • sod: Left: ρ = 1 . 0 , u = 0 . 0 , p = 1 . 0 ; Right: ρ = 0 . 125 , u = 0 . 0 , p = 0 . 1 • lax: Left: ρ = 0 . 445 , u = 0 . 6977 , p = 3 . 528 ; Right: ρ = 0 . 5 , u = 0 . 0 , p = 0 . 571 • mach 3: Left: ρ = 3 . 857 , u = 0 . 92 , p = 10 . 333 ; Right: ρ = 1 . 0 , u = 3 . 55 , p = 1 . 0 Parameter Information: • cfl: Courant-Friedrichs-Lewy number , C F L = ( | u | + c )∆ t ∆ x where c = p γ p/ρ is the speed of sound 20 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs • beta: Limiter parameter for generalized superbee • k: Blending parameter between central and upwind fluxes • n space: Number of grid cells for spatial discretization, determines spatial resolution: ∆ x = L/n space Physical Parameters: • Domain length: 1.0 • Gamma (ratio of specific heats): 1.4 • Case: sod Con v ergence Check: • Errors between the simulation based on your solution and the simulation based on the self-refined solution are computed to assess con v ergence. • Con vergence is confirmed if the follo wing v alidation criteria are satisfied. V alidation Criteria: • Curr ent Problem Pr ecision Le vel : HIGH • Requir ed RMSE T olerance : ≤ 0 . 01 • Relati v e RMSE must meet this tolerance compared to self-refined solution • Positi vity preserv ation: pressure and density must remain positiv e at all times • Shock speed consistenc y: pressure gradients should not exceed physical bounds A vailable functions: Function Name: euler 1d check conv er ge n space Description: Conduct a 1D Euler PDE simulation and ev aluate its spatial conv er gence by doubling n space. It returns the following results: • RMSE : float • is converged : boolean • accumulated cost : integer • The cost of the solver simulating the environment : integer • The cost of the solver verifying convergence (This will not be included in your accumulated cost) : integer • metrics1 : object • metrics2 : object Parameters: • cfl (float): CFL number • beta (float): Limiter parameter for generalized superbee • k (float): Blending parameter for MUSCL reconstruction • n space (integer): Current number of grid cells for spatial discretization Required parameters: cfl , beta , k , n space 21 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Euler 1D - Multi-round - n space Prompt Example System Prompt Y our task is to find the optimal parameter , solving the 1D Euler equations for compressible in viscid flow , using a 2nd order MUSCL scheme with Roe flux and generalized superbee limiter . This serves as a simplified model for compressible fluid dynamics. Y ou should try to minimize the total cost incurred by function calls, but your primary goal is to successfully meet the con ver gence criteria. Y ou should always use the tool call function to finish the problem. And the maximum number of your function calls is 10. W orkflow: n space (Number of grid cells) determines the spatial discretization resolution: ∆ x = L/n space where L is the domain length. Y ou may only change ‘n space‘. The v alue of k is -1.0 , beta is 1.0 , cfl is 0.25 . Y ou must not change them! Step 1: Estimate an initial fairly coarse choice of n space, as you will gradually refine the solution and check con v ergence. Step 2: Call the Conv ergence T est Function; check if con ver ged. Step 3: Refine n space based on the feedback from the simulation. Step 4: Y ou have at most 10 total opportunities to refine your resolution. Step 5: If you think the experiment can be stopped, you must respond with the final response format and make no further function calls. If you reach the 10th refinement, you must still perform a con vergence check immediately after that refinement; then, regardless of whether it is con verged or not, respond with the final response format and make no further function calls. User Prompt: [Identical to the single-r ound setting .] ICL Prompt Example - Euler 1D - High Accuracy - Beta The following shows how in-context learning examples are prepended to the task prompt. This example uses accuracy- matched ICL with cost information (our central ICL approach). ICL Prompt Example Reference Example 1: In a pre vious simulation (high accuracy), the result gi v en parameters: n space=4096 , cfl=0.25 , beta=1.0 , k=-1.0 , was converged=true , cost=29425664 . Reference Example 2: In a pre vious simulation (high accuracy), the result gi ven parameters: n space=200 , cfl=0.25 , beta=1.5 , k=-1.0 , was converged=false , cost=71000 . Please note that your problem may have different settings (e.g., case, accuracy requirements, etc.), so identical parameters may produce entirely different results. [T ask pr ompt follows as shown in pr evious examples...] For Cost-Ignorant ICL , the cost=... field is omitted from e xamples. For Mixed-Accuracy ICL , e xamples from all three accuracy le v els (lo w , medium, high) are provided re gardless of the target accurac y le v el. 22 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs B.4. Evaluation Pr otocols Success Criteria : Domain-specific conv ergence (RMSE thresholds, ph ysical constraints) Precision Le vels : Three tiers with RMSE tolerances from 0.08 (low) to 0.0001 (high) B.5. GPT -5 Reasoning Effort Euler 1D - Single-round - High Accuracy - cfl GPT -5 Output (Reasoning Effort: Minimal) T ool Name: euler 1d check converge cfl T ool Arguments: • cfl : 0.4 • beta : 1.0 • k : -1.0 • n space : 256 T ool Reason: Select a single CFL that balances stability (TVD MUSCL with Roe and superbee limiter , k=-1.0 upwind-biased, beta=1.0) and accuracy for HIGH precision tolerance (RMSE ≤ 0.01) on the Sod problem at n space=256. For second-order explicit schemes with Roe flux and strong limiter, a conserv ati ve but not overly small choice is CFL=0.4, which typically ensures stability and sufficient temporal accurac y without excessi v e cost. GPT -5 Output (Reasoning Effort: High) T ool Name: euler 1d check converge cfl T ool Arguments: • cfl : 0.3 • beta : 1.0 • k : -1.0 • n space : 256 T ool Reason: Choose CFL=0.3 to ensure robust stability and high-precision conv ergence for Sod with Roe + MUSCL where k=-1 collapses to first-order upwind and beta=1 reduces to minmod. This value balances accurac y (tight RMSE ≤ 0.01 when halving CFL) with reasonable computational cost at n space=256. B.6. Failur e Mode Examples Domain experts manually analyzed model reasoning traces and simulation logs, identifying fi v e recurring failure modes: 1. F alse P ositive (Multi-Round) : The model treats con v ergence as the sole success criterion, ignoring other diagnostic signals (e.g., suspiciously low/high cost, physical anomalies). This premature stopping prev ents finding more efficient solutions. 2. Blind Exploration (Multi-Round) : Despite receiving simulation feedback, the model fails to extract actionable patterns, making random parameter adjustments without strategic direction. W ithout systematic feedback integration, additional iterations merely waste computation. 3. Instruction Misunderstanding (Multi-Round) : The model achieves con vergence early b ut continues searching for the “optimal solution, ” unnecessarily increasing cumulativ e cost despite e xplicit instructions that all attempts accrue cost. 23 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs 4. Prior Bias (Single-Round) : The model assigns fix ed parameter values regardless of v arying problem conditions (accuracy lev el, simulation domain, boundary conditions), suggesting memorization of “canonical v alues” from training data. Example: In the Euler 1D beta task (402 questions), LLMs chose beta=1.5 in 99.3% of cases (399/402), despite the optimal value being 1.0. 5. Conservati ve Strategy (Both Modes) : The model selects excessi v ely fine resolutions or small time steps f ar beyond con v ergence requirements, inflating cost unnecessarily . This accounts for the paradoxical pattern where low-accurac y tasks (wider success bands) show lo wer average ef ficienc y than high-accuracy tasks—models o ver -solv e easy problems. These failure modes re v eal that current LLMs lack: (1) cost-awareness calibration—understanding when “good enough” con v ergence justifies stopping, and (2) adapti v e prior weighting—dynamically adjusting training-data-deri v ed def aults based on task-specific feedback. The following e xamples illustrate each failure mode: F A L S E P O S I T I V E 24 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs False P ositi ve - Heat 2D - Multi-r ound - error threshold Summary LLMs mistakenly interprets immediate con vergence (RMSE = 0.0 in the first attempt) as successful completion of the experiment. The model fails to recognize that such immediate con ver gence indicates the simulation parameters are too loose, with num iter in the solver is only 1, requiring further in vestigation and parameter refinement rather than stopping. Detailed Experiment Log Attempt 1 • T ool Name: heat 2d check converge error threshold • T ool Arguments: – dx : 0.01 – relax : 0.2 – t init : 0.0 – error threshold : 0.1 • T ool Reason: T o initiate the 2D heat transfer PDE simulation and ev aluate its error threshold con v ergence by dividing the error threshold by ten. This function conducts both PDE solving and con vergence testing in one call. • Results: – RMSE : 0.0 – is con verged : true – accumulated cost : 10,201 – Number of iterations to simulate : 1 – Number of iterations to verify con vergence : 1 – Cost of simulating the en vironment : 10,201 – Cost of verifying con vergence : 10,201 – Metrics1: temperature valid : true – Metrics2: temperature valid : true Attempt 2 • Status: should stop : T rue • T ool Reason: The simulation has con ver ged, and the con vergence criteria ha ve been met. There is no need to refine the error threshold further . B L I N D E X P L O R A T I O N Blind Exploration - Heat 2D - Multi-round - dx Summary LLMs exhibited a ”Blind Exploration” failure mode, demonstrating ineffecti ve strate gy adjustment based on simulation feedback. The model persistently explored values around the 0.023-0.025 range through random micro-adjustments instead of systematic refinement, while the reference (brute-force) dx was 0.01. Ke y failure patterns: • Inefficient parameter space exploration : After finding dx=0.025 gav e RMSE=0.000307 (just abo ve 0.0003 25 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs threshold), the model made random small adjustments (0.024, 0.023, 0.0245, 0.0248) without clear strategy . • Ignoring cost implications : The model focused solely on RMSE reduction while ignoring exponentially increasing computational costs (from 53,724 to 57,009,546). • False con ver gence claim : The final attempt claimed RMSE=0.000307 ”meets the 0.0003 tolerance” when it actually exceeds it. • Lack of systematic appr oach : No binary search or gradient-based optimization; just random perturbations around a local region. The model should hav e recognized that smaller dx w ould provide much better accuracy with reasonable computa- tional cost, rather than getting trapped in inefficient fine-tuning. Detailed Experiment Log Attempt 1 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.1 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: Starting the con vergence test with an initial coarse grid spacing of 0.1 to establish a baseline error measurement. • Results: – RMSE : 0.00494 – is con verged : false – accumulated cost : 53,724 – iterations (simulation) : 444 – iterations (verification) : 1,602 – simulation cost : 53,724 – verification cost : 706,482 Attempt 2 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.05 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: The initial dx=0.1 produced RMSE=0.00494 which exceeds the 0.0003 tolerance. Halving dx to 0.05 to reduce discretization error and check con v ergence. • Results: – RMSE : 0.001285 – is con verged : false 26 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs – accumulated cost : 760,206 – iterations (simulation) : 1,602 – iterations (verification) : 5,676 – simulation cost : 706,482 – verification cost : 9,541,356 Attempt 3 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.03 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: Reducing dx to 0.03 to further lower RMSE from 0.001285 and approach the 0.0003 tolerance requirement. • Results: – RMSE : 0.000856 – is con verged : false – accumulated cost : 5,386,518 – iterations (simulation) : 4,002 – iterations (verification) : 13,998 – simulation cost : 4,626,312 – verification cost : 62,837,022 Attempt 4 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.025 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: Decreasing dx to 0.025 to push RMSE below 0.0003 tolerance after previous 0.03 attempt showed 0.000856 error . • Results: – RMSE : 0.000307 – is con verged : false – accumulated cost : 14,927,874 – iterations (simulation) : 5,676 – iterations (verification) : 19,741 – simulation cost : 9,541,356 – verification cost : 129,520,701 27 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs Attempt 5 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.024 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: The dx=0.025 trial achie ved RMSE=0.000307, just abo ve the 0.0003 threshold. T esting dx=0.024 to verify if con vergence criteria can be met with minimal additional refinement. • Results: – RMSE : 0.000359 – is con verged : false – accumulated cost : 25,398,978 – iterations (simulation) : 5,936 – iterations (verification) : 21,079 – simulation cost : 10,471,104 – verification cost : 148,733,424 Attempt 6 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.023 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: dx=0.024 produced higher RMSE (0.000359) than dx=0.025. Retrying dx=0.023 to find optimal resolution between 0.025 and 0.024 that meets 0.0003 tolerance. • Results: – RMSE : 0.000548 – is con verged : false – accumulated cost : 37,926,834 – iterations (simulation) : 6,471 – iterations (verification) : 22,454 – simulation cost : 12,527,856 – verification cost : 169,954,326 Attempt 7 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.0245 – relax : 0.6 28 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs – t init : 0.5 – error threshold : 1e-08 • T ool Reason: dx=0.023 increased RMSE to 0.000548. Retrying dx=0.0245 between 0.025 and 0.024 to capture potential minimum error point near previous best result. • Results: – RMSE : 0.000308 – is con verged : false – accumulated cost : 47,468,190 – iterations (simulation) : 5,676 – iterations (verification) : 20,183 – simulation cost : 9,541,356 – verification cost : 135,710,492 Attempt 8 • T ool Name: heat 2d check converge dx • T ool Arguments: – dx : 0.0248 – relax : 0.6 – t init : 0.5 – error threshold : 1e-08 • T ool Reason: dx=0.0245 achie ved RMSE=0.000308, just above tolerance. T esting dx=0.0248 to verify if slight increase brings error below 0.0003 while minimizing computational cost. • Results: – RMSE : 0.000307 – is con verged : false – accumulated cost : 57,009,546 – iterations (simulation) : 5,676 – iterations (verification) : 19,741 – simulation cost : 9,541,356 – verification cost : 129,520,701 Attempt 9 • T ool Name: (No tool call - stop decision) • T ool Arguments: (No arguments) • T ool Reason: After 9 refinements, dx=0.0248 produced RMSE=0.000307 which meets the 0.0003 tolerance. Stopping as this satisfies con v ergence criteria with minimal error . • Results: Model chose to stop with should stop=true , claiming false con v ergence. 29 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs I N S T R U C T I O N M I S U N D E R S TA N D I N G Instruction Misunderstanding - NS T ransient 2D - Multi-round - resolution Summary LLMs exhibited a critical instruction misunderstanding failure mode. Despite explicit instructions stating that each simulation results in an increase in cumulativ e cost, the model obtained promising conv er gence results in early attempts but persisted in searching for an ”optimal solution, ” leading to unnecessary additional simulations and an explosi v e increase in cost. The model successfully achieved con v ergence at grid resolution 20 (first attempt) with accumulated cost of 2,692,800, but continued refining through resolutions 40, 80, 160, 320, and finally 640, where con v ergence failed with accumulated cost reaching 40,553,179,200 - an increase of over 1500x from the initial successful solution. This represents a fundamental misunderstanding of the cost-benefit trade-of f and f ailure to recognize when to stop despite achieving satisf actory results early in the process. Detailed Experiment Log Attempt 1 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 20 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: Starting with a coarse grid resolution of 20 to initiate the simulation and check for con v ergence. • Results: – r efined parameter : ”resolution” – curr ent value : 20 – r efined value : 40 – norm RMSE : 0.34589 – is con verged : true – accumulated cost : 2,692,800 – curr ent num steps : 411 – r efine num steps : 810 Attempt 2 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 40 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: The simulation con v erged at a grid resolution of 20. Now , refining the grid resolution to 40 to check for further con v ergence. • Results: 30 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs – r efined parameter : ”resolution” – curr ent value : 40 – r efined value : 80 – norm RMSE : 0.205553 – is con verged : true – accumulated cost : 21,774,400 – curr ent num steps : 810 – r efine num steps : 1,610 Attempt 3 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 80 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: The simulation con v erged at a grid resolution of 40. Now , refining the grid resolution to 80 to check for further con v ergence. • Results: – r efined parameter : ”resolution” – curr ent value : 80 – r efined value : 160 – norm RMSE : 0.137035 – is con verged : true – accumulated cost : 114,241,600 – curr ent num steps : 1,610 – r efine num steps : 3,210 Attempt 4 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 160 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: The simulation con ver ged at a grid resolution of 80. Now , refining the grid resolution to 160 to check for further con v ergence. • Results: – r efined parameter : ”resolution” – curr ent value : 160 – r efined value : 320 – norm RMSE : 0.194845 31 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs – is con verged : true – accumulated cost : 703,400,000 – curr ent num steps : 3,210 – r efine num steps : 6,410 Attempt 5 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 320 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: The simulation conv er ged at a grid resolution of 160. Now , refining the grid resolution to 320 to check for further con v ergence. • Results: – r efined parameter : ”resolution” – curr ent value : 320 – r efined value : 640 – norm RMSE : 0.262424 – is con verged : true – accumulated cost : 5,087,553,600 – curr ent num steps : 6,410 – r efine num steps : 12,810 Attempt 6 • T ool Name: ns transient 2d check converge resolution • T ool Arguments: – resolution : 640 – cfl : 0.05 – relaxation factor : 1.3 – residual threshold : 0.01 • T ool Reason: The simulation conv er ged at a grid resolution of 320. Now , refining the grid resolution to 640 to check for further con v ergence. • Results: – r efined parameter : ”resolution” – curr ent value : 640 – r efined value : 1280 – norm RMSE : Infinity – is con verged : false – accumulated cost : 40,553,179,200 – curr ent num steps : 12,810 – r efine num steps : 4,038 32 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs P R I O R B I A S Prior Bias - Burgers 1D - Single-r ound - beta Summary Beta Bias : All LLMs default to 1.5 as the most appropriate initial v alue for beta, sho wing e xcessi v e rigidity and lacking the analytical ability and flexibility to adjust beta v alues according to specific circumstances. Detailed Experiment Log 1. Nova-Premier • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : -1.0 – n space : 100 • T ool Reason: W e need to start by selecting beta and n space parameters. Since beta controls dissipation and must be > = 1 , choosing beta=1.5 provides moderate limiting. For n space, starting with 100 giv es a baseline resolution for the sinusoidal case. This combination allows us to e v aluate initial con vergence beha vior while minimizing computational cost. • Results: – RMSE : 0.051053 – is con verged : true – accumulated cost : 29,100 2. GPT -5 • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : -1.0 – n space : 100 • T ool Reason: Initial exploration of beta parameter with moderate grid resolution to understand con vergence behavior for the 1D Burgers equation with sinusoidal initial condition. Starting with a moderate beta v alue of 1.5, which provides a balanced approach between preserving sharp gradients and maintaining numerical stability . • Results: – RMSE : 0.051053 – is con verged : true – accumulated cost : 29,100 3. Qwen-32B 33 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : -1.0 – n space : 100 • T ool Reason: T o solve the 1D Burgers equation and check con vergence, we need to choose appropriate values for beta and n space. For beta, we’ll use 1.5, which is a moderate v alue for the generalized superbee limiter , providing a balance between accurac y and stability . For n space, we’ll start with 100 grid cells, which should provide a reasonable initial resolution. W e’ll use the gi ven fix ed v alues for cfl (0.25) and k (-1.0). • Results: – RMSE : 0.051053 – is con verged : true – accumulated cost : 29,100 4. Claude-3.7-Sonnet • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : 0.0 – n space : 256 • T ool Reason: I need to select appropriate values for beta and n space to solv e the 1D Bur gers equation. For beta, I’ll choose 1.5 which is a common value for the generalized superbee limiter that balances between accuracy and stability . For n space, I’ll start with 256 grid cells which should provide suf ficient resolution for this problem while keeping computational cost reasonable. The parameter k is fixed at 0.0 as specified, and cfl is fixed at 0.25. • Results: – RMSE : 0.031673 – is con verged : true – accumulated cost : 190,464 5. Llama-3-70b-Instruct • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : -1.0 – n space : 1000 • T ool Reason: Choosing beta=1.5 and n space=1000 to achie ve a good balance between accuracy and computa- tional cost. 34 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs • Results: – RMSE : 0.017798 – is con verged : true – accumulated cost : 2,879,000 6. Mistral-Large • T ool Name: burgers 1d check converge beta • T ool Arguments: – cfl : 0.25 – beta : 1.5 – k : -1.0 – n space : 100 • T ool Reason: T o conduct a 1D Burgers PDE simulation and ev aluate its spatial con ver gence for the giv en beta task by doubling n space. • Results: – RMSE : 0.051053 – is con verged : true – accumulated cost : 29,100 C O N S E RV A T I V E S T R A T E G Y Conservati ve Strategy - Epoch 1D - npart Summary The model’ s conserv ati v e strate gy f ailed by choosing npart = 1000 , which is 100 times lar ger than the reference value and resulted in a computational cost 200 times higher than necessary . Despite achie ving con vergence, this choice led to e xcessi ve computational resource consumption. The model’ s reasoning incorrectly emphasized ”statistical accuracy” without adequately considering the significant computational overhead. This ex emplifies ov er-engineering, where the model selected an unnecessarily expensi v e parameter configuration that far e xceeded what was required for con vergence. The conservati v e choice illustrates the model’ s failure to balance accurac y needs with computational efficienc y . Detailed Experiment Log Attempt 1 • T ool Name: epoch 1d check converge npart • T ool Arguments: – nx : 3200 – dt multiplier : 0.95 – npart : 1000 – field order : 4 – particle order : 3 35 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs • T ool Reason: I chose npart = 1000 as it provides a good balance between statistical accuracy and computational cost. This value is high enough to ensure adequate particle statistics but not so high that it becomes computationally expensi v e. • Results: – L2 err or : 0.136838 – is con verged : true – accumulated cost : 2105.210849768 – The cost of the solver simulating the en vironment : 2105.210849768 – The cost of the solv er verifying con vergence (This will not be included in y our accumulated cost) : 1971.728457438 C. Supplementary Results C.1. Detailed Overall Results T able 4. The overall results on the full dataset. Abbreviations: S - Success Rate (%), E - Ef ficiency . Low/Med/High - accuracy le vels. Measurements r eported f or the single-round infer ence mode. Model/Acc lev el Low Med High A ve Metrics S E S E S E S E GPT -5 71.4 0.36 66.0 0.26 53.9 0.21 63.8 0.27 Claude-3.7-Sonnet 66.8 0.37 54.7 0.35 44.9 0.36 55.4 0.36 Llama-3-70B-Instruct 57.3 0.16 42.0 0.16 37.9 0.24 45.7 0.18 Qwen3-32B 61.6 0.46 42.8 0.39 34.6 0.33 46.3 0.39 GPT -OSS-120B 68.4 0.50 51.8 0.41 39.6 0.45 53.3 0.45 T able 5. The overall results on the full dataset. Abbreviations: S - Success Rate (%), E - Ef ficiency . Low/Med/High - accuracy le vels. Measurements r eported are f or multi-r ound tunable parameters only . Model/Acc lev el Low Med High A ve Metrics S E S E S E S E GPT -5 84.9 0.60 79.9 0.51 76.2 0.56 80.3 0.55 Claude-3.7-Sonnet 69.7 0.41 68.2 0.38 66.8 0.49 68.2 0.42 Llama-3-70B-Instruct 70.1 0.40 72.8 0.39 69.2 0.55 70.7 0.44 Qwen3-32B 71.0 0.66 67.7 0.49 64.2 0.65 67.7 0.60 GPT -OSS-120B 77.6 1.03 80.5 0.71 81.0 0.86 79.7 0.86 T able 6. Comparison between single-round and multi-round inference modes: Success Rate . Abbre viations: Lo w/Med/High - accuracy lev els, SR/MR - single-round and multi-round modes. Model / Acc lev el Low Med High A ve Inference Modes SR MR SR MR SR MR SR MR GPT -5 71.4 84.9 66.0 79.9 53.9 76.2 63.8 80.3 Claude-3.7-Sonnet 66.8 69.7 54.7 68.2 44.9 66.8 55.4 68.2 Llama-3-70B-Instruct 57.3 70.1 42.0 72.8 37.9 69.2 45.7 70.7 Qwen3-32B 61.6 71.0 42.8 67.7 34.6 64.2 46.3 67.7 GPT -OSS-120B 68.4 77.6 51.8 80.5 39.6 81.0 53.3 79.7 C.2. McNemar’ s T est Results T able 8 provides the detailed McNemar’ s test statistics for single-round vs. multi-round comparison, aggregated across all models. 36 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 7. Comparison between single-round and multi-round inference modes: Efficiency . Abbreviations: Low/Med/High - accuracy lev els, SR/MR - single-round and multi-round modes. Model / Acc lev el Low Med High A ve Inference Modes SR MR SR MR SR MR SR MR GPT -5 0.36 0.60 0.26 0.51 0.21 0.56 0.27 0.55 Claude-3.7-Sonnet 0.37 0.41 0.35 0.38 0.36 0.49 0.36 0.42 Llama-3-70B-Instruct 0.16 0.40 0.16 0.39 0.24 0.55 0.18 0.44 Qwen3-32B 0.46 0.66 0.39 0.49 0.33 0.65 0.39 0.60 GPT -OSS-120B 0.50 1.03 0.41 0.71 0.45 0.86 0.45 0.86 T able 8. Multi-round improvement over single-round (McNemar’ s test). Single/Multi = success rate (%); ∆ = improv ement. T op: per-model a verages across accurac y le vels. Bottom: ov erall by accuracy le vel. Model / Accuracy Single Multi ∆ p -value GPT -5 64.9% 87.1% +22.3% < 0.001 Claude-3.7-Sonnet 59.0% 67.6% +8.6% 0.610 Llama-3-70B-Instruct 52.2% 79.2% +27.1% < 0.001 Qwen3-32B 47.6% 71.2% +23.6% < 0.001 GPT -OSS-120B 56.2% 84.0% +27.8% < 0.001 Overall (High) 49.0% 77.2% +28.2% < 0.001 Overall (Medium) 55.9% 79.9% +24.0% < 0.001 Overall (Low) 67.2% 78.3% +11.1% < 0.001 C.3. Statistical Robustness Analysis W e address two potential concerns about the McNemar’ s test results: (1) within-dataset correlation violating independence assumptions, and (2) multiple comparison inflation across 15 model-accuracy combinations. Mixed-Effects Analysis. McNemar’ s test assumes independence between paired observations. Since our task instances share structure (same dataset, related physics), we verified robustness using mix ed-ef fects logistic regression with dataset as a random effect: success ∼ mode + (1 | dataset ) where mode is a binary indicator (0=single-round, 1=multi-round) and the random intercept (1 | dataset ) accounts for potential within-dataset correlation. T able 9 shows results are unchanged: all models sho w significant improv ement from multi-round inference ( p < 0 . 001 ), and dataset variance is near zero for most models. The negligible random ef fect v ariance indicates task-le vel outcomes are approximately independent across datasets, validating the McNemar approach. T able 9. Mix ed-effects logistic regression results. Dataset variance near zero indicates tasks within the same dataset are not highly correlated, supporting the independence assumption in McNemar’ s test. Model Effect SE p -value Dataset V ar . All +20.4% 0.005 < 0.001 0.000 GPT -5 +22.0% 0.011 < 0.001 0.000 Claude-3.7-Sonnet +7.3% 0.013 < 0.001 0.027 Llama-3-70B +25.6% 0.012 < 0.001 0.000 Qwen3-32B +22.2% 0.012 < 0.001 0.046 GPT -OSS-120B +25.4% 0.011 < 0.001 0.039 Multiple Comparison Correction. W ith 15 comparisons (5 models × 3 accurac y levels), we applied Benjamini-Hochberg FDR correction at α = 0 . 05 . T able 10 sho ws that all 14 significant comparisons remain significant after correction. The single non-significant comparison (Qwen3-32B at low accuracy) also remains non-significant. This confirms that the uncorrected p-values reported in the main te xt are v alid. 37 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 10. McNemar’ s test with Benjamini-Hochberg FDR correction. W ith 15 comparisons (5 models × 3 accuracy le v els), we apply FDR correction at α = 0 . 05 . All comparisons remain significant after correction except Qwen3-32B at lo w accuracy . Model Accuracy ∆ SR p (uncorr .) p (FDR) Sig. GPT -5 Low +21.1% < 0.001 < 0.001 Y es GPT -5 Medium +20.6% < 0.001 < 0.001 Y es GPT -5 High +25.2% < 0.001 < 0.001 Y es Claude-3.7-Sonnet Low +-0.9% 0.610 0.610 No Claude-3.7-Sonnet Medium +11.0% < 0.001 < 0.001 Y es Claude-3.7-Sonnet High +15.7% < 0.001 < 0.001 Y es Llama-3-70B Low +13.9% < 0.001 < 0.001 Y es Llama-3-70B Medium +34.7% < 0.001 < 0.001 Y es Llama-3-70B High +32.6% < 0.001 < 0.001 Y es Qwen3-32B Low +13.0% < 0.001 < 0.001 Y es Qwen3-32B Medium +27.9% < 0.001 < 0.001 Y es Qwen3-32B High +29.9% < 0.001 < 0.001 Y es GPT -OSS-120B Low +10.9% < 0.001 < 0.001 Y es GPT -OSS-120B Medium +31.5% < 0.001 < 0.001 Y es GPT -OSS-120B High +41.0% < 0.001 < 0.001 Y es T able 11. ICL performance breakdo wn by dataset and inference mode. V alues sho w success rate (%) a v eraged across accurac y le vels. Base = no ICL; ICL = accuracy-matched examples with cost; No-Cost = ICL without cost info; Mixed = examples from all accuracy lev els. Single-Round Multi-Round Dataset Base ICL No-Cost Mixed Base ICL No-Cost Mixed Euler 1D 38.1 50.3 67.0 66.4 50.8 39.8 39.7 39.8 Heat 1D 69.9 75.0 85.9 86.6 87.7 81.0 85.0 72.8 MPM 2D 34.3 49.7 63.0 71.3 35.2 36.4 35.4 28.7 NS T rans. 2D 38.3 68.3 73.4 71.2 78.5 72.7 70.3 69.7 C.4. Per -Dataset ICL Results The success-efficienc y trade-off from ICL (Section 4 ) manifests consistently across all fi ve e v aluated datasets. T able 11 shows success rates by dataset and inference mode. Figure 4 shows the detailed ICL ablation results with per-accuracy breakdown, complementing the summary in Figures 3e – 3f . C.5. Reasoning Effort Ablation Results T able 12 shows GPT -5 reasoning effort comparison by accuracy le v el and inference mode. The key finding is that reasoning effort sho ws no significant overall impact on performance—increased reasoning does not translate to better parameter selection when aggregated across accurac y le vels and modes. T able 12. GPT -5 reasoning effort ablation by accurac y le vel and inference mode. S = Success Rate (%), E = Ef ficienc y . Bold indicates best per column. Single-Round Multi-Round Effort Le v el Low Med High Low Med High S E S E S E S E S E S E Minimal 74.2 0.39 60.5 0.29 48.0 0.19 81.7 0.66 84.5 0.50 84.8 0.70 Medium 68.1 0.43 62.1 0.28 50.0 0.15 86.5 0.71 81.8 0.52 75.1 0.64 High 70.4 0.44 60.8 0.29 58.6 0.19 87.9 0.62 78.3 0.51 76.0 0.56 Figure 5 shows the detailed reasoning ef fort ablation results by accuracy lev el for both inference modes. Minimal effort achiev es competiti v e or better ef ficiency across most conditions. 38 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs (a) Success Rate by ICL variant and accurac y le vel (b) Efficienc y by ICL v ariant and accuracy le v el F igur e 4. Detailed ICL ablation results. (a) Success rate improves with ICL in single-round mode but degrades in multi-round. (b) Efficienc y drops when cost information is omitted from examples (No-Cost v ariant). F igur e 5. Reasoning effort ablation by accuracy level. Left: Single-round mode. Right: Multi-round mode. No consistent pattern emerges—increased reasoning ef fort does not reliably impro ve parameter selection. Why Reasoning Effort Does Not Help. The example traces in Section B.5 illustrate why increased reasoning does not improv e performance. Both Minimal and High effort select similar CFL v alues (0.4 vs 0.3) and pro vide comparable justifications—neither deriv es the parameter from first principles. The bottleneck is not reasoning depth but reasoning gr ounding : models reason about stability conditions b ut ultimately select v alues based on memorized def aults rather than task-specific deriv ation. Minimal effort achie ves competiti v e ef ficiency because additional reasoning tok ens do not bridge this grounding gap. C.6. Bayesian Optimization Baseline W e compare LLM performance against Bayesian Optimization with Gaussian Process surrogate (BO-GP), a classical black-box optimization approach that b uilds a surrogate model of the objecti v e function and uses acquisition functions to guide parameter exploration. Implementation Details. Our BO-GP baseline uses scikit-learn’ s Gaussian Process implementation with a Matern kernel ( ν =2.5), Upper Confidence Bound (UCB) acquisition function ( κ =2.576), and automatic length-scale optimization. [Y ang: W e run up to 10 BO iterations] , matching the LLM multi-round budget. [Y ang: At each iteration, a Gaussian Process (GP) is fitted on pre viously observed samples (f allback to random samples for initial round); an Upper Confidence Bound (UCB) acquisition function is constructed with the GP . T o propose a parameter based on acquisition values, we take a explore-and-e xploit approach: we select the best out of 10,000 randomly-sampled ne w parameters (random-best), compare 39 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs this with the solution proposed by L-BFGS-B on acquisition function (smart-best), and select the best out of the two] . Additional GP hyperparameters include α =1e-6 for numerical stability , y -normalization enabled, and 5 random restarts for optimizer initialization. [Y ang: The above h yper-parameters are all suggested in a standard implementation( Nogueira , 2014– )] . The con v ergence criterion matches LLM e xperiments (task-specific distance threshold). Algorithm 3 provides the pseudocode. Why Multi-Round Only . BO is inherently an iterative algorithm that requires observ ed data to train its surrogate model. A “single-round” BO ev aluation would be equiv alent to random sampling, as the GP cannot make informed predictions without prior observations. W e therefore only compare BO against LLMs in multi-round mode. Results Summary . T able 13 shows per -solver success rates comparing BO against the LLM av erage (across GPT - 5, Claude-3.7-Sonnet, Llama-3-70B, Qwen3-32B, and GPT -OSS-120B). BO achiev es 76%/71%/66% success rates at Low/Med/High accurac y le vels, comparable to the LLM av erage (77%/76%/74%), but with larger solv er-to-solv er v ariation (25%–100% vs. more consistent LLM performance). BO excels on smooth, monotonic cost-accuracy relationships (Burgers 1D: 100% vs 78%; EPOCH 1D: 98% vs 87%) but struggles on discrete or multi-modal parameter spaces (HM Linear: 33% vs 81%; MPM 2D: 25% vs 34%). T able 13. Bayesian Optimization vs LLM performance by solver (multi-round mode, success rate %). Low/Med/High = accurac y lev els. LLM A vg is the mean across all fiv e models. BO LLM A vg Solver Low Med High Low Med High Burgers 1D 100 99 87 78 77 74 Diff-React 1D 76 48 50 81 74 75 EPOCH 1D 98 94 98 87 79 75 Euler 1D 74 65 51 60 67 58 Euler 2D 93 98 100 91 90 89 FEM 2D 64 53 9 73 71 65 HM Linear 33 47 37 81 84 88 HM Nonlinear 100 97 94 93 91 88 Heat 1D 100 93 93 96 92 87 Heat 2D 52 35 60 66 78 74 MPM 2D 25 33 32 34 27 27 NS 2D 100 89 80 80 84 85 Overall 76 71 66 77 76 74 BO Exploration Beha vior . BO’ s UCB acquisition function tends to select extreme parameter values in early iterations to maximize information gain, [Y ang: and other common acquisition functions share similar early-iteration behaviors ( Mockus et al. , 2014 ; Jones et al. , 1998 )] . While [Y ang: guaranteeing a theoretical competiti ve performance] , this strategy is detrimental when cumulati ve costs accumulate, tasks terminate upon reaching accuracy thresholds, and iteration b udgets are limited. In contrast, LLMs lev erage physics intuition from pre-training to mak e more conserv ati v e initial guesses near typical working parameter ranges. This physics-informed exploration particularly benefits lo w-accuracy tasks and early exploration phases, e xplaining why LLMs achie ve higher ef ficiency despite BO’ s sophisticated search strate gy . 40 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 14. Bayesian Optimization vs LLM ef ficiency by solver (multi-round mode). V alues show cost ef ficienc y (higher is better; 1.0 = brute-force search cost). BO LLM A vg Solver Low Med High Low Med High Burgers 1D 0.36 0.41 0.67 0.78 0.70 0.85 Diff-React 1D 0.30 0.97 0.64 1.20 1.18 1.00 EPOCH 1D 0.64 0.57 0.58 0.95 0.99 1.04 Euler 1D 0.99 1.08 1.18 1.21 0.78 0.79 Euler 2D 0.60 0.62 0.57 0.42 0.38 0.49 FEM 2D 0.30 0.20 0.48 0.46 0.51 0.73 HM Linear 0.50 0.33 0.33 0.28 0.64 0.81 HM Nonlinear 0.46 0.49 0.89 0.23 0.13 0.26 Heat 1D 0.80 0.93 0.93 0.58 0.61 0.67 Heat 2D 6.13 0.37 1.02 16.81 1.62 0.86 MPM 2D 0.15 0.01 0.02 0.35 0.56 0.83 NS 2D 1.01 0.80 1.01 1.12 0.54 0.52 Overall 1.02 0.57 0.70 2.03 0.72 0.74 Algorithm 3 Bayesian Optimization with Gaussian Process 1: Input: objectiv e f , bounds B , init points m , T = 10 , κ = 2 . 576 2: Initialize GP with Matern kernel ( ν = 2 . 5 ), α = 10 − 6 , normalize- y enabled, 5 optimizer restarts; D ← ∅ 3: f or i = 1 to m do 4: Sample x uniformly from B ; observe y = f ( x ) ; D ← D ∪ { ( x, y ) } 5: end f or 6: f or t = 1 to T do 7: Fit GP on D ; obtain µ ( x ) , σ ( x ) for any x 8: Compute UCB: a ( x ) = µ ( x ) + κσ ( x ) 9: Sample 10,000 new x uniformly from B 10: x rand ← arg max x ∈{ 10 , 000 new samples } a ( x ) (random-best) 11: x smart ← arg max x a ( x ) via L-BFGS-B with random restarts (smart-best) 12: x ∗ ← arg max x ∈{ x rand , x smart } a ( x ) 13: Observe y ∗ = f ( x ∗ ) ; D ← D ∪ { ( x ∗ , y ∗ ) } 14: if con v ergence criterion satisfied (task-specific distance threshold) then 15: break 16: end if 17: end f or 18: Output: best ( x, y ) in D C.7. T ask Group Classification T able 15 presents the parameter-to-group mapping used throughout our analysis, cate gorizing parameters by their role in numerical simulation. C.8. T ask Group P erf ormance Plots Figures 3b – 3c (in the main paper) provide an aggregated summary of performance by task group across both inference modes. Figure 6 shows the detailed performance breakdo wn by task group, referenced from Section 4 . C.9. T ask Correlation Analysis W e in vestig ate whether tasks within the same parameter group (Spatial, T emporal, T olerance, Misc) exhibit correlated success patterns across model-accuracy configurations. If within-group correlations were strong, fine-tuning on one solver’ s tasks might transfer to other solvers with similar parameter types. Our analysis computes Pearson correlations between task success rates across all (model, accurac y le v el) configurations 41 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 15. T ask group classification. Parameters are grouped by their role in numerical simulation, enabling analysis of LLM performance across parameter types. Group Parameters Description Spatial n space , dx , nx , n grid x , ... Grid/mesh resolution parameters af fecting spa- tial discretization accuracy . Most common across all solvers. T emporal cfl , dt , dt multiplier T imestep and CFL parameters controlling tem- poral stability and accuracy . T olerance error threshold , cg tolerance , cg atol , residual threshold , ... Con ver gence tolerances for iterativ e solvers to control solution accuracy . Misc beta , k , t init , field order , ... Physics-specific parameters (limiter coefficients, relaxation factors) v arying by solver . (a) Success Rate (b) Efficienc y F igur e 6. Performance by task group. Spatial and T olerance parameters show consistently strong performance, while Misc parameters lag especially at higher accuracy le vels. (note: each correlation is computed from 5 models × 3 accuracy levels = 15 data points). When a veraging correlations and performing statistical tests, we apply Fisher’ s z-transformation (z = arctanh(r)) to account for the bounded nature of correlation coefficients (which are not normally distributed), then transform back to the correlation scale for reporting. W e compare within-group correlations (tasks sharing the same parameter type) to between-group correlations (tasks from different parameter types). Figures 7a and 7b sho w the a verage correlation between task groups for single-round and multi-round modes respecti v ely . The diagonal (within-group) v alues are not notably higher than of f-diagonal (between-group) v alues, confirming that parameter type does not strongly predict correlated difficulty . Figures 8a and 8b show the full pairwise correlation matrices for all 22 tunable parameters. T asks are grouped by parameter type (Spatial, T emporal, T olerance, Misc), with black lines separating groups. No clear block-diagonal structure emerges in either mode, further supporting the conclusion that task difficulty is solv er -specific rather than parameter -type-dri ven. C.10. Summary T ables T ables 16 – 18 provide precise numerical values corresponding to the main paper figures. C.11. Detailed Simulator -wise Results T ables 19 – 26 present detailed per-solver results grouped by physics domain. 42 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs (a) Single-round mode (b) Multi-round mode F igur e 7. A verage Pearson correlation between task groups. Diagonal = within-group; off-diagonal = between-group. T able 16. Overall model performance (corresponds to Figure 2 ). Model Success Rate (%) Efficiency Single Multi ∆ Single Multi GPT -5 63.8 80.3 +16.6 0.27 0.55 Claude-3.7-Sonnet 55.4 68.2 +12.8 0.36 0.42 Llama-3-70B-Instruct 45.7 70.7 +25.0 0.18 0.44 Qwen3-32B 46.3 67.7 +21.3 0.39 0.60 GPT -OSS-120B 53.3 79.7 +26.4 0.45 0.86 (a) Single-round mode (b) Multi-round mode F igur e 8. Full pairwise task correlation matrix. T asks grouped by parameter type: Spatial, T emporal, T olerance, Misc. Black lines separate groups. 43 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 17. Performance by task group (corresponds to Figure 6 ). T ask Group Success Rate (%) Efficiency Single Multi ∆ Single Multi Spatial 59.6 78.5 +18.9 0.22 0.63 T emporal 55.1 72.1 +17.0 0.40 0.46 T olerance 59.4 76.5 +17.2 0.32 0.46 Misc 43.5 67.6 +24.1 0.68 0.90 T able 18. ICL variant comparison using Claude-3.7-Sonnet (corresponds to Figure 3 ). V ariant Success Rate (%) Efficiency Single Multi ∆ Single Multi Base 46.7 64.1 +17.3 0.328 0.367 Idealized ICL 63.8 53.2 -10.6 0.445 0.377 Cost-Ignorant ICL 73.4 53.1 -20.3 0.364 0.470 Mixed-Accurac y ICL 75.4 46.3 -29.1 0.572 0.435 T able 19. Single-round results for Fluid Dynamics solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E Burgers 1D GPT -5 53.9 1.07 41.7 0.71 23.8 0.40 39.8 0.67 Claude 48.9 1.46 25.0 0.94 18.9 0.78 30.9 1.02 Llama 62.2 0.55 56.7 0.26 21.7 0.85 46.9 0.50 Qwen 33.3 1.26 25.0 0.68 15.6 0.95 24.6 0.93 GPT -OSS 48.3 1.31 22.8 0.71 21.7 0.82 30.9 0.91 Euler 1D GPT -5 81.5 1.07 54.7 0.52 34.7 0.82 57.0 0.77 Claude 83.3 1.33 23.5 0.71 7.4 0.70 38.1 0.87 Llama 58.3 2.18 23.5 0.38 15.2 0.84 32.3 0.88 Qwen 72.2 1.61 16.9 0.60 10.6 0.80 33.3 0.92 GPT -OSS 74.1 1.59 24.5 0.84 7.4 0.65 35.3 0.95 Euler 2D GPT -5 91.7 0.08 93.1 0.11 73.7 0.16 86.2 0.11 Claude 92.6 0.12 89.8 0.21 74.7 0.36 85.7 0.20 Llama 91.7 0.08 85.6 0.13 86.9 0.33 88.1 0.15 Qwen 92.6 0.15 85.6 0.13 84.8 0.31 87.7 0.18 GPT -OSS 92.6 0.18 85.6 0.20 62.7 0.65 80.3 0.28 NS T rans. 2D GPT -5 40.3 0.89 34.0 0.46 22.1 0.36 32.1 0.53 Claude 42.4 0.61 41.4 0.43 31.1 0.92 38.3 0.62 Llama 25.7 0.59 21.5 0.18 25.0 0.68 24.1 0.42 Qwen 43.8 0.76 34.3 0.61 32.9 0.30 37.0 0.51 GPT -OSS 51.4 1.07 48.7 0.71 22.8 0.77 41.0 0.84 44 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 20. Multi-round results for Fluid Dynamics solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E Burgers 1D GPT -5 85.0 0.83 86.1 0.75 72.2 0.98 81.1 0.85 Claude 77.8 0.72 65.0 0.54 62.2 0.65 68.3 0.63 Llama 79.4 0.53 80.0 0.53 57.5 0.68 72.3 0.57 Qwen 66.7 0.84 63.9 0.91 59.2 1.10 63.2 0.94 GPT -OSS 71.7 0.70 71.7 0.49 83.3 0.67 75.6 0.62 Euler 1D GPT -5 73.1 1.77 61.1 0.87 41.8 0.90 58.7 1.11 Claude 47.2 2.16 55.7 0.56 49.4 0.65 50.8 0.93 Llama 53.7 0.47 79.9 0.70 72.1 0.90 68.6 0.67 Qwen 58.3 3.06 60.0 0.54 36.9 0.71 51.7 1.06 GPT -OSS 43.5 3.75 74.0 0.60 64.6 0.71 60.7 1.17 Euler 2D GPT -5 92.6 0.20 89.8 0.22 94.9 0.41 92.5 0.26 Claude 65.7 0.06 68.1 0.13 70.7 0.20 68.2 0.12 Llama 99.1 0.58 99.1 0.54 99.0 0.79 99.0 0.63 Qwen 96.3 0.40 99.1 0.42 98.0 0.70 97.8 0.49 GPT -OSS 89.8 0.56 86.1 0.52 98.0 0.63 91.3 0.57 NS T rans. 2D GPT -5 84.7 1.74 94.4 0.69 91.7 0.47 90.3 0.83 Claude 76.4 1.38 75.7 0.58 83.3 0.47 78.5 0.72 Llama 77.8 1.01 83.9 0.56 88.9 0.52 83.5 0.66 Qwen 72.2 1.72 86.5 0.52 88.9 0.57 82.6 0.80 GPT -OSS 83.3 1.24 83.9 0.38 80.6 0.64 82.6 0.67 T able 21. Single-round results for Heat & Diffusion solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E Heat 1D GPT -5 88.0 0.38 80.0 0.36 55.4 0.05 74.5 0.19 Claude 96.0 0.45 68.0 0.99 45.6 0.53 69.9 0.62 Llama 100.0 0.01 92.0 0.05 82.4 0.23 91.5 0.05 Qwen 98.0 0.47 68.0 0.78 35.9 0.29 67.3 0.47 GPT -OSS 100.0 0.71 65.3 0.56 33.0 0.94 66.1 0.72 Heat 2D GPT -5 65.2 1.96 60.3 1.08 53.9 0.80 59.8 1.19 Claude 49.4 4.28 37.8 1.37 33.3 0.67 40.2 1.58 Llama 49.4 3.03 13.5 0.80 12.3 0.24 25.1 0.84 Qwen 34.6 19.23 12.5 0.97 10.4 1.01 19.2 2.66 GPT -OSS 48.8 2.74 24.1 1.14 14.3 0.76 29.1 1.33 Diff-React 1D GPT -5 70.4 0.60 82.5 0.88 70.0 0.47 74.3 0.63 Claude 64.2 0.65 87.3 0.60 65.8 0.73 72.4 0.66 Llama 66.7 0.23 50.5 0.82 32.5 0.64 49.9 0.49 Qwen 69.1 0.51 50.5 0.84 35.0 0.59 51.5 0.63 GPT -OSS 66.7 0.54 48.9 0.80 43.6 0.74 53.1 0.68 45 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 22. Multi-round results for Heat & Diffusion solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E Heat 1D GPT -5 96.0 0.88 92.0 0.71 88.1 0.86 92.0 0.81 Claude 92.0 0.82 88.0 1.01 83.2 0.88 87.7 0.90 Llama 92.0 0.66 93.3 0.51 86.8 0.53 90.7 0.56 Qwen 92.0 0.85 88.7 0.73 84.6 0.81 88.4 0.80 GPT -OSS 92.0 0.97 93.3 0.83 92.3 0.85 92.5 0.88 Heat 2D GPT -5 94.2 3.68 96.2 1.21 91.0 0.93 93.8 1.61 Claude 44.6 6.07 63.5 1.38 72.0 1.04 60.0 2.06 Llama 83.7 0.76 98.6 1.52 85.9 0.79 89.4 0.97 Qwen 45.8 30.54 60.8 1.54 44.5 0.72 50.4 3.24 GPT -OSS 62.9 6.01 81.9 1.74 77.8 0.78 74.2 2.01 Diff-React 1D GPT -5 87.7 0.63 74.0 0.42 75.8 0.75 79.2 0.58 Claude 77.8 0.42 67.6 0.36 71.7 0.97 72.4 0.52 Llama 70.4 0.43 61.9 0.90 91.7 0.49 74.6 0.57 Qwen 74.1 0.55 55.2 0.98 78.3 0.44 69.2 0.62 GPT -OSS 66.7 0.83 70.8 0.41 74.2 0.50 70.5 0.55 T able 23. Single-round results for Solid Mechanics solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E FEM 2D GPT -5 73.0 0.13 80.0 0.14 39.7 0.47 64.3 0.20 Claude 73.0 0.19 58.4 0.14 52.6 0.31 61.3 0.20 Llama 65.5 0.10 63.9 0.26 67.9 0.22 65.8 0.18 Qwen 70.0 0.15 63.9 0.22 51.3 0.21 61.7 0.19 GPT -OSS 42.0 0.23 51.9 0.39 49.6 0.20 47.8 0.26 MPM 2D GPT -5 38.9 0.35 47.2 0.17 39.4 0.15 41.8 0.21 Claude 33.3 0.20 41.7 0.44 27.8 0.55 34.3 0.36 Llama 16.7 0.00 5.6 0.00 11.1 0.00 11.1 0.00 Qwen 38.9 0.45 33.3 0.50 27.8 0.57 33.3 0.50 GPT -OSS 27.8 0.11 33.3 0.16 33.3 0.32 31.5 0.18 T able 24. Multi-round results for Solid Mechanics solvers. S: Success Rate (%), E: Efficienc y . Solver Model Low Med High A vg S E S E S E S E FEM 2D GPT -5 90.0 0.44 91.5 0.60 100.0 0.72 93.8 0.58 Claude 57.5 0.43 63.2 0.32 53.8 0.38 58.2 0.38 Llama 46.0 0.10 48.5 0.21 37.2 0.38 43.9 0.20 Qwen 75.5 0.45 66.9 0.61 60.3 0.68 67.5 0.57 GPT -OSS 85.0 0.40 89.2 0.72 86.8 0.95 87.0 0.65 MPM 2D GPT -5 55.6 0.36 36.1 1.29 27.8 0.93 39.8 0.76 Claude 38.9 0.13 33.3 0.18 33.3 0.36 35.2 0.21 Llama 33.3 0.48 25.0 0.05 16.7 1.00 25.0 0.28 Qwen 36.1 1.02 33.3 0.62 33.3 1.00 34.3 0.86 GPT -OSS 50.0 0.37 33.3 0.58 33.3 0.87 38.9 0.57 46 SimulCost: A Cost-A ware Benchmark and T oolkit for A utomating Physics Simulations with LLMs T able 25. Single-round results for Plasma Physics solv ers. S: Success Rate (%), E: Efficiency . Solver Model Low Med High A vg S E S E S E S E EPOCH 1D GPT -5 82.2 0.05 73.9 0.02 80.4 0.01 78.9 0.03 Claude 77.8 0.28 74.2 0.29 70.8 0.29 74.3 0.29 Llama 51.1 0.15 42.9 0.13 61.6 0.10 51.9 0.13 GPT -OSS 93.3 0.40 84.1 0.56 79.2 0.55 85.5 0.50 HM Linear GPT -5 97.6 0.35 91.1 0.49 85.2 0.65 91.3 0.48 Claude 81.3 0.06 61.9 0.22 64.9 0.22 69.4 0.14 Llama 77.2 0.04 54.2 0.59 36.7 0.70 56.0 0.25 Qwen 81.3 0.05 56.2 0.80 45.0 0.65 60.8 0.30 GPT -OSS 88.4 0.14 62.3 0.28 44.7 0.27 65.1 0.22 HM Nonlinear GPT -5 96.0 0.05 90.0 0.04 82.5 0.06 89.5 0.05 Claude 83.7 0.01 77.7 0.01 68.3 0.02 76.6 0.01 Llama 54.3 0.01 32.0 0.01 36.1 0.10 40.8 0.02 Qwen 88.9 0.02 79.3 0.01 73.0 0.02 80.4 0.02 GPT -OSS 100.0 0.04 98.0 0.03 82.5 0.04 93.5 0.04 T able 26. Multi-round results for Plasma Physics solv ers. S: Success Rate (%), E: Efficiency . Solver Model Low Med High A vg S E S E S E S E EPOCH 1D GPT -5 80.7 0.25 73.9 0.22 78.3 0.14 77.7 0.20 Claude 94.1 0.80 81.4 1.03 67.6 0.94 81.0 0.92 Llama 75.6 0.39 70.4 0.51 67.7 0.56 71.2 0.48 GPT -OSS 97.8 2.84 94.8 3.30 92.8 4.71 95.1 3.53 HM Linear GPT -5 98.8 0.53 97.8 0.81 100.0 0.86 98.9 0.72 Claude 59.9 0.06 72.1 0.17 85.6 0.21 72.5 0.13 Llama 66.7 0.12 65.3 0.35 64.4 0.41 65.5 0.26 Qwen 84.9 0.09 68.3 0.14 63.0 0.86 72.1 0.22 GPT -OSS 100.0 0.42 98.7 0.62 100.0 0.58 99.6 0.53 HM Nonlinear GPT -5 100.0 0.19 98.0 0.11 91.7 0.36 96.6 0.20 Claude 100.0 0.02 96.0 0.01 91.7 0.05 95.9 0.02 Llama 68.6 0.11 76.6 0.03 76.6 0.09 73.9 0.07 Qwen 92.9 0.04 81.7 0.07 83.7 0.15 86.1 0.08 GPT -OSS 100.0 0.19 98.0 0.16 94.4 0.28 97.5 0.20 47

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment