The rapid growth of cloud computing in the Electronic Design Automation (EDA) industry has created a critical need for resource and job lifetime prediction to achieve optimal scheduling. Traditional machine learning methods often struggle with the complexity and heterogeneity of EDA workloads, requiring extensive feature engineering and domain expertise. We propose a novel framework that fine-tunes Large Language Models (LLMs) to address this challenge through text-to-text regression. We introduce the scientific notation and prefix filling to constrain the LLM, significantly improving output format reliability. Moreover, we found that full-attention finetuning and inference improves the prediction accuracy of sliding-window-attention LLMs. We demonstrate the effectiveness of our proposed framework on real-world cloud datasets, setting a new baseline for performance prediction in the EDA domain.
The semiconductor industry relies heavily on Electronic Design Automation (EDA) tools to design and verify complex ICs. As chip designs grow in complexity, the computational demands of EDA workloads have skyrocketed, leading to a massive migration of these tasks to cloud computing platforms (Bavikadi et al., 2022;Stok, 2014). While the cloud offers scalability and flexibility, efficiently managing resources to control costs without compromising performance remains a fundamental challenge (Liu et al., 2023). Accurate prediction of a job's compute resource requirements (e.g., CPU, memory, and disk) and its execution time, or lifetime, is crucial for efficient workload prioritization, real time resource provisioning, and long term infrastructure planning.
Traditional approaches to this prediction problem often rely on statistical methods or machine learning models like Directed Acyclic Graphs (DAGs) (Huang, 2021), or convolutional graph network Kipf (2016); Zhu et al. (2024). However, these methods require structured, tabular data, forcing engineers to perform extensive and often brittle feature engineering. EDA job configurations are inherently complex and semi-structured, comprising tool settings, design parameters, technology node details, and script configurations. Flattening this rich information into a fixed-length vector is challenging and often leads to a loss of critical contextual information, limiting predictive performance.
The recent success of Large Language Models (LLMs) in diverse domains has opened up new possibilities for tackling complex regression tasks through a text-to-text formulation (Song and Bahri, 2025;Song et al., 2024). However, such opportunity has not yet been explored for EDA cloud job prediction. For the first time, by representing the entire EDA job configuration as a single string we directly train an LLM to “read” the configuration and “write” the predicted resource and lifetime values. This approach employs LLMs to encode the semi-structured job representations and learn to extract predictive signals from the inherent structure, relation, and dependency of the data. In this paper, we present a framework for fine-tuning LLMs for EDA job prediction. We demonstrate that this first text-to-text regression approach is not only feasible but highly effective:
โข We provide the first validation of training LLMs on semi-structured EDA data for predicting the resource consumption and lifetime of EDA cloud jobs, establishing a new modeling paradigm for this problem space. โข We propose two key techniques to enhance performance: representing numerical outputs in scientific notation to handle large dynamic ranges and using constrained decoding to guide the model’s output, improving both accuracy and robustness. Moreover, we demonstrate that full attention fine-tuning can further improve generation accuracy of a sliding-window pre-trained LLM. โข We empirically validate our framework on real-world EDA datasets, demonstrating significant improvements over various manual and heuristic baselines.
EDA workflows consist of a series of computational jobs, such as logic synthesis, place-androute, timing analysis, and physical verification. These jobs’ performance is highly dependent on a multitude of factors, including but not limited to:
Design Characteristics: The size and complexity of the circuit design (e.g., number of logic gates, memory blocks).
Tool Configuration: The specific EDA tool, its version, and the multitude of settings and flags used for a particular run.
Technology Node: The target semiconductor manufacturing process (e.g., 7nm, 5nm), which dictates physical design rules.
Execution Environment: The underlying cloud infrastructure, including VM types and storage solutions.
The interplay between these factors creates a high-dimensional and complex feature space. A minor change in a synthesis script could lead to a drastically different netlist and cause a tenfold increase in the runtime of the subsequent place-and-route stage. This sensitivity makes prediction a complicated regression task.
Let X denote the space of heterogeneous EDA job configurations, where a specific job instance ๐ โ X encapsulates parameters such as dependency graphs, command-line arguments, and hardware constraints. Our objective is to predict a set of performance metrics ๐ โ โ ๐ , which includes peak memory usage, disk I/O, CPU utilization, and wall-clock execution time.
Traditional approaches typically frame this as a regression problem, requiring a feature extraction function ๐ : X โ โ ๐ to map the complex configuration ๐ into a fixed-size feature vector. A regression model ๐ ๐ is then learned such that ๐ ๐ (๐(๐)) โ ๐ . The primary limitation of this paradigm lies in the design of ๐(โข), which often struggles to capture the nuanced semantics of textual and hierarchical data within ๐. In contrast, we formulate this task as a sequence-to-sequence generation problem. We define a serialization function ๐(โข) that transforms the structured job configuration ๐ into a sequence of input tokens x job = ๐(๐). Similarly, the target metrics ๐ are serialized into a target sequence y = ( ๐ฆ 1 , ๐ฆ 2 , . . . , ๐ฆ ๐ ). The goal is to learn the parameters ๐ of a Large Language Model (LLM) that approximates the conditional probability distribution: ๐ ๐ (y | x job ) = ๐ ๐ก=1 ๐ ๐ ( ๐ฆ ๐ก | x job , ๐ฆ <๐ก ).
Standard approaches for adapting LLMs to regression tasks often involve appending a dense linear layer on top of the final hidden state, minimizing the Mean Squared Error (MSE) on the continuous target metrics ๐ . However, this method exhibits poor generalization when the target domain is expansive, as is typical for EDA resource metrics like RAM and disk that span several orders of magnitude (e.g., from 0 to 10 5 ). The limited capacity of the linear head, combined with the difficulty of backpropagating stable gradients across such a wide continuous range, introduces significant training instability. To circumvent this, we treat the prediction task as a sequence-to-sequence problem: by tokenizing the numerical values, we effectively convert the continuous regression into a highly-structured, discrete token-level classification problem. This approach allows us to leverage the intrinsic sequence- modeling capabilities of the Transformer architecture and the stability of the Cross-Entropy loss (L CE ). We outline the key components of our framework, including input representation, our proposed techniques for numerical stability, and the fine-tuning process in Fig. 1.
To leverage the power of LLMs, we must first represent the EDA job configuration as a coherent string. We serialize the job information in the order of descending importance: source file name (no source code), tags (action type, application type, etc), initial scheduling priority, build configuration, execution specification (launching command, EDA tools), dependencies, caching policy, expected state of execution results (success or fail), and replication.
These configurations are often available in formats like CSV, Protocol Buffers, YAML, or proprietary script files. We process them into a unified key-value format as a JSON file. JSON is capable to preserve the hierarchical structure and semantics of the configuration. An example of a serialized input string is shown in the blue box of (a) in Fig. 1. The output metrics are similarly formatted into a JSON as well. A naive approach would be to simply convert the numbers to their string representation. However, EDA metrics span a vast range of values (e.g., memory from megabytes to terabytes). To handle this, we introduce a scientific notation representation, detailed in Section 3.2.
Remark: The job information is often too long for efficient LLM fine-tuning or inference. In our industrial dataset for instance, half example exceeds 4k tokens, and 10% of them are even longer than 18k tokens. To ensure a balance of performance and computational efficiency, we set the maximum sequence length to 2048, and truncate from left side to right side. As the job information is in descending importance order, this token clip can preserve most critical information. We present an ablation study in Section 4.5.3.
The standard tokenization of floating-point numbers in LLMs is often inefficient and can struggle with precision across different orders of magnitude. To address this, we represent all numerical target values in a standardized scientific notation format, aligned with recent work Akhauri et al. (2025); Song and Bahri (2025); Song et al. (2024). Given an example 12300, we use dedicated tokens for the sign, mantissa, and exponent (e.g., <1>, <.>, <2>, <3>, , <+>, <0>, <4>). This approach has several advantages:
Compactness and Consistency: It provides a fixed-format, compact representation for numbers of any scale, which is easier for the model to learn.
Normalization-Free: It obviates the need for per-task normalization of target values, simplifying the training pipeline when dealing with multiple prediction targets (e.g., memory and CPU) with different scales.
We implement the proposed formulation using a decoder-only Transformer architecture as in Fig. 1 Let the input context be denoted by C = [p sys ; x job ]. We apply Full Attention across the entire sequence length ๐ * . The model is optimized end-to-end by minimizing the autoregressive cross-entropy loss L CE specifically on the tokens corresponding to the resource metrics:
where ๐ฆ ๐ก represents the ๐ก-th token of the ground truth metrics y gt , and ๐ฆ <๐ก the preceding metric tokens. This masking strategy ensures the model learns the causal correlation between the EDA workload parameters defined in C and the resulting resource consumption.
On large-scale EDA cloud infrastructure, the volume of workload processing is immense, often reaching millions of jobs daily. Consequently, even a trivial failure rate in result generation can cascade into significant operational disruptions or data pipeline failures. Standard autoregressive decoding, as illustrated in Fig. 1(c), lacks structural guarantees; it is prone to “hallucinations” such as malformed JSON syntax, the fabrication of non-existent keys, or the generation of irrelevant conversational text. To eliminate these stochastic instabilities and ensure strictly parseable outputs, we implement a Constrained Decoding framework that enforces structural rigidity. As depicted in Fig. 1(d), our decoding strategy partitions the generation process into two distinct phases: Deterministic Prefix Bypass and Constrained Numeric Sampling.
To maintain schema consistency, we treat the JSON keys and structural delimiters as deterministic templates. Instead of allowing the model to predict these static tokens, we employ a generation prefix filling technique. During inference, the specific key strings (e.g.,
{“life_time (s)”: “) are automatically appended to the input context window, like (d)
of Fig. 1.
Once the deterministic prefix is supplied, the model must generate the specific numerical values. To prevent the generation of non-numerical text, we dynamically restrict the model’s vocabulary space V.
In the logical flow of Fig. 1(d), we apply a masking function M to the output logits at each step ๐ก where a numerical value is expected. We define a valid token subset V ๐๐ข๐ โ V, which consists strictly of digit tokens form <0> to <9>. During the generation of value fields, the probability of any token ๐ค ๐ โ V ๐๐ข๐ is set to zero. Similarly, we can sample symbol tokens from <+> and <->. This ensures that the output is always a valid number in scientific notation, effectively reducing the parsing error rate to zero.
We describe the experimental setup, including the baselines, datasets, and implementation details. We then present our main results and conduct ablation studies to analyze the efficacy of our methodology.
To evaluate the effectiveness of our fine-tuned Large Language Model (LLM), we establish two baselines for comparison:
User-Requested Resources: represent the resource allocations initially requested by users for their EDA jobs, providing a measure of the “as-is” resource consumption without any optimization.
Heuristic-based Resource Optimizations: an existing heuristic-based system that suggests resource allocations. It operates by searching past runs of similar jobs within the most recent time window and using historical peak resource usage as its output.
The main evaluation is performed on two datasets, each comprising design verification (DV) jobs for a specific chip project of billion-scale combinatorial complexity:
We use pre-trained LLMs as our base model to perform supervised fine-tuning. We consider Gemma-3 Team et al. (2025) and Qwen-3 Yang et al. (2025) series to demonstrate our work, not only because they showcase SOTA performance, but their vocabulary consists of single digit tokens from <0> to <9>, rather than multiple digit tokens like <23>, <456>. This critical property ensures that the representation of scientific notation is determined, ensuring correctness of the proposed constrained decoding method in Section 3.4.
We use Low-Rank Adaptation (LoRA) Hu et al. (2022) for efficient fine-tuning. The evaluation metric across all experiments is the Mean Absolute Error (MAE). We set LoRA rank to 8, alpha to 16, dropout to 0.05. The total training epochs are set to 2, and the maximum sequence length of examples is 2048. We use a learning rate at 2 ร 10 -5 , and a batch size of 64. All experiments are conducted on a 4 ร 8 H100 platform. To eliminate randomness of model during evaluation, we set LLM’s temperature to 0, leading to greedy generation. The evaluation metrics in our experiments are mean absolute error (MAE), Pearson correlation (๐), and Spearman correlations (๐ ๐ ).
As shown in Table 1, the fine-tuned LLMs significantly outperform the traditional baselines across both datasets. Fig. 3 illustrates Gemma-3-12B predictions on dataset 1, demonstrating generalization across target value range. Furthermore, we observe a clear scaling law within the model architecture. Increasing the parameter from 270M to 12B for Gemma-3 yields an improvement in performance, characterized by lower MAE and higher Pearson and Spearman correlation coefficients. The larger models (Gemma-3-12B and Qwen-3-8B) achieve the highest fidelity, confirming that sufficient prior knowledge is essential for our task.
A critical requirement for fine-tuned models is the ability to generalize to new, unseen data. We assessed the Gemma-3-4B model, which was fine-tuned on Dataset 2. We then tested its performance on data from subsequent, non-overlapping time periods, partitioning the future data into 5-day windows (up to 26 days).
The results are detailed in Table 2. While the model’s performance degrades to a certain level of accuracy, especially on life time and cpu max prediction, it’s still competitive and acceptable. Moreover, the model demonstrates strong resistance to temporal drift across the full 26-day period. This overall stability suggests the model has learned the fundamental, generalizable patterns of resource consumption from Dataset 2, rather than overfitting to the specific job distribution of the training period.
In addition to enforcing the correct generation format, whose instances are shown in Fig. 1, we demonstrate another benefit of the proposed constrained decoding method: temporal efficiency during LLM inference. We compared the total wall-clock time required to generate responses using standard decoding versus our constrained decoding approach. The experiments were conducted on dataset 1 across Gemma-3-1B, Gemma-3-4B, and Gemma-3-12B models.
As illustrated in Fig. 4, our method significantly reduces the computational cost of generation. Specifically, we observed a latency reduction of more than 30% for all three models. This speedup attributed to the constraint mechanism bypassing LLM’s forward of deterministic tokens, thereby reducing the total number of tokens generated per example compared to the unconstrained baseline.
We employ standard Supervised Fine-Tuning (SFT), which minimizes Cross-Entropy (CE) loss rather than a generation error metric like MAE. To understand the relationship between these objectives, we tracked both the test CE loss and the test MAE of model checkpoints during fine-tuning on Dataset 2. We evaluate intermediate checkpoints of Qwen-3-8B model, from 500, 1000, 1500, 2000, and final steps on test data, and the results are shown in Table 3 from the first row to the end, respectively. We observe a strong correlation: as the CE loss decreases with more training, the MAE for all resource metrics also consistently decreases. The best performance is achieved at the final training stage, where both CE loss and all MAE values are at their minimum. This suggests that minimizing CE loss is an effective proxy for improving the final prediction accuracy on this task.
To evaluate the impact of context visibility, we conducted an ablation study using the default Sliding Window Attention (SWA) configurations for Gemma 3 models (512-token sliding window for 270M and 1B, 1024 for 4B and 12B). We fine-tuned them on dataset 1 and compared their Pearson correlation and Spearman correlation to those of the full-attention fine-tuning from Table 1.
As illustrated in Fig. 5 performance gap of +0.078 on Dataset 1 (0.767 vs. 0.689) and +0.100 on Dataset 2 (0.834 vs. 0.734). While the sliding window approach remains competitive at larger scales (e.g., 12B), the full attention mechanism ensures robust performance across both datasets: it improves average ๐ from 0.695 to 0.734 on dataset 1, and from 0.752 to 0.761 on dataset 2, respectively. This enhancement showcases that global context visibility is critical for minimizing regression errors in the EDA job task.
We analyze the maximum sequence length to balance information retention against computational cost. Gemma-3-4B is trained on dataset 1 with different maximum sequence length: 512, 1024, 2048, 4096, and 8192. Fig. 6 shows the normalized MAE across all regression metrics. The result demonstrates that increasing sequence length improves regression accuracy by incorporating more relative job information. However, lengths beyond 2048, like 4096 or 8192, provide marginal benefits. Since 2048 maximum sequence length also minimizes ram and disk prediction error, we adopt it as the optimal setting.
We introduced a novel framework for predicting EDA cloud job resource usage and lifetime by leveraging LLMs. By formulating the multi-output regression problem as a sequence-tosequence generation task, we minimize the need for brittle, manual feature engineering and regression model training, and enable LLMs to learn directly from rich, semi-structured job configuration data. Our primary contributions are the initial application of sequence modeling to this critical industrial setting and the introduction of two key techniques: (1) scientific notation for outputs, and (2) constrained decoding that collectively improve the accuracy, robustness, and numerical stability of the predictions. We also demonstrated the benefit of full attention fine-tuning over a sliding window LLM. The effectiveness of our framework was showcased on challenging real-world EDA datasets, establishing a new and powerful baseline for resource and lifetime prediction.
This content is AI-processed based on open access ArXiv data.