LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization

LeJO T -AutoML: LLM-Dri v en Feature Engineering for Job Ex ecution T ime Prediction in Databricks Cost Optimization Lizhi Ma † ‡ , Y i-Xiang Hu † , Y ihui Ren ‡ , Feng W u ∗ † , Xiang-Y ang Li † † University of Science and T ec hnology of China , Hefei, China ‡ Lenovo , Beijing, China { malizhi,yixianghu } @mail.ustc.edu.cn, renyh10@lenov o.com, { wufeng02,xiangyangli } @ustc.edu.cn Abstract —Databricks job orchestration systems (e.g., LeJOT) reduce cloud costs by selecting lo w-priced compute conﬁgurations while meeting latency and dependency constraints. Accurate execution-time prediction under heter ogeneous instance types and non-stationary runtime conditions is ther efor e critical. Ex- isting pipelines rely on static, manually engineered features that under -capture runtime effects (e.g., partition pruning, data skew , and shufﬂe ampliﬁcation), and predicti ve signals ar e scattered across logs, metadata, and job scripts—lengthening update cy- cles and increasing engineering overhead. W e present LeJO T - A utoML, an agent-driven A utoML framework that embeds large language model agents thr oughout the ML lifecycle. LeJO T - A utoML combines retrieval-augmented generation o ver a domain knowledge base with a Model Context Protocol toolchain (log parsers, metadata queries, and a read-only SQL sandbox) to analyze job artifacts, synthesize and validate featur e-extraction code via safety gates, and train/select predictors. This design materializes runtime-derived features that are difﬁcult to ob- tain through static analysis alone. On enterprise Databricks workloads, LeJOT -A utoML generates ov er 200 features and reduces the featur e-engineering and evaluation loop fr om weeks to 20–30 minutes, while maintaining competitive prediction ac- curacy . Integrated into the LeJO T pipeline, it enables automated continuous model updates and achiev es 19.01% cost sa vings in our deployment setting through improved orchestration. Index T erms —automated machine learning, large language model, multi-agent systems, job orchestration, cost optimization. I . I N T R O D U C T I O N Large language models (LLMs) [1] and agentic reasoning framew orks (e.g., ReAct [2]) ha ve adv anced automated code generation and document understanding [3]–[5]. Howe ver , enterprise machine-learning (ML) workﬂows for regression and forecasting still rely on manual feature engineering and brittle glue code, which slows adaptation to workload drift and platform e volution [6]. This challenge is pronounced in Databricks job orchestration, where a small prediction error can translate into repeated mis-provisioning across thousands of daily runs. LeJO T [7] is a Databricks-based framework that minimizes ex ecution cost under dependency and latenc y constraints by selecting resource conﬁgurations using an execution-time pre- dictor [8], [9]. In production, the predictor must generalize across heterogeneous instance types, changing software stacks, ∗ Corresponding author . and non-stationary data characteristics. Four obstacles arise in this setting. First, high-impact performance signals only emerge at runtime [10]: scan v olume after partition pruning, ske w-induced stragglers, shufﬂe ampliﬁcation, and ex ecutor scheduling ef fects. Second, these signals are fragmented across multiple sources (logs, metadata, job scripts, and conﬁguration histories), which complicates the end-to-end feature pipeline. Third, manual feature engineering demands domain expertise in Spark SQL and platform internals, and the resulting features often lag behind e v olving workloads. Finally , slow retraining and validation cycles yield stale predictors under drift [11], which degrades orchestration quality and cost efﬁcienc y . T o address these limitations, we propose LeJO T -AutoML, an agent-driv en ML pipeline that turns the conv entional lifecy- cle [12], [13] into a dynamic, self-impro ving system [14], [15]. A Feature Analyzer Agent (F AA) retrieves domain kno wledge via retriev al-augmented generation (RA G) [16] and proposes candidate feature templates that map to observable artifacts. A Feature Extraction Agent (FExA) then in vokes a Model Context Protocol (MCP) toolchain [17], [18] (log parsers, metadata queries, and a read-only SQL sandbox) to materi- alize both static and runtime-deri ved features with ex ecution- time validation. A feedback-enabled Feature Ev aluation Agent (FEvA) ev aluates feature quality and model performance, and iterativ ely reﬁnes the pipeline to accelerate adaptation under drift [11], [19]–[21]. T wo safety gates—a code-completion checker and a data-leakage checker —ﬁlter in valid extractors and prev ent label leakage. Our contributions are summarized as follo ws: • LLM-power ed A utoML pipeline for enterprise job runtime prediction. W e embed LLM agents across analysis, tool in v ocation, feature extraction, validation, training, and model selection to enable rapid retraining and inference-time feature materialization. • Agent–tool collaborative featur e extraction via MCP . By combining LLM planning with tool-based ex ecu- tion and veriﬁcation, LeJO T -AutoML ef ﬁciently extracts dynamic features that are inaccessible to purely static analysis. • Iterative evaluation loop with safety gates. W e in- troduce a feedback-driven ev aluation agent and safety checks (code-completion and data-leakage detection) to improv e reliability and drive iterativ e reﬁnement until predeﬁned criteria [19]–[21] are met. I I . B AC K G RO U N D A N D M O TI V A T I O N LeJO T [7] performs cost-aware orchestration on Databricks by recommending resource conﬁgurations that minimize ex- ecution cost while satisfying dependenc y and latency con- straints. The orchestration relies on two coupled components: (i) ex ecution-time estimation under candidate resource al- locations and (ii) an optimizer that selects the lowest-cost conﬁguration that meets predicted-time constraints. Prediction errors therefore propagate directly into orchestration: under- estimation violates latency Service Le vel Objectives (SLOs), while overestimation drives ov er-pro visioning and recurring cost waste. A central obstacle is that the effecti ve processed data volume of SQL w orkloads is largely determined at runtime. Although data volume is a strong predictor of ex ecution time, it does not align with static metadata (e.g., table row counts). Actual scan and shufﬂe volumes depend on query logic, data distributions, and optimizer beha vior , which inv alidates many static proxies. Static feature engineering thus fails in the following scenarios: 1) Partition pruning: Queries o ver partitioned tables scan a subset of data; using the total table size overestimates scan volume by orders of magnitude. 2) Join skew: Skewed key distributions concentrate work on a subset of tasks, producing stragglers and violating linear scaling assumptions. 3) Aggregation-induced shuf ﬂe: Opera- tors like GROUP BY trigger intermediate shuf ﬂes; input table size underestimates network and compute costs under shuf ﬂe ampliﬁcation. These cases sho w that “hidden” runtime features— selectivity , ske w se verity , shufﬂe degree, and stage-level variance—are critical for accuracy yet hard to capture with ﬁxed, hand-written extraction logic. Moreov er , the relev ant evidence is distributed: the SQL te xt encodes logical operators, the ex ecution plan and logs encode physical behavior , and the metadata store encodes schema and partition structure. A practical pipeline must bridge these sources and update quickly when workloads drift [11]. LeJO T -AutoML addresses this gap by using LLM agents to synthesize and revise feature extrac- tors grounded in tool-executed e vidence (log parsing, metadata queries, and sandbox ed SQL analysis), then v alidating them with safety gates before training. I I I . L E J OT- A U TO M L F R A M E W O R K A. Overview Figure 1 shows the LeJO T -AutoML architecture and its integration into the LeJO T pipeline. The frame work operates in two phases: (i) an automated training phase that generates and validates artif acts, and (ii) an online inference phase for real-time prediction. T raining Phase. LeJOT -AutoML forms a closed-loop Au- toML pipeline consisting of ﬁv e core components: the Feature Analyzer Agent (F AA), Feature Extraction Agent (FExA), a baseline model, the Feature Ev aluation Agent (FEvA), and a model selector . T wo safety gates—a code-completion chec ker and a data-leakage chec ker —ensure that generated feature-extraction code is executable and does not leak label information. The training workﬂo w starts with F AA, which ingests three heterogeneous sources: execution logs (e.g., shufﬂe volumes and stage/task runtimes), a domain knowledge base (e.g., Spark SQL practices), and a metadata store (e.g., schema, partitions, and data statistics). Guided by RAG [22], F AA proposes a structured feature list. FExA then generates ex- traction programs and materializes feature v alues through an MCP toolchain (e.g., metadata queries and a read-only SQL sandbox). After passing safety checks, the resulting feature matrix is used to train baseline and candidate models. FEvA ev aluates feature quality (coverage, ske wness), feature utility (importance and redundancy), and model-level metrics, then emits actionable feedback for reﬁnement. Finally , the model selector chooses the best-performing model (e.g., XGBoost [4], LightGBM [23]) and hyperparameters for deployment. Inference Phase. For a ne w job, F AA reuses learned feature templates to determine the required feature set and the corre- sponding extraction plan. FExA extracts features in parallel from scripts/metadata and via lightweight sandbox analysis. The standardized feature vector is fed into the deployed predictor to estimate ex ecution time, which is then consumed by LeJO T’ s orchestration algorithm to select a cost-minimizing conﬁguration. Prediction residuals and feature health signals are logged and fed back into the training loop for continuous updates. B. System interfaces and design goals LeJO T -AutoML sits between raw job artifacts and the downstream orchestration algorithm. Its inputs consist of (i) static artifacts (job code, conﬁguration, cluster speciﬁcation, metadata), (ii) runtime traces (logs and metrics), and (iii) an ev olving knowledge base that stores domain rules and “feature experience” deriv ed from pre vious runs. The outputs consist of a versioned feature speciﬁcation and a deployed predictor with a reproducible e xtraction bundle. W e follo w four design goals. (1) Low-latency inference: extraction plans prioritize features with bounded runtime cost, and the toolchain executes reads under a strict sandbox policy . (2) Safety and gover nance: e very generated extractor passes syntactic completeness checks and leakage screening before ex ecution. (3) Continuous adaptation: model retraining is triggered by drift signals or a periodic schedule, reducing staleness in dynamic workloads. (4) T raceability: each feature records prov enance (source, transformation, and collection method), which supports deb ugging and compliance. C. Mathematical F ormulation Let D t = { d i } n t i =1 denote the dataset at time t , where each instance d i = ( d s i , d u i ) contains structured and unstructured information, and let Y t = { y i } n t i =1 be the observed execution times. Given a kno wledge base K and an MCP toolset T , the agent performs feature analysis ϕ analyze and feature extraction Knowledge Base • Expert Experience • SparkSQL Knowl edge • Databricks Platform Knowledge • … Metadata Store • Table Structure • Partition Information • Data Distribution Statistics • Table Size/Row Count • … Execution Log • Spark Job Logs • Performance Metrics • Shuffle Data Volume • Task Execution Time • … Job Orchestration User Input • Timeline Set • Job Priority • … Training Phase Inference Phase LeJOT-AutoML Feature Analyzer A gent Feature Extraction Agent Feature Evaluation Agent Baseline Model  Log Feature Li st Generation [Dimension/Rolling Aggregation]  Code Feature List Generation [Code Profile/Code Complexity]  Python Code Generation for each feature  Feature Validation/ Training  Feature Importance  Feature Profiling  Baseline Model Metrics Load Feature List Feature Extraction Model Selector  Pick Best Model Load Model Inference LeJOT Pipeline Orchestration Recommend Configuration Code Completion Checker Data Leakage Checker  Checks if code generated from previous step is complete  Checks if code generated for feature extraction contains any data leakage Fig. 1. Overview of LeJOT -AutoML. (a) Left: the end-to-end LeJO T -AutoML system. (b) Right: how LeJOT -AutoML integrates into the LeJOT pipeline. ϕ extract . For each instance, the agent retrie ves the domain context via RA G: R i = RA G ( Q ( d i ) , K ) , (1) and determines a job-speciﬁc feature set: F i = ϕ analyze ( d i , R i ) . (2) Feature values are materialized via tool in vocation: x i = { ϕ extract ( f , d i , T ) | f ∈ F i } . (3) Stacking all instances yields the feature matrix X t = [ x 1 , x 2 , . . . , x n t ] ⊤ . A predictor M ( · ; θ ) is trained by mini- mizing a loss L : θ ∗ t = arg min θ L ( X t , Y t ; θ ) , (4) and the deployed model M t ( · ; θ ∗ t ) predicts ˆ y = M t ( x new ) for a new job d new . Cost-aware featur e selection. Online extraction introduces a latency b udget that constrains the usable feature set. Let c ( f , d ) denote the extraction cost of feature f on job artifact d , and let B be a per-request b udget. F AA therefore tar gets feature sets that jointly impro ve prediction accuracy and satisfy runtime constraints: F new = arg min F ⊆F E [ ℓ ( M ( x F ) , y )] + λ X f ∈ F c ( f , d new ) s.t. X f ∈ F c ( f , d new ) ≤ B , (5) where F is the global feature univ erse, x F denotes the subv ector restricted to F , and λ controls the accuracy–latency tradeoff. Safety constraints. Each generated extractor program p f is ex ecuted only if it passes two gates: a syntactic completeness predicate g cc ( p f ) = 1 and a leakage predicate g dl ( p f ) = 1 . These gates impose hard constraints on feasible feature sets: ∀ f ∈ F : g cc ( p f ) = 1 ∧ g dl ( p f ) = 1 . (6) Continuous updates. When new data D new arriv es, the system updates D t +1 = D t ∪ D new , Y t +1 = Y t ∪ Y new , and retrains: θ ∗ t +1 = arg min θ L ( X t +1 , Y t +1 ; θ ) . (7) This cycle is triggered periodically (e.g., daily) or by drift sig- nals to maintain reliable predictions under workload ev olution. D. Implementation Details W e summarize implementation choices that make LeJO T - AutoML practical in an enterprise setting, focusing on (i) the MCP tool interface, (ii) e xecution and caching policies, and (iii) feedback and drift handling. a) MCP tool interface and outputs: Each MCP tool exposes a restricted, typed interface and returns structured outputs (e.g., JSON-like records) that can be deterministically transformed into features. W e group tools into three categories: (i) metadata tools (schema, partition layout, table statistics, cluster conﬁguration), (ii) log/trace tools (stage/task timing, shufﬂe read/write, spill metrics, failure reasons), and (iii) sand- box tools that execute read-only SQL queries or lightweight plan inspection under strict policies. For rob ustness, each extractor emits a typed feature schema (name, type, default value, pro venance), and the Feature Extraction Agent validates tool outputs against the schema before materialization. b) Execution policy (safety , determinism, parallelism): All generated extractors run in a sandboxed en vironment with an allo wlist of libraries and tool calls. The code-completion checker blocks extractors with missing imports, undeﬁned variables, or unresolved tool outputs. The data-leakage checker enforces availability: a feature must be computable from information a v ailable before a scheduling decision (e.g., job scripts and historical traces), and cannot rely on post-run artifacts. For efﬁciency , independent tool calls are executed in parallel when dependencies permit. W e also bound inference latency by capping per-job sandbox queries and enforcing timeouts. c) Caching and versioning: T o av oid repeated reads for recurring jobs, we cache intermediate tool outputs and materi- alized feature values. Cache keys are tuples of (job signature, data snapshot identiﬁer , feature version, tool version), enabling safe reuse and incremental retraining by re-materializing only affected features. Each deployed model is packaged with a versioned feature speciﬁcation and extractor bundle to ensure online inference replays training-time transformations. d) F eedback signals and drift-trigger ed updates: LeJO T - AutoML logs prediction residuals, feature health signals (miss- ingness, outliers, schema mismatches), and extraction latency per tool modality . These signals drive periodic refresh (e.g., daily retraining) and drift-triggered refresh when residuals or feature distributions e xceed thresholds. During retraining, FEvA summarizes failure modes (unstable features, high- cardinality categoricals, redundancy) and feeds concise guid- ance to F AA/FExA to reﬁne feature speciﬁcations and extrac- tion plans. I V . D E S I G N O F C O R E M O D U L E F U N C T I O N S A. F eature Analyzer Agent (F AA) The F AA determines the candidate feature space and there- fore the accuracy ceiling of the predictor . It takes as input: (i) the task objecti ve (ex ecution-time prediction for cost- aware orchestration), (ii) supplementary artifacts (job scripts, conﬁguration, historical logs, cluster and table metadata), (iii) constraints (collection cost, privac y policy , and access scope), and (iv) an output schema. Using this information, F AA performs two functions. (1) Context recov ery via RA G. F AA formulates queries ov er K to retriev e Spark SQL and platform kno wledge that clariﬁes which runtime behaviors dominate performance. Re- triev ed context is then grounded against job artifacts to avoid generic suggestions. (2) Featur e speciﬁcation synthesis. F AA outputs a list of featur e speciﬁcations rather than ra w names. Each speciﬁca- tion records: name , type (numerical/cate gorical/text-deri ved), sour ce (log/metadata/code), extraction plan (tool calls and transformations), and expected cost and r efr esh fr equency . This schema supports traceability , caching, and do wnstream parallel extraction. B. F eature Extraction Agent (FExA) The FExA materializes the proposed features via an agent– tool collaborative architecture. The LLM performs planning and program synthesis, while the MCP toolchain executes and veriﬁes retriev al steps through a constrained interface [17]. FExA follows a four-stage extraction pipeline: • Static extraction: parses job scripts and metadata to ob- tain in v ariant features (e.g., operator counts, join patterns, table statistics, partition layouts). • Runtime materialization: in vokes log parsers and the read-only SQL sandbox to collect runtime-derived sig- nals (e.g., stage imbalance, shufﬂe ampliﬁcation, pruning effecti veness). • Normalization and encoding: maps heterogeneous out- puts into a uniﬁed feature vector through scaling, one-hot encoding, and text vectorization when needed. • Data-quality checks: ﬂags missing values, outliers, and schema mismatches, then emits repair actions (fallback features, default v alues, or re-execution with tightened queries). T o meet inference latenc y goals, FExA executes indepen- dent extractors in parallel and uses caching ke yed by (job signature, feature v ersion, and data snapshot), which reduces repeated reads across similar recurring jobs. C. F eature Evaluation Agent (FEvA) The FEvA performs a multi-lev el assessment to decide which features enter the deployed model. It aggregates three families of signals: Featur e health. For each feature, FEvA measures co verage (missing rate), stability (v ariance under similar jobs), and distribution shifts across time windo ws, identifying brittle or non-stationary features. Featur e utility . FEvA estimates importance and redundanc y using model-based attribution (gain/SHAP-style summaries from tree models) and correlation screening, which reduces collinearity and overﬁtting risks. End-to-end impact. FEvA e valuates the marginal impact of candidate features through ablations on baseline models and reports deltas in MAE/RMSE and R 2 . The output is a T ABLE I F E A T U R E D I V E RS I T Y C O M P A R I S ON A utoML Manual Number of Features 200+ 40+ Feature T ypes - Log Proﬁling Features - Historical Time Series Data - Driver Node Historical Data - Node Conﬁguration Historical Data - Log Proﬁling Features T ABLE II T OP - 5 F E A T U R E I M P ORTA N CE F OR A U TO M L V S . M A N UA L Featur e Name A utoML (%) Manual (%) duration seconds lag 1 28.8 – vcpu lag 1 24.6 – duration seconds shifted avg last 3 runs 8.0 – DBU lag 1 7.8 – Memory lag 1 3.6 – vcpu ratio – 16.2 vmemory ratio – 13.4 total memory changed – 8.0 total cpu changed – 7.9 worker ﬂe xibility ratio – 4.1 structured feedback packet that instructs F AA/FExA to reﬁne extraction plans, drop unstable features, or propose additional domain-grounded interactions. D. Safety gates and model selection LeJO T -AutoML ex ecutes generated code under strict safety gates. The code-completion checker veriﬁes that each extrac- tor program is syntactically complete, imports only approved libraries, and returns values conforming to the feature schema. The data-leakage checker enforces temporal and semantic isolation between features and labels, rejecting extractors that directly access the target (ex ecution time) or indirectly deriv e it from post-run artifacts. After FEvA produces ev aluation summaries, the model selector searches a bounded candidate set of algorithms and hyperparameters, trains candidates on the versioned feature matrix, and selects the ﬁnal conﬁguration for deployment. The selected model is packaged together with its feature speci- ﬁcation and extractor bundle, ensuring that online inference replays the same transformations used during training. V . E X P E R I M E N T S W e ev aluate LeJOT -AutoML on enterprise Databricks work- loads by comparing A utoML (LLM+MCP automated feature engineering) with Manual feature engineering. The pipeline follows an analysis–act–validate loop: the LLM parses un- structured job artifact s (scripts and logs), the MCP toolchain materializes runtime-deri ved signals, and safety checks vali- date extracted features. W e use 5-fold cross-v alidation to es- timate generalization and conduct ablation studies to quantify the contributions of dif ferent feature sources. W e report feature div ersity , prediction metrics (MAE, MAPE, R 2 ), per-module runtime, and end-to-end cost savings in LeJO T . Experiments were conducted on a single machine with an Intel Core Ultra 7 165U CPU, 32 GB RAM, and Windo ws 11 Professional. W e used Qwen-235B [24] for agent reasoning and feature/code synthesis, and trained the ex ecution-time predictor using XGBoost [4]. T ABLE III C O MPA R IS O N O F A UT O M L A N D M A N UA L F E A T U R E E X TR A CT I O N A P PR OAC H E S Metrics A utoML Manual R 2 0.81 0.91 MAPE 20.13 % 19.49 % MAE 123.29 78.94 T ime 20 − 30 min (3 iterations) 1 Month T ABLE IV I N FE R E N CE R ES U LTS F OR A R E P RE S E N T AT IV E JO B U ND E R T W O F E A T U R E E N GI N E E RI N G M E T HO D S Compute Machine A utoML (s) Manual (s) Standard F4s 167 206 Standard F16s 154 82 Standard E16 v4 162 78 Standard F16 v4 Photon 147 52 A. Manual vs. AutoML AutoML synthesizes more than 200 features spanning log proﬁling, time-series statistics, and dri ver -node history , whereas manual engineering yields around 40 features, lar gely deriv ed from node-conﬁguration history (T able I). The two pipelines also surface markedly different top-ranked features (T able II), indicating that AutoML emphasizes temporal and workload-dependent signals beyond the static resource ratios that dominate many hand-crafted designs. A ke y source of this difference lies in the a vailability of control-plane context. Manual features leverage histori- cal cluster sizing decisions, instance family transitions, and conﬁguration–price mappings. In our current deployment, the MCP toolset provides conﬁguration snapshots but does not expose conﬁguration-change trajectories or pricing con- text with comparable ﬁdelity . Consequently , F AA focuses on runtime-deriv ed evidence grounded in logs and plan inspec- tion, and it under-represents certain resource-history signals that manual engineering explicitly encodes. Extending the tool interface with conﬁguration-change logs and instance- speciﬁcation/price knowledge is a clear path to improving resource awareness and narrowing this gap. From an engineering-effort perspectiv e, AutoML completes three end-to-end iterations within 20–30 minutes, whereas manual feature design typically requires about one month (T able III). Manual features achie ve higher predictive accu- racy ( R 2 =0.91 vs. 0.81), but AutoML deliv ers competitiv e performance at a small fraction of the development cost. A representativ e case study (T able IV) further highlights the T ABLE V E X EC U T I ON T IM E ( I N S EC O N D S ) F OR E AC H AG E N T N O DE AC RO S S FI V E E X PE R I M EN TA L R U NS . Agent Run 1 Run 2 Run 3 Run 4 Run 5 Feature Analyzer 260.60 233.27 263.29 223.47 298.12 Feature Extractor 107.80 92.71 105.25 115.79 137.87 Code Completion Check er < 0.01 < 0.01 < 0.01 < 0.01 < 0.01 Data Leakage Check er 199.89 172.21 209.45 210.34 325.60 Evaluation Agent 33.39 38.85 43.62 49.66 38.80 Model Selector 27.12 20.87 34.56 24.93 32.43 T ABLE VI M E TR I C S O F T H E B AS E L I NE X GB O O S T M O D E L OV E R T H R EE I TE R A T IO N S O F E V A L UA T I ON Metric First Iteration Second Iteration Third Iteration MAE 247.95 172.09 145.64 MAPE (%) 36.28 25.18 21.31 R 2 0.61 0.80 0.81 T ABLE VII C O ST S A V I NG R A T E C O M P A RI S O N B E T WE E N S O L U TI O N S Solution Throughput (k/s) Initial Cost ($k) Final Cost ($k) Cost Saving Rate AutoML 0.08 52.6 42.6 19.01% Manual ML 0.12 52.6 37.9 27.94% trade-off: when instance types are upgraded and Photon is enabled, the manual model’ s predictions shift substantially , while the AutoML model exhibits a smaller response. This pattern suggests that AutoML currently does not fully capture the direct effect of resource upgrades, consistent with its limited access to conﬁguration-history and pricing signals. B. Additional r esults Module-lev el runtime (T able V) shows that F AA and the data-leakage checker dominate end-to-end latenc y , reﬂecting the cost of LLM reasoning and semantic veriﬁcation. Over three FEvA iterations, MAE decreases from 247.95 to 145.64 and R 2 improv es from 0.61 to 0.81 (T able VI), validating the effecti veness of the feedback loop. Integrated into LeJO T , Au- toML achie ves 19.01% cost savings (T able VII), demonstrating practical value e ven with a modest accuracy gap. V I . C O N C L U S I O N A N D D I S C U S S I O N W e presented LeJO T -AutoML, an LLM-driv en frame work for automated feature engineering in Databricks job execution- time prediction. By integrating LLM agents with an MCP toolchain, the system expands the feature space to include hard-to-observe runtime signals and compresses the feature- engineering c ycle from months to minutes. Although manual feature engineering still deliv ers better generalization across hardware conﬁgurations ( R 2 =0.91 vs. 0.81), LeJOT -AutoML provides a scalable, lo w-maintenance alternative that enables continuous learning and achieves 19.01% cost savings in LeJO T . Future work will focus on improving resource aware- ness by incorporating richer conﬁguration- and runtime-le vel indicators of execution and data-movement beha vior . A C K N O W L E D G M E N T S The research is partially supported by Innovation Pro- gram for Quantum Science and T echnology 2021ZD0302900 and China National Natural Science Foundation with Nos. 62132018 and 62231015, “Pioneer” and “Leading Goose” R&D Program of Zhejiang, 2023C01029, and 2023C01143, Anhui Provincial Natural Science Foundation under Grant 2208085MF172, and the USTC Kunpeng-Ascend Scientiﬁc and Educational Innovation Excellence Center . R E F E R E N C E S [1] A. Hurst, A. Lerer , A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow , A. W elihinda, A. Hayes, A. Radford et al. , “Gpt-4o system card, ” arXiv pr eprint arXiv:2410.21276 , 2024. [2] S. Y ao, J. Zhao, D. Y u, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models, ” arXiv pr eprint arXiv:2210.03629 , 2022. [3] B. W ang, X. Du, Y . Bai et al. , “ A survey on large language models as agents, ” arXiv pr eprint arXiv:2309.07864 , 2023. [4] T . Chen and C. Guestrin, “XGBoost: A scalable tree boosting system, ” in Pr oceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , 2016. [5] J. Li, Q. Liu, K. Zhang, and M. W ang, “ Agent-based automated machine learning for enterprise applications, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 34, no. 8, pp. 4567–4580, 2023. [6] M. Feurer, K. Eggensperger , S. F alkner , and F . Hutter , “ Auto-sklearn 2.0: Hands-free automl via meta-learning, ” Journal of Machine Learning Resear ch , vol. 23, no. 261, pp. 1–61, 2022. [7] L. Ma, Y .-X. Hu, Y . W ang, Y . Zhao, Y . Ren, J.-X. Liao, F . Wu, and X.-Y . Li, “Lejot: An intelligent job cost orchestration solution for databricks platform, ” in 2025 11th International Confer ence on Big Data Computing and Communications (BigCom) , 2025. [8] A. Kipf, T . Kipf, B. Radke et al. , “Learned cardinalities: Estimating correlated joins with deep learning, ” in Conference on Innovative Data Systems Researc h (CIDR) , 2019. [9] J. Park, N. Polyzotis, S. Roy et al. , “Learning-based query cardinality estimation with deep neural netw orks (naru), ” in Pr oceedings of the 2020 ACM SIGMOD International Confer ence on Manag ement of Data , 2020. [10] S. Kim, J. Park, M. Lee, and Y . Choi, “Dynamic feature extraction for real-time machine learning inference, ” IEEE Tr ansactions on Knowledge and Data Engineering , vol. 35, no. 12, pp. 3456–3468, 2023. [11] S. V enkataraman, Z. Y ang, D. Liu et al. , “Ernest: efﬁcient performance prediction for lar ge-scale advanced analytics, ” 13th USENIX Symposium on Networked Systems Design and Implementation , pp. 363–378, 2016. [12] E. LeDell and S. Poirier , “H2o automl: Scalable automatic machine learning, ” arXiv pr eprint arXiv:2004.11731 , 2020. [13] Y . Peng, H. Raghavan, S. V enkataraman et al. , “Slaq: Quality-dri ven scheduling for distributed machine learning training, ” in Pr oceedings of the ACM Symposium on Cloud Computing (SoCC) , 2018. [14] X. He, K. Zhao, and X. Chu, “ Automated machine learning: methods, systems, challenges, ” Automated Machine Learning: Methods, Systems, Challenges , pp. 3–19, 2019. [15] M. Feurer , A. Klein, K. Eggensperger , J. Springenberg, M. Blum, and F . Hutter , “ Auto-sklearn: efﬁcient and rob ust automated machine learning, ” Automated Machine Learning: Methods, Systems, Challenges , pp. 113–134, 2019. [16] M. Liu, S. Chen, P . W ang, and L. Zhang, “Rag-enhanced feature engineering for machine learning pipelines, ” Pr oceedings of the ACM on Management of Data , vol. 2, no. 1, pp. 1–25, 2024. [17] Anthropic, “Model context protocol (mcp) speciﬁcation, ” https://modelcontextprotocol.io, 2024, accessed 2025-10-09. [18] T . Schick, J. Dwivedi-Y u, R. Raileanu et al. , “T oolformer: Lan- guage models can teach themselves to use tools, ” arXiv pr eprint arXiv:2302.04761 , 2023. [19] L. Zhang, M. W ang, H. Chen, and W . Liu, “Job e xecution time prediction in distributed computing systems: a survey , ” A CM Computing Surveys , vol. 56, no. 3, pp. 1–35, 2023. [20] K. W ang, M. M. H. Khan, N. Nguyen, and S. Gokhale, “Spark performance prediction using machine learning, ” Cluster Computing , vol. 24, no. 3, pp. 1921–1935, 2021. [21] J. Li, M. B ¨ ackstr ¨ om, B. van Stein, A. Biedenkapp, F . Hutter, and M. Lindauer, “Large language models for automated data science: introducing caafe for context-a ware automated feature engineering, ” arXiv preprint arXiv:2309.03428 , 2023. [22] P . Lewis, E. Perez, A. Piktus, F . Petroni, V . Karpukhin, N. Goyal, H. K ¨ uttler , M. Lewis, W .-t. Y ih, T . Rockt ¨ aschel et al. , “Retrieval- augmented generation for knowledge-intensi ve nlp tasks, ” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020. [23] G. Ke, Q. Meng, T . Finley et al. , “Lightgbm: A highly efﬁcient gradient boosting decision tree, ” in Advances in Neural Information Processing Systems (NeurIPS) , 2017. [24] A. Y ang et al. , “Qwen3 technical report, ” arXiv preprint arXiv:2505.09388 , 2025.

LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment