LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform

LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.


💡 Research Summary

The paper introduces LeJOT, an intelligent job‑cost orchestration framework designed specifically for the Databricks data‑lakehouse platform. The authors identify that existing cost‑control practices on Databricks rely on static resource configurations or reactive scaling, which are ill‑suited to the highly dynamic and heterogeneous workloads typical of modern analytics and machine‑learning pipelines. To overcome these limitations, LeJOT combines two complementary components: (1) a lightweight machine‑learning (ML) model that predicts the execution time of each workflow under a variety of resource configurations, and (2) a mixed‑integer linear programming (MILP) optimizer that, given the predicted runtimes and user‑specified temporal constraints (earliest start, latest finish, and precedence relationships), selects the resource allocation that minimizes total cloud cost while respecting all constraints.

ML Prediction Engine
The prediction engine extracts a rich set of features from historical Databricks logs and job metadata: CPU core count, memory size, degree of parallelism, number of subtasks, number of tables accessed, code length and content, data volume on disk, and task type (I/O‑intensive vs. compute‑intensive). After an extensive model‑selection phase involving linear regression, Lasso, Elastic Net, and others, Ridge Regression is chosen for its balance of regularization and interpretability. The resulting model delivers sub‑5 % mean absolute error on held‑out data and can infer execution times for any (device type, configuration) pair in milliseconds, making it suitable for real‑time scheduling loops.

Cost‑Minimization Formulation
The optimization problem is formally defined over a set of workflows (W), device types (D), and configuration sets (K_d). Decision variables (x_{i,d,k}) indicate whether workflow (i) runs on device (d) with configuration (k). The objective function captures a tiered pricing scheme: a base hourly rate (c_{0d}) applies up to a pre‑purchased usage threshold (A_d); beyond that, an incremental rate (c_{1d}) is charged. The total cost is therefore
\


Comments & Academic Discussion

Loading comments...

Leave a Comment