Learned Query Optimizer in Alibaba MaxCompute: Challenges, Analysis, and Solutions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing learned query optimizers remain ill-suited to modern distributed, multi-tenant data warehouses due to idealized modeling assumptions and design choices. Using Alibaba’s MaxCompute as a representative, we surface four fundamental, system-agnostic challenges for any deployable learned query optimizer: 1) highly dynamic execution environments that induce large variance in plan costs; 2) potential absence of input statistics needed for cost estimation; 3) infeasibility of conventional model refinement; and 4) uncertain benefits across different workloads. These challenges expose a deep mismatch between theoretical advances and production realities and demand a principled, deployment-first redesign of learned optimizers. To bridge this gap, we present LOAM, a one-stop learned query optimization framework for MaxCompute. Its design principles and techniques generalize and are readily adaptable to similar systems. Architecturally, LOAM introduces a statistics-free plan encoding that leverages operator semantics and historical executions to infer details about data distributions and explicitly encodes the execution environments of training queries to learn their impacts on plan costs. For online queries with unknown environments at prediction time, LOAM provides a theoretical bound on the achievable performance and a practical strategy to smooth the environmental impacts on cost estimations. For system operating, LOAM integrates domain adaptation techniques into training to generalize effectively to online query plans without requiring conventional refinement. Additionally, LOAM includes a lightweight project selector to prioritize high-benefit deployment projects. LOAM has seen up to 30% CPU cost savings over MaxCompute’s native query optimizer on production workloads, which could translate to substantial real-world resource savings.

💡 Research Summary

The paper investigates why existing learned query optimizers, which have shown promise in research settings, fail to deliver in modern distributed, multi‑tenant data warehouses such as Alibaba’s MaxCompute. By closely examining MaxCompute’s architecture and operational realities, the authors identify four system‑agnostic challenges: (C1) highly dynamic execution environments that cause large variance in plan costs; (C2) the frequent absence or staleness of input statistics needed for accurate cost estimation; (C3) the infeasibility of conventional model refinement because executing additional candidate plans is prohibitively expensive and risky; and (C4) the heterogeneity of over 100,000 user projects, making universal deployment inefficient and necessitating automatic project selection.

To bridge this gap, the authors propose LOAM (Learned Optimizer for Alibaba MaxCompute), a one‑stop framework that directly addresses each challenge with a set of principled design choices. First, LOAM’s cost model is “environment‑aware”: during training it records concrete runtime environment features (e.g., allocated CPU cores, cluster load, memory pressure) alongside each plan‑cost pair, allowing the model to learn how environment fluctuations affect cost. Because online queries lack this information at optimization time, the authors derive a theoretical upper bound on achievable performance and adopt a practical strategy of predicting costs under a representative average environment, smoothing out the unknown variance.

Second, LOAM eliminates the reliance on fresh statistics by introducing a “statistics‑free plan encoding”. Instead of requiring histograms or NDV values, the encoder extracts operator‑level semantics (operator type, join key cardinality, filter predicates) and leverages historical execution logs to infer coarse distributional cues. This enables cost prediction even when statistics are missing or outdated.

Third, to avoid costly online re‑training, LOAM performs “preemptive generalization”. It trains offline on the massive historical query repository that MaxCompute already maintains. Since the distribution of training plans can differ markedly from the candidate plans generated for live queries, LOAM incorporates domain adaptation techniques—specifically, adversarial feature alignment—to learn domain‑invariant intermediate representations. Consequently, the model generalizes well to unseen online plans without executing additional candidates, eliminating the prohibitive cost and risk associated with conventional refinement.

Fourth, LOAM adds an “automatic project selector”. A two‑stage pipeline first filters out projects unsuitable for learning (e.g., those lacking any statistics or exhibiting extreme cost volatility) using rule‑based heuristics. The remaining projects are then ranked by a learned ranker that predicts potential CPU‑cost savings. This mechanism focuses deployment effort on the small subset of projects that can reap substantial benefits.

The implementation integrates LOAM with MaxCompute’s native optimizer in a steering fashion: the native optimizer explores a diverse set of candidate plans using its existing transformation rules; LOAM predicts the cost of each candidate using its environment‑aware, statistics‑free model; and the plan with the lowest predicted cost is selected for execution.

Empirical evaluation on real production workloads—queries processing tens of billions of rows, featuring complex multi‑way joins and aggregations—demonstrates that LOAM achieves up to 30 % reduction in CPU cost compared with the native optimizer, while also reducing cost variance. Approximately 4 % of MaxCompute projects are identified as high‑impact, each capable of achieving ≥10 % cost savings; this translates to thousands of machines’ worth of resource savings at Alibaba scale. The project‑selection component attains >95 % precision in identifying beneficial projects, and the domain‑adapted model reduces prediction error by 15 % relative to a baseline without adaptation.

In summary, LOAM provides a deployment‑first redesign of learned query optimization that reconciles the theoretical advances of machine‑learning‑based cost models with the practical constraints of large‑scale, multi‑tenant cloud data warehouses. The four design principles—environment‑aware modeling, statistics‑free encoding, preemptive domain adaptation, and automated project selection—are broadly applicable to other cloud data platforms such as Azure Synapse, Google BigQuery, and Snowflake, paving the way for AI‑enhanced query optimization in production environments. Future work includes richer environment modeling, real‑time statistics inference, and multi‑objective optimization (latency, energy, monetary cost).

Learned Query Optimizer in Alibaba MaxCompute: Challenges, Analysis, and Solutions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment