The Stretto Execution Engine for LLM-Augmented Data Systems
LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime-accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how KV-caching can be used to realize a spectrum of different physical operators that transform a sparse design space into a dense continuum of runtime-accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.
💡 Research Summary
The paper introduces Stretto, a novel execution engine designed for large‑language‑model‑augmented (LLM‑augmented) data systems. Existing LLM‑augmented databases such as Lotus, Palimpzest, and Abacus either provide only local accuracy guarantees for each semantic operator or rely on coarse‑grained model choices (small vs. large). Consequently, they cannot efficiently allocate error budgets across a query pipeline, leading to sub‑optimal performance and lack of global quality guarantees.
Stretto addresses these shortcomings with two key innovations. First, it formulates query planning as a constrained optimization problem that simultaneously minimizes total execution cost while satisfying user‑specified end‑to‑end precision and recall constraints. The optimizer treats model selection, KV‑cache compression level, and continuous parameters (e.g., similarity thresholds) as variables in a continuous relaxation of the plan space. Using gradient‑based methods (e.g., Adam), it iteratively adjusts these variables, dynamically reallocating error budgets from “easy” operators (which can meet targets with cheap settings) to “hard” operators that require more expensive models or lower compression. This global view prevents over‑provisioning and ensures that the final physical plan meets the overall quality target.
Second, Stretto enriches the physical design space by exposing a family of KV‑cache‑enabled operators. During an offline preprocessing phase, the system materializes KV‑caches for all multimodal data items using a set of LLMs (e.g., Llama‑8B, Llama‑70B). It then compresses these caches using Expected Attention Press, producing multiple compression ratios (e.g., 0.5, 0.7, 0.9). At query time, operators can reuse the pre‑computed caches, skipping the costly forward pass of the transformer. Higher compression reduces memory footprint and enables larger batch sizes, yielding faster execution at the cost of a modest accuracy drop. By combining model size, cache compression, and optional non‑LLM alternatives (embedding‑based filters), Stretto creates a dense ladder of cost‑accuracy trade‑offs rather than the sparse step function of prior systems.
Stretto also supports multi‑stage operator cascades. A logical semantic operator can be implemented as a sequence of physical operators ordered from cheap to expensive. Each stage returns one of three outcomes: accept, reject, or unsure. Only “unsure” tuples are forwarded to the next, more accurate stage. This cascade mechanism dramatically reduces the number of high‑cost LLM invocations, especially when early stages can prune large portions of the data.
The authors evaluate Stretto on a broad benchmark suite (SemBench and several domain‑specific datasets such as medical reports and financial filings) covering filters, maps, joins, and aggregations. Compared against state‑of‑the‑art baselines, Stretto achieves an average 42 % reduction in query latency while consistently meeting global precision (≥ 0.7) and recall (≥ 0.9) targets. KV‑cache reuse cuts memory usage by over 30 % and allows larger batch processing, improving GPU utilization by roughly 1.5×. In cascade configurations, the proportion of tuples processed by the most expensive LLM drops by more than 60 %.
In summary, Stretto contributes: (1) a holistic, gradient‑based optimizer that enforces end‑to‑end quality constraints without exhaustive enumeration; (2) a novel KV‑cache‑enabled operator layer that densifies the cost‑accuracy design space; and (3) a systematic empirical validation showing substantial speed‑ups and reliable quality guarantees. The work demonstrates that careful integration of LLM inference mechanics (KV‑caching, compression) with global query optimization can make LLM‑augmented databases practical for real‑world, latency‑sensitive applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment