LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum
With the rapid advancement of large language models (LLMs), efficiently serving LLM inference under limited GPU resources has become a critical challenge. Recently, an increasing number of studies have explored applying serverless computing paradigms to LLM serving in order to maximize resource utilization. However, LLM inference workloads are highly diverse, and modern GPU clusters are inherently heterogeneous, making it necessary to dynamically adjust deployment configurations online to better adapt to the elastic and dynamic nature of serverless environments. At the same time, enabling such online reconfiguration is particularly challenging due to the stateful nature of LLM inference and the massive size of model parameters. In this paper, we propose a dynamic pipeline reconfiguration approach that enables online adjustment of pipeline configurations while minimizing service downtime and performance degradation. Our method allows the system to select the optimal pipeline configuration in response to changing workloads. Experimental results on heterogeneous GPU platforms, including NVIDIA A100 and L40s, demonstrate that our migration mechanism incurs less than 50 ms of service downtime, while introducing under 10% overhead on both time-to-first-token (TTFT) and time-per-output-token (TPOT).
💡 Research Summary
The paper addresses the growing need for privacy‑aware orchestration across the cloud‑edge continuum, where sensitive data must remain within regulated jurisdictions while meeting latency requirements. Traditional orchestration stacks such as Kubernetes for compute and ONOS for software‑defined networking expose rich policy knobs, but translating high‑level privacy goals into low‑level configuration rules still demands expert knowledge. To bridge this gap, the authors propose an intent‑driven framework that leverages a large language model (LLM), specifically GPT‑4o, as a natural‑language interface for privacy intents.
The system operates in four stages. First, a user submits a privacy intent in plain English (e.g., “All personal data must stay within the European Union and must not traverse untrusted switches”). Second, a carefully crafted prompt guides GPT‑4o to parse the intent, extracting core constraints such as required geographic region, trust level, and routing restrictions. Third, the extracted constraints are mapped to Kubernetes node‑selector specifications (e.g., region=EU, trusted=yes) and to ONOS flow‑rule templates that enforce the desired network paths (e.g., avoid switches lacking a trusted label). Fourth, the generated policies are applied to the Kubernetes control plane and the ONOS controller, causing pods to be scheduled only on compliant nodes and traffic to be steered along compliant routes.
Formally, the authors define label functions λN for compute nodes and λV for network vertices, a placement function σ mapping pods to nodes, and a routing set ρ describing allowed paths. An intent I is compiled into a set of compute constraints ΦC and network constraints ΦN. A configuration C = ⟨σ, ρ⟩ satisfies I under the current label state if (i) every placement constraint in ΦC holds for σ given λN, and (ii) every routing constraint in ΦN holds for the paths induced by ρ in the network graph G given λV. This formalism enables precise verification of policy compliance and supports dynamic re‑configuration when workloads change.
To evaluate the approach, the authors curated a benchmark of 90 diverse privacy intents covering data locality, infrastructure avoidance, trust‑zone constraints, and provider restrictions. They built an automated validator that integrates a hybrid Kubernetes‑ONOS test‑bed, deploys the LLM‑generated policies, monitors the resulting runtime state, and produces pass/fail compliance reports without human intervention. Across all trials, the system produced correct, enforcement‑ready policies in 86 out of 90 cases (95.6% accuracy). The average latency from intent submission to policy deployment was approximately 21 seconds, demonstrating that the LLM‑driven translation is fast enough for practical use. Service disruption during re‑placement or path updates was measured at less than 50 ms, and performance penalties on time‑to‑first‑token (TTFT) and time‑per‑output‑token (TPOT) were kept under 10%, indicating minimal impact on inference workloads.
The paper also discusses security assumptions: the orchestration core (Kubernetes control plane and SDN controller) is trusted, infrastructure labels are immutable and provisioned by secure mechanisms (e.g., attestation, RBAC), and the threat model excludes a fully compromised control plane. Potential attacks include observation of non‑compliant traffic, intent ambiguity exploitation, and configuration mistakes. The authors acknowledge that LLM outputs can be nondeterministic, requiring prompt engineering and post‑generation validation to avoid policy drift. Moreover, the current framework focuses on label‑based placement and routing; it does not incorporate end‑to‑end encryption, key management, or dynamic data‑flow confidentiality mechanisms.
Future work outlined includes extending the approach to multi‑cloud and multi‑vendor environments, standardizing label schemas across compute and network domains, developing conflict‑resolution algorithms for overlapping intents, and integrating formal verification techniques (e.g., SMT solving) to prove that generated policies satisfy the original intent. The authors also propose exploring tighter coupling between LLM‑driven intent compilation and runtime monitoring to enable closed‑loop adaptation.
In summary, this research demonstrates that large language models can serve as effective translators from natural‑language privacy requirements to concrete, cross‑layer orchestration policies. By automating the generation of Kubernetes and SDN configurations, the framework lowers the expertise barrier for privacy‑compliant deployment across heterogeneous edge‑cloud infrastructures, achieving high accuracy, low latency, and minimal performance overhead.
Comments & Academic Discussion
Loading comments...
Leave a Comment