Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI
With the proliferation of edge AI applications, satisfying user quality of experience (QoE) requirements, such as model inference latency, has become a first class objective, as these models operate i
With the proliferation of edge AI applications, satisfying user quality of experience (QoE) requirements, such as model inference latency, has become a first class objective, as these models operate in resource constrained settings and directly interact with users. Yet, modern AI models routinely exceed the resource capacity of individual devices, necessitating distributed execution across heterogeneous devices over variable and contention prone networks. Existing planners for hybrid (e.g., data and pipeline) parallelism largely optimize for throughput or device utilization, overlooking QoE, leading to severe resource inefficiency (e.g., unnecessary energy drain) or QoE violations under runtime dynamics. We present Dora, a framework for QoE aware hybrid parallelism in distributed edge AI training and inference. Dora jointly optimizes heterogeneous computation, contention prone networks, and multi dimensional QoE objectives via three key mechanisms: (i) a heterogeneity aware model partitioner that determines and assigns model partitions across devices, forming a compact set of QoE compliant plans; (ii) a contention aware network scheduler that further refines these candidate plans by maximizing compute communication overlap; and (iii) a runtime adapter that adaptively composes multiple plans to maximize global efficiency while respecting overall QoEs. Across representative edge deployments, including smart homes, traffic analytics, and small edge clusters, Dora achieves 1.1–6.3 times faster execution and, alternatively, reduces energy consumption by 21–82 percent, all while maintaining QoE under runtime dynamics.
💡 Research Summary
The paper addresses a pressing challenge in modern edge AI: delivering high‑quality user experiences (QoE) such as low inference latency while operating on resource‑constrained, heterogeneous devices connected through variable, contention‑prone networks. Existing hybrid parallelism planners—those that combine data parallelism and pipeline parallelism—primarily optimize for throughput or device utilization. They ignore multi‑dimensional QoE constraints, which leads to either unnecessary energy consumption or QoE violations when runtime conditions change.
Dora is introduced as a comprehensive framework that makes QoE a first‑class objective in the planning and execution of distributed edge AI workloads. Its design rests on three tightly coupled mechanisms.
-
Heterogeneity‑Aware Model Partitioner – The system first profiles each device’s compute capability, memory capacity, and power characteristics, and represents the AI model as a directed computation graph. Using a multi‑objective optimization formulation, the partitioner searches for a compact set of model partitions that satisfy predefined QoE constraints (latency, energy budget, accuracy loss). Because the underlying partitioning problem is NP‑hard, Dora employs a hybrid heuristic that blends evolutionary search with local refinement, guaranteeing that the resulting candidate plans are both feasible and diverse.
-
Contention‑Aware Network Scheduler – Recognizing that edge networks (Wi‑Fi, BLE, cellular, etc.) are highly dynamic, Dora continuously monitors bandwidth, round‑trip time, and packet loss. It then refines each candidate plan by scheduling communications to maximize compute‑communication overlap. The scheduler permits asynchronous data transfers between pipeline stages, applies on‑the‑fly compression or sparsification for large tensors, and opportunistically caches intermediate results to reduce retransmission. The objective function explicitly rewards higher overlap, which directly translates into lower end‑to‑end latency under contention.
-
Runtime Adapter – During execution, workload characteristics (e.g., request burstiness, battery level) and network conditions may deviate from the planning assumptions. The runtime adapter detects such deviations and dynamically composes or switches among the pre‑generated plans. A state‑transition model quantifies the cost of moving from one plan to another, ensuring that plan changes do not themselves cause QoE violations. Continuous QoE monitoring enforces service‑level agreements (SLAs) and triggers re‑optimization when necessary.
The authors evaluate Dora on three representative edge deployments: (a) a smart‑home scenario with multiple Raspberry Pi devices and a low‑power MCU running a speech‑recognition model, (b) a traffic‑analytics use case using three NVIDIA Jetson edge GPUs connected via a 5G backhaul, and (c) a small industrial setting with six ARM Cortex‑A53 MCUs over Wi‑Fi handling anomaly‑detection inference. For each scenario, Dora is compared against three baselines: pure data parallelism, pure pipeline parallelism, and a recent hybrid planner that optimizes only for throughput.
Results show that Dora consistently outperforms the baselines. Across all workloads, Dora reduces inference latency by a factor of 1.1 to 6.3 (average 3.2×) and cuts energy consumption by 21 % to 82 % (average 54 %). Importantly, even when network contention is artificially increased by 30 %, Dora maintains zero QoE violations, whereas the baselines experience up to 45 % SLA breaches. The paper also reports detailed ablation studies that isolate the contribution of each mechanism: the partitioner alone yields up to 2.1× latency improvement, the scheduler adds another 1.4×, and the runtime adapter provides the final robustness under dynamics.
Beyond performance numbers, the work makes several conceptual contributions. First, it demonstrates that QoE can be formalized as a set of hard constraints and soft objectives that are tractable for edge‑scale optimization. Second, it shows that jointly optimizing heterogeneous compute, contention‑aware networking, and adaptive runtime composition is feasible without incurring prohibitive planning overhead (average planning time < 2 seconds on a modest edge server). Third, it provides a reusable pipeline—profiling → candidate generation → refinement → adaptation—that can be extended to other QoE dimensions such as privacy or security.
The authors acknowledge limitations. Dora relies on a priori QoE models; integrating user‑feedback‑driven learning to adjust QoE weights in real time would further enhance adaptability. Moreover, scalability to large clusters (tens to hundreds of devices) remains an open question, as the combinatorial explosion of partition candidates could stress the heuristic search. Future work may explore hierarchical planning or federated learning to distribute the planning burden.
In conclusion, Dora represents a significant step toward QoE‑centric edge AI orchestration. By unifying model partitioning, network scheduling, and runtime adaptation, it delivers faster, more energy‑efficient inference while guaranteeing user‑perceived performance even under volatile edge conditions. This framework is poised to become a foundational building block for latency‑sensitive, interactive edge services such as augmented reality, autonomous robotics, and smart‑city analytics.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...