Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
IoT applications are increasingly relying on on-device AI accelerators to ensure high performance, especially in limited connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, substantially inflating latency. While collaborative processing by partitioning the model processing between CPU and accelerator resources can reduce accelerator memory pressure and latency, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently curb swapping, a problem that is further amplified in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference for memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.
💡 Research Summary
The paper addresses the severe latency penalties caused by memory‑swapping on Edge TPUs, which have only 8 MB of on‑chip SRAM. When a model exceeds this capacity, the TPU runtime must repeatedly swap model segments between host memory and the accelerator, and in multi‑tenant scenarios the swapping overhead can dominate inference time. The authors propose SwapLess, a system that jointly optimizes where to split each DNN between the TPU and the host CPU and how many CPU cores to allocate to each offloaded suffix.
SwapLess operates in two phases. In an offline phase it enumerates all feasible partition points for each model, compiles the TPU prefix and CPU suffix, and profiles their execution times, memory footprints, and swapping costs. In the online phase it uses these measurements in an analytic queuing model: the TPU is modeled as an M/G/1/FCFS queue, while each CPU suffix is modeled as an M/M/k queue with k equal to the allocated cores. The model incorporates (i) the TPU’s utilization and service‑time variance, (ii) the probability that a request will miss its weights in TPU memory (α), and (iii) deterministic inter‑model loading latency derived from weight size and the measured host‑to‑TPU bandwidth. Expected waiting times are computed via the Pollaczek‑Khinchine formula for the TPU and standard M/M/k results for the CPU.
Using these latency estimates, SwapLess runs a greedy hill‑climbing algorithm that iteratively adjusts a single model’s partition point or its core allocation if the change reduces the overall average response time. The search space is limited to the pre‑computed partition points and core counts, keeping the decision overhead sub‑millisecond.
Experimental evaluation on Edge‑TPU‑equipped platforms (Coral Dev Board, Raspberry Pi 4) with a suite of CNNs (ResNet‑50V2, Inception‑V4, DenseNet‑201, MobileNet‑V2, SqueezeNet, EfficientNet, GPUNet) shows substantial gains. For single‑tenant workloads, SwapLess cuts mean latency by up to 63.8 % compared with the default Edge TPU compiler. In multi‑tenant workloads with mixed request patterns (e.g., 50:50 and 90:10 mixes), the system reduces mean latency by up to 77.4 %. The adaptive allocator reacts quickly to changes in request rates and model arrivals, and its overhead remains below 1 ms, confirming suitability for real‑time edge services.
In summary, SwapLess demonstrates that a principled combination of analytical queuing theory and lightweight online optimization can effectively mitigate memory‑swapping bottlenecks on memory‑constrained accelerators, enabling high‑performance, multi‑tenant inference on Edge TPUs.
Comments & Academic Discussion
Loading comments...
Leave a Comment