No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
With the rise of machine learning, inference on deep neural networks (DNNs) has become a core building block on the critical path for many cloud applications. Applications today rely on isolated ad-hoc deployments that force users to compromise on consistent latency, elasticity, or cost-efficiency, depending on workload characteristics. We propose to elevate DNN inference to be a first class cloud primitive provided by a shared multi-tenant system, akin to cloud storage, and cloud databases. A shared system enables cost-efficient operation with consistent performance across the full spectrum of workloads. We argue that DNN inference is an ideal candidate for a multi-tenant system because of its narrow and well-defined interface and predictable resource requirements.
💡 Research Summary
The paper addresses a pressing mismatch between the growing demand for deep neural network (DNN) inference in cloud‑based applications and the current deployment paradigm that relies on isolated virtual machines (VMs) or containers. While VMs provide strong isolation, they are ill‑suited for inference workloads that require millisecond‑scale latency, rapid elasticity, and cost proportional to actual compute. The authors identify four practical problems with the status‑quo: (1) over‑provisioning leads to idle expensive resources, especially GPUs/TPUs; (2) auto‑scaling decisions are made in coarse‑grained time intervals (minutes), causing latency spikes during sudden traffic bursts; (3) cold‑start latency when previously idle VMs are re‑instantiated; and (4) pricing models that charge for total VM uptime rather than per‑inference work, penalizing users for idle time.
To overcome these limitations, the paper proposes elevating DNN inference to a first‑class cloud primitive delivered via a shared, multi‑tenant system—analogous to cloud storage or databases. In this architecture, a logically centralized controller manages model registration, replication, and request routing, while a fleet of long‑lived worker processes host many tenants’ models concurrently. High‑demand models are replicated across multiple workers to provide fault tolerance and elasticity; low‑traffic models incur no cold‑start penalty because workers remain alive even when idle.
A key design decision is to restrict the runtime to a fixed, internal execution engine that consumes models expressed in the Open Neural Network Exchange (ONNX) format. This eliminates the need to run arbitrary user code, simplifying security and enabling aggressive system‑wide optimizations. Although this limits custom layer support, the authors argue that the vast majority of production DNNs use standard layers, and new layers can be added to the system once they become mainstream.
Performance isolation, traditionally the hardest challenge in multi‑tenant environments, is mitigated by the inherent predictability of DNN inference. Each model’s FLOP count and memory footprint can be determined a priori, and inference lacks dynamic control flow. The system therefore employs a “predict‑measure‑feedback” loop: at request admission, the controller estimates latency based on model size and input dimensions; during execution, actual latency is measured; and the scheduler adjusts fair‑queueing and resource quotas in real time. Experiments with TVM‑compiled models show that 99th‑percentile latencies stay within 15 % of the mean across diverse workloads, demonstrating tight latency bounds.
Security is addressed by isolating model parameters and input/output buffers within the worker process, and by validating all client requests at the controller level. Since no user‑provided binaries run on the workers, the attack surface is dramatically reduced compared to container‑based approaches.
The proposed pricing model charges per inference request (or per compute unit) rather than per VM hour, aligning cost with actual usage and eliminating wasteful charges for idle accelerators. This model also encourages higher overall system utilization, as multiple tenants can share expensive GPUs/TPUs without competing for dedicated instances.
In summary, the paper presents a compelling case for re‑architecting cloud DNN inference as a shared, multi‑tenant service. By leveraging the deterministic nature of neural network computation, it achieves low, predictable latency, rapid elasticity, strong performance isolation, and a usage‑based pricing scheme. The design bridges the gap between the needs of latency‑sensitive online applications and the economics of cloud resource provisioning, positioning inference as a core cloud primitive on par with storage and databases.
Comments & Academic Discussion
Loading comments...
Leave a Comment