AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

Reading time: 5 minute
...

📝 Abstract

The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node’s VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.

💡 Analysis

The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node’s VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.

📄 Content

AIVAILABLE: A SOFTWARE-DEFINED ARCHITECTURE FOR LLM-AS-A-SERVICE ON HETEROGENEOUS AND LEGACY GPUS A PREPRINT Pedro Antunes1, Ana Rita Ortigoso*1, Gabriel Vieira†1, Daniel Fuentes‡1, Luís Frazão§1, Nuno Costa¶1, and António Pereira||1 1Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal, (pedro.m.antunes, ana.l.ortigoso, gabriel.m.vieira, daniel.fuentes, luis.frazao, nuno.costa, apereira)@ipleiria.pt November 18, 2025 ABSTRACT The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node’s VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs. Keywords LLM · Heterogeneous · High Availability · Low-Cost · LLMaaS · SDAI 1 Introduction LLMs are increasingly being integrated into various aspects of daily life and professional work [1]. However, not all individuals or organizations possess the resources to deploy high-end LLM solutions, which often require access to powerful GPUs. In practice, many institutions, such as schools, universities, and small-to-medium enterprises (SMEs), rely on mid-range hardware like older NVIDIA RTX series or older AMD series GPUs, and in some cases, even legacy devices such as the NVIDIA GTX series [2]. Beyond resource constraints, some organizations also prefer to host LLMs locally due to concerns over data privacy, compliance requirements, or the need for tight integration with existing development workflows. This further reinforces the demand for low-cost, locally deployable solutions that can operate effectively on heterogeneous and resource-limited infrastructures. ∗ORCID: 0009-0001-7529-5857 †ORCID: 0009-0000-2300-8441 ‡ORCID: 0000-0001-9726-1087 §ORCID: 0000-0003-2571-7940 ¶ORCID: 0000-0002-2353-369X ∥ORCID: 0000-0001-5062-1241 arXiv:2511.11621v1 [cs.DC] 6 Nov 2025 AIvailable A PREPRINT To address this gap, we introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform de- signed specifically for SMEs and institutions with limited GPU capacity. AIvailable enables effective LLM deployment without the need for high-end infrastructure, lowering barriers to entry while maintaining accessibility and performance. This paper introduces an architecture designed to maximize the utilization of available hardware resources, with a particular focus on leveraging the full VRAM capacity of each computational node. The approach enables users to deploy and operate LLMs in a manner that fully exploits the capabilities of their selected hardware, regardless of heterogeneity across nodes. In addition, the architecture will utilize a unified client interface through which users can seamlessly communicate with all LLM instances they have deployed, across all chosen nodes, without the need to manage separate endpoints or configurations. In doing so, AIvailable is committed to democratising the use of LLMs for everyone. 2 Related Work The deployment of LLMs on heterogeneous and resource-constrained infrastructures has become a growing area of interest, particularly as organizations seek alternatives to high-end datacenter solutions. Several approaches have been proposed to address challenges related to availability, efficiency, and accessibility. On this matter, [3] introduces a distributed serving architecture designed to provide low-latency inference across GPU clusters while maintaining resilience to node failures. Although effective in high-performance settings, this approach assumes access to datacenter-grade accelerators (e.g., A100 or H100 GPUs), which makes it less suitable for SMEs or educational institutions relying on legacy hardware. Complementary to this, [4] propose adaptive scheduling policies for LLM serving, aiming to maximize GPU memor

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut