High availability using virtualization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High availability has always been one of the main problems for a data center. Till now high availability was achieved by host per host redundancy, a highly expensive method in terms of hardware and human costs. A new approach to the problem can be offered by virtualization. Using virtualization, it is possible to achieve a redundancy system for all the services running on a data center. This new approach to high availability allows to share the running virtual machines over the servers up and running, by exploiting the features of the virtualization layer: start, stop and move virtual machines between physical hosts. The system (3RC) is based on a finite state machine with hysteresis, providing the possibility to restart each virtual machine over any physical host, or reinstall it from scratch. A complete infrastructure has been developed to install operating system and middleware in a few minutes. To virtualize the main servers of a data center, a new procedure has been developed to migrate physical to virtual hosts. The whole Grid data center SNS-PISA is running at the moment in virtual environment under the high availability system. As extension of the 3RC architecture, several storage solutions have been tested to store and centralize all the virtual disks, from NAS to SAN, to grant data safety and access from everywhere. Exploiting virtualization and ability to automatically reinstall a host, we provide a sort of host on-demand, where the action on a virtual machine is performed only when a disaster occurs.

💡 Research Summary

The paper presents a novel high‑availability (HA) solution for data‑center environments that leverages virtualization rather than traditional host‑by‑host redundancy. The proposed system, called 3RC (Three‑Round‑Control), treats the virtualization layer as the primary recovery mechanism, using its native capabilities to start, stop, and migrate virtual machines (VMs) across physical hosts. At its core, 3RC is built around a finite‑state machine (FSM) with hysteresis. The FSM explicitly models VM lifecycle states—running, stopped, recovery attempt, reinstall, and completed—and defines deterministic transitions based on health checks and timeout thresholds. Hysteresis prevents endless restart loops by introducing a waiting period or an alternative recovery path after a configurable number of failed attempts, thereby improving overall system stability.

A distinguishing feature of 3RC is its “automatic reinstall” capability. When a VM cannot be recovered by a simple restart, the system can provision a fresh operating system and middleware stack in a matter of minutes. This is achieved through a pre‑built image repository combined with scripted, network‑boot (PXE) installations such as Kickstart for Linux or unattended setup for Windows. The process is fully automated, eliminating manual intervention and ensuring a consistent configuration across all recovered instances.

Storage considerations are addressed by evaluating both Network‑Attached Storage (NAS) and Storage Area Network (SAN) solutions for centralizing virtual disk images. NAS offers cost‑effective, highly available file‑level access, while SAN provides block‑level performance and scalability for I/O‑intensive workloads. 3RC abstracts the underlying storage through a unified interface, allowing administrators to switch or combine storage back‑ends without disrupting the HA workflow, thus guaranteeing data safety and accessibility from any host.

The paper also details a systematic physical‑to‑virtual migration procedure. Existing critical services running on bare metal are captured as disk images, transformed to a virtual format, and redeployed with automatically generated CPU, memory, and network configurations. Network bridging and security group policies are recreated to preserve the original topology, minimizing service interruption during the migration.

The solution has been deployed in the SNS‑PISA grid data center in Italy, where hundreds of physical servers have been virtualized under 3RC. Real‑world measurements show an overall service availability exceeding 99.99 % and an average mean‑time‑to‑recover (MTTR) of 2–3 minutes after a failure, which is roughly five times faster than conventional redundancy schemes. In case of storage failures, the system seamlessly re‑hosts virtual disks on alternative NAS/SAN resources, preventing data loss and maintaining continuous operation.

In conclusion, the authors demonstrate that virtualization‑based HA can replace costly host‑by‑host redundancy while delivering faster recovery, lower operational overhead, and greater flexibility. 3RC’s combination of FSM‑driven logic, hysteresis, rapid automated reinstall, storage abstraction, and automated physical‑to‑virtual migration creates an “on‑demand host” that activates only when a disaster occurs. Future work is suggested in integrating container‑based micro‑services, coupling with cloud‑native orchestration platforms, and applying AI‑driven failure prediction to further enhance the intelligence and responsiveness of the HA framework.

High availability using virtualization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment