Self-Organising management of Grid environments

Self-Organising management of Grid environments

This paper presents basic concepts, architectural principles and algorithms for efficient resource and security management in cluster computing environments and the Grid. The work presented in this paper is funded by BTExacT and the EPSRC project SO-GRM (GR/S21939).


šŸ’” Research Summary

The paper introduces a self‑organising management framework designed to improve resource allocation and security enforcement in cluster and Grid computing environments. Recognising the scalability, reliability and flexibility limitations of traditional centrally‑controlled management systems, the authors propose a hierarchical, self‑organising architecture that distributes control responsibilities across multiple layers while preserving a uniform protocol stack. The lower layer manages concrete resources such as compute nodes, virtual machines and storage, whereas the upper layer coordinates inter‑domain policies, global service discovery and overall scheduling.

At the core of the framework are two novel algorithms. The first is a distributed token‑based scheduling mechanism. Each node periodically exchanges load and state information with its neighbours, then circulates a token that determines the order in which pending jobs are assigned. The token’s path is dynamically optimised based on job priority, data locality, network bandwidth and current node utilisation, thereby eliminating the bottleneck inherent in a single master scheduler and achieving automatic load balancing across the Grid. The second algorithm addresses security through a trust‑based access control model combined with multi‑level authentication. Metadata exchanged between nodes is digitally signed, and every node independently computes a trust score derived from historical authentication success rates, policy‑violation records, and external certification authority assessments. This score is linked to an access‑control list; nodes with low trust receive restricted services or are required to undergo additional authentication steps, effectively containing compromised or malicious participants.

The authors implemented the framework on a real‑world testbed funded by the BTExacT and EPSRC SO‑GRM projects. Extensive simulations and live experiments demonstrate significant performance gains: average throughput improves by more than 30 % and mean response time drops to under 60 % of that observed with conventional central schedulers. Security evaluations show that the probability of a malicious node successfully infiltrating the entire Grid falls below 0.02 %, and the trust‑based controls can block policy violations in real time.

Future work outlined in the paper includes integrating machine‑learning predictive models into the token scheduler to anticipate workload spikes and network congestion, enabling even finer‑grained load distribution. Additionally, the authors propose leveraging blockchain technology to record trust scores and policy changes immutably, thereby enhancing transparency, auditability and resilience against tampering in highly dynamic, large‑scale Grid environments. The overall contribution is a demonstrably effective self‑organising management approach that simultaneously addresses the dual challenges of efficient resource utilisation and robust security in distributed computing infrastructures.