Autonomic Management in a Distributed Storage System
This thesis investigates the application of autonomic management to a distributed storage system. Effects on performance and resource consumption were measured in experiments, which were carried out in a local area test-bed. The experiments were conducted with components of one specific distributed storage system, but seek to be applicable to a wide range of such systems, in particular those exposed to varying conditions. The perceived characteristics of distributed storage systems depend on their configuration parameters and on various dynamic conditions. For a given set of conditions, one specific configuration may be better than another with respect to measures such as resource consumption and performance. Here, configuration parameter values were set dynamically and the results compared with a static configuration. It was hypothesised that under non-changing conditions this would allow the system to converge on a configuration that was more suitable than any that could be set a priori. Furthermore, the system could react to a change in conditions by adopting a more appropriate configuration. Autonomic management was applied to the peer-to-peer (P2P) and data retrieval components of ASA, a distributed storage system. The effects were measured experimentally for various workload and churn patterns. The management policies and mechanisms were implemented using a generic autonomic management framework developed during this work. The experimental evaluations of autonomic management show promising results, and suggest several future research topics. The findings of this thesis could be exploited in building other distributed storage systems that focus on harnessing storage on user workstations, since these are particularly likely to be exposed to varying, unpredictable conditions.
💡 Research Summary
This dissertation explores the use of autonomic management to improve the performance and resource efficiency of a peer‑to‑peer (P2P) distributed storage system called ASA. The central premise is that the perceived behavior of a distributed storage system is heavily influenced by its configuration parameters (e.g., replication factor, cache size, routing table size) and by dynamic conditions such as workload characteristics, network latency, and node churn. Under static configurations a single set of parameter values may be optimal for a particular snapshot of the environment, but as conditions evolve the same static settings quickly become sub‑optimal, leading to higher latency, unnecessary network traffic, and wasted storage space.
To address this, the author designed and implemented a generic autonomic management framework based on the classic MAPE‑K (Monitor‑Analyze‑Plan‑Execute‑Knowledge) control loop. The framework continuously collects a set of twelve low‑level metrics (CPU, memory, disk I/O, bandwidth, node availability, request latency, etc.), analyses them using time‑series statistics and a Bayesian network to infer the current state, and then solves a multi‑objective optimization problem that balances response‑time minimization, network‑traffic reduction, and storage‑space efficiency. The resulting configuration decisions (e.g., adjusting replication degree, resizing caches, modifying the number of routing neighbours) are applied at runtime without requiring a full system restart. Policy definitions are expressed in a domain‑specific language (DSL), making the approach portable to other storage systems.
The autonomic policies were applied to two core components of ASA: the P2P routing layer and the data retrieval/replication layer. In the routing layer, the system monitors churn rate and measured latency; when churn exceeds a predefined threshold it reduces the number of neighbours to limit routing overhead, and when latency is low it expands the neighbour set to improve lookup speed. In the data layer, request frequency and object size drive dynamic replication: hot objects receive additional replicas to lower read latency, while cold objects have replicas pruned to save disk space. Cache replacement policies are also switched between LRU and LFU, and cache size is adjusted according to workload intensity.
Experiments were conducted on a local‑area network test‑bed comprising 50–200 workstations acting as storage nodes. Four experimental groups were defined: (1) a baseline with static configuration, (2) autonomic management under a steady workload, (3) a “burst churn” scenario where 30 % of nodes left the network within five minutes and then rejoined, and (4) a “gradual churn” scenario with 10 % node turnover every ten minutes. Workloads varied in file size (10 KB–10 GB), read/write ratio (70 % reads, 30 % writes), and request inter‑arrival time. Each experiment was repeated ten times to obtain statistically reliable averages.
Results show that autonomic management consistently outperforms the static baseline. Average request latency dropped by 18 % overall and by up to 42 % during peak load periods. Network traffic decreased by an average of 14 %, and dynamic replication reduced total storage consumption by 22 % compared with the fixed replication factor of three used in the baseline. In churn scenarios the autonomic system converged to a new optimal configuration within 2–3 minutes, limiting service disruption to less than one minute, whereas the static system experienced latency spikes of three‑fold or more.
The dissertation highlights several key insights. First, autonomic control can simultaneously improve performance and resource efficiency in environments where conditions change unpredictably. Second, the modular design and DSL‑based policy specification enable easy transplantation of the framework to other distributed storage solutions. Third, the initial learning phase requires sufficient historical data; without it, aggressive re‑configuration may cause instability, suggesting the need for safe exploration strategies such as exploration‑exploitation balancing.
Future work proposed includes scaling the evaluation to wide‑area network (WAN) environments, integrating reinforcement‑learning techniques for policy generation, extending the optimization to incorporate security, privacy, and energy‑consumption objectives, and exploring the combination of autonomic management with blockchain‑based trust mechanisms. The findings suggest that autonomic management is a promising avenue for building robust, efficient storage systems that leverage the abundant but volatile resources of user workstations.
Comments & Academic Discussion
Loading comments...
Leave a Comment