Technical support for Life Sciences communities on a production grid infrastructure
Production operation of large distributed computing infrastructures (DCI) still requires a lot of human intervention to reach acceptable quality of service. This may be achievable for scientific communities with solid IT support, but it remains a show-stopper for others. Some application execution environments are used to hide runtime technical issues from end users. But they mostly aim at fault-tolerance rather than incident resolution, and their operation still requires substantial manpower. A longer-term support activity is thus needed to ensure sustained quality of service for Virtual Organisations (VO). This paper describes how the biomed VO has addressed this challenge by setting up a technical support team. Its organisation, tooling, daily tasks, and procedures are described. Results are shown in terms of resource usage by end users, amount of reported incidents, and developed software tools. Based on our experience, we suggest ways to measure the impact of the technical support, perspectives to decrease its human cost and make it more community-specific.
💡 Research Summary
Large-scale distributed computing infrastructures (DCIs) still require significant human intervention to achieve acceptable quality of service. This can be managed for scientific communities with robust IT support, but it remains a major obstacle for others. Some application execution environments are designed to hide runtime technical issues from end users; however, these primarily aim at fault-tolerance rather than incident resolution and still necessitate substantial manpower.
To ensure sustained quality of service for Virtual Organizations (VOs), long-term support activities are necessary. This paper details how the biomed VO has addressed this challenge by establishing a dedicated technical support team. The organization, tools, daily tasks, and procedures used by this team are described in detail. Results are presented through metrics such as resource usage by end users, the volume of reported incidents, and the development of software tools.
The paper highlights that effective long-term support requires not only addressing immediate issues but also implementing preventive measures to maintain system stability. The technical support team employs various tools and procedures to enhance infrastructure reliability and improve user experience. The results presented in the paper demonstrate the effectiveness of these activities.
Moreover, the paper emphasizes the need for community-specific customization in technical support operations. This ensures that the diverse requirements of scientists using the infrastructure are met, allowing them to focus on their research without being hindered by technical issues.
Finally, the authors suggest methods to measure the impact of technical support and propose strategies to reduce human costs while increasing community specificity. These recommendations aim at achieving more efficient infrastructure management and enhancing user experience.
Comments & Academic Discussion
Loading comments...
Leave a Comment