Inter-organizational fault management: Functional and organizational core aspects of management architectures

Inter -or ganizational fault management: Functional and or ganizational core aspects of management architectures Patricia Marcu, W olfg ang Hommel Leibniz Supercomputing Centre Boltzmannstr . 1, 85748 Garching, Germany {marcu, hommel}@lrz.de ABSTRA CT Outsour cing – successful, and sometimes painful – has become one of the hottest topics in IT service management discussions over the past decade. IT services ar e outsour ced to external service pro vider in order to reduce the effort r equir ed for and overhead of delivering these services within the own or ganization. More r ecently also IT services pr oviders themselves started to either outsour ce service parts or to deliver those services in a non-hierar chical cooperation with other pr oviders. Splitting a service into several service parts is a non-trivial task as they have to be implemented, operated, and maintained by differ ent pr oviders. One ke y aspect of such inter-or ganizational cooperation is fault management, because it is crucial to locate and solve pr oblems, which r educe the quality of service, quickly and r eliably . In this article we pr esent the results of a thor ough use case based r equirements analysis for an ar chitectur e for inter-or ganizational fault management (ioFMA). Furthermore, a concept of the or ganizational r espective functional model of the ioFMA is given. KEYWORDS Inter-or ganizational F ault Manag ement; IT -Service Delivery Diversity; Management Ar- chitectur e 1 . I N T R O D U C T I O N Providing IT services in an inter -organizational manner is a complex and often error-prone task. Managing IT services is often characterized by applying the classic FCAPS partitioning: fault, conﬁguration, accounting, performance, and security management. In this article, we focus on the technical functionality as well as the org anizational aspects of fault management in the context of inter-or ganizationally operated IT services. Our work is primarily motiv ated by the interaction of the following three challenges: The Outsourcing pr oblem . : One characteristic of the last decade is that many org anizations hav e outsourced their IT services to external parties, either entirely (e.g. email, ﬁle storage, and web servers) or just partially . Consequently , many processes and workﬂo ws have been transferred to and restructured by these external service providers. Outsourcing is performed in order to reduce the organization’ s IT costs, but also to facilitate good technical support. Related to these goals, ITIL v3 (see [1]) describes the migration from the value-chain-model – also kno wn as hierarchical service deliv ery model – to the value-network-model , which contains horizontal (non-hierarchical) relationships between the in volved providers. Within this scope, dif ferent sourcing strategies are deﬁned. The pr oblem of heterogeneity and autonomy in multi-domain en vir onments . : From an org anizational point of view , IT service providers collaborate with each other in very div erse ways. This makes it dif ﬁcult to specify a single, universal methodology for ef fectiv e and efﬁcient inter-or ganizational fault management ( ioFM ) The common denominator of the organizational models found in practice is the heter ogeneity and autonomy , especially concerning the deployed IT systems and management tools. W e therefore ha ve to face the challenge of specifying Figure 1. Propagation of faults in inter-or ganizational en vironments fault management concepts that are to be deployed in cross-organizational or multi-domain en vironments and deal with these characteristics. The service delivery diversity . : Regarding the service deli very process as a producti ve process, a big difference between the real org anizations concerning process control, communi- cation, and many other IT service management (ITSM) aspects can be observed. Therefore very useful reference processes exist. Reference processes for fault management have been described, for example, in [1] and [2] for hierarchical service deliv ery , and in [3] for heterarchical (i.e. non-hierarchical) service delivery . Based on these reference processes, other related work, and real-world scenarios, we hav e extracted the requirements for an ioFM architecture as presented in this paper . The above stated problems are those characteristics of inter-org anizational IT en vironments that are most relev ant for ioFM; a simple example is giv en in Figure 1: A provider deli vers its services to customers A, B, and C in different ways: It outsourced three of the services (labeled Service 1, 2, and 3 respectiv ely) to other service providers. Up to this point we deal with a vertical service chain, which represents the classical type of hierar chical service delivery . Both services 1 and 2 are delivered by only one service provider (providers 1 and 2 are the subcontractors of the service provider). Opposed to these two services, Service 3 is provided by multiple cooperating service providers (Provider 3, 4, and 5). Each of these three providers is required to deliv er its part of the service, but none of them has a superior role; instead, they are on a par with each other: These service providers coexist on the same „service layer” regarding the service functionality . They deliver „service parts” (as discussed in [4]) (Service Part 1, 2, and 3 respectiv ely) which together lead to the delivery of a single horizontal service. These service parts are concatenated within the same service layer , so the horizontal service chain represents a heterar chical service delivery . It is usual that each real world organization aligns itself on its own requirements, workﬂows, and processes. It also uses different IT infrastructures, systems, and tools. As a consequence, each organization we deal with needs to be analyzed ﬁrst, and typically there is a lack of tool interoperability whenev er multiple service providers are about to be coupled in order to jointly provide an IT service. In this context, management tool support is of utmost importance, because the complexity of the IT infrastructure as well as of each service increases with the number of in volv ed providers. T aking into account the abov e stated challenges, the scenario described here is clearly a heterogeneous one. Follo wing issue is important here: A fault, e.g., within the Provider 4’ s domain, will – independent of its root cause – make the whole Service 3 fail because of this issue within Service Part 2. This fault will be propagated to the Service Provider , and thus the customers will face a quality-degraded or unav ailable service. This fault can have more or less follo w-ups depending on the service customization for each individual customer . Nev ertheless, in such inter-or ganizational scenarios it is very difﬁcult to precisely locate such a fault, to correlate it with other unsolved faults, and to track and steer the progress of the handling and correction. For a single IT service provider’ s infrastructure already several approaches and best-practices concerning fault management exist. But regarding ioFM there is a lack of both research and best practices. Our work faces the additional practical challenge that IT service providers from dif ferent countries are in volv ed, which in turn increases both the technical complexity as well as the organizational and legal constraints, resulting in even more complex deliv ery processes. Regarding outsourcing as well as multi-domain IT service deli very from a process-oriented point of vie w , a well deﬁned and proper ioFM is needed on the system layer . In order to meet this demand, our work focuses on an ioFM Architecture (ioFMA). This article presents our methodology and the results of our ioFMA requirements analysis. It is structured as follows: In Section 2 we sum up the related work that has inﬂuenced our methodology and ioFMA design. In Section 3, we present details about our design rationale and the MDA-based approach that has been taken. Section 4 outlines the inter-or ganizational scenarios we have analyzed. Section 5 speciﬁes the roles and actors relev ant to ioFM on which the organizational model bases and on this basis we then present the identiﬁed use cases and the derived requirements. In Section 6 we are gi ving an o verview on the functional model of the ioFMA. A summary and an outlook to our future work concludes this paper in Section 7. 2 . R E L A T E D W O R K 2.1. Management ar chitectures and their submodels In Hegering et al. [5] the building blocks of manag ement ar chitectur es (MA) are described. The primary goal of each management architecture is to establish an inte grated management approach by providing a v alid system management frame work instead of using several manage- ment tools independently of each other . The MA is composed of four complementary submodels: the information model (IM) , the or ganizational model (OM) , the communication model (CM) , and the functional model (FM) . The IM represents the description and modeling of the managed objects (management-rele vant information to be exchanged). The OM describes the roles as well as the responsibilities and speciﬁes the communication patterns within the MA . The CM speciﬁes the communication procedures for the exchange of management information. The FM splits the management task into sev eral components and provides dedicated management func- tionalities: fault management, conﬁguration manag ement, accounting management, performance management, and security manag ement (also known as FCAPS) . The MA concept along with its submodels is very valuable for this work, because it the base for holistic integrated network management. Thus our work will be aligned to the four submodels of such a MA. They hav e to be extended to take inter-organizational conditions into account, which hav e not been considered by previous MA variations yet. Also the functional area of fault management (FM) will be taken into account and reﬁned to additional ioFM functionalities that are tailored for inter-or ganizational en vironments. 2.2. IT Service Management ITSM frame works, such as ITIL v3 [1], ISO/IEC 20000 [6], and eTOM [2] have been established to design management processes that follow the continual improvement strategy of Deming’ s plan-do-check-act life cycle. These ITSM frame works have been used primarily for process deﬁnition in hierarchical service deliv ery scenarios. For non-hierarchical service deli very , a new concept has been dev eloped in [3]. These approaches gi ve guidelines for the inter-org anizational service deliv ery pr ocesses as a whole. Nev ertheless, on the (technical) system layer there is no underlying concept for inter - org anizational service deliv ery deﬁned yet. Our work focuses on reﬁning the giv en reference processes and designing an integrated system-le vel MA. 2.3. Service Composition As we take into account services delivered in an inter-or ganizational en vironment, the concept of service composition is a ke y enabler for our research. In [7], Dreo distinguishes between two types of supply chains: vertical and horizontal. By vertical the well known hierarchical service deliv ery is meant. The horizontal supply chain addresses the issue of peering. Despite the partially ov erlapping scope between these results and our work, the non-hierarchical service deli very taken in account by our research does not only cover peering. The underlying necessity has also been postulated by Hedlund [8], whose work uses the term heterar chy for the non- hierarchical organizational forms, which we also address. In their work [9] on service composition applied to network management, V ianna et al. show that service composition can indeed be realized by using traditional management technologies. The application of technologies created to support service composition will bring important adv antages to the network management discipline. Howe ver , they consider only services based on a hierarchical chain of compositions. Klie et. al analyze the automatic web service composition as a possibility to further automate network management in [10]. They compare se veral web service composition technologies in order to describe an approach using a composition engine for network management. This automatic web service composition can be used to simplify complex network management tasks. It also enables the automatic composition for covering large parts of several network management tasks; this approach is v aluable as a guideline for the implementation of the ioFMA. 2.4. F ault Management related T asks In [11] a framework for problem determination is proposed. It is based on the monitoring of e vent streams that are generated by the dif ferent components of an IT service. A generic representation of a problem through spatial-temporal patterns is given. Additionally , efﬁcient algorithms are described in order to sustain building blocks for a hierarchical heuristic for detect- ing generic patterns. Even though some of these concepts are distantly related to our approach, their work is merely based on hierarchical service structures. Also in [12] the automation of the incident management is proposed. In our former work [13] we speciﬁed a methodology for handling faults in non-hierarchical service deli very en vironments, which we called Service Provider Coalitions. This approach’ s goal was the correlation of fault reports generated by different incident ticketing systems in multi-enterprise en vironments. W e now propose to realize the fault management on a higher le vel of abstraction. 3 . D E S I G N R A T I O N A L E This section describes the methodology used in designing the architecture, se veral of the taken design decisions, and the consequences for the ioFMA. 3.1. Model Driv en Architectur e Our design of the management architecture follows the Model Driven Ar chitectur e (MD A) [14] approach. Its iterativ ely reﬁning character is outlined in Figure 2. MD A contains three MDA CIM (Computation Independent Model) Process view Architecture view System view Scenario, Use Cases Requirements Reference processes (ITIL, ISO20000, [Hamm09]) Platform implementation PIM (Platform Independent Model) PSM (Platform Specific Model) ioFMA Functional model Organisational model Communication model Information model Figure 2. Our methodology , which is based on the MD A approach models: 1) The computation independent model (CIM) provides a general view on the system, as well as on the en vironment in which this system will be deployed. 2) The platform independent model (PIM) pro vides a vie w on the system independently of the platform that it will be deployed on. Consequently , this model is still generic and can be applied to se veral platforms of similar type. 3) The platform speciﬁc model (PSM) takes the speciﬁcation from the PIM and describes its application to a speciﬁc platform. As a result, the three models build upon each other and descend from a higher lev el of abstraction (CIM) to a lower one (PSM). The design of our ioFMA is done in analogy to MD A. In our design process, the requirements elicitation and its model design correspond to MD A ’ s CIM vie w . The scenarios’ description (one hierarchical and one heterarchical scenario) and their gen- eralization are part of the requirement analysis. From the resulting general scenario we deri ve use cases and sev eral implicit requirements on the ioFMA. A three-tier procedure for the model design is used: 1) The pr ocess view corresponds to CIM and contains reference processes regarding Incident Management (for hierarchy we used [1] and for heterarchy we used [3]). 2) The arc hitectur e view corresponds to PIM and contains the ioFMA as well as its sub models, which correspond to the described processes in the upper layer . 3) The system view , which is representing the PSM in our approach, contains the imple- mentation of the overlaid architecture on any speciﬁc platform on the system layer . Furthermore, the design methodology of our ioFMA is split into two parts: the requirements analysis and the model design. 3.2. Methodology of requir ements analysis In order to elicit ioFMA requirements, we have analyzed tw o real world scenarios: The IntegraTUM scenario as an representativ e example of a hierarchical inter-org anizational service deli very , and the GÉANT scenario representing the heterarchical service delivery in inter - org anizational en vironments. Based on these practical scenarios, we deriv ed a more abstract generic scenario and its use cases. The textual description of the use cases has been performed with a focus on management architectures (cf. section 2) and their sub models. Functional and non-functional requirements hav e then been derived from these use cases. 3.3. Methodology of model design Based on the requirements and on the reference process for incident management (cf. section 2), the sub models of our ioFMA are speciﬁed in the following order: 1) The functional model , which has to underline the most important functionalities concern- ing fault management, comes ﬁrst. 2) The or ganizational model follo ws and reveals the roles and responsibilities in inter- org anizational en vironments that are required in order to conduct ef ﬁcient ioFM. 3) The communication model then deli vers the required information communication exchange measures and procedures. 4) The information model ﬁnally speciﬁes the data format for the ioFM information exchange and processing. In the next step, the ioFMA will be transformed to a PIM; then it will be instantiated for hierarchical, heterarchical, and mixed forms of service deliv ery . All of them will be mapped onto PSMs. In the next section, we present details about the ﬁrst step in this methodology , i.e. the requirements analysis. 4 . S C E N A R I O S F O R I N T E R - O R G A N I Z A T I O N A L F A U L T M A N A G E M E N T In order to design and implement an ioFMA, we hav e chosen the following tw o scenarios, one for each inter-organizational service deli very model: hierarchy (IntegraTUM) and heterarchy (GÉANT). 4.1. IntegraTUM In the IntegraTUM project [15], which has been funded by the German Research Founda- tion (DFG) and initiated by the T echnische Univ ersität München (TUM), sev eral univ ersity IT services, which were previously operated by the v arious TUM institutions (e.g. library , administration, and faculties) themselves, hav e been reorganized and recentralized at the Leibniz Supercomputing Center (LRZ). TUM’ s staf f and students are automatically granted access to all rele vant services, such as the university web portal, learning management system, and computer labs based on an identity management process that is coupled with the student enrolment process and the human resources (HR) management software. Thus, TUM is LRZ’ s customer and the scenario fulﬁlls the criteria of the hierarchical inter-or ganizational service deliv ery model as outlined abov e. A fault management process has been established between the both organizations in this hierarchy and is described in detail in [16]. 4.2. GÉANT The End-to-End (E2E) Link service in the GÉANT2 multi-national netw ork [17] is an example of services deli vered by a heterarchical service provider organization. Co-funded by the European Commission as well as Europe’ s national research and education networks (NRENs), and managed by D ANTE, the GÉANT network connects 34 countries via 30 NRENs. On the technical layer, multiple 10Gbps wav elengths are used to set up dedicated E2E links. One representativ e customer is the Large Hadron Collider (LHC) project at CERN in Switzerland. It is expected that its recently started experiments will produce 15 petabytes of scientiﬁc data each year . In order to meet the bandwidth and quality of service requirements of large-scale research projects, dedicated optical E2E Links must be set up. These links span multiple countries and allow the unrestricted utilization of the physically possible bandwidth. E2E Links connect organizations located in different countries and cross the networks of dif ferent providers. When providing the E2E Link services, each pro vider (member of the service provider coalition) has to collaborate w .r .t. setup, maintenance, and management with the other providers. Major challenges in the realization of these services are the heterogeneity concerning the technical implementations, the used software tools, various people related issues, and many more. In [3], Hamm introduced a reference incident management process for E2E Links. 5 . U S E C A S E S A N D R E Q U I R E M E N T S E L I C I TA T I O N Both of the scenarios outlined above provide plenty of use cases for the elicitation of ioFMA requirements, although fault management obviously is only one of a lot of aspects that need to be addressed in such complex service provider constellations. One of the characteristics common to both scenarios is that the service providers, which are in volved in the deliv ery process, are communicating and cooperating with each other in a kind of „pr ovider network" . T o better address such speciﬁcs, we ﬁrst deﬁne the roles for ioFM in the next section. They hav e been generalized based on the roles and responsibilities we found in the real world scenarios. 5.1. Deﬁning r oles for inter -organizational fault management One of the most important roles in ioFM is the user . This is the role that typically initiates the fault management process by means of fault notiﬁcations that are stored in trouble ticket systems (TTS). In inter-org anizational en vironments this role can be assigned to a service provider that is using a certain service as a user , e.g., due to outsourcing. Service Pr ovider (SP) is the role that is responsible for the deliv ery of a service and for the fulﬁllment of the Service Level Agreements (SLAs) agreed with its users . These SP s are also essential to the ioFM as they constitute the pr ovider network and deliv er IT services in a cooperativ e manner . W ithin the different service provider domains, there is always a role that is responsible for the local fault management. W e called this role the Domain F ault Manager (DFM) . The DFM does not only communicate within its domain, but also with the DFM s of other domains. On the local lev el also a Domain F ault Operator (DFO) is required in order to isolate, correct, and log a fault within her own domain. Even though these both are intra-organizational roles, the DFO has a purely operational role, whereas the DFM primarily has coordinating responsibilities. In ioFM, the so-called Global F ault Coor dination Manager (GFCM) has the ov erall coordi- nation role: It addresses all the domains that are inv olved in the service deliv ery process. The GFCM’ s main tasks include: monitoring of conﬁrmed and potential faults, forwarding of fault- related information between the different domains, and facilitating inter-domain communication. In the hierarchical case the role of the GFCM is identical to DFM for obvious reasons. Howe ver , in a heterarchy , the role of GFCM will be assigned temporarily to each of the domains in an on-demand manner . Last but not least the Domain Monitoring System (DMS) is responsible within a domain for system and component monitoring. This role announces fault notiﬁcations or alarms about malfunctions of the system. Using these roles the use cases are described in the following section. The important roles deﬁned here are the base for the or ganizational model of the ioFMA. 5.2. Identifying use cases Abov e we describe and analyze the two real-w orld scenarios in order to elicit use cases needed for the requirements analysis. Therefore we ha ve identiﬁed the follo wing different classes of use cases: fault localization , fault r esolution pr ogr ess management , monitoring , reporting , and handling false-positives . These also represent the core functionalities that an ioFMA should of fer . (a) Use cases for fault localization (b) Use cases for fault monitoring Figure 3. Use cases for fault localization and monitoring 5.2.1. F ault Localization The main functionality of the ioFMA has to be the precise localization of faults. Depending on the place where the fault will be localized, there can be multiple variations as shown in Figure 3(a): The fault localization within one’ s own domain (L01) is initiated by the user , or by the DMS respectiv ely , and will be localized by the DFM if a known fault occurs; otherwise, i.e. if it is an unknown fault, it will be the DFM ’ s task with the support of the DFO . If the fault cannot be isolated within this domain, the issue will be forwarded to another domain. The fault localization in an undeﬁned domain (L02) will therefore be initiated. The DFM is reporting the fault to the GFCM , which will forward it to all DFM s in volv ed in the service deli very . In collaboration with the DFO s, the fault will – in the best case – be found in one of the domains and back reported to the GFCM . Howe ver , in the case that the fault cannot be isolated in this way , an escalation procedure has to be initiated. A deriv ate of this use case is fault localization within a speciﬁc domain (L03) ; here, the GFCM has to forward the fault only to a certain (known) domain and not to all in volved partners. 5.2.2. F ault Resolution Progress Management A status display informs about the pr ogr ess of the fault r esolution or the pr ogr ess of the maintenance work . The progress of the fault resolution (P01) is initiated by the DFM that wants to know the progress of the fault resolution within his o wn or any other in volv ed domain. It can also be initiated by the GFCM in order to get an ov erview of the whole inter-or ganizational network with respect to the fault resolution process instances. Consequently , the DFM and/or GFCM query the DFM s reg arding the progress of the fault resolution in their respecti ve domains. The DFM s will retriev e this information from their DFO s and giv e feedback to the DFM or GFCM from which the query originates. For the progress of the maintenance work P02 the same steps will be run through, but with a different scope. The case when a user wishes to be informed about the status of the fault resolution and/or maintenance is a secondary scenario within this use case, which results in a query forwarded by the DFM or GFCM . 5.2.3. Monitoring In both the hierarchical and the heterarchical case, monitoring is a very important feature that the ioFMA should have. By means of continuous monitoring, faster fault localization is enabled. W e distinguish between domain monitoring , overall monitoring , and service monitoring (see F ault proc es si ng DFM DFO Show status of fault processing Show status of maintanance work GFCM (a) Use cases for fault resolution progress management F als e posi tiv e faul ts Remove false positive fault information DFM Localize false positive fault information DFO (b) Use cases for false positiv es Figure 4. Use cases for fault resolution progress management and false positives ﬁgure 3(b)). The domain monitoring (M01) is responsible for the fault monitoring within a domain. It can be initiated by the user , DFM , or GFCM . They will be querying the DFM of a certain domain about the general status of the faults within this domain. The result will be retrie ved from the DMS , which is always updated concerning the alarms and fault notiﬁcations. One exception that needs to be dealt with is when the user or DFM does not hav e the necessary access rights to fetch monitoring information about another domain. Overall monitoring (M02) is responsible for the monitoring of the whole provider network. It can be initiated by the GFCM or by any other DFM that has suf ﬁcient access rights. This results in querying the entire domain DFM s about their monitoring status. If all of the domains are replying with a v alid status, then the ov erall monitoring is enabled; otherwise only a partial monitoring of the provider network can be established. As many providers (but not all of those within the provider network) are in volv ed in the deli very of a certain service, the service monitoring (M03) is denoting that only these inv olved domains will be monitored. This is a special case of the former one, as it monitors only a well-deﬁned subset of the pro vider network. 5.2.4. Reporting Reports are supporting dif ferent processes, such as fault management. They gi ve an overvie w of actual measur ements, metrics, accounting data , Quality of Service (QoS) parameters , but also information based on historical data, e.g. in order to facilitate a trend analysis . First the realization of statistical plots and accounting data reports (R01) will be speciﬁed. This is usually initiated by the GFCM , which is about to retriev e all this data from all DFM s in the provider network. In the best case all the domains send the requested information so that a report and statistical plots from all the in volved domains can be conducted. In the case that some domains do not respond to the information request, incomplete statistical plots or/and accounting data will be shown. The QoS parameter (R02) will be retrieved in order to check the fulﬁllment of the agreed SLAs and to e valuate the follow-up of dif ferent faults that hav e occurred in the past. Based on historical information, trend analysis (R03) can be done by predicting the liability of the system to some speciﬁc faults with v arious follow-ups according to various statistical models. Potential future faults could therefore be resolved or by-passed before they really occur . 5.2.5. F alse-positives In order to be assured that information concerning faults is valid, false positiv es (i.e. wrongly announced faults) have to be identiﬁed and remov ed. This use case is very important as in Scenario Use case IntegraTUM GÉANT Generalized (hierarch y) heterarch y (mix ed) L01: 3 3 3 L02: 3 3 L03: 3 3 P01: 3 3 3 P02: 3 3 3 M01: 3 3 3 M02: 3 3 M03: 3 3 3 R01: 3 3 3 R02: 3 3 3 R03: 3 3 3 F01: 3 3 F02: 3 3 T ABLE 1. Cov erage of the scenarios by the use cases many cases the search for non-existing faults impedes the normal functionality of an IT service. The localization of false positives (F01) is initiated by the GFCM or by one of the DFM s. In the case that a potential false fault notiﬁcation is giv en that cannot be mapped onto the behavior of the system, the GFCM or DFM is querying the responsible DFM about this issue. The DFM has to consult the DFO and ﬁgure out whether this fault really is a false positiv e. The result will be reported back to the GFCM. The removal of false positives (F02) requires that it has reliably been identiﬁed as such ﬁrst. Thus, the DFO identiﬁes the non-existing fault and remov es the false positive (manually or tool-supported) from the monitoring system. This action is then reported to the DFM . 5.3. Deriving requir ements T able 1 summarizes the different use case occurrence as requirements for the functional model of the ioFMA . Additional to these, the follo wing two additional requirements ha ve to be considered: • FM-01 : In order to increase the legibility of the fault information, a visual presentation is necessary . • FM-02 : Especially regarding the use cases for fault resolution progress management and in the remov al of false positiv es the possibility to change or remove fault data has to be gi ven. In order to support the realization of the use cases described above some requirements on the sub-models of the ioFMA hav e to be fulﬁlled. W e identiﬁed the follo wing requirements regarding the information model of the ioFMA : • IM-01 : A common data format for fault information is needed in order to facilitate the inter-domain data exchange and the communication. This should consist of a set of common attributes or properties. • IM-02 : Another additional or coexisting requirement to the ﬁrst one is the existence of con version methods between the data format in the different domains. • IM-03 : Interface deﬁnition across different domains hav e to be deﬁned. • IM-04 : The ioFMA has to support all the life cycle phases of a fault resolution process (detection, isolation, repairing/recov ery , and forecast/prev ention). • IM-05 : Also the use of standard metrics has a supporting role in the monitoring, and respecti vely in the reporting. An example of such a set of standard metrics is the IP Performance Metrics (IPPM) [18] (e.g., One W ay Delay (O WD [19]), IP Delay V ariation ([20]), Packet Loss ([21]), and others). Requirements Phases of the f ault life cycle Detection Isolation Repairing F orecast/- Pre vention f or the IM IM-01 3 3 IM-02 3 3 IM-03 3 3 3 IM-04 3 3 3 3 IM-05 3 3 IM-06 3 3 f or the OM OM-01 3 3 3 3 OM-02 3 3 3 3 f or the FM FM-L01 3 3 FM-L02 3 3 FM-L03 3 3 FM-P01 3 FM-P02 3 FM-M01 3 3 3 3 FM-M02 3 3 3 3 FM-M03 3 3 3 3 FM-R01 3 3 FM-R02 3 3 FM-R03 3 FM-F01 3 3 FM-F02 3 3 FM-01 3 3 3 FM-02 3 3 3 f or the KM KM-01 3 3 3 3 KM-02 3 3 3 3 KM-03 3 3 3 3 NF A NF-01 3 3 NF-02 3 3 NF-03 3 3 NF-04 3 3 3 3 NF-05 3 3 NF-06 3 3 NF-07 3 3 3 NF-08 3 3 T ABLE 2. Cov erage of the phases of the fault life cycle by the requirements • IM-06 : As the correlation/interrelation between the metrics of different domains has to be provided, a suitable aggregation function has to be deﬁned. Furthermore, requirements regarding the or ganizational model of the ioFMA must be con- sidered: • OM-01 : The inter-organizational service delivery models hav e to be supported. • OM-02 : Deﬁnition of roles and responsibilities according to the use cases described abov e. Ho wever , also the follo wing requirements regarding the communication model of the ioFMA must be kept in mind: • KM-01 : Communication mechanism , such as pull or push models have to be supported by the ioFMA. • KM-02 : Inter-domain communication is a very important requirement as the ioFMA will be deployed in an inter-or ganizational environment. Different networks with heterogeneous technologies exchange dif ferent data with each other . The inter-domain communication is also important, because in the absence of a central unit for coordination and communication between different networks at least a minimal set of information has to be exchanged. • KM-03: In order to support the data exchange within different networks a communication protocol has to be deﬁned. The complexity of the inter -organizational en vironment with their different provider , networks, and protocols is the challenge we are facing here. Finally , we argue that the functional requirements regarding the sub-models of the ioFMA must be complemented by the following series of non-functional requirements: • NF-01 : An access control mechanism has to be part of the ioFMA . • NF-02 : Protection against data loss and deliberate data altering especially in the fault localization, reporting, and false-positiv e data integrity has to be provided all the time. • NF-03 : The up-to-dateness of the data in the ioFMA has to be guaranteed. • NF-04 : Especially fault localization, monitoring, and false-positiv es management require a well-designed scalability of the tools in order to provide the discussed functionality . • NF-05 : Adequate performance in the realization of the abo ve named functionalities has to be achie ved. • NF-06 : The automation of as many possible functionalities as possible has to be realized in order to speed up the fault resolution process. • NF-07 : A common data base for all the providers in volv ed in the inter-or ganizational fault management process. • NF-08 : Last but not least all processes and functionalities ha ve to be properly documented . As we take the whole fault resolution process into account, the requirements hav e to be related to all rele vant life cycle phases. T able 2 shows which requirements have to be ful- ﬁlled in the different phases of the fault life cycle (detection, isolation, repairing/recov ery , and forecast/pre vention). 6 . C O R E A S P E C T S O F T H E F U N C T I O N A L M O D E L This section addresses the functional model of the ioFMA. As a base for its design the use cases described in section 5.2 are applied. As stated in [5], the functional model contains the functional areas which integrate all the required functionalities of a management architecture. For the ioFMA, we elicited three functional areas related to the organizational domain in which it is deployed: • Provider management – this the part of the ioFMA concerned with local „arrangements” and integrating them with intra-organizational fault management • Inter-or ganizational Management – this is the core part of the functional model of the ioFMA as it contains all inter-or ganizational aspects • Customer management – is placed on a more abstract le vel abo ve the both former functional areas as it is connected to both of them and is the enabler of the provider and inter - org anizational management, respectively . 6.1. Pro vider Management W ithin the service provider domain, different management functions in order to support the inter-or ganizational fault management ha ve to be implemented. These management functions rely on the described use cases. Fault localization within one’ s own domain is the ﬁrst management function which has to be realized in a domain as a part of an ioFMA. The progress management for the fault resolution as well as the progress management for the maintenance work hav e to be performed within the service provider domain and connect to the inter-or ganizational management. Finding and removing false positiv es as well as performing data changes (under the strict control of the inter-or ganizational management) hav e also to be implemented within the domain. 6.2. Inter -organizational Management As the core of the functional model, the inter-or ganizational management has to coordinate, integrate, put together information and functions from the different in volv ed service provider Figure 5. Overvie w on the functional model of the ioFMA domains. The management functions, which the inter -organizational management comprises, are: fault localization in an unspeciﬁed domain and within a speciﬁc domain, progress management for the fault resolution and for the maintenance w ork, overall monitoring and service monitoring, creation of statistical plots and accounting data reports, representation of QoS parameter and realization of trend analysis as well as detecting respective removing false positi ve fault reports. It can be observed that these are mainly the use cases deﬁned previously . In addition to this a very important management function – data change – has to be added. This has to be allowed but only under control of the inter-or ganizational management. 6.3. Customer Management The customer management is the key enabler for both the provider management and the inter-or ganizational management. It actually contains all the management functions listed abov e, but has additional functionality . For example, from the customer’ s perspecti ve the opening and updating of fault reports has to be supported. It serves as both a trigger and a feedback channel and is an essential core component of IT service management architectures. 7 . C O N C L U S I O N S A N D F U T U R E W O R K In this article we presented a full requirement analysis in order to design an inter-or ganizational fault management architecture. W e also discussed the core aspects of the functional and or- ganizational models based on the elicited use cases and requirements. The next steps in our research are to complete the architecture with a communication and an information model. After that we will deli ver a full model of ioFMA on the PIM layer as well as its transformation to the system layer . Our implementation will be customized for the LHC optical pri vate network (LHCOPN), which is operated by the European GÉANT network. A C K N OW L E D G M E N T S The authors would like to thank their colleagues at the Leibniz Supercomputing Centre of the Bavar - ian Academy of Sciences and Humanities (see http://www .lrz.de/) for helpful discussions and valuable comments about this paper . The authors wish to thank the members of the Munich Network Management T eam (MNM-T eam) for helpful discussions and valuable comments on previous versions of this paper . The MNM T eam directed by Prof. Dr . Dieter Kranzlmüller and Prof. Dr . Heinz-Gerd Hegering is a group of researchers at Ludwig-Maximilians-Universität München, T echnische Universität München, the Univ ersity of the Federal Armed Forces and the Leibniz Supercomputing Centre of the Ba varian Academy of Sciences and Humanities. See http://www .mnm- team.or g/. R E F E R E N C E S [1] OGC, Ed., Service Operation , ser . IT Infrastructure Library v3 (ITIL v3). Norwich, UK: The Stationary Ofﬁce, 2007. [2] “enhanced T elecom Operations Map (eTOM), The Business Process Framework for the Information and Communications Services Industry , ” T eleManagement Forum, GB 921 Release 5.0, Apr . 2005. [3] M. Hamm, “IT Service Management Prozesse verketteter Dienste, ” Dissertation, Ludwig–Maximilians– Univ ersität München, Jun. 2009. [4] M. Hamm, P . Marcu, and M. Y ampolskiy , “Beyond Hierarchy: T owards a Service Model supporting new Sourcing Strategies for IT Services, ” in Pr oceedings of the 2008 W orkshop of HP Software University Association (HP-SU A), Infonomics-Consulting, Hewlett-P ackar d , Marrakech, Morocco, June 2008. [5] H.-G. Hegering, S. Abeck, and B. Neumair , Inte grated Management of Networked Systems - Concepts, Ar chitectur es and their Operational Application . Morgan Kaufmann Publishers, 1999. [6] “ISO/IEC 20000-1:2005 - Information T echnology - Service Management - Part 1: Speciﬁcation, ” International Organization for Standardization, T ech. Rep., Dec. 2005. [7] G. Dreo Rodosek, “A Framew ork for IT Service Management, ” Habilitation, University of Munich (LMU), Department of Computer Science, Munich, Germany , Jun. 2002. [8] G. Hedlund, “ Assumptions of hierarchy and heterarch y , with applications to the management of the multinational corporation, ” in Organizational Theory and the Multinational Corporation , 2nd ed., S. Ghoshal and E. W estney , Eds., London, 2005, pp. 198–221. [9] R. L. V ianna, E. R. Polina, C. C. Marquezan, L. Bertholdo, L. M. R. T arouco, M. J. B. Almeida, and L. Z. Gran ville, “An Evaluation of Service Composition T echnologies Applied to Network Management, ” in 10th IFIP/IEEE International Symposium on Integr ated Network Management , Munich, 2007, pp. 420–428. [10] T . Klie, F . Gebhard, and S. Fischer , “T owards Automatic Composition of Network Management W eb Services, ” in Inte grated Network Management, IM 2007. 10th IFIP/IEEE International Symposium on Inte grated Network Management , Munich, Germany, 2007, pp. 769–772. [11] S. Mitra, P . Dutta, S. Kalyanaraman, and P . Pradhan, “Spatio-T emporal Patterns for Problem Determination in IT Services, ” pp. 49–56, Sep. 2009. [12] R. Gupta, K. H. Prasad, and M. K. Mohania, “Information integration techniques to automate incident management, ” in Pr oceedings of the IEEE/IFIP Network Operations and Management Symposium: P ervasive Management for Ubioquitous Networks and Services (NOMS 2008) . Salv ador Bahia, Brazil: IFIP/IEEE, Apr . 2008, pp. 979–982. [13] P . Marcu, L. Shwartz, G. Grabarnik, and D. Loewenstern, “Managing Faults in the Service Delivery Process of Service Provider Coalitions, ” in IEEE International Conference on Service Computing (SCC 2009) , Bangalore, India, Sep. 2009. [14] “MD A Guide, ” http://www .omg.org/mda/, Jun 2003. [15] “Inte graTUM project, T echnische Universität München, ” http://portal.mytum.de/iuk/integratum/inde x_html. [16] W . Hommel and S. Knittl, “Aufbau von organisationsübergreifenden Fehlermanagementprozessen im Projekt IntegraTUM, ” in Informationsmanagement in Hochschulen , A. Bode and R. Borgeest, Eds. Berlin: Springer- V erlag, 2010. [17] GÉANT, “GéANT Homepage, ” http://www .geant.net/, 2010. [18] “IP Performance Metrics W orking Group. ” [Online]. A vailable: http://tools.ietf.org/wg/ippm/ [19] G. Almes, S. Kalidindi, and M. Zekauskas, “A One-way Delay Metric for IPPM, ” USA, T ech. Rep., 1999. [20] C. Demichelis and P . Chimento, “IP Packet Delay V ariation Metric for IP Performance Metrics (IPPM), ” USA, T ech. Rep., 2002. [21] G. Almes, S. Kalidindi, and M. Zekauskas, “A One-way Packet Loss Metric for IPPM, ” USA, T ech. Rep., 1999. A utors Patricia Marcu receiv ed her diploma in Computer Science in 2006 at the LMU Munich. In 2007 she joined the MNM-T eam at Leibniz Super- computing Centre as a research assistant and pursues her Ph.D. degree in Computer Science. She is currently working on the further dev elopment of the Customer Network Managemnt (CNM) tool and on the visualiza- tion of the LHCOPN within the European Geant project. Her research focuses on inter-or ganizational fault management and IT Service Man- agement. W olfgang Hommel has a Ph.D. in computer science from LMU Mu- nich, and heads the network services planning group at the Leibniz Super- computing Centre. His current research focuses on IT security and pri vac y management in large distributed systems, including identity federations and Grids. Emphasis is put on a holistic perspective, i.e., the problems and solutions are analyzed from the design phase through softw are engineering, deployment in heterogeneous infrastructures, and during the operation and change phases according to IT service management process frameworks, such as ISO/IEC 20000-1.

Inter-organizational fault management: Functional and organizational core aspects of management architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment