Cloud Service Agreements: What to Expect and What to Negotiate
Copyright © 2019 Object Management Group Page 36
Step 7: Prepare for Service Failure Management
In a traditional data center, organizations are able to manage failures using a centralized service
management system. In the increasingly common case where an organization builds systems that use
cloud services from multiple CSPs, managing these multiple systems becomes a bigger challenge.
In an IaaS model, while CSPs are responsible for the virtualization infrastructure, the platform and
software services that are provisioned, configured and running on top of the infrastructure are the
responsibility of the CSC. Identifying the potential causes of service problems in advance is essential to
ensure service continuity. In view of the complexity of network connectivity, infrastructure, platform
and software services on which cloud-based applications depend, it is increasingly important to employ
effective operational logging and monitoring capabilities, which may be offered by the CSP or third-party
performance monitoring services. Identifying and isolating the root cause of service failures is anything
but simple, and requires a trail of data that the CSP must collect.
Operations support requires increasingly specialized, capabilities. Performance monitoring dashboards
must be understood and analyzed, particularly where end-to-end functions are delivered by a
combination of services from multiple CSPs.
The public CSAs reviewed discuss service commitments, credits, and the credit process in detail.
However, when it comes to service failure management capabilities or expectations, the details are
sparse. Although not much mentioned, most CSPs follow IT Infrastructure Library (ITIL) or ITIL-
compatible practices for managing their cloud services. CSCs need to pay attention to three key
processes and systems used in failure management: event management, incident management and
problem management.
• Event management involves the cloud services and their related components, generating
different types of events related to the monitored functions, and then distributing,
consolidating, delivering and processing these events. The monitored functions include machine
states (up/down), the status of hypervisors, stages of service processing, performance metrics
collection, and more. Most cloud service failures are automatically handled by the event
management system; however, there are cases when automation is not sufficient. In such cases,
the event management system passes control to an incident management system by generating
a ticket.
• Incident management involves ticket generation, ticket assignment to administrators, tracking
of ticket resolution, as well as checking and updating the ticket processing status, and escalation
procedures. Given that the number of security incidents is rising, it may be advisable to set up
specific security incident response processes for suspected security breaches or threats. This is a
very useful part of endpoint security detection, and establishing automated alerts is clearly an
excellent prevention measure. Several Industry or regulatory bodies have mandated specific
steps and dispositions in the response process for security incidents, especially if they impact
the general public or core services and infrastructure.
• Problem management is aimed at preventing problems, in particular by analyzing recurring
incidents in order eliminate them, and minimizing the impact of incidents that cannot be totally