Cloud Computing: Orchestrating and Automating Technical Building Blocks

Date: Nov 28, 2011 By Venkata Josyula, Malcolm Orr, Greg Page.
This chapter provides a detailed overview of how an Infrastructure as a Service (IaaS) service is orchestrated and automated.

Upon completing this chapter, you will be able to understand the following:

  • How the basic resources are added to the cloud
  • How services are added to the cloud
  • The creation and placement strategies used within a cloud
  • How services are managed throughout their life cycle

This chapter provides a detailed overview of how an Infrastructure as a Service (IaaS) service is orchestrated and automated.

On-Boarding Resources: Building the Cloud

Previous chapters discussed how to classify an IT service and covered a little bit about how to place that service in the cloud using a number of business dimensions, criticality, roles, and so on. So you know how to choose the application and how to define the components that make up those applications, but how do you decide where in the cloud to actually place the application or workload for optimal performance? To understand this, you must look at the differences between how a consumer and a provider look at cloud resources.

Figure 11-1 illustrates the differences between these two views. The consumer sees the cloud as a "limitless" container of compute, storage, and network resources. The provider, on the other hand, cannot provide limitless resources as this is simply not cost-effective nor possible. However, at the same time, the provider must be able to build out its infrastructure in a linear and consistent manner to meet demand and optimize the use of this infrastructure. This linear growth is archived through the use of Integrated Compute Stacks (ICS), which will often be referred to as a point of delivery (POD). A POD has been described previously. However, for the purposes of this section, consider a POD as a collection of compute, storage, and network resources that conform to a standard operating footprint that shares the same failure domain. In other words, if something catastrophic happens in a POD, workloads running in that POD are affected but neighboring workloads in a different POD are not.

Figure 11-1

Figure 11-1 Cloud Resources

When the provider initially builds the infrastructure to support the cloud, it will deploy an initial number of PODs to support the demand it expects to see and also lines up with the oversubscription ratios it wants to apply. These initial PODs, plus the aggregation and core, make up the initial cloud infrastructure. This initial infrastructure now needs to be modeled in the service inventory so that tenant services can be provisioned and activated; this process is known as on-boarding.

Clearly at the concrete level, what makes up a POD is determined by the individual provider. Most providers are looking at a POD comprised of an ICS that offers a pre-integrated set of compute, network, and storage equipment that operates as a single solution and is easier to buy and manage, offering Capital Expenditure (CAPEX) and Operational Expenditure (OPEX) savings. Cisco, for example, provides two examples of PODs, a Vblock1 and a FlexPod,2 which provide a scalable, prebuilt unit of infrastructure that can be deployed in a modular manner. The main difference between the Vblock and FlexPod is the choice of storage in the solution. In a Vblock, storage is provided by EMC, and in a FlexPod, storage is provided by NetApp. Despite the differences, the concept remains the same; provides an ICS that combines compute, network, and storage resources; and enables incremental scaling with predictable performance, capability, and facilities impact. The rest of this chapter assumes that the provider has made a choice to use Vblocks as its ICS. Figure 11-2 illustrates the relationship to the conceptual model and the concrete Vblock.

Figure 11-2

Figure 11-2 Physical Infrastructure Model

A Vblock can be specified by many different packages that provide different performance footprints. A generic Vblock for this example is comprised of

  • Cisco Unified Computing System (UCS) compute
  • Cisco MDS storage-area networking (SAN)
  • EMC network-attached storage (NAS) and block storage
  • VMware ESX Hypervisor
  • Not strictly part of a Vblock, but the Cisco Nexus 7000 will be used as the aggregation layer and used to connect the NAS

A FlexPod offers a similar configuration but supports NetApp FAS storage arrays instead of EMC storage, and as it is typically NAS using Fibre Channel over Ethernet (FCoE) or Network File System (NFS); then, the SAN is no longer required. Also, note that the Vblock definition is owned by the VCE company (Cisco, EMC, and VMware coalition), so it will be aimed at VMware Hypervisor-based deployments, whereas FlexPod can be considered more hypervisor neutral.

To deliver services on a Vblock, it first needs to be modeled in the service inventory or Configuration Management Database (CMDB), as the relationships between the physical building blocks will be relatively static, that is, cabling doesn’t tend to change on the fly; the CMDB is a suitable place to store this data. The first choices in terms of modeling are really driven by the data model supported by the repository that has been chosen to act as the system of record for the cloud. Most CMDBs come with a predefined model based around the Distributed Management Task Force (DMTF) Common Information Model (CIM) standard, and where possible, the classes and relationships provided in this model should be reused. For example, BMC, an ITSM3 company, provides the Atrium CMDB product, which implements the BMC_ComputerSystem class to represent all compute resources. An attribute of that class, isVirtual=true, is used to differentiate between physical compute resources and virtual ones without needing to support an additional class. Figure 11-3 illustrates a simple infrastructure model.

Figure 11-3

Figure 11-3 Infrastructure Logical Model

The following points provide a summary of the relationships shown in Figure 11-3:

  • A compute blade instance depends on the router for connectivity and on the SAN switch for access block–based storage; a compute blade is also a member of an ESX cluster. The ESX cluster depends on a data store to access the storage array.
  • The NAS gateway instance depends on the router for connectivity, and the storage pool is a component of the NAS gateway.
  • The storage array instance depends on the SAN switch for access, and the storage pool is also a component of the NAS gateway.
  • The POD acts as a container for all physical resources.

With this basic infrastructure model added to the CMDB or service inventory, you can begin to add the logical building blocks that support the service and any capabilities or constraints of the physical POD that will simplify the provisioning process. Does this seem like a lot of effort? Well, the reason for this on-boarding process is threefold:

  • The technical building blocks discussed in the previous chapter need to exist on something physical, so it's important to understand that there is a relationship between physical building blocks and the building blocks that make up the technical service. Provisioning actions that modify or delete a technical service need to be able to determine which physical devices need to be modified so that they will consult the CMDB or service inventory for this information.
  • The physical building blocks have a limit to how many concrete building blocks they can support. For example, the Nexus 7000 can only support a set number of VLANs, so these capacity limits should be modeled and tracked. (This is discussed in more detail in the next chapter.)
  • From a service assurance perspective, the ability to understand the impact of a physical failure on a service or vice versa is critical. Figure 11-4 provides an example of the relationship between a service, the concrete building blocks that make up the service, and the physical building blocks that support the technical building blocks.

One important point to note is that we introduced the concept of a network container in the tenant model. A network container represents all the building blocks used to create the logical network, the topology-related building blocks. A network topology can be complex and can potentially contain many different resources, so using a container to group these elements simplifies the provisioning process.

Figure 11-4

Figure 11-4 Tenant Model

The following points provide a summary of the relationships shown in Figure 11-4:

  • The virtual machine (VM) contains a virtual disk (vDisk) and a virtual network interface card (vNIC). The VM is dependent on the ESX cluster, which in turn is dependent on the physical server. It is now possible to determine the impact on a service if a physical blade fails, even if it simply means that the service is just degraded.
  • The vDisk depends on the data store in which it is stored. The vNIC depends on the port group to provide virtual connectivity, which in turn contains a VLAN that depends on both the UCS blade and Nexus 7000 to provide the Layer 2 physical connectivity. The VLAN depends on Virtual Routing and Forwarding (VRF) to provide Layer 3 physical connectivity. In the case of a physical failure of the Nexus, we can determine the impact on the service.
  • The VRF is contained in a service, which is contained in a virtual data center (VDC), which is contained in an organization. These relationships mean that it is possible to understand the impact of any physical failure at a service or organization level and, when modifying or deleting the service, what the logical and physical impacts will be.
  • The VRF, VLAN, and PortGroup are grouped in a simple network container that can be considered a collection of networking infrastructure, can be created by a network designer, and is part of the overall service.
  • The VM, vNIC, and VDisk are grouped in a simple application container that can be considered a collection of application infrastructure, can be created by an application designer, and is part of the overall service.

Hopefully, you can begin to see that there can be up to five separate provisioning tasks to create the model shown in Figure 11-4:

  1. The cloud provider infrastructure administrator provisions the infrastructure model into the CMDB or service inventory as new physical equipment is added.
  2. The cloud provider customer administrator creates the outline tenant model by creating the organizational entity and the VDC.
  3. The tenant service owner creates the service entity.
  4. The network designer creates the network container to support the application and attaches it to the service.
  5. The application designer creates the application container to support the application user's needs, connects it to the network resources created by the network designer, and attaches it to the service.

Modeling Capabilities

Modeling capabilities are an important step when on-boarding resources as they have a direct impact on how that resource can be used in the provisioning process. If you look at the Vblock definition again, you can see that it will support Layer 2 and Layer 3 network connectivity as well as NAS and SAN storage and the ESX hypervisor, so you can already see that it won't support the majority of the design patterns discussed earlier as they require a load balancer. If you were looking for a POD to support the instantiation of a design pattern that required a load balancer, you could query all the PODs that had been on-boarded and look for a Load-Balancing=yes capability. If this capability doesn’t exist in any POD in the cloud infrastructure, the cloud provider infrastructure administrator would need to create a new POD with a load balancer or add a load balancer to an existing POD and update the capabilities supported by that POD.

Taking this concept further, if you configure the EMC storage in the Vblock to support several tiers of storage—Gold, Silver, and Bronze, for example—you could simply model these capabilities at the POD level and (as you will see in the next chapter) do an initial check when provisioning a service to find the POD that supports a particular tier of storage. Capabilities can also be used to drive behavior during the lifetime of the service. For example, if you want to allow a tenant the ability to reboot a resource, you can add a reboot=true capability, so this could be added to the class in the data model that represents a virtual machine. You probably wouldn't want to add this capability to a storage array or network device as rebooting one of these resources would affect multiple users and should only be done by the cloud operations team.

However, if a storage array supported a data protection capability such as NetApp Snapshot, a snapshot=true capability could be modeled at the storage array as well as the hypervisor level, allowing a tenant to choose whether he wants to snapshot at the hypervisor or storage level. The provider could offer these two options at different prices, depending on the resources they consume or the cost associated with automating this functionality.

Modeling Constraints

Modeling constraints are another important aspect of the on-boarding process. No resource is infinite, storage gets consumed, and memory gets exhausted, so it is important to establish where the limits are. Within a Layer 2 domain, for example, you only have 4096 VLANs to work with; in reality, after you factor in all infrastructure connectivity per POD, you will have significantly fewer VLANs to work with. So adding these limits to the resource and tracking usage against these limits give the provisioning processes a quick and easy way of checking capacity. Constraint modeling is no replacement for strong capability management tools and processes, but it is a lightweight way of delivering a quick view of where the limits are for a specific resource.

Resource-Aware Infrastructure

Modeling capabilities and constraints in the service inventory or CMDB are needed as these repositories act as the single source of truth for the infrastructure. As discussed in previous chapters, these repositories are not necessarily the best places to store dynamic data. One alternative method is the concept of an infrastructure that is self-aware that understands what devices exist within a POD, how the devices relate to each other, what capabilities those devices have, and what constraints and loads those devices have. This concept will be addressed further in the next chapter, but it is certainly a more scalable way of understanding the infrastructure model. These relationships could still be modeled in the CMDB, but the next step is simply to relate the tenant service to a POD and let the POD worry about placement and resource management. Figure 11-5 illustrates the components of a resource-aware infrastructure.

Figure 11-5

Figure 11-5 Resource-Aware Infrastructure

The following points provide a more detailed explanation of the components shown in Figure 11-5:

  • A high-speed message bus, such as Extensible Messaging and Presence Protocol (XMPP), is used to connect clients running in the devices with a resource server responsible for a specific POD.
  • The resource server persists policy and capabilities in a set of repositories that are updated by the clients.
  • The resource server tracks dependencies and resource utilization within the POD.
  • The resource manager implements an API that allows the orchestration system to make placement decisions and reservation requests without querying the CMDB or service inventory.

The last point is one of the most critical. Offloading the real-time resource management, constraint, and capabilities modeling from the CMDB/service inventory to the infrastructure will mean a significant simplification of the infrastructure modeling is needed going forward.

Adding Services to the Cloud

Figure 11-6 illustrates a generic provisioning and activating process. The major management components required to support this are as follows:

  • A self-service portal that allows tenants to create, modify, and delete services and provides views or service assurance and usage views
  • A security manager that manages all tenant credentials
  • A service catalogue that stores all commercial and technical service definitions
  • A change manager that orchestrates all tasks, manual and automatic, and acts as a system or record for all tenant requests that can be audited
  • A capacity and/or policy manager responsible for managing infrastructure capacity and access policies
  • An orchestrator that orchestrates technical actions and can provide simple capacity and policy decisions
  • A CMDB/service inventory repository that stores asset and service data
  • A set of element managers that communicate directly with the concrete infrastructure elements
Figure 11-6

Figure 11-6 Generic Orchestration

The orchestration steps illustrated in Figure 11-6 are as follows:

Step 1.

The tenant connects to the portal and authenticates against the security manager, which can also provide group policy information used in Step 2.

Step 2.

A list of entitled services is retrieved from the service catalogue and displayed in the portal, along with any existing service data and the assurance and usage views.

Step 3.

The provisioning process begins with the tenant selecting the required service and ends with the technical building blocks that support the service being reserved in the service inventory or CMDB. Depending on the management components deployed, the validation that the service can be fulfilled based on the constraints and capabilities provided by the POD will be done in a separate capacity or policy manager or can be performed by the orchestrator.

Note that the orchestrator/capacity manager is not necessarily making detailed placement decisions for the activation of the service. For example, on which blade in a vSphere/ESX cluster to place a resource, these will typically be made in the element manager that maintains detailed, real-time usage data.

Step 4.

The portal will create a change request to manage the delivery of the request or order.

Step 5.

The change will be approved. This could simply be an automatic approval, or it can be passed into some form of change process.

Step 6.

The activation process begins. A change is decomposed into at least one change task that is passed to the orchestrator.

Steps 7, 8, 9.

These processes are being managed by the orchestrator to instantiate the concrete building blocks based on the service definition. The orchestrator will communicate with the various element managers. For example, in the case of a Vblock, the orchestrator would communicate with VMware vCenter to create, clone, modify, or delete a virtual machine. Up until this point, the data regarding the service has been abstracted away from the specific implementation. At this point, the orchestrator will extract the relevant data and pass it to the element manager using its specific APIs.

Step 10.

A billing event is created that will be used to charge for fixed items such as adding more vRAM or another vCPU.

Step 11.

The orchestration has completed successfully, so all resources are committed in the service inventory and the change task closed. This flow represents the "happy day" scenario in which all process steps are completed successfully. A more detailed process would have rollback and compensations steps documented as well, but this is beyond the scope of this chapter.

Step 12.

The flow of control is passed back into the change manager, and this marks the end of the activation process. This might start another task or might close the overall change request if only one task is present.

Step 13.

A notification is passed back directly to the tenant, indicating that the request has been completed. Alternatively, this notification could be sent to the portal if the portal maintains request data.

As discussed previously, there might be several provisioning steps, so you might need to iterate through this process several times.

Provisioning the Infrastructure Model

We now look at the steps needed to provision the tenant model shown in Figure 11-4, this assumes that the actual physical building blocks have been racked, stacked, cabled and configured in the datacenter already:

  1. The cloud provider infrastructure administrator (CPIA) will log in to the self-service portal and be presented with a set of services that he is entitled to see, one of which is On-board a New POD. The CPIA will select this service; complete all the details required for this service, such as management IP addresses, constraints, and capabilities; and submit the request. The reservation step is skipped here because this service is creating new resources.
  2. As this is a significant change, this service will go through an approval process that will see infrastructure owners and cloud teams review and approve the request.
  3. After it is approved, as the infrastructure already exists, a single change task will be created to update the CMDB, and this will be passed to the orchestrator.
  4. The orchestrator has little to do but simply call the CMDB/Service Inventory and create the appropriate configuration items (CI) and their relationships. Optionally, the orchestrator can also update service assurance components to ensure that the new resources are being managed, but in most cases, this has already been done as part of the physical deployment process.
  5. A success notification is generated up the stack, and the request is shown as complete in the portal.

Provisioning the Organization and VDC

The same process used by the CPIA is followed by the cloud provider customer administrator (CPAD), but a few differences exist:

  • The CPAD will be entitled to a different set of services than the CPIA.
  • The approval process will now be more commercial/financial in nature, checking that all the agreed-upon terms and conditions are in place and that credit checks have been done.
  • Orchestration activities will manage interactions with the CMDB to create the organization and VDC CIs to add user accounts to the identity repository so that the tenant can log in to the portal, and to add VDC resource limits to the capacity/policy manager and set up any branding required for the tenant in the portal.

Creating the Network Container

The same process is followed by the tenant network designer, but a few differences exist:

  • The network designer (ND) logs in to the portal using the credentials set up by the CPIA and is presented with a set of services orientated around creating, modifying, and deleting the network container. The network designer could be a consumer or a provider role depending on the complexity of the network design.
  • The ND selects the virtual network building blocks he requires and submits the request. As this is a real-time system, the resources are reserved so that they are assigned (but not committed) to this request. The capacity manager will make sure that sufficient capacity exists in the infrastructure and that the organization has contracted enough capacity before reserving any resources.
  • The approval process is skipped here if the organization has contracted enough capacity and there is enough infrastructure capacity; then the change will be preapproved.
  • Orchestration activities will manage interactions with the element managers responsible for automating and activating the configuration of the virtual network elements in a specific POD and generating billing events so that the tenant can be billed on what he has consumed.
  • A success notification is generated up the stack, and the request is shown as complete in the portal. The resources that were reserved are now committed in the service inventory and/or CMDB.

Creating the Application

The same process used by the network designer is followed by the tenant application designer, but a few differences exist:

  • The cloud consumer application designer (CCAD) logs on to the portal using the credentials set up by the CPIA and is presented with a set of services orientated around creating, modifying, and deleting the application container.
  • The CCAD selects the application building blocks he requires and submits the request. The network building blocks created by the network designer will also be presented in the portal to allow the application designer to specify which network he wants the application elements to connect to. As this is a real-time system, the resources are reserved.
  • Orchestration activities will manage interactions with the element managers responsible for automating and activating the configuration of the virtual machines, deploying software images, and generating billing events so that the tenant can be billed on what he has consumed.
  • A success notification is generated up the stack, and the request is shown as complete in the portal. The resources that were reserved are now committed in the service inventory and/or CMDB.

Workflow Design

The workflow covered in the preceding sections will vary. Some will be based on out-of-the-box content provided by an orchestration/automation vendor such as Cisco and some will be completely bespoke; most workflow will be a combination. It is important to balance flexibility and supportability. On the one hand, you don’t want to build a standardized, fixed set of workflows that cannot be customized or changed; on the other hand, you don’t want to build technical workflows that are completely bespoke and unsupportable. One potential solution is to use the concept of moments and extension points to allow flexible workflows but at the same time introduce a level of standardization that promotes an easier support and upgrade path. Figure 11-7 illustrates these concepts.

Figure 11-7

Figure 11-7 Workflow Design

The core content is comprised of workflow moments; the moment concept is applied to points in time of the technical orchestration workflow. Some example moments are as follows:

  1. Trigger and decomposition: This moment is where the flow is triggered and decomposes standard payload items to workflow attributes, for example, the action variable, which is currently used to determine the child workflow to trigger but might also be required to be persisted in the workflow for billing updates and so on.
  2. Workflow enrichment and resource management: This moment is where data is extracted from systems using standard adapters and any resource management or ingress checking is performed.
  3. Orchestration: This is the overarching orchestration flow.
  4. Standard actions: These are the standard automation action sequences provided by the vendor.
  5. Standard notifications and updates: This step will update any inventory repositories (CMDBs) provided with the solution, such as the cloud portal, change manager, and so on.

The core consent can be extended to support bespoke functions using extension points. The concept here is that all processes would contain a call or dummy process element that can be triggered after the core task had completed to handle customized actions without requiring changes to the core workflow. An example set of extension points are as follows:

  1. Trigger and decomposition: This extension point is where custom service data received from the calling system/portal is decomposed into variables used in the rest of the workflow. This will allow designers to quickly add service options/data in the requesting system and handle this data in a standard manner without changing the core decomposition logic.
  2. OSS enrichment and resource management: This extension point is where custom service data is requested through custom WS* calls or other nonstandard methods and added to the workflow runtime data. This will allow designers to integrate with clients’ specific systems without changing the core enrichment logic.
  3. Actions: This extension point is where custom actions are performed using WS* calls or other nonstandard methods. This will allow designers to integrate with clients’ specific automation sequences without changing the core automation logic.
  4. Notifications: This extension point is where custom notifications are performed using WS* calls or other nonstandard methods. This will allow designers to integrate with clients’ specific automation systems without changing the core notification logic.

Creation and Placement Strategies

The previous sections discussed that any form of activation would require a placement decision; these decisions are typically made on the following grounds:

  • Maximizing resource usage: The provider wants to ensure that it gets the maximum workload density across its infrastructure. This means optimizing CAPEX (the amount of money spent on purchasing equipment).
  • Maximizing resilience: The provider chooses to spread workloads across multiple hosts to minimize failures; this typically means hosts are typically underutilized.
  • Maximize usage: This is similar to maximizing resource usage, but this requires a point-in-time decision to be made about what host is least loaded now and in the time span of the workload. Think of Amazon's Spot service.
  • Maximize facilities usage: The provider wants to place workloads based on facilities usage and to reduce energy consumption.
  • Maximize application performance: Place all building blocks in the same domain or data center interconnect.

Figure 11-8 illustrates these concepts.

Figure 11-8

Figure 11-8 Generic Resource Allocation Strategies

Certain placement strategies are simpler to implement than others. For example, placing workloads based on current load can be done quite simply in a VMware environment because the API supports it if Distributed Resource Scheduling (DRS) is enabled. The orchestrator can query vCenter to determine DRS placement options or simply create a VM in a cluster and let DRS decide placement. Maximizing facilities usage is substantially more difficult based on the following needs:

  • A system that understands all this usage
  • An interface to query this system
  • The creation of an adapter that queries this system in the orchestrator and the logic associated with this query

A single placement strategy can be adopted, or the provider can adopt multiple strategies with prioritization. Typically a provider of public clouds will look to optimize its resource usage and reduce its energy consumption ahead of application performance unless the consumer chooses to pay for better application performance, in which case this will affect the placement or migration of the service to a POD that supports the placement requirement. A public cloud provider might prioritize resilience and application performance over resource optimization.

Choosing which POD supports a particular placement strategy will mean combining dynamic data such as vCenter DRS placement options with the capabilities and constraints that are modeled when the POD was on-boarded and making a decision based on all this data. The choice can be complicated as the underlying physical relationship within the POD needs to be understood. For example, assume that a provider has purchased two Vblocks and connected them to the same aggregation switches; each Vblock has 64 blades installed, four ESX clusters in total. Two clusters are modeled with high oversubscription capability and two with lower oversubscription capability. A tenant requests an HTTP load balancer design pattern that needs the web servers to run on Gold tier storage but on a low-cost (highly oversubscribed) ESX cluster.

If the orchestrator/capacity manager simply looks at the service inventory/CMDB, it will determine that two clusters can support low cost through the oversubscription_type=high capability modeled in the inventory. The orchestrator/capacity manager can also identify which ESX cluster of the two can support the workloads required by consulting DRS recommendations, but this simply identifies which compute building blocks will support the workload. As this design pattern requires a Gold storage tier, the service inventory must be queried to understand which storage is attached to that ESX cluster and whether it has enough capacity to host the required number of .vmdk files. In addition, as the design pattern requires a session load balancer and neither Vblock supports a network-based session load balancer, the compute and associated storage must also be able to support an additional VM-based load balancer. If there was a third POD that did contain a load balancer but was connected to different compute and storage resources, would this POD make a better choice for the service? As more infrastructure capabilities are modeled, it is likely that the need for a true policy manager will evolve as the capabilities of the orchestrator/capacity manager are overhauled.

Service Life Cycle Management

Previously, we discussed how a service could be designed and, in this chapter, how a service is instantiated and transitioned into an operational state. Viewing this from an Information Technology Infrastructure Library (ITIL)4 perspective, these sections could be viewed as the service design and service transition phases of the ITIL V3 model, respectively. Of course, they don’t cover the entire best-practice recommendations of ITIL V3, but the essence is there. This section covers service life cycle management, which is a term often used to refer to the complete ITIL V3 framework, from service strategy to service operations, including continual improvement. However, this section refers to service life cycle management as the management of an operational service throughout its remaining lifetime until the tenant chooses to decommission and/or delete the service. The reason for this is that depending on the overall cloud operations model that is chosen by the provider and the type of cloud service offered, there will often be a division of responsibility between the provider and consumer at the service operational level, whereas the provider is normally fully responsible for the strategy, design, and transition process areas. Service life cycle management from the consumer's perspective begins when the service is operational.

Making the distinction between decommission and delete is important because the tenant might simply choose to decommission the service (that is, not have the service active) rather than to delete it. Consider the example of the development and test environments. While development is taking place, the test environment might not be needed and vice versa. If those environments incur costs while active, it might be more cost-effective to decommission a service with the view that it will be commissioned again when it is required. Deleting a service obviously removes all the building blocks of the service, releases any resources, and deletes the service definition so that the service can no longer be recovered.

The service operations phase consists of a number of key process areas:

  • Incident and problem management
  • Event management
  • Request fulfillment
  • Access management
  • Operations management
  • Service desk function

ITIL V3 also expects some form of service improvement framework to be in place to support continual improvement, and this is never more important in the operational phase. This chapter is concerned with orchestrating and automating cloud services, and this doesn’t stop simply because the service is operational. The following sections will discuss in more detail the impact that the cloud and, in particular, orchestration and automation have on each of the ITIL V3 areas.

Incident and Problem Management

The primary focus of the incident management process is to manage the life cycle of all incidents, an incident being defined as an unplanned outage or loss of quality to an IT/cloud service. The primary objective of incident management is to return the IT service to users as quickly as possible. If services are being deployed into a multitenant environment, a single incident might affect many different tenants or users, so this becomes a critical process. The primary objectives of problem management are to prevent incidents from happening and to minimize the impact of incidents that cannot be prevented. Both incident and problem management processes are typically managed by the service desk function, which is typically implemented in IT Service Management (ITSM) service desk software. All workflow and coordinating activities are managed in the service desk software, and this won't change when a service is hosted in a cloud, although the service desk software itself might need to support a different usage.

Event Management

The primary focus of event management is to filter and categorize events and to decide on appropriate actions. Event management is one of the main activities of service operations. Within the cloud, event handling and categorization will become a major task, the event manager must receive, categories, enrich and correlate alarms from virtual machines and the infrastructure but also alarms need to be processed from provisioning and activation systems, in fact alarms need to be processed form any operational activity that is being automated. Combine this complexity with the exponential demand and rate of change that cloud offers and the need to have operational awareness of any issues to do with service provisioning or operations, and you have identified one of the major operational challenges of a cloud platform. Orchestration can play a major part in linking event handling to problem and incident resolution, as illustrated in Figure 11-9.

Figure 11-9

Figure 11-9 Orchestration for Incident and Problem Management

As the number and variety of alarms grow, the need to understand the operational context of an event becomes paramount. Is this event impacting multiple tenants or cloud users or just a single user? Does this event affect a service-level agreement (SLA) or not? Given the potential volume of events and the fact that a cloud is effectively open for business 24 hours a day, this analysis cannot be performed by operators any longer. The systems that perform event management can do event correlation; however, that is normally done by looking at the event data itself or applying the event data within a domain. For example, network faults can be correlated based on which device they occur on, that is, multiple failures can be correlated to the fact that an uplink on a device has failed or using a network topology model can be correlated against an upstream device failure.

Where event management systems are often weak is looking across domains, applying that operational context to a number of events. This is where orchestration can help. Orchestration can be used to process correlated and uncorrelated events within an operational context by querying other systems and relating different domain events together. Unfortunately, this will not happen "out of the box." The experience of senior support staff is required to build the orchestration workflows, and in effect, what you are doing is automating the investigation and diagnostic knowledge of engineers, or the known problem database in ITIL speak. Many years ago, Cisco launched a tool called MPLS Diagnostic Expert (MDE) that took the combined knowledge of the Cisco Technical Assistance Center (TAC) and distilled it into a workflow for deterring connectivity for IP-VPN. One of the first case studies involved rerunning a major service provider outage that took one full day to troubleshoot manually through the tool and resolving the problem in ten minutes (with the correct determination that the culprit, a network operator, had removed a Border Gateway Protocol [BGP] connectivity statement). The workflow was only successful because the subject matter expert responsible for its content really understood the technical context. Now expand that across multiple domains and include an understanding of the real-time state of the operational environment, and you begin to see the scope of the problem; however, the cloud is transformational, and this is one of the major areas that need to be considered by the provider when adopting the cloud. Consumers will normally be unaware of the underlying process and will typically only see the output of the event management and incident/problem management processes. When the incident is raised directly by the customer, orchestration can still assist with investigation and resolution within the operational context, but the trigger is the incident rather than an event.

Request Fulfillment

The primary focus of the request fulfillment process is to fulfill service requests, which in most cases are minor (standard) changes (for example, requests to change a password) or requests for information. Depending on the type of request, the orchestrator can forward the request to the required element manager to process or simply raise a ticket in the service desk to allow an operator to respond to a request for information. Given the self-service nature of the cloud, most standard changes will be preapproved and simply fulfilled by the appropriate element manager.

Access Management

The primary focus of the access management process is to grant authorized users the right to use a service while preventing access to non-authorized users. The access management process essentially executes policies defined in IT security management and, as such, is a critical process within cloud operations. From a Software as a Service (SaaS) perspective, this is relatively simple as the provider is responsible for all aspects of the application. Access management, therefore, is focused on simply providing access to the application, and the application provides access to the data. In IaaS, access must be granted at a more granular level:

  • Access to the virtual machine
  • Out-of-band access to the virtual machine, in the case of a configuration
  • Access to backups and snapshots
  • Access to infrastructure consoles such as restore consoles

All this access needs to be provided in a consistent and secure manner across multiple identity repositories. Orchestration and automation can ensure that as users are added to the cloud, their identity and access rights are provisioned correctly and modified in a consistent manner.

Operations Management

The primary focus of the operations management process is to monitor and control the IT services and IT infrastructure—in short, the day-to-day routine tasks related to the operation of infrastructure components and applications. Table 11-1 defines typical operational tasks and shows where orchestration and automation have a role to play. The task of facilities management is not covered because of the size and breadth of this subject.

Table 11-1 Operational Tasks

Task

Role of Orchestration

Patching

Typically, this function is carried out by element managers, but if the portal allows the consumer to upload specific patches and apply them, orchestration will be involved to coordinate the automated deployment and installation of the patches.

Backup and restore

Typically, a backup is scheduled to occur on a regular basis, so this would be handled by the cloud scheduler application, but the initial creation, modification, and deletion of the backup job would be automated and coordinated by the orchestration system.

Antivirus management

The orchestration system would coordinate the deployment of antivirus agents (if required), but typically the scanning, detection, and remediation of viruses and worms will be handled by the antivirus applications.

Compliance checking

As with antivirus, the orchestration system would coordinate the deployment of compliance policies, but typically the scanning, detection, and reporting of compliance will be handled by the compliance applications.

Monitoring

While monitoring is a key component of Continuous Service Improvement from the provider perspective, this data is normally exported to the cloud portal to allow the tenants to see how their services are performing, so there is little orchestrator involvement here, apart from setting up the policy defining which data should be exported.

The Cloud Service Desk

The cloud service desk is the single point of contact between the consumer and provider. As such, it is typically a view within the overall cloud portal that provides access to the support functions and standard change options. The following features are typically required for a cloud-enabled service desk:

  • Support multiple tenants
  • Support a web-based user interface and be Internet "hardened"
  • Allow content to be embedded in another portal
  • Support a single sign-on

Continued Service Improvement

A dynamic, self-service demand model means that the IT environment and the services that run in the environment need to be continually monitored, analyzed, and optimized. The CSI process should implement a monitoring framework that is continually measuring performance against an expected baseline and optimizing the infrastructure to ensure that any deviation from the baseline is managed effectively. It is in the optimization step that orchestration and automation can play a significant part in coordinating and performing the actions that bring the cloud platform performance back in line with the expected baseline.

Summary

To create new services, the orchestration and automation systems need to work together to provision and activate the technical building blocks that make up the overall cloud service. There are a number of steps that need to take place in any fulfillment activity:

  • An infrastructure model needs to build which represents the physical building blocks, capabilities, and constraints that make up cloud infrastructure deployed by the cloud provider.
  • This infrastructure model is typically mastered in the service inventory or CMDB, but in the future, it is likely in the future it will be held in the infrastructure itself.
  • After an infrastructure model is in place, tenant services can be overlaid on top, first by reserving a set of resources that support the logical building blocks and then by activating the resources on the various physical building blocks.
  • The orchestration and automation tools not only play a part in the provisioning and activation process but also have a significant impact in service life cycle management.

References

  1. Vblock, at www.vce.com/vblock.
  2. FlexPod, at www.netapp.com/us/technology/flexpod.
  3. ITSM, at www.itsmf.co.uk.
  4. ITIL, at www.itil-officialsite.com.