The Operations Layer defines the operational processes and procedures necessary to deliver Information Technology (IT) as a Service. This layer leverages IT Service Management concepts that can be found in prevailing best practices such as ITIL and MOF. The main focus of the Operations Layer is to execute the business requirements defined at the Service Delivery Layer. Cloud-like service attributes cannot be achieved through technology alone and require a high level of IT Service Management maturity.
Change Management process is responsible for controlling the life cycle of all changes. The primary objective of Change Management is to eliminate or at least minimize disruption while desired changes are made to services. Change Management focuses on understanding and balancing the cost and risk of making the change versus the benefit of the change to either the business or the service. Driving predictability and minimizing human involvement are the core principles for achieving a mature Service Management process and ensuring changes can be made without impacting the perception of continuous availability.
Standard (Automated) Change
Non-Standard (Mechanized) Change
It is important to note that a record of all changes must be maintained, including Standard Changes that have been automated. The automated process for Standard Changes should include the creation and population of the change record per standard policy in order to make sure auditability.
Automating changes also enables other key principles such as:
The Service Asset and Configuration Management process is responsible for maintaining information on the assets, components, and infrastructure needed to provide a service. Critical configuration data for each component, and its relationship to other components, must be accurately captured and maintained. This configuration data should include past and current states and future-state forecasts, and be easily available to those who need it. Mature Service Asset and Configuration Management processes are necessary for achieving predictability.
A virtualized infrastructure adds complexity to the management of Configuration Items (CIs) due to the transient nature of the relationship between guests and hosts in the infrastructure. How is the relationship between CIs maintained in an environment that is potentially changing very frequently?
A service comprises software, platform, and infrastructure layers. Each layer provides a level of abstraction that is dependent on the layer beneath it. This abstraction hides the implementation and composition details of the layer. Access to the layer is provided through an interface and as long as the fabric is available, the actual physical location of a hosted VM is irrelevant. To provide Infrastructure as a Service (IaaS), the configuration and relationship of the components within the fabric must be understood, whereas the details of the configuration within the VMs hosted by the fabric are irrelevant.
The Configuration Management System (CMS) will need to be partitioned, at a minimum, into physical and logical CI layers. Two Configuration Management Databases (CMDBs) might be used; one to manage the physical CIs of the fabric (facilities, network, storage, hardware, and hypervisor) and the other to manage the logical CIs (everything else). The CMS can be further partitioned by layer, with separate management of the infrastructure, platform, and software layers. The benefits and trade-offs of each approach are summarized below.
CMS Partitioned by Layer
CMS Partitioned into Physical and Logical
Table 2: Configuration Management System Options
Partitioning logical and physical CI information allows for greater stability within the CMS, because CIs will need to be changed less frequently. This means less effort will need to be expended to accurately maintain the information. During normal operations, mapping a VM to its physical host is irrelevant. If historical records of a VM’s location are needed, (for example, for auditing or Root Cause Analysis) they can be traced through change logs.
The physical or fabric CMDB will need to include a mapping of fault domains, upgrade domains, and Live Migration domains. The relationship of these patterns to the infrastructure CIs will provide critical information to the Fabric Management System.
The Release and Deployment Management processes are responsible for making sure that approved changes to a service can be built, tested, and deployed to meet specifications with minimal disruption to the service and production environment. Where Change Management is based on the approval mechanism (determining what will be changed and why), Release and Deployment Management will determine how those changes will be implemented.
The primary focus of Release and Deployment Management is to protect the production environment. The less variation is found in the environment, the greater the level of predictability – and, therefore, the lower the risk of causing harm when new elements are introduced. The concept of homogenization of physical infrastructure is derived from this predictability principle. If the physical infrastructure is completely homogenized, there is much greater predictability in the release and deployment process.
While complete homogenization is the ideal, it may not be achievable in the real world. Homogenization is a continuum. The closer an environment gets to complete homogeneity, the more predictable it becomes and the fewer the risks. Full homogeneity means not only that identical hardware models are used, but all hardware configuration is identical as well. When complete hardware homogeneity is not feasible, strive for configuration homogeneity wherever possible.
Figure 2: Homogenization Continuum
The Scale Unit concept drives predictability in Capacity Planning and agility in the release and deployment of physical infrastructure. The hardware specifications and configurations have been pre-defined and tested, allowing for a more rapid deployment cycle than in a traditional data center. Similarly, known quantities of resources are added to the data center when the Capacity Plan is triggered. However, when the Scale Unit itself must change (for example, when a vendor retires a hardware model), a new risk is introduced to the private cloud.
There will likely be a period where both n and n-1 versions of the Scale Unit exist in the infrastructure, but steps can be taken to minimize the risk this creates. Work with hardware vendors to understand the life cycle of their products and coordinate changes from multiple vendors to minimize iterations of the Scale Unit change. Also, upgrading to the new version of the Scale Unit should take place one Fault Domain at a time wherever possible. This will make sure that if an incident occurs with the new version, it can be isolated to a single Fault Domain.
Homogenization of the physical infrastructure means consistency and predictability for the VMs regardless of which physical host they reside on. This concept can be extended beyond the production environment. The fabric can be partitioned into development, test, and pre-production environments as well. Eliminating variability between environments enables developers to more easily optimize applications for a private cloud and gives testers more confidence that the results reflect the realities of production, which in turn should greatly improve testing efficiency.
The virtualized infrastructure enables workloads to be transferred more easily between environments. All VMs should be built from a common set of component templates housed in a library, which is used across all environments. This shared library includes templates for all components approved for production, such as VM images, the gold OS image, server role templates, and platform templates. These component templates are downloaded from the shared library and become the building blocks of the development environment. From development, these components are packaged together to create a test candidate package (in the form of a virtual hard disk (VHD) that is uploaded to the library. This test candidate package can then be deployed by booting the VHD in the test environment. When testing is complete, the package can again be uploaded to the library as a release candidate package – for deployment into the pre-production environment, and ultimately into the production environment.
Since workloads are deployed by booting a VM from a VHD, the Release Management process occurs very quickly through the transfer of VHD packages to different environments. This also allows for rapid rollback should the deployment fail; the current release can be deleted and the VM can be booted off the previous VHD.
Virtualization and the use of standard VM templates allow us to rethink software updates and patch management. As there is minimal variation in the production environment and all services in production are built with a common set of component templates, patches need not be applied in production. Instead, they should be applied to the templates in the shared library. Any services in production using that template will require a new version release. The release package is then rebuilt, tested, and redeployed, as shown below.
Figure 3: The Release Process
This may seem counter-intuitive for a critical patch scenario, such as when an exploitable vulnerability is exposed. But with virtualization technologies and automated test scripts, a new version of a service can be built, tested, and deployed quite rapidly.
Variation can also be reduced through standardized, automated test scenarios. While not every test scenario can or should be automated, tests that are automated will improve predictability and facilitate more rapid test and deployment timelines. Test scenarios that are common for all applications, or the ones that might be shared by certain application patterns, are key candidates for automation. These automated test scripts may be required for all release candidates prior to deployment and would make sure further reduction in variation in the production environment.
Knowledge Management is the process of gathering, analyzing, storing, and sharing knowledge and information within an organization. The goal of Knowledge Management is to make sure that the right people have access to the information they need to maintain a private cloud. As operational knowledge expands and matures, the ability to intelligently automate operational tasks improves, providing for an increasingly dynamic environment.
An immature approach to Knowledge Management costs organizations in terms of slower, less-efficient problem solving. Every problem or new situation that arises becomes a crisis that must be solved. A few people may have the prior experience to resolve the problem quickly and calmly, but their knowledge is not shared. Immature knowledge management creates greater stress for the operations staff and usually results in user dissatisfaction with frequent and lengthy unexpected outages. Mature Knowledge Management processes are necessary for achieving a service provider’s approach to delivering infrastructure. Past knowledge and experience is documented, communicated, and readily available when needed. Operating teams are no longer crisis-driven as service-impacting events grow less frequent and are quickly resolves when they do occur.
When designing a private cloud, development of the Health Model will drive much of the information needed for Knowledge Management. The Health Model defines the ideal states for each infrastructure component and the daily, weekly, monthly, and as-needed tasks required to maintain this state. The Health Model also defines unhealthy states for each infrastructure component and actions to be taken to restore their health. This information will form the foundation of the Knowledge Management database.
Aligning the Health Model with alerts allows these alerts to contain links to the Knowledge Management database describing the specific steps to be taken in response to the alert. This will help drive predictability as a consistent, proven set of actions will be taken in response to each alert.
The final step toward achieving a private cloud is the automation of responses to each alert as defined in the Knowledge Management database. Once these responses are proven successful, they should be automated to the fullest extent possible. It is important to note, though, that automating responses to alerts does not make them invisible and forgotten. Even when alerts generate a fully automated response they must be captured in the Service Management system. If the alert indicates the need for a change, the change record should be logged. Similarly, if the alert is in response to an incident, an incident record should be created. These automated workflows must be reviewed regularly by Operations staff to make sure the automated action achieves the expected result. Finally, as the environment changes over time, or as new knowledge is gained, the Knowledge Management database must be updated along with the automated workflows that are based on that knowledge.
The goal of Incident Management is to resolve events that are impacting, or threaten to impact, services as quickly as possible with minimal disruption. The goal of Problem Management is to identify and resolve root causes of incidents that have occurred as well as identify and prevent or minimize the impact of incidents that may occur.
Pinpointing the root cause of an incident can become more challenging when workloads are abstracted from the infrastructure and their physical location changes frequently. Additionally, incident response teams may be unfamiliar with virtualization technologies (at least initially) which could also lead to delays in incident resolution. Finally, applications may have neither a robust Health Model nor expose all of the health information required for a proactive response. All of this may lead to an increase in reactive (user initiated) incidents which will likely increase the Mean-Time-to-Restore-Service (MTRS) and customer dissatisfaction.
This may seem to go against the resiliency principle, but note that virtualization alone will not achieve the desired resiliency unless accompanied by highly mature IT Service Management (ITSM) maturity and a robust automated health monitoring system.
The drive for resiliency requires a different approach to troubleshooting incidents. Extensive troubleshooting of incidents in production negatively impacts resiliency. Therefore, if an incident cannot be quickly resolved, the service can be rolled back to the previous version, as described under Release and Deployment. Further troubleshooting can be done in a test environment without impacting the production environment. Troubleshooting in the production environment may be limited to moving the service to different hosts (ruling out infrastructure as the cause) and rebooting the VMs. If these steps do not resolve the issue, the rollback scenario could be initiated.
Minimizing human involvement in incident management is critical for achieving resiliency. The troubleshooting scenarios described earlier could be automated, which will allow for identification and possible resolution of the root much more quickly than non-automated processes. But automation may mask the root cause of the incident. Careful consideration should be given to determining which troubleshooting steps should be automated and which require human analysis.
Human Analysis of Troubleshooting
If a compute resource fails, it is no longer necessary to treat the failure as an incident that must be fixed immediately. It may be more efficient and cost effective to treat the failure as part of the decay of the Resource Pool. Rather than treat a failed server as an incident that requires immediate resolution, treat it as a natural candidate for replacement on a regular maintenance schedule, or when the Resource Pool reaches a certain threshold of decay. Each organization must balance cost, efficiency, and risk as it determines an acceptable decay threshold – and choose among these courses of action:
The benefits and trade-off of each of the options are listed below:
Option 4 is the least desirable, as it does not take advantage of the resiliency and cost reduction benefits of a private cloud. A well-planned Resource Pool and Reserve Capacity strategy will account for Resource Decay.
Option 1 is the most recommended approach. A predictable maintenance schedule allows for better procurement planning and can help avoid conflicts with other maintenance activities, such as software upgrades. Again, a well-planned Resource Pool and Reserve Capacity strategy will account for Resource Decay and minimize the risk of exceeding critical thresholds before the scheduled maintenance.
Option 3 will likely be the only option for self-contained Scale Unit scenarios, as the container must be replaced as a single Scale Unit when the decay threshold is reached.
The goal of Request Fulfillment is to manage requests for service from users. Users should have a clear understanding of the process they need to initiate to request service and IT should have a consistent approach for managing these requests.
Much like any service provider, IT should clearly define the types of requests available to users in the service catalog. The service catalog should include an SLA on when the request will be completed, as well as the cost of fulfilling the request, if any.
The types of requests available and their associated costs should reflect the actual cost of completing the request and this cost should be easily understood. For example, if a user requests an additional VM, its daily cost should be noted on the request form, which should also be exposed to the organization or person responsible for paying the bill.
It is relatively easy to see the need for adding resources, but more difficult to see when a resource is no longer needed. A process for identifying and removing unused VMs should be put into place. There are a number of strategies to do this, depending on the needs of a given organization, such as:
The benefits and trade-offs of each of these approaches are detailed below:
Option 4 affords the greatest flexibility, while still working to minimize server sprawl. When a user requests a VM, they have the option of setting an expiration date with no reminder (for example, if they know they will only be using the workload for one week). They could set an expiration deadline with a reminder (for example, a reminder that the VM will expire after 90 days unless they wish to renew). Lastly, the user may request no expiration date if they expect the workload will always be needed. If the last option is chosen, it is likely that underutilized VMs will still be monitored and owners notified.
Finally, self-provisioning should be considered, if appropriate, when evaluating request fulfillment options to drive towards minimal human involvement. Self-provisioning allows great agility and user empowerment, but it can also introduce risks depending on the nature of the environment in which these VMs are introduced.
For an enterprise organization, the risk of bypassing formal build, stabilize, and deploy processes may or may not outweigh the agility benefits gained from the self-provisioning option. Without strong governance to make sure each VM has an end-of-life strategy, the fabric may become congested with VM server sprawl. The pros and cons of self-provisioning options are listed in the next diagram:
The primary decision point for determining whether to use self-provisioning is the nature of the environment. Allowing developers to self-provision into the development environment greatly facilitates agile development, and allows the enterprise to maintain release management controls as these workloads are moved out of development and into test and production environments.
A user-led community environment isolated from enterprise mission-critical applications may also be a good candidate for self-provisioning. As long as user actions are isolated and cannot impact mission critical applications, the agility and user empowerment may justify the risk of giving up control of release management. Again, it is essential that in such a scenario, expiration timers are included to prevent server sprawl.
The goal of Access Management is to make sure authorized users have access to the services they need while preventing access by unauthorized users. Access Management is the implementation of security policies defined by Information Security Management at the Service Delivery Layer.
Maintaining access for authorized users is critical for achieving the perception of continuous availability. Besides allowing access, Access Management defines users who are allowed to use, configure, or administer objects in the Management Layer. From a provider’s perspective, it answers questions like:
From a consumer’s perspective, it answers questions such as:
Access Management is implemented at several levels and can include physical barriers to systems such as requiring access smartcards at the data center, or virtual barriers such as network and Virtual Local Area Network (VLAN) separation, firewalling, and access to storage and applications.
Taking a service provider’s approach to Access Management will also make sure that resource segmentation and multi-tenancy is addressed.
Resource Pools may need to be segmented to address security concerns around confidentiality, integrity, and availability. Some tenants may not wish to share infrastructure resources to keep their environment isolated from others. Access Management of shared infrastructure requires logical access control mechanisms such as encryption, access control rights, user groupings, and permissions. Dedicated infrastructure also relies on physical access control mechanisms, where infrastructure is not physically connected, but is effectively isolated through a firewall or other mechanisms.
The goal of systems administration is to make sure that the daily, weekly, monthly, and as-needed tasks required to keep a system healthy are being performed.
Regularly performing ongoing systems administration tasks is critical for achieving predictability. As the organization matures and the Knowledge Management database becomes more robust and increasingly automated, systems administration tasks is no longer part of the job role function. It is important to keep this in mind as an organization moves to a private cloud. Staff once responsible for systems administration should refocus on automation and scripting skills – and on monitoring the fabric to identify patterns that indicate possibilities for ongoing improvement of existing automated workflows.