This document provides guidance for the planning and design of Service Delivery in a private cloud. It addresses Service design considerations based on the Private Cloud Principles and Concepts. This guide should be used as an aid to operations, architects, and consultants who are designing Service Delivery processes, procedures, and best practices for a private cloud. The reader should already be familiar with the Microsoft Operations Framework (MOF) and Information Technology Infrastructure Library (ITIL) models as well as the Private Cloud Principles, Concepts, and Patterns described in this documentation.
Note: This document is part of a collection of documents that comprise the Reference Architecture for Private Cloud document set. The Reference Architecture for Private Cloud documentation is a community collaboration project. Please feel free to edit this document to improve its quality. If you would like to be recognized for your work on improving this article, please include your name and any contact information you wish to share at the bottom of this page. This article is no longer being updated by the Microsoft team that originally published it. It remains online for the community to update, if desired. Current documents from Microsoft that help you plan for cloud solutions with Microsoft products are found at the TechNet Library Solutions or Cloud and Datacenter Solutions pages.
The Private Cloud Reference Model diagram with the Service Delivery Layer highlighted shows the primary functions within the Service Delivery layer that are impacted by the Private Cloud Principles and Concepts. Figure 1: Private Cloud Reference Model
The Service Delivery Layer is the conduit for translating business requirements into Information Technology (IT) requirements for delivering a service. The Service Delivery Layer is the primary interface with the business and seeks to answer the following questions:
Financial Management incorporates the function and processes responsible for managing a service provider’s budgeting, accounting, and charging requirements.
A private cloud allows IT to provide greater cost transparency than what can be achieved in a traditional data center. Traditionally, the line between shared resources and dedicated resources can often become blurred, making it difficult to determine the actual cost of providing a given service. By taking a service provider’s approach to offering Infrastructure as a Service (IaaS), the metering of service consumption allows for a better understanding of the percentage of the overall infrastructure being consumed by a given service. The two aspects of cost transparency that need to be considered are:
Service classification is covered in greater detail in the Service Catalog Management section of this document. However from a Financial Management perspective, it is important to note that the primary driver behind service classification is workload characteristics, which actually drives the cost. When applications are designed to handle hardware failure within the fabric, the cost of providing that fabric is significantly less. The infrastructure cost is higher for applications that cannot handle hardware failure gracefully and require live migration. The primary driver behind the differing costs is the cost of redundant power. Applications that need live migration must have greater infrastructure reliability, which requires redundant power. The cost of providing power redundancy should only be charged for those applications that require it. This service classification may lead to three distinct environments within a data center, each with its own fixed cost. A traditional data center may be needed for hosting services that cannot be or should not be virtualized. There will be a datacenter high availability environment with redundant power to host applications that require live migration. And, there will be a standard datacenter environment without redundant power to host applications that are stateless and do not require live migration.
In a Public Cloud offering, the consumer has a clearly defined consumption based price. They only pay for what they use. The provider takes the risk of making sure that the price they charge will cover their costs as a host and still make a profit. When a business builds a private cloud and is, in effect, both the provider and the consumer, this model may not be the best approach. The business is not likely to be interested in making profit from their IaaS and may only wish to cover their costs. Also, depending on the nature of the business model, the cost of IaaS may be covered wholly by the business or there may be autonomous business units that require a clear cost separation. The following pricing options reflect these differing business models:
The benefits and trade-offs of each model is outlined below.
Regardless of the option that is used, the unit of consumption will need to be defined as a part of the Cost Model. There are many aspects of a private cloud such as power, compute, network, and storage that may be used as a unit of consumption. Power is likely to be the largest variable in the cost, but it is difficult to allocate specific power consumption to a given workload. There is a fairly close correlation between CPU utilization and power consumption and therefore CPU utilization by workload may be the most accurate way to reflect variable power consumption. The Cost Model should also reflect the cost of Reserve Capacity. A discussion on calculating Reserve Capacity can be found in the Capacity Management section of this document. Depending on the model used, this cost can either be reflected as part of the fixed price charge or in the unit of consumption.
The ultimate goal of Financial Management is to encourage the desired consumer behavior. With cost transparency, consumption based pricing, and price based service classes; consumers can better understand the cost of the services they are consuming and are more likely to consume only what they really need. A private cloud model not only encourages desired behavior in consumers but also provides incentives to service owners and developers. After the cost differentiation across classes of service is known, the service owners are more likely to build or buy applications that do not require hardware redundancy and qualify for the least expensive class of service.
Demand Management involves activities that relate to understanding and influencing the customer demand for services and scaling capacity (up or down) to meet these demands. The principles of perceived infinite capacity and continuous availability are fundamental to the customer demand for Cloud-based services. Providing a resilient, predictable environment and managing capacity in a predictable way are both necessary for achieving these principles. Factors such as cost, quality, and agility are used to influence consumer demand for these services.
Additionally, the initial service portfolio assessment is critical to determining the demand for each of the environments.
Business Relationship Management is the strategic interface between the business and IT. Business defines functionality of the services that are needed and then partners with IT to provide these services. Business must work closely with IT to also ensure that the current and future capacity requirements are met.
Mature Business Relationship Management is critical to achieving a service provider’s approach. IT must not remain just a cost center; instead it should become a trusted advisor to the business. With a service provider’s approach, IT is better positioned to help the business make informed decisions on how to consume the services they need while balancing cost, benefit, and risk. Business Relationship Management begins with an analysis of the existing IT service portfolio. IT can begin to advise the business on its IT portfolio strategy by understanding key business drivers, the critical function of each service for the business, and the capability of the market to meet these critical functions. Key questions that need to be answered by IT include:
A key service attribute that will become increasingly important, as Cloud services become more readily available, is service portability. As the market matures, services that are designed for portability will be best positioned to leverage the marketplace. It is beneficial to evaluate if a service (or components of the service) can be moved from internal IT to a Public Cloud provider, or from one Public Cloud provider to another to take advantage of changes in the marketplace. Services that are designed for portability will be in a position to benefit from these possibilities as they emerge. Other key questions that should be asked as part of portfolio analysis are:
Service Catalog Management involves defining and maintaining a catalog of services offered to consumers. This catalog will list the following:
The service catalog may also include specific Virtual Machine (VM) templates, such as a high compute template, designed for different workload patterns. Each template will define the VM configuration specifics; such as the amount of allocated CPU, memory, and storage.
A public service provider offers a small number of service options to the consumer. This allows greater predictability to the provider. The consumer benefits from the cost savings that come with standardization by giving up a degree of flexibility. This same approach needs to be taken in an in-house private cloud. Because the workloads have been abstracted from the physical hardware, the consumer no longer needs to be concerned about specifics of the hardware. They only need to choose the VM template that provides the right balance of availability, performance, security characteristics, and cost. Financial Management, Demand Management, and Business Relationship Management all provide key inputs necessary for designing a service catalog. From these management activities, a number of key questions will need to be answered and reflected in the service catalog including:
Below is a sample of what a service catalog may look like. The Compute Unit will be defined based on a logical division of the Random Access Memory (RAM) and CPUs of the physical hosts. The Consumption Unit will be defined based on the Cost Model described in the Financial Management portion of this document. This will be the base unit of consumption that will be charged to the consumer. If the selected VM had a small configuration for CPU, memory, and storage and was housed in the standard environment, the cost for the VM would be 1 Consumption Unit. If a larger configuration for CPU, memory, or storage was selected, or if the workload required live migration and redundant power, multipliers would be added to the cost of the VM. For example, if the VM required medium CPU (x2), large memory (x4), small storage (x1), and required live migration (x6), the total cost of this VM would be 48 Consumption Units. Another possible factor that is not shown in this sample but may be a service catalog option is input/output per second (IOPS). There may be a standard IOPS and a high IOPS option with an associated Consumption Unit multiplier. Figure 2: Sample Service Catalog
Service Life Cycle Management takes an end-to-end management view of a service. There is a natural journey that takes place from the identification of a business need through Business Relationship Management to the time when that service becomes available. There are two aspects of Life Cycle Management that need to be discussed in the context of a private cloud. Firstly, what is unique about managing the life cycle of services end-to-end in a private cloud? Secondly, how does a private cloud enable Life Cycle Management for the services it hosts?
Maturely managing the Service Life Cycle of a datacenter requires that a service provider’s approach be taken. A clear process needs to be defined for identifying when components need to be added to the data center and when they should be retired. A more detailed description of identifying capacity growth is provided in the Capacity Management section of this document. After the need to add additional capacity has been identified, a managed approach to capacity growth is necessary for Service Life Cycle Management to provide predictability within the data center. The Scale Unit simplifies the process of adding new components to a datacenter. The models and quantities of hardware for the Scale Unit are known and therefore the procurement cycle becomes more predictable. Hardware replacement should also become predictable and ease the burden of hardware budgeting. As part of the Service Life Cycle, hardware should be replaced regularly per the manufacturer’s recommendations. Hardware is often kept in production long after the recommended replacement date; driven by the mindset that, “if it’s not broken, don’t fix it.” There is risk in keeping hardware in production past its recommended life expectancy. The ongoing maintenance costs may go up because there will likely be an increase in incidents associated with the old hardware. When the hardware finally fails, the cost associated with the service outage and the management of the incident will also impact the cost of extending the life of the hardware. It becomes clear that the cost and risk of extending the life of hardware will typically not outweigh the benefits of increased stability and predictability that come with regular hardware replacement.
If you view a cloud from the perspective of IaaS, the workloads on a particular VM are irrelevant. Yet a cloud, by its very nature, can assist with the Life Cycle Management of those services. By providing a homogenized infrastructure, a cloud reduces environmental variation and provides more predictable and consistent development, testing, and production environments. Infrastructure homogenization combined with virtualization reduces the risks traditionally associated with the Deliver phase of a life cycle and facilitates a more rapid approach to server deployment. It is this ability to rapidly deploy services without compromising risk management that makes this a compelling option.
The cost transparency also provides incentives to service owners to think about service retirement. In a traditional data center, services may fall out of use but often there is no consideration on how to retire an unused service. The cost of ongoing support and maintenance for an under-utilized service may be hidden by the cost model of the data center. In a private cloud, the monthly consumption cost for each VM can be provided to the business and to the consumers, thus encouraging VM owners to retire unused services and reduce cost.
Service Level Management process is responsible for negotiating Service Level Agreements (SLAs) and ensuring these agreements are met. In a private cloud, these agreements will define the targets of cost, quality, and agility by service class and the metrics for measuring the successful achievement of each.
A high degree of Service Level Management maturity is necessary in maintaining a datacenter; this may require a re-examination of the existing approaches to define SLAs. SLAs for fabric availability will need to be defined in terms of the environment. For example, the Standard Environment that has removed power redundancy may have a Service Level Target of 99.9 percent fabric availability. On the other hand, the High Availability Environment may have a Service Level Target of 99.99 percent fabric availability. This reflects the fact that this environment will have power redundancy.
Additional questions that may need to be answered include:
Availability Management defines the processes necessary to achieve the perception of continuous availability. Continuity Management defines how risk will be managed in a disaster scenario and ensures that minimum service levels are maintained.
Strategies for achieving availability will differ across the standard and high availability environments based on requirements of the hosted workloads. In the standard environment, redundant power has been removed, putting the burden of redundancy on the application. In this environment, there is a high likelihood of an Uninterruptible Power Supply (UPS) failure at some point, which will cause a Fault Domain failure. The details of constructing a Health Model are covered in the Planning and Design Guide for Systems Management. From the continuity and availability perspective, the most critical aspect of achieving resiliency is to understand the health of the UPS. Early signs of failing UPS health should be closely monitored and should automatically trigger the relocation of VMs in the effected Fault Domain. In the event there is a Fault Domain failure without warning, the management system should ideally detect the fault and restart the VMs on a functional Fault Domain.
The benefits and trade-offs of homogenized infrastructure are discussed in the Release and Deployment Management section of the Private Cloud Planning Guide for Operations. The less homogeneity there is in the environment, the greater the likelihood that this differentiation could cause or extend the length of an outage. Therefore, the extent of homogenization should be factored in when determining achievable availability targets. When implementing a private cloud, existing Business Continuity and Disaster Recovery (BC/DR) plans should be re-examined. In a disaster recovery scenario, the recovery environment may be architected to include the standard and high availability environments though this is not a requirement. The recovery environment may not be dynamic, but a more traditional virtualized data center. The virtualized workloads could still be deployed and run in this environment, but there is likely to be a loss of resiliency. The most important aspect of continuity is ensuring that VMs are replicated and can be restarted in the recovery environment. Regularly testing this functionality should become part of the existing BC/DR plan.
Capacity Management defines the processes necessary for achieving the perception of infinite capacity. Capacity needs to be managed to meet existing and future peak demands while controlling under-utilization. Business Relationship and Demand Management are key inputs into defining Capacity Management requirements.
The most fundamental aspect of maintaining the perception of infinite capacity is the development and execution of a mature Capacity Plan. Inputs to the plan need to include Reserve Capacity requirements, current capacity usage patterns, business input on projected growth, and an understanding of vendor procurement processes. The more accurate this information is, the more accurate the Capacity Plan will be. The calculation for determining Reserve Capacity is detailed in the Patterns section of the Principles, Concepts, and Patterns document. The Reserve Capacity is not considered part of the available Resource Pool as this capacity has been reserved to ensure availability during Fault Domain failure, upgrades, and Resource Decay. Therefore, the Reserve Capacity is subtracted from the Resource Pool, leaving the Total Resource Units. Initially, the business will provide a projected rate of growth. This projection should be regularly reviewed with the business as part of Business Relationship Management. This projected growth rate will be converted into the Projected Consumption Rate. The Projected Consumption Rate is a reflection of how many Available Resource Units are consumed each day. As historical data becomes available, Projected Consumption Rate should be compared to the Actual Consumption Rate and may need to be altered based on actual consumption. An agreement will need to be reached with the suppliers on how long it will take to procure a Scale Unit after an order has been submitted. This length of time along with time allotted for internal procurement processes and installation will provide the Procurement Duration. Procurement Consumption (the amount of additional Available Resource Units that will be consumed during the Procurement Duration) is also factored and this is calculated by multiplying the Procurement Duration with the Actual Consumption Rate. Lastly, the Safety Threshold should be assigned. This is an additional number of units held in reserve to reduce risk. In a green field scenario, there may not be historical data on capacity usage. Similarly, it may be difficult to get accurate growth projections from the business. In these situations, a larger safety threshold should be allocated. Less information means that there is greater risk, so more investment needs to be made to reduce it. Over time, as historical data becomes available and actual growth projections become more predictable, this becomes less of a risk and the size of the safety threshold can be lessened. Here is the formula that shows how these inputs can be used to calculate the Scale Unit Threshold, or minimum available resource units allowed before the procurement process is triggered. Here is the formula to calculate the Scale Time Threshold, or the number of days remaining before the procurement process is triggered. To know how many Available Resource Units remain, subtract the Total Resource Units from the Consumed Resource Units. For example, let’s assume the Resource Pool, after removing Reserve Capacity, can host 500 Consumption Units (as defined in the service catalog). This gives us 500 Total Resource Units. Currently 450 units have been consumed, leaving us with 50 Available Resource Units. The business expects a growth rate of 25 percent this year. This means we will need an additional 125 VMs by the end of the year. If we divide 125 VMs by 365 days we get a 0.34 Consumption Rate (the average additional VMs needed per day).
Feeling fairly confident in the projections, we will allocate a Safety Threshold of 10 Units.
It was determined that the Procurement Duration is 60 days. During the 60 days of the procurement process, there will be a growth of 0.34 VMs per day (our Consumption Rate); therefore the Procurement Consumption is 20.4 Units (60 x 0.34). Based on the above formulas, we now know that the Scale Time Threshold is 56.8 days, meaning we need to initiate the procurement process in 56 days. We also know that the Scale Unit Threshold is 30.4 units, meaning we should also initiate the procurement process if the available resources are less than 31.
Another important aspect of Capacity Management is determining the appropriate size for Scale Units. Looking specifically at the Compute Scale Unit, the Scale Unit must be at least as large as the Fault Domain in order to maintain the Reserve Capacity equation. For standard datacenter environments, it is likely that there will be a single UPS per rack, therefore the Fault Domain and minimum Scale Unit size should probably be the rack as well. However, depending on the size of the data center, the Scale Unit may be multiple racks. While there isn’t a simple formula to determine the appropriate sized Scale Unit, there are a number of factors to consider. If the Consumption Rate is high, Procurement Duration is long, and the frequency of initiating the procurement process is high, then the size of the Scale Unit should be increased. On the other hand, if the Consumption Rate is low and the available Resource Units remain high, this indicates that there is excess capacity that remains unused and therefore a smaller Scale Unit might be more appropriate.
Information Security Management ensures that the requirements for confidentiality, integrity, and availability of the organization’s assets, information, data, and services are maintained. An organization’s information security policies will drive the architecture, design, and operations of its datacenter. Considerations for resource segmentation and multi-tenancy requirements must be included.
A service provider must provide adequate security countermeasures and mitigation plans in the event of a security issue. This includes having a formalized Security Risk Management process for identifying threats and vulnerabilities. A formal threat analysis and vulnerability assessment should be performed on an ongoing basis as the organizations risk profile changes because of environment, technology, business ventures, business process, and procedures. Data classification schemes and information compliance can help determine privacy requirements. These requirements may dictate a tenancy strategy for how Resource Pools are segmented, based on the risk of data leakage.
Availability is one of the primary goals of security. Looking at the problem of availability from a holistic point of view allows for effective deployment of countermeasures. Deploying countermeasures at each layer can add complexity and thereby lower agility. The defense-in-depth strategy is still a viable strategy in a datacenter, but it is important to understand how the countermeasures work together from end-to-end. If too many security measures are built into each component or layer without looking at the problem in its entirety, it could hamper resiliency. For example, an effective countermeasure will be to transfer workloads to newly provisioned VMs periodically to minimize the risk of malware or misconfigured item where an exploit can be leveraged against the datacenter.
Having a more uniform set of devices to manage makes it easier to standardize configurations and keep them up to date. As the number of permutations of devices is reduced, Patch Management implementation and test becomes easier, timely, and more reliable. REFERENCES: