Infrastructure as a Service (IaaS) refers to an environment that offers dynamically scalable, virtualized computing power and storage resources as a service. This
abstracts low-level hardware, such as servers and storage devices, reducing organizations’ investment in such equipment.
This document is part of a collection of documents that comprise the
Reference Architecture for Private Cloud
document set. The Reference Architecture for Private Cloud documentation is a community collaboration project. Please feel free to edit this document to improve its quality. If you would like to be recognized for your work on improving this article,
please include your name and any contact information you wish to share at the bottom of this page.
The purpose of this document is to define how Private Cloud Infrastructure as a Service patterns affect Systems Management design.
This guide should be used as an aid to operations, architects, and consultants who are designing processes, procedures, and best practices for a private cloud. The reader should already be familiar with the Microsoft Operations Framework (MOF) and Information
Technology Infrastructure Library (ITIL) models as well as the Private Cloud Principles, Concepts, and Patterns described in this documentation.
The Private Cloud Reference Model is shown here with the Management Layer highlighted. Within the Management Layer are the nine Systems Management Capabilities a private cloud must provide to support Infrastructure as a Service; Service Reporting, Service
Management System, Service Health Monitoring, Configuration Management System, Fabric Management, Deployment and Provisioning Management, Data Protection, Network Management, and Security Management.
Figure 1: Private Cloud Reference Model
This document describes these nine Management Layer capabilities and the impact of private cloud IaaS patterns on their planning and design. The patterns are defined in the Private
Cloud Principles, Concepts, and Patterns document and are summarized here:
Service Reporting supports many domains in a private cloud, as it is the way information is delivered to IT and business. These domains must provide the raw data to Service Reporting.
Effective management of a private cloud is dependent on its Key Performance Indicators (KPIs). These general Service-Level Management reports are typically relevant:
This list omits other typical reporting requirements (for example, Service Desk response time and Change Request response time) that are not specific to a private cloud.
Resource Pools are logical collections of servers, network switches, and storage. The Service Delivery, Operations, and Infrastructure teams need reports in order to manage the life cycle of a Resource Pool. Here is a list of typical Systems Management activities
where Resource Pool reporting is needed:
To support these requirements, the Systems Management tools must understand Resource Pools and keep track of resources in each Resource Pool.
This list omits typical reports produced in mature Information Technology (IT) organization that are not specific to a private cloud; for example, audit reports for compliance or security.
The Systems Management tools must understand and represent Physical Fault Domains and keep track of resources that are hosted in a Fault Domain. The following reports will help in the management of Fault Domains:
The Systems Management tools must understand and represent Upgrade Domains and maintain a list of resources within an Upgrade Domain. The following reports will help in the management of Upgrade Domains:
Reserve Capacity is likely to be relatively static in each Resource Pool, which means reporting is less critical for Reserve Capacity as compared to some of the other private cloud patterns. A report on Current Reserve Capacity per Resource Pool serves as
The Capacity Plan is highly dependent on correct reporting. The following list of reports is a baseline and should be reviewed based on each customer’s approach to capacity planning:
While these requirements could be satisfied through manual collection and manipulation of capacity data (for example, listing capacity and consumption in Microsoft Excel®), the manual approach increases the risk of running out of capacity, violating the Private
Cloud principle of perceived unlimited capacity.
The Systems Management tools should ideally report capacity data in real time.
The Health Dashboard information for the various consumers of Health Model should include the following:
CIO and Service Delivery
The Service Delivery team will want to understand how they can improve the services for the business. The following reports can help in this respect:
Reporting in a private cloud may be integral to any billing or cost allocation process. Reporting requirements may define information needed from the infrastructure; for example, information on tenant-based consumption of resources.
The architect must make sure any service reporting capability is multi-tenant aware from the perspectives of both content (reports on tenant consumption) and user (reporting must honor tenant separation).
The Service Management System should be capable of integrating data generated from the entire suite of tools found in the Management Layer. It should also be capable of processing and presenting this data as needed.
At a minimum, the Service Management System should be linked to the Configuration Management System (CMS), which consists of one or more Configuration Management Databases (CMDBs), and should be able to log and track incidents, problems, and changes. It is
also preferred that the Service Management System be integrated with the Service Health Modeling system so incident tickets can be auto-generated as needed.
Compared to traditional infrastructure, a private cloud is highly automated, but this does not eliminate the need to raise a ticket against a change. Adding a machine in response to a load or moving a machine instance to another physical host are automated
activities, yet a ticket must be created. The same applies to automated incident response:
an incident ticket still needs to be logged.
The Service Reporting section above defined many Systems Management activities that need to be reported. All these activities also require automatic ticketing. Some important private cloud specific ticketing triggers are:
The Service Management System should understand the concept of Capacity Management and create a ticket when resources are depleted to the defined level. The ticket is used to indicate that the procurement process has been triggered.
The Service Management System should also track Resource Decay and trigger maintenance when the appropriate Resource Decay threshold is met. The Service Management System needs to raise a ticket when a Resource Decay threshold is met because it is now an incident.
If maintenance occurs on a regular schedule instead of when a threshold is met, the Service Management System should create a ticket listing all maintenance work to be undertaken at the next scheduled window. Even when maintenance is regularly scheduled, the
Service Management System still needs to raise a ticket when a resource decay threshold is met.
For resiliency, a private cloud must be able to automatically detect a hardware component failure or diminished capacity. This requires an understanding of all the hardware components that work together to deliver a Service, and the relationships between
these components. Monitoring these relationships allows a private cloud to see which machines are affected by a hardware component failure; it allows the private cloud management system to determine if an automated response action is needed to prevent an outage
or restore a failed instance on another system.
From a broader perspective, the management system needs to classify a failure as an incident, as Resource Decay, a Physical Fault Domain failure, or a broad failure that requires the system to trigger the disaster recovery response.
When creating the Health Model, it is important to consider the connections between systems. This includes connections to power, network, and storage. The Architect also needs to consider data access, for example, server connections to the correct logical unit
number (LUN). Finally, the architect must understand how diminished performance may impact the system. For example, if the network is saturated, performance may be affected to the point that the management system moves workloads to new hosts. It is important
to know how to determine both healthy and failed states in a predictable manner.
Physical Fault Domains play a significant role in the Health Model. The monitoring and management systems must understand Fault Domains and the resources that are hosted in each Fault Domain.
The next diagram shows a typical case of systems interconnections. In this case, power is a single point of failure as all servers are connected to one of two UPSs. Here each UPS represents a Physical Fault Domain as its failure leads to server failure. Network
connections and Fiber Channel connections to the SAN are also Fault Domains, but these connections are redundant.
Figure 2: Connection Diagram with Physical Fault Domains
When “UPS A” fails, it causes a loss of power to Servers 1-4. It also causes a loss of power to “Network A” and “Fiber Channel A”, but because the network and Fiber Channel are redundant, only one Fault Domain fails. The others (UPS B, Network B, and FC B)
are diminished, as they lose their redundancy.
Figure 3: UPS A Fails Causing a Physical Fault Domain Failure
The management system needs to detect the Fault Domain failure (UPS failure in this example) and migrate or restart workloads on functioning Physical Fault Domains – Fault Domain UPS B.
Figure 4: Management System Relocates Workloads around the Failure
The monitoring system must be able to tell when resources are being taken out of operation for an upgrade, so it can suppress alerts from the resources during the upgrade process. A private cloud achieves this by defining and placing resources in Upgrade Domains.
The Private Cloud Principles, Concepts, and Patterns document provides a method for calculating the level of Reserve Capacity in each Resource Pool. If this formula is adopted,
there will always be sufficient capacity in a Resource Pool to handle all running machine instances. In theory, the health monitoring system shouldn't need to understand or trigger alerts on lowered Reserve Capacity. But architects and operations teams are
likely to be uncomfortable with this. The monitoring system needs to trigger an incident when resources are overcommitted, Resource Decay exceeds the defined figure used in the formula, or more than one Fault Domain fails simultaneously. Therefore, it is recommended
that the health monitoring system understand Reserve Capacity and issue alerts when it is nearly consumed.
The monitoring systems and Health Model aid capacity planning by recording resource consumption trends and alerting operators when capacity triggers are reached.
A private cloud depends on three types of Health Model:
From a health perspective, there are likely to be two categories of Service Classes: stateful and stateless. These two classes of service must be managed differently. For example, a stateless class of virtual machine could be restarted automatically on a
different physical server in the Resource Pool should its host fail. A stateful class may provide clustering support to avoid outages during planned maintenance.
Understanding the long-term health and utilization of Service Classes helps Service Management improve service quality. A Health Model needs to be constructed for each Service Class. This enables Systems Management tools to measure the performance, availability,
and utilization of each Service Class. The same Health Model supports the user views; however, the scope in this instance is limited to the users’ instances of that particular Service Class.
Fabric Management, detailed later in this document, uses the Service Class Health Model in conjunction with defined Resource Pools and Physical Fault Domains.
The Health Monitoring System contributes to the private cloud cost model in a number of ways:
The principle of IT departments taking a “service provider” approach to delivering a private cloud suggests that the CMS and CMDB used for the infrastructure should be separate from the CMS and CMDB used by the consumers. But practical adoption of this approach
depends on such factors as cost (manpower available to manage another CMS, plus software licensing and physical infrastructure costs) and portability (the ability to transfer capabilities to an external supplier).
The CMS will handle two broad categories of changes; those focused on the private cloud infrastructure and those focused on the private cloud service(s). It is likely there will be different people involved in these two types of changes; most infrastructure
changes should be invisible to consumers, to support the principle of continuous availability, but service changes are likely to be in response to consumer requests. Common infrastructure and service changes are:
Because of the high level of automation and the dynamic nature of a private cloud, its change rate is likely to be considerably higher than a traditional infrastructure. Much of the automation will depend on the CMDB; therefore the CMS as well as CMDB should
be designed with these requirements in mind.
The CMS must handle the test-and-release management process for the private cloud infrastructure. Versioning is very important to the reliability of private cloud automation; a release should encompass the private cloud infrastructure and systems management
functions such as the Fabric Management rules and automation scripts.
Some changes can be rolled across the private cloud infrastructure incrementally; for example, patches and firmware updates. Significant upgrades such as operating system version changes should be done as a version release where thorough testing of the systems
management functions (Fabric Management and automation scripts) can occur. Many standard changes will be automated and the CMS should track these changes.
The CMS needs to have Configuration Items (CIs) for patterns such as Resource Pools, Fault Domains, Upgrade Domains, Service Classes, workloads, and consumer or tenant (owner).
The Resource Pool, Fault Domain, and Upgrade Domain CIs will contain physical hosts, VMs, switches, and storage CIs.
The Upgrade Domain pattern will be used extensively by the CMS. The forward schedule of change can use Upgrade Domains as the scope of change. The definition of Upgrade Domains will vary depending on the resource and the infrastructure design. But Upgrade Domains
should observe these basic rules:
The Upgrade Domains should be defined jointly by the infrastructure specialist (for example, network manager) and the service manager to make sure an appropriate balance among size, risk, and change window.
In traditional IT, servers, network, and storage are managed separately, often on a project-by-project basis. In dynamic IT, and private cloud system management, these resources can be thought of as the fabric upon which compute instances run. They therefore
need to be managed as a unit. In the same way that virtualization abstracts physical hardware, Fabric Management abstracts a service from specific hypervisors and network switches.
Fabric Management can be thought of as an “orchestration engine” responsible for:
The term “orchestration” is fitting because Fabric Management may use other Systems Management tools and scripts to execute changes initiated by Fabric Management.
A traditional virtualized infrastructure focuses on individual VMs, whereas a private cloud focuses on groups of VMs that collectively make up a solution. Take Microsoft Exchange, for example. The private cloud groups the VMs that collectively provide Exchange
(Client Access Server, Hub, and Mailbox) into a workload. Workloads have configurable attributes, among them a scale-up attribute that defines when Fabric Management should automatically add additional VMs to the workload.
Fabric Management has two primary activities:
The VM placement process would be simple if it was simply based on the available capacity. However, Fabric Management also uses Service Classes (defined in Private
Cloud Planning Guide for Service Delivery document) to determine VM placement. Service classifications define when VMs must be deployed into separate Physical Fault Domains. The combination of defining workloads and service classification enables solutions
to be deployed in a resilient fashion. For example, a workload could contain two Exchange CAS servers that use a service classification that forces deployment of each VM into separate Physical Fault Domains.
In a traditional infrastructure, this process is handled manually. Fabric Management does it automatically in a private cloud, and ensures that separation is maintained even after the failure of a Physical Fault Domain. Existing VM management tools do not provide
this level of intelligence and are therefore unable to manage VM placement or movement in a private cloud.
To determine VM placement, Fabric Management must keep track of available resources, their quantity, and the quantity of resource allocated to Reserve Capacity. Policies should define if Fabric Management can exceed the Reserve Capacity threshold on a Resource
Taking this holistic view of the infrastructure, Fabric Management can respond to the failure of a Physical Fault Domain or an individual server by automatically moving or restarting VMs on working Fault Domains or servers. Note that a movement will trigger
the decision-making process to determine where a VM can be placed based on the workload and service classification.
To be able to move VMs around and still deliver a consistent service to consumers, Fabric Management depends upon pools of resources. Ideally, every server in a Resource Pool is identical, so any VM can be hosted on any server in the pool and perform the same
way. Fabric Management must understand the concept of Upgrade Domains, so it can move VMs off hosts when an Upgrade Domain is activated.
It should be obvious that Fabric Management is highly dependent on the Fabric Health Model. The richer the health information provided by monitoring systems, the greater the effectiveness of Fabric Management. For example, if systems monitoring detects impending
failures, Fabric Management can proactively move affected VMs to avoid an outage.
Fabric Management is also responsible for the dynamic scaling of workloads. Attributes defined in Service Classification or Workload Configuration are used to determine when to add or remove Scale and by how much. Triggers can be based on load or time; for
example, increase scale whenever CPU exceeds a certain threshold, or at 2100 Coordinated Universal Time (UTC) every Friday. This capability allows a private cloud to respond to business needs rapidly or on a schedule, allocating more resources to compute-intensive
workloads when needed.
The architect must determine if and how to support dynamic scaling, because the choice impacts systems monitoring, service classification, and Fabric Management.
Upgrades in a traditional infrastructure are largely manual. The resources to be upgraded are identified and taken out of service before an upgrade is implemented. Fabric Management orchestrates this process by understanding what physical resources are allocated
to an Upgrade Domain.
A homogenized fabric enables one set of scripts to be used on all resources of the same type and avoids the need for complex calculations and dependence on hypervisor configuration to determine how much CPU is allocated to individual VMs. Over the life, however,
it is perhaps unrealistic to assume that the fabric will remain homogenous. The architect must consider this when designing Fabric Management and the lower-level Systems Management automation. One solution is to partition a private cloud into Resource Pools,
each of which has a homogenous fabric.
Deployment and Provisioning Management is the toolset used to carry out the processes defined by Release and Deployment Management in the Operations Layer.
Deployment and Provisioning Management is responsible for bare metal deployment of the hypervisor and the provisioning of VMs with service-specific, instance-specific, workload-specific configuration. It is also responsible for deployment of patches and system
updates to the fabric.
Deployment can be deployed as the installation and configuration of the software which makes up the fabric, and provisioning as the creation of VMs and making the necessary configuration changes to the fabric for these new VMs to function. As stated earlier,
Fabric Management initiates and orchestrates these fabric changes; the provisioning activities are relatively simple and straightforward: create, move, start, or stop a VM, create a virtual IP in a load balancer, add or remove hosts from load-balanced farm,
and add or expand a LUN. Most provisioning activities will be orchestrated by Fabric Management.
All these activities must be mechanized (scripted), otherwise a private cloud will not be able to scale or meet future agility needs.
Because the provisioning tools are called by Fabric Management, they do not need to know Resource Pools, Upgrade Domains, or Physical Fault Domains. The deployment tools must understand both Resource Pools and Physical Fault Domains to be able to identify which
Resource Pool and Physical Fault Domain the server is being deployed into. Consequently, the deployment and provisioning process must be integrated with the CMDB (though not necessarily the tools; they can simply execute instructions issued by Fabric Management).
Deployment and Provisioning Management tools must be integrated with the Health Model because any deployment or provisioning failure must be exposed to the health monitoring tools.
The provisioning process (not necessarily the tools) must understand each Service Class, as this information defines the VM configuration.
Data Protection provides the capability to backup and restore data, hosts, infrastructure, and service components. As a private cloud considers machines to be black boxes, the architect has a choice to make the consumer responsible for backup or provide
Data Protection capability as part of the Service Classification.
For the purposes of this document, it is assumed that a private cloud will not provide a backup of machines as this capability is tightly coupled with workloads. This section defines Data Protection for the private cloud itself. Private cloud is highly dependent
on its systems management functions; therefore, these functions should be highly available to make sure that a failure doesn’t cause the private cloud itself to fail. Consequently, Data Protection as defined here is designed to protect against a disaster,
rather than provide a snapshot of the private cloud at a point in time.
The fabric in a virtualized environment, that is, the hosts running the hypervisor or parent partition, should not need Data Protection as they don't host data and can be rebuilt automatically through bare metal deployment.
Data Protection for a private cloud focuses on fabric configuration and systems management functions, all of which need protection. The particular method for backing up these management functions will be defined by the management tools vendor; however, a good
principle is to back up the primary instance of each management server and its corresponding database. These backups should be kept offsite to enable recovery to a different facility.
Network switch configuration and storage array configuration both need protection through their respective management software, again utilizing offsite storage.
The data protection infrastructure, like any backup infrastructure, needs to be monitored to make sure it is healthy.
The private cloud depends on network switches, a virtual local area network (VLAN), and, optionally, load balancers to deliver a service. Management of these resources must be integrated with the private cloud in such a way that “business as usual” changes are
mechanized, allowing Fabric Management to make automated infrastructure changes.
Each switch should be allocated to a Resource Pool along with compute resources. When defining Service Classes, the architect must decide whether separate VLANs are defined for each consumer and workload. This decision also impacts whether VLANs exist globally
in all switches, or are deployed only to the switch which hosts the consumer or the workload.
When defining Service Classes, the architect may choose to create a category of service based on a network characteristic such as high throughput. In such a case, the architect could choose a lower machine density per switch, or create one or more dedicated
Resource Pools with high-speed network interconnects. High-end network switches and network adaptors must be factored into the Service Class Cost Model.
The Upgrade Domain pattern applies to network hardware. Switches and load balancers need patching and operating system upgrades. The network should be fully redundant; therefore, network upgrades could be accomplished by placing one switch from each redundant
pair into an Upgrade Domain. For example, switch Upgrade Domain A may contain all (or a subset of) primary network switches while Upgrade Domain B contains all (or a subset of) secondary switches.
Network capacity planning is concerned with physical capacity (ports) and logical capacity (throughput, number of VLANs supported, and spanning tree support). Physical capacity is planned through Scale Units. Defining the number of servers and corresponding
network ports in each Scale Unit makes it easy to plan the physical capacity of the physical network. Logical capacity can also be defined as a part of the Scale Unit. However, network management tools should monitor switch utilization and trigger some health
alert if a machine or machines saturate the switch. Further, the switches or the Network Management tools should display health information as part of the overall Health Model.
Security Management provides the capabilities needed to make sure both the private cloud infrastructure and the services it provides are secure.
In addition to many of the standard security requirements such as authentication, access control, monitoring, and reporting, a private cloud introduces the following new requirements:
Multi-tenancy: This may demand that each tenant is separated. Such a requirement drives the security design; for example, are VLANs and VMs sufficient or is physical separation needed?
Private cloud could provide services which are differentiated by their security design, for example, shared (multi-tenant) or dedicated infrastructure. Each of these will result in separate service classifications and are likely to be implemented as separate
Compliance in a Dynamic Infrastructure: Compliance is a challenging area because it lacks prescriptive guidance. For example, one auditor may be happy with designs that show virtual separation between tenants or workloads, whereas others demand
physical separation. The Private Cloud Systems Management design faces similar issues. Separation of the Private Cloud Systems Management from that of the workload means there is no information spillage; however, an auditor may not accept this.
Most auditors are likely to want VM movements logged so they can see on which physical host a VM was resident at any point in time. Further, they may want to see what other VMs were sharing the host.
The security requirements facing the business should be considered early in the design. Security requirements, if they drive physical separation, have the potential to compromise some private cloud concepts of optimized resource usage. A private cloud is designed
to provide a limited set of standard services and security should influence but not drive the design of these services. Applications which have strict or unique security requirements are better implemented on a traditional infrastructure.
If you edit this page and would like acknowledgement of your participation in the v1 version of this document set, please include your name below:
[Enter your name here and include any contact information you would like to share]
Reference Architecture for Private Cloud