This article describes monitoring in a private cloud. Monitoring is a toolset capability that allows capturing of instrumentation from various sources and storing of that data into a store for analysis.

Instrumentation is emitted by the Management and Operations Layers of the Private Cloud Reference Architecture throughout the normal operation if each of these layers components. Instrumentation is also emitted by the resources instantiated on the private cloud infrastructure. Throughout the lifecycle of a private cloud there will be significant amounts of instrumentation that communicate the state of a management operation or resource, the range or value of a property, or the step or relative position of an management operation or code execution.

This article also defines manageable applications and explains the benefits to operators, developers, and architects of manageable applications. It also defines a high level-process for designing, developing, deploying, and operating manageable applications.

This article should be of use primarily to solutions architects and infrastructure architects. However, it also provides useful background information to developers and operators.

This document is part of a collection of documents that comprise the Reference Architecture for Private Cloud document set. The Reference Architecture for Private Cloud documentation is a community collaboration project. Please feel free to edit this document to improve its quality. If you would like to be recognized for your work on improving this article, please include your name and any contact information you wish to share at the bottom of this page.

This article is no longer being updated by the Microsoft team that originally published it.  It remains online for the community to update, if desired.  Current documents from Microsoft that help you plan for cloud solutions with Microsoft products are found at the TechNet Library Solutions or Cloud and Datacenter Solutions pages.

1 Understanding Manageable Applications

Hardware and software costs form only a small percentage of the total cost of ownership (TCO) for enterprise applications running on the private cloud. Over time, the costs of managing, maintaining, and supporting those applications are far more significant.

A large portion of day-to-day running costs is attributable to application failures, performance degradation, intermittent faults, and operator error. The resultant downtime can severely impact business processes throughout an organization.

Many of these problems can be mitigated by ensuring that the enterprise applications are designed to be manageable. As a minimum, a manageable application must meet the following criteria:
  • It is compatible with the target deployment environment.
  • It works well with operational fabric management tools and processes.
  • It provides visibility into the health of the application.
  • It is dynamically configurable at run time.

Manageable applications make day-to-day operations a more predictable, efficient process. However, the benefits of manageable applications are not restricted to the operations team. With many current applications, when a problem occurs with an existing application, the operator attempts to diagnose the problem and may solve the problem by either modifying the configuration of the application or modifying the system at a lower level (for example, by making changes to the operating system, the hardware, or the network).

If the operator is unable to diagnose or fix an application problem, the operator may have to report it to the development team so a fix can be produced. One of the main reasons this happens is because of insufficient or irrelevant instrumentation. If architects and developers create manageable applications, they can reduce the number of times they are called upon to fix problems through additional development.

Similarly when fabric management orchestration is developed the author must instrument each management operation as the orchestration steps through the operation to provide sufficient content information into the management operation state and the state of the resource that is acted upon.

This article demonstrates how to understand applications from different perspectives and describes how knowledge of the operations perspective can lead to applications that are designed to be manageable.

1.1 Application Persecptive

Depending on their relationship to an application, different people in an organization will have a different perspective about an application. The different perspectives include the following:

  • User. The user perspective can be thought of as the consumer of the application. From the user perspective, an application is responsible for meeting user requirements. Requirements such as security, performance, and availability are typically defined in a service-level agreement (SLA).
  • Operator. The operator perspective can be thought of as the facilitator of the application. From the operator perspective, the application must be provided to the user, according to the requirements of the application SLA. The operator is responsible for ensuring that the requirements of the user are being met and taking appropriate action if they are not being met. Appropriate action includes troubleshooting problems, providing the user with feedback, and providing the developer with feedback that may lead to further development.
  • Developer. The developer perspective can be thought of as the creator of the application. From the developer perspective, the application must be designed and built to meet the needs defined by the user. However, when creating manageable applications, the developer perspective should also capture the needs of the operator and the tasks the operator must perform.

Each of these perspectives is held by multiple job roles, all of whom should be involved in developing and consuming a manageable application. For example, the developer perspective will typically be held by one or more architect roles, along with the application developers.

1.2 Operating Business Applications

Before developing manageable applications, it is important to understand the challenges that operations typically face when managing applications.

Operations consists of a series of interrelated tasks, including the following:

  • Monitoring applications and services
  • Starting and stopping applications and services
  • Detecting and resolving failures
  • Monitoring performance
  • Monitoring security
  • Performing change and configuration management
  • Protecting data

The operations layer is responsible for ensuring day-to-day availability of the application, yet often provided with applications that are difficult to effectively manage. This often results in a number of problems, including the following:

  • An inability to determine the consequences of problems when they occur
  • Insufficient run-time configurability of applications
  • Poor understanding of interdependencies between the hardware and software elements that make up a system
  • Poorly designed administration tools that do not reflect the way the IT administrator views the application
  • Changes in one part of a system creating significant impact on the overall environment. The intent of the administrator and the dependencies among the various components often cannot be determined by looking at how the resources were deployed in the environment.
  • IT administrators providing the only points of integration across different subsystems. System configuration rules often reside only in someone's head. Typically, there are no formal records of either the configuration itself or of the changes that have been made to it.
  • Social processes being responsible for achieving coordination of systems. Administrators have hallway conversations, send e-mail, or write on sticky notes to
    remind each other of issues, changes, and so on.

These problems affect the efficiency of the operations team to manage the application and can ultimately affect the experience of the users consuming the application.

To solve these problems, the work of the operations team needs to be considered throughout application design, development, test, and deployment. In many cases, this will be an iterative process. For example, the experience gained from the day-to-day operation of the system should guide improvements to the application design over time. With manageable applications, it is generally easier to transfer system knowledge between all phases of the IT life cycle.

1.3 Application Dependencies

From an operations perspective, applications always execute on a platform and generally communicate over a network. Applications are dependent on their own underlying system and network layers, but they may also communicate with, and be dependent on, other applications and services.

Operators collect information that corresponds to each of these layers, using the information to ensure that applications continue to run smoothly. Understanding each layer as a separate entity, and understanding the relationships between the layers, often allows the operations team to quickly isolate the source of any problem.
For example, if a computer running a SQL database that provides data to an application becomes unavailable, the functionality of the application could be affected. In this situation, the operator needs to know several things:

  • What has caused the SQL Sever to become unavailable? Typically, this is exposed in the form of instrumentation at the system tier and network tier. For example, the computer running SQL Server may have shut down or a network cable may have been removed.
  • What are the consequences to the application? Typically, this is exposed in the form of instrumentation at the application tiers. For example, some functionality of the application may be lost or performance of the application may be affected.
  • What are the consequences to the business operations of the company? Typically, this can be exposed in the form of instrumentation at the application business logic tier and may depend on factors outside the application itself. For example, if a business operation that occurs once a month is affected, and the problem occurs when there are 25 days before the operation occurs again, the problem is less critical than if the operation must occur every day.

Typically, developers are not concerned with the details of the lower layers. However, an architect that is designing for operations should at least have a greater awareness of these details, because issues at a lower level can lead to problems with the health of the application itself.

1.4 Core Characteristics for Designing Manageable Applications

If you are going to design manageable business applications, you must consider manageability as an integral part of the initial design of the application; it should not be just an afterthought. Manageability should also be refined and improved through feedback from the operations team after you get better insight about how applications behave after deployment. The process of designing manageable applications is the result of collaboration between multiple parties who must agree to a number of core principles, including the following:

  • Applications will provide comprehensive, configurable instrumentation that is relevant to the IT team. Instrumentation is a very important tool that helps you understand how an application functions and whether it is functioning as expected. Instrumentation can also form the basis for determining the resolution to problems.
  • Applications will have a health state that varies according to their ability to perform operations as expected. A healthy application is an application that is performing as expected. By setting certain parameters for an application, and measuring whether the application is functioning within those parameters, you can determine the health of an application and take corrective measures when an application is unhealthy. For more information about application health, see "Creating Effective Management Models."
  • Application development must remain independent of the underlying platform. Problems with the underlying platform can affect the health of an application (for example, a DNS issue may prevent an application from functioning as expected), and it is often necessary to capture these dependencies in tools. However, this should not prevent developers from using the proven practice of developing applications that are independent of the underlying
  • Applications will be managed according to proven practices. Operations teams currently use a series of practices to manage applications. These practices are determined by experience and the capabilities and limitations of the available management tools. Manageable applications should provide an operations experience similar to the best examples of current manageable server applications.
  • Operations will use existing standard management tools to manage applications. There are many existing platform tools available for operating applications, such as Event Viewers and Performance Logs and Alerts. Creating new tooling for managing applications further increases the operations team's workload, so wherever effective existing tooling is available, it should be used.

This list of core principles is not comprehensive. In many cases, additional core principles will be established to cover areas such as task management and configuration management.

1.5 Operations Challenges

A number of the problems with product shipping faced by Northern Electronics stem from the existing product shipping application. The operations team for this application face the following challenges:

  • They rely on users to detect and report faults. Sometimes, users cannot provide sufficient or accurate information; this makes diagnosis and resolution of faults difficult, costly, and time-consuming.
  • They may have to visit the computer to investigate issues. The information they receive or can extract from the event logs or performance counters may not provide the appropriate data required to resolve the fault.
  • They cannot easily detect some problems early. These problems include impending failure of a connection to a remote service caused by a failing network connection or lack of disk space on the server. They are unlikely to monitor performance counters and event logs continuously and, instead, use them solely as a source of information for diagnosing faults.

1.6 Develoment Challenges

The solutions architect is committed to making the new product shipping solution a manageable application. However, he faces several challenges in achieving this goal:

  • The development team has no experience in developing manageable applications, and there is no budget for using external developer resources.
  • The organization is planning to modify the design of its infrastructure, and these plans are currently not finalized.
  • The organization is planning to migrate to a Private Cloud.

The solutions architect plans to use a management model for the application to help him overcome these challenges.

1.7 Providing Infrastructure as a Service

This section examined the different perspectives that interact with an application and focused more closely on the operations perspective, which must be well understood to design manageable applications. It introduced some core characteristics that should be followed when designing manageable applications.
Providing Infrastructure as a Service depends on fabric management tools and orchestration that have been designed to be highly manageable. Architects developing applications that integrate with private cloud fabric management must exhibit these highly manageable characteristics to maintain predictability in offering Infrastructure as a Service capability.

These characteristics extend up through the Reference Model into the Platform and Software as a Service Layers.

2 A High-Level Process for Manageable Applications

The high-level process for manageable applications defines four interconnected stages that capture the application through design, development, deployment, and operations, as shown in Figure 1.

Figure 1: High-level process for manageable applications

This section describes each stage and demonstrates how the stages are used together in manageable applications. As illustrated in Figure 1, the stages are the following:

  • Design. A management model is used to define how the application will function in operations. The management model captures, at an abstract level, the entities that make up the application, the dependencies between them, the deployment model for the application, and an abstract representation of the health and instrumentation in the application.
  • Develop. A manageable application will include extensive health and instrumentation artifacts represented in the management model. Information contained in the management model is used to help determine the specifics of the health and instrumentation implementation. Instrumentation will include event IDs, performance counters, categories, and messages. The application may also perform additional health checks, such as synthetic transactions.
  • Deploy. After the application is developed, it must be deployed. The infrastructure model (defined as part of the management model) for the application affects the specific environment that the application runs in, which in turn, affects the health and instrumentation technologies that can be used. For example, an application deployed in a low trust environment may not be able to log to a Windows Event Log.
  • Operate. After the application is deployed, it must be operated on a day-to-day basis. Typically, the operations team uses management tools to consume the health and instrumentation information provided by the application in daily operations and makes necessary changes to application configuration.

The order of these stages is important - adding the appropriate instrumentation to an application on an as-needed basis at the end of the development process—or, even worse, after completing testing and deployment—is unlikely to produce a manageable application. However, in many cases, feedback during the cycle leads to further development of the management model.

2.1 Roles Participating in the High-Level Process

The following four roles are primarily involved in the high-level process:

  • Solutions architect. The solutions architect is responsible for defining the application at the logical level. This involves determining how the application should be structured, how health can be determined for the application (in an abstract sense), and the instrumentation that is necessary to make that determination. To help define the various manageability requirements of an application, the solutions architect should create a management model; typically, this is created in collaboration with the infrastructure architect.
  • Developer. The developer is responsible for consuming the model created by the solutions architect and creating the application, along with appropriate health, instrumentation, and configuration artifacts, as defined in the model.
  • Infrastructure architect. The infrastructure architect is responsible for specifying the environment in which the application will run. This information may be specified in an infrastructure model, which may affect decisions made by the solutions architect (for example, the trust environment into which the application will be deployed). The infrastructure architect must also ensure that the application can be deployed in the environment; if it cannot be deployed in the environment, the infrastructure architect must ensure that the appropriate changes are made to the application or the environment.
  • Operator. The operator is responsible for the ongoing running of the application and responds to application and system alerts using a variety of operations tools. The operator may also adjust run-time configuration of the application in response to certain events.

Figure 2: High-level Process Showing Job Roles

Many additional job roles participate at some point in the life cycle of a manageable application. The following table lists these roles and the perspectives that they would hold on the application. For more information about application perspectives, see "Understanding Manageable Applications."

Role Perspective Description
User User Uses application
User Product Manager User Defines user needs and required features of the application. Works with the solutions architect and infrastructure architect to define service-level agreement (SLA) for application
Helpdesk User or Operator Responds to user problems. Works with operations to troubleshoot application problems. Records information that will assist operations amd future development.
Usre Education Developer Responsible for content in error messages, events, and Help files.
Test Developer Provides feedback to develop during development cycles.


2.2 Understanding the Process

To understand how manageable applications are designed, implemented, deployed, and managed, it is important to look at the process in more detail.

2.2.1 Designing the Manageable Application

The management model forms the starting point for a manageable application. One of the great challenges of creating manageable applications is determining, at design time, the needs for the application in daily operations. By investing time in creating an effective management model early, you can dramatically increase the likelihood that your application is manageable later.

Creating a management model for the application does not prevent you from using an iterative approach when designing your application—the model should be flexible enough to be altered as changes occur in later iterations.

Typically, the infrastructure architect and the solutions architect are the main roles involved in creating a management model. The infrastructure architect provides input about the environment in which the application will be deployed, which may include factors such as network connectivity, network zones, and allowed protocols. This information is critical to the overall design, because it can affect the way instrumentation will be implemented in the application. For example, if the application is to be deployed in a low-trust environment, it is typically not possible to write events to an event log. In some cases (for example, for a shrinkwrapped application), it may not be possible to determine in advance what the deployment environment will be, so multiple trust levels may have to be supported.

Generally, the solutions architect is responsible for the specifics of the management model. The management model defines how the application is broken into manageable operational units (known as managed entities). It also contains abstract information about the application, which defines how the application is developed, deployed, and, ultimately, how it is managed. This information includes an instrumentation model, which indicates all the instrumentation points for the application, and a health model, which indicates the various health states for the application.

2.2.2 Developing the Manageable Application

After an effective management model is created for the application, the application itself needs to be developed, and the information contained in the model must be incorporated. The developer is responsible for taking the abstract elements in the model and generating concrete artifacts in the code. In particular, the developer typically incorporates specific instrumentation, such as the following:

  • Event log events
  • Performance counters
  • Event traces

The developer may also need to incorporate specific health indicators, which are used to determine the health of an application, and configurability support, which are used to modify what instrumentation is used at run time.

2.2.3 Deploying the Manageable Application

After the application is developed, it must be deployed (for simplicity, testing is intentionally omitted from this process). For an application to be truly manageable, you should have a high degree of control over the deployment of that application. This allows you to more easily manage the process of changes to the application. Also, during deployment, specific configuration settings for the application may be chosen.

2.2.4 Operating the Manageable Application

After the application is deployed to its target environment, it must be operated. The operator is responsible for managing the application, using an administrative console, and supporting tools such as Event Viewer and Performance Logs and Alerts. The operator may also use more advanced fabric management tooling available on the private cloud platform, with the application.

In many cases, information contained in the management model can be consumed at run time by operations. This may be as simple as the operator using a report generated from the original model to understand the workings of the application, or it may be exported from the original management model and imported into the Private Cloud Management and Operations Layers.

2.3 Summary

This chapter examined a high-level process for designing, developing, deploying, and operating manageable applications. It examined the roles that participate in that process and the responsibilities that each role holds. It also examined the artifacts that are available to facilitate the process of designing manageable applications.

3 Platform Monitoring

The previous sections discussed manageable applications and monitoring of those applications. A process for development of manageable application characteristics is discussed.

In this section we expand on the scope to include platform monitoring and the activities involved in creating a highly manageable private cloud management orchestration that provides complete and meaningful instrumentation that can be consumed by the Private Cloud Management and Operations Layers and by operators.

3.1 Scope

Platform monitoring is the monitoring of platform instrumentation created by private cloud resources during the normal deployment and management of these resources. Platform monitoring also includes the monitoring of fabric management orchestration that is performing management operations upon the resource or its configuration properties.

Consider an example where a job has been created to stand up a service consisting of an n-tier application that includes a data layer, business logic application layer and a presentation later. Assume the presentation layer consists of five appropriately sized and configured compute resources. These five compute resources will likely be provisioned by fabric management in parallel. It is also likely this presentation tier would be provisioned after the data and business logic tiers since these are a dependency that has been defined in the management model.

When provisioning of the presentation tier occurs the private cloud fabric management component will initiate the provisioning of these five compute instances within a private cloud shared resource pool. These five instances will be allocated, imaged and initialized concurrently and as this happens platform instrumentation will be emitted by fabric management that shows:

  • Allocation of compute resources from a shared pool
  • Selection of operating system image from the library
  • Loading of an image onto the instance
  • Base configuration of the instance
  • Initialization of the instance.

When a compute resource instance initializes it begins exposing platform instrumentation that is specific to the context or scope of the instance. At this point there is both fabric management and resource instance instrumentation contained in the management repository.

Associating this platform and resource instrumentation with the provisioning job is referred to as correlation. The ability to correlate instrumentation to a fabric management job is provided by the private cloud Infrastructure as a Service platform.

The architect is responsible for designing appropriate instrumentation into custom developed orchestration and integrated into fabric management to maintain a highly manageable private cloud.