This week we are publishing the first blog post of a long series about the design and implementation of
Cloud Service Fundamentals on Windows Azure. This reusable code package demonstrates how
to address some of the most common scenarios we encountered working on complex Windows Azure customer projects through a number of components that can become the basis for your solution development. The first component we are presenting is Telemetry. It is
a vast topic itself and we decided to break it down into four main buckets, as we described in this
In this first post, we consider some of the basic principles around monitoring and application health by looking at fundamental metrics, information sources, tools, and scripts.
You can use these to troubleshoot a simple solution deployed on Windows Azure (few compute node instances, single Windows Azure SQL Database instance). This is an elaboration of what you can find on official
Operating solutions in a Windows Azure cloud environment is a combination of traditional, well-known, troubleshooting techniques and specific toolsets, which reduce the intrinsic added complexity introduced
by a highly automated and abstracted platform. When the number of moving parts in a solution is reasonably small, such as a few compute nodes plus a relational database, troubleshooting and diagnostics practices can be easily performed manually or with minimum
However, for large-scale systems, collecting, correlating, and analyzing performance and health data requires a considerable effort that starts during the early stages of application design and continues for
the entire application lifecycle (test, deployment, and operations). This is where CSF telemetry component can help you reducing the implementation effort.
Providing a complete experience around operational insights helps customers to meet their SLAs with their users, reduce management costs, and make informed decisions about present and future resource consumption
and deployment. This can only be achieved by considering all of the different layers involved:
Process, correlate, and consume this information will help operations teams (maintaining service health, analyzing resource consumptions, managing support calls) and development teams (testing, troubleshooting,
planning for new releases, etc.).
For large-scale system, telemetry should be designed to scale. It must execute data acquisition and transformation activities across multiple role instances, storing data into multiple raw data SQL Azure repositories.
To facilitate reporting and analytics, data can be aggregated in a centralized database that serves as a main data source for both pre-defined and custom reports and dashboards. These aspects will be considered in next posts of this series.
Define key metrics and indicators will immediately tell you the status of your application’s health. Here is an example of a basic set of metrics covering both the infrastructural side and the application side:
Look at the resource consumption of compute nodes hosting your application components is a first step in monitoring or troubleshooting scenarios. Focus on traditional indicators
like CPU activity, memory usage, disk usage, network throughput, and latency. While not providing a comprehensive view of your application state, they provides a solid starting point. Opening a Remote Desktop connection to individual compute node instances
will be enough for simple deployment. Then use tools like PerfMon,
and Resource Monitor to access these performance indicators in real-time while the application
Event logs in compute node instances will give you additional
information on overall state of your application and the underlying platform. System and Application Logs will provide indications of issues such as application crashes or system restarts. Aggregate and correlate them with other events from different sources
will help during troubleshooting sessions or analyzing trends.
Query response times (or service-to-service interactions) are critical in distributed applications. Interactive workloads are usually very sensitive to database response times and most of the time directly influence
end-user experience. Define thresholds and automate the measurement processes is a good practice to understand application behaviors over time. Application instrumentation plays a big role here, and will cover that in the next blog post. You can also use simple
tools like SQL Server Management Studio (SSMS) and measure response times directly. The following example shows how to measure the impact of network latency in response times between “on-premises-to-cloud” and “cloud-to-cloud” communications:
Figure 1 - On Premises to Azure SQL Database latency
In the previous picture, we just ran multiple times a simple “SELECT 1” from on premises against an Azure SQL Database instance, with Client Statistics enabled to minimize server execution time, and in the last
section “Time Statistics”, you can notice how “Wait time on server replies” is taking most of the time during the roundtrips.
Figure 2 - Azure Compute Node to Azure SQL Database latency
The same experiment repeated with SQL Server Management Studio running on a Compute Node will give us immediately the different experience in terms of latency.
A high rate of exceptions generated can be indicator that something is going wrong in your application. Temporary malfunctioning conditions may be tolerated in some scenarios, and classified as a transient bump.
In other cases, malfunctions should trigger an alerting mechanism and receive immediate attention by operators. It is a recommended approach to establish a monitoring system for application errors and exceptions. This helps with both near “real time” troubleshooting
as well as long-term root cause analysis activities.
A critical subset of application exceptions are database connection and command failures. Transient connectivity issues such as the ones generated by failover or throttling events can happen more often with
Azure SQL Database compared to on-premises versions of SQL Server. Sometimes they can be the symptoms for a suboptimal data access layer implementation:
We will introduce how to automate a regular monitoring task to capture these events and correlate application exceptions and server logs in future posts. For simple deployment scenarios, manually querying server
level tables like sys.event_log and sys.database_connection_stats (exposed in the
master database) as shown in the next pictures will provide an aggregation of potential connectivity issues over the last five minutes.
Figure 3 - Event log table example (SELECT * FROM sys.event_log WHERE database_name <> ‘master AND severity >0’)
Figure 4 - Connection stats example (SELECT * FROM sys.database_connection_stats WHERE database_name <> ‘master’)
At the application level, look for exceptions that involve
SqlConnection/DbConnection or SqlCommand/DbCommand in your trace or application logs. This can give you more details on what happened and when the issues occurred.
Windows Azure is a shared and distributed environment and some degree of application performance variance is expected. Application components in Cloud Service Fundamentals (CSF) package are designed to minimize this variability, but is important to define
acceptable thresholds and measure how your application is behaving. Services like Azure SQL Database can be queried to extract how many concurrent users are hitting a database instance at any given time.
sys.dm_exec_requests will return concurrent requests and statistics on the most recent reasons for what the system is waiting for resources to complete query executions. High rates of “SE_REPL” wait types, for example, can indicate that the replication
system is affecting system’s ability to execute user transactions. Trend analysis on these performance issues are critical during capacity planning and similar important management decision points.
So far, we have just scratched the surface on some of the most important metrics to consider while starting a monitoring exercise for your Azure solutions. But in this section, we will drill down more specifically into the monitoring features and capabilities
of the Windows Azure platform and its key services.
Windows Azure Diagnostics (WAD) is the built-in solution offered by the platform to collect
and aggregate multiple diagnostic data sources into a single repository (Storage Account):
In our CSF package, we heavily rely on WAD to collect diagnostics information from all compute roles included in the solution and to consolidate them into a single Storage Account. This becomes one of the information
sources for telemetry data acquisition and transformation pipelines. These pipelines aggregate and correlate diagnostic information into a single Windows Azure SQL Database instance for further analysis.
In particular, we are using the most significant Storage Tables and Blobs that are created by the DiagnosticAgent in the Storage Account we configured in the WAD plug-in connection string. The following screenshot
shows an example of the standard naming configuration:
Figure 5 Windows Azure Diagnostics Tables and Blob containers
For troubleshooting simple scenarios, you can, of course, manually query these Tables using tools like Visual Studio Server Explorer or third party tools like
Cotega or Cerebrata’s Azure Management Studio to
extract information that is relevant to your investigation. The following screenshot shows a diagnostic table in Visual Studio Server Explorer.
Figure 6 Windows Azure Diagnostic Event Table
You can also manually download trace logs or performance counter values over a specific time interval when an incident happens to perform further analysis. For example, you can plot this information on an Excel
spreadsheet (as shown in the next picture) as an effective way to better understand your application behaviors without the need for a more complex underlying infrastructure.
Figure 7 Customer Excel sheet to analyze Windows Azure Diagnostics data points
Our applications are the greatest sources of information when it comes to monitoring purposes. No matter what tool or approach you take to consume this information, you must first
properly instrument your application. Without making instrumentation a high priority design task, you cannot achieve your manageability goals after the application goes into production. Next blog post will be totally dedicated to application instrumentation,
from creating custom performance counters to simple trace messages. These
can be collected through Windows Azure Diagnostics and used as a stand-alone tool to measure your application metrics. More advanced techniques include generating your own diagnostic events using the
in .NET 4.5 and integrating these into your application logging strategy using the
Semantic Logging Application Block capabilities. Please refer to the next blog
post in this series for more implementation details.
Correlate application-side events and metrics with the behavior of your database instances is required for a successful monitoring and troubleshooting strategy. When application response times increase, drill
down into database response times for single operations, and look at the query execution plans that database engine was using at that time. This will point you to optimization areas like missing indexes or suboptimal operations. When your application is receiving
throttling errors, you might want to check how many concurrent requests are hitting your database at a given time and look at what wait types are impacting your workload.
The article called “Windows Azure SQL Database and
SQL Server -- Performance and Scalability Compared and Contrasted” describes some of the differences between traditional SQL Server on premises and Windows Azure SQL Database. “Troubleshooting and Performance Analysis
in SQL Database” section specifically presents some of the monitoring and analysis techniques that we adopted in the CSF telemetry package. Specifically, we monitor:
In that article, there are T-SQL queries that we have embedded into the automated data acquisition pipeline of the CSF telemetry package. Nothing prevents you from using those queries interactively through SQL
Server Management Studio or Windows Azure SQL Database management portal. Manual runs can generate independent result sets during your investigation.
By querying these tables, you’ll be able to extract meaningful information, like the number of requests, end to end latency, and server processing times for operations in a given time interval. See the following
table as an example:
Figure 8 $MetricsTransactionsTable in Azure Storage Analytics
Figure 9 Azure Storage Analytics Blob structure
Figure 10 Azure Storage Analytics log entry format
Once you have defined your key performance indicators (KPIs) and familiarized yourself with the different diagnostic data sources mentioned so far, your troubleshooting toolset is almost complete. You just need
to apply a consistent approach to analyze this information during both normal performance and availability incidents. Here is an approach that worked consistently during our customer engagements:
Detect performance issues before end users will be impacted requires a scheduled mechanism in place to poll different data sources and monitors key thresholds, triggering appropriate alerts. But with proper
level of instrumentation, data sources we mentioned in Section 3 will give you the ability to drill down into what is happening. Application errors and exceptions are typically a good starting point. To detect all components affected by a specific incident,
you might need to interrogate and analyze results from different tools (e.g. Windows Azure Diagnostic logs, Windows Azure SQL Database DMVs, etc.). This is where having a centralized and correlated repository is invaluable in reducing the time-to-detection
for application issues.
In this services world, your first priority should be to recover from an incident and have your application up and running as soon as possible. Once you have developed a strong practice for the steps described
in Section 4.1 and 4.2, you will be best positioned to understand the right solution to a given application issue. Most of the times, contacting Microsoft Customer Support Services (CSS) is the fastest choice for all issues that are out of your direct control.
But, even then, having the right set of diagnostic data available during your interactions with support engineers significantly speeds up problem resolution. For other cases, the data will show that the issue has been generated by something wrong in your application,
and you will have all the information at your disposal to bring your application back to normal functioning.
Root Cause Analysis (RCA) should always complement issue resolution and recovery. With all data collected and correlated from the various sources we mentioned, it will be possible to understand recurring patterns
and trends that anticipated a particular incident; for example higher web page loading times can be directly correlated to database response times. This can lead to identifying the root cause such as the underlying query execution time increased due to a
suboptimal query plan caused by indexing issues. This kind of RCA will typically lead to long-term mitigation actions that may or may not involve application code changes or general improvements. This virtuous cycle will not be possible without an appropriate
diagnostic toolset in place.
This is a collection of tools and scripts that we have found useful during various troubleshooting processes. Some of these have been integrated into our CSF telemetry
package, while others can be used to complement (or replace) that during your investigation.
Monitoring for Azure SQL Database
Active Cloud Monitoring by MetricsHub
Stackify Retrace for Azure
Azure Watch by
Azure Management Studio by
Visual Studio Server Explorer
Windows Azure Diagnostics
The Azure CAT team is developing a client-side tool that will help in automating data collection and correlation for one-off troubleshooting scenarios. This will
complement CSF telemetry package and reuse most of the techniques presented in this blog post. Once completed it will be posted here, so stay tuned!!