Written by Ewan Fairweather
and Silvano Coriani
from the AzureCAT team.
Welcome to the fourth wiki entry on designing and implementing the Telemetry component in
Cloud Service Fundamentals (CSF) on Windows Azure! So
far, we have described basic principles around application health in Telemetry
Basics and Troubleshooting including
an overview to the fundamental tools, information sources, and scripts that you can use to gain information about your deployed Windows Azure solutions. In our second entry, we addressed
Telemetry - Application Instrumentation
how our applications are the greatest sources of information when it comes to monitoring, and how you must first properly instrument your application to achieve your manageability goals after the application goes into production. In our third article we described
how to automate and scale a data acquisition pipeline to collect monitoring and diagnostics information across a number of different components and services within your solution; and consolidate this information in a queryable operational store.
The topic for this fourth article is reporting, which basically entails showing you how to get the information you need about your system to suit the different type of analytical
and reporting requirements in your organization. We will show you how to quickly extract things like your database tier resource utilization, end-to-end execution time analysis, and how to turn these into reports and dashboards. Specifically, we will walk
through the underlying implementation of the operational store along with examples of how to use analytical queries on it. We will also cover the reporting package that we provide and how to utilize Excel to do a deeper level of analysis. We will then show
you how you can extend the provided helper functions to get further information.
The previous article in this series discussed the data pipeline, which is the CSF implementation of the collector tasks shown in the following data flow diagram. These collector
tasks are used by the CSF telemetry worker role and scheduler to populate the Telemetry database on a configurable periodic basic. In this article we will describe how you can extract information (shown on the right-hand side of the diagram) to provide information
through reporting services, SSMS and Excel.
Figure 1: Telemetry DB in CSF
A key first step to gaining effective insight into your telemetry data is to define the reporting scenarios and their key requirements. When defining the CSF telemetry solution,
a useful technique was to first define the three scenarios: operational reporting, alerting, and root-cause isolation. The “I can” approach to defining requirements was then used to define the key requirements. These were then prioritized, with the majority
of the operational reporting and root-cause isolation scenarios implemented in the out-of-the-box CSF experience. The underlying data structure is there to service your alerting needs.
This technique allowed us to consider how the underlying schema would support the current and potential future requirements, both when defining and then later extending the telemetry
database. This technique was an important first step, and you should especially use it if you plan to extend your telemetry database.
The following picture highlights the specific parts of the CSF package which are related to the telemetry database and reporting solution.
Figure 2: CSF Package
The OpsStatsDB project in the CSF solution contains our implementation of the telemetry database, which acts as the centralized repository for all of our telemetry data. This solution
has been used as the repository for a customer with over 500 databases and up to a hundred worker roles. The tables closely map to the underlying data structures extracted by the various sources, as we decided to preserve the original data shape as much as
possible. As described below, the import task data pipeline process enriches the underlying data that is collected with some additional fields. This additional information enables us to simplify the query logic required to meet our reporting requirements,
which we defined earlier.
Figure 3: Table Columns
Throughout the tables in the OpsStatsDB you will see a
timestampKey and eventdateUTC field. The former is used to collect the import time and the latter collects the actual event time; the decision to implement it this way was because the collection tasks use a fan out process to collect the requisite
Dynamic Management View (DMV) data. If this is performed across multiple Windows Azure SQL DBs, there will be a slight discrepancy in the actual time, often in the order of milliseconds, that the data is actually collected across the shards, this is due to
the nature of the collection process, and occurs even though the collection is done in parallel. By using the same
timestampKey per collection interval it simplifies reporting queries (shown in the second picture below) where we want to ascertain the size of the database at a point in time. To fulfill this cross-instance and database correlation purpose but maintain
an accurate record, this separation is implemented consistently across the database. In the Windows Azure Diagnostics (WAD) related tables we also add the
roleinstance field, and in the database related tables we add the db_name
field. These are populated during the collection data pipeline process and fulfill a similar purpose--to enrich the reporting scenarios.
Figure 4: db names
The picture below shows a practical example. A simple query can return the size of our databases at a particular time period. In particular please note the natural slight discrepancy
between the eventdateUTC field in each row.
Figure 5: UTC
To assist with querying requirements, we have implemented a processing layer using a set of
Table Valued Functions, to extract curated information from the raw data. These are depicted below:
Figure 6: Table valued functions
We also provide the scalar function
dbo.fnConvertToTimeKey which is used to simplify the conversion of a datatime value into a
timestampKey. There are additional benefits on top of the simplified correlation across instances, collection periods, and databases. The
BIGINT field timestampKey is also the Clustered index key in all tables within the OpsStatsDB; this field represents the numerical transposition of the collection task timestamp; this numerical field has the format yyymmddhhmmss. Since
this is the clustered index key, it helps optimize time-range queries, which are pretty common in our historical reporting and trend analysis scenarios.
Figure 7: Scalar valued functions
Most of the reporting queries have the following general format utilizing the scalar value function and to convert a start and end time into a
Figure 8: Timestamp key
The typical pattern for the Table Valued Functions (TVFs) that the OpsStatsDB provides is shown below. You will find at least two parameters defined
startdate and enddate. Depending on the context of the TVF, you might find other parameters which should be self explanatory. In the following example,
DBName requires the database name to be passed. This example shows the implementation of the
QueryStatsOverallPlansByExecutionCount, which is used to provide information on the resource utilization per execution count at the statement level. Note the use of the scalar function, fnConvertToTimeKey, described earlier.
-- Add the parameters for the function here
GROUP BY dbname,query_hash,
To query the TVFs, the following approach can be used. This also includes an example of the results with identifying customer information removed.
='<ENTER DB NAME HERE'
Figure 9: fnConvertToTimeKey
The image below shows the SQLReporting project which is included with the CSF solution. All reports use a shared data source to access the OpsStatsDB.
Figure 10: SQLReporting project
To deploy this you will need to configure the
TargetServerURL property. When prompted, specify an appropriate user name and password. Then deploy this to either a
Windows Azure SQL Reporting services instance
or an on-premises deployment. We utilized Windows Azure SQL Reporting services during CSF testing and subsequent implementation with large-scale customer deployments.
Figure 11: TargetServerURL
menu is found in
Default.rdl and is shown below. This can be used to navigate between the different reports.
Figure 12: Main menu
The main dashboard, which can be found in
Dashboard.rdl, is primarily focused on providing you with an understanding of your application performance characteristics represented by the following information: the performance counters you configured WAD to collect, graphs to depict this, and an
accurate representation of your overall database resource usage. We will now look at how you can customize this report.
Figure 13: Dashboard 1
Figure 14: Dashboard 2
The image below depicts the design view of the Main Dashboard. Two input parameters are defined to represent the start and end period during which you would like to query. The
Date/Time data type is used, and a default expression similar to “=DateAdd(DateInterval.Hour,-12,Now())” is used to populate sensible defaults when the dashboard is first opened. This can be adjusted using the Date/Time selection box which is present
at the top of each report. Each of the data sets is configured to query OpsStatsDB, and they utilize the TVFs which we showed earlier to simplify the querying logic. For example the CPU dataset (providing the underlying data for the Processor utilization
chart) contains the following T-SQL query, which as shown utilizes the PerfLogsPerCounterAllInstances
TVF. As discussed previously, the TVF uses the scalar function to simplify the conversion to the required underlying
SELECT * FROM [dbo].[PerfLogsPerCounterAllInstances](@startdate,@enddate, '\Processor(_Total)\% Processor Time')
ORDER BY eventdateUTC
Figure 15: Dashboard 3
A consistent structure is implemented throughout the reports. The image below depicts the
Figure 16: Dashboard 4
In addition to the dashboards we have demonstrated, the reporting solution also provides a database dashboard and various specific reports designed to assist you with your reporting
requirements for monitoring, troubleshooting, and capacity planning purposes.
In order to help you with understand your resource usage, a TVF has been provided to provide overall resource usage information over time. With the introduction in public preview
of Azure SQL DB premium offering,
this allows you to determine using a data driven approach your current resource usage; and hence whether you need to move your database to a larger instance, or potentially scale out your workload.
DatabaseResourceUsageOverallOvertime can be used as follows to provide resource usage information, across all databases that the CSF telemetry solution is configured to collect:
This will produce an output like the following, showing your overall resource consumption. With this particular customer we can see there is a spike of CPU cores used and read
and write utilization. This corresponds with the time that they run their daily maintenance procedure.
Figure 17: EventDateUTC
Now if we want to see on a per database level, we can create a new TVF to simplify this. Below is the script to create the TVF
DatabaseResourceUsageOverallByDatabaseOvertime, this uses the same technique as the previous TVF, excepts that it groups by both
EventDateUTC and dbname
-- qp.Execs, q.Execs,
-- qp.Writes, q.Writes,
-- qp.Reads, q.Reads,
-- qp.CPUSec, q.CPUSec,
timestampkey < [dbo].[fnConvertToTimeKey]
GROUP BY query_hash,
) AS q
timestampkey < [dbo].[fnConvertToTimeKey]
GROUP BY query_hash,
) AS Aggregated
GROUP BY EventDateUTC,dbname
We can then use this to drill down and understand per database which specific databases are consuming the most resources, using the following query.
Which will produce an output like this. The database names have been obscured deliberately.
Figure 18: TVF
To understand temporal trends, this data can be represented using excel and PowerPivot. If you select the PowerPivot tab in Excel and then go to Manage, the option as shown below
will be available to import a data source from a database.
Figure 19: Excel menu
If you configure it with the location to your OpsStatsDB, as per below and then select the Write Query option, you can use a query such as below to import data which can be used
to understand resource utilization.
Figure 20: OpsStatDB
Once the data is imported the PivotChart option can then be used
Figure 21: PivotChart
A chart can then be configured to visualize the appropriate data, the configuration below will produce the chart shown in the next screenshot that shows the CPU core/sec utilization
over time for this customers databases.
Figure 22: PivotChart Fields
As can clearly be seen in the diagram below they have a peak between 16:00-19:00 UTC time which corresponds with their peak user volumes and another at 06:00-08:00 which corresponds
with their maintenance processes. This data can be refreshed over time, so that you can monitor resource utilization as the end user volume increases. A traditional capacity planning approach and performance optimization can then be combined to size your
databases for the peaks and tune your workload to be more efficient.
Figure 23: Peak chart
Thank you for taking the time to read through this wiki entry which has walked you through how you to use traditional querying techniques, a reporting services approach and graphical
tools such as Excel PowerPivot to gain insight into your system. As we discussed the OpsStatsDB and reporting solution have been designed in a way that allows you to modify and extend the helper TVF functions we have provided to suit your reporting needs.