Event monitors use one of the event data sources to identify a particular event that indicates an issue. As soon as the specific data source that holds the required information is identified, the logic used to determine different health states must be determined. In addition to the logic that indicates whether an error condition has occurred, additional logic must be defined to determine when the state should be changed back to a healthy condition.
The different kinds of logic that can be used to detect an error condition by using events are listed in the following table. As noted in the table, some logic can only be used with Windows events.
Simple detection refers to a state change being triggered immediately after a single occurrence of the specified event. This is the most basic kind of detection and will apply to most scenarios.
Repeated event detection uses one or more occurrences of a particular event in a time window to indicate an error condition. This typically applies to conditions in an application where a single event on its own can be ignored, but multiple occurrences of that event in a particular time window indicate a potential error. There are different algorithms that can be used for this detection, depending on the logic that best identifies the specific application issue. The following are details of the different algorithms:
Trigger on timer consolidation of events uses a specified time window and is not dependent on the number of events received. A single event can trigger an error in the health state as in simple detection. Unlike simple detection which sets the health state immediately upon detection of the specified event, however trigger on timer consolidation waits until a specified time window to set the health state of the monitor. The time window can be a rotating time duration of specified length or a specific window based on day of the week.
Trigger on timer consolidation is useful for errors that should only be detected in a certain time window. Used with a time window based on a specific time of day, this disables the monitor outside that time period. It can also have the effect of delaying the change of state for a particular time during which an event that indicates a healthy state could be received. In this case, the health state would never be changed.
Trigger on count consolidation of events lets a monitor require multiple occurrences of the same event in a specified time window before it changes the health state to an error. The time window can be rotating time duration of specified length or a specific window based on day of the week.
Trigger on count consolidation resembles trigger on timer consolidation except that multiple occurrences of the event are required instead of just one. When the time window is reached, the event count is returned to zero, and the specific number of events must detected before the time window expires again for the health state to be changed.
Trigger on count, sliding consolidation of events is similar to trigger on count consolidation except that the time window is reset every time that the specified event is received. The time window only expires if the time is reached after the occurrence of the last event.
Trigger on count, sliding consolidation is useful for error conditions that are detected by a certain number of events in a particular length of time. By using trigger on count consolidation, some events could be received in one time window and then other events received in the next time window with the result that the health state is never changed. Using trigger on count, sliding consolidation, the time window depends on when the event occurs preventing this condition.
To help with understanding the different algorithms used for repeated event detection, the following table shows the effect on health state for monitors based on the different kinds of consolidation. This is based on a repeated event monitor that uses the following details:
A correlated event monitor uses two separate events in a particular time period to detect a single issue. This kind of monitor supports conditions where an issue cannot be identified by a single event alone.
When the first event is detected, a timer is triggered. If the second event is received within that period, the state change is triggered. If the second event is not received in the period, the timer is reset until the first event is received again. The monitor may be configured to better tune the specific conditions that must be met in order to perform correlation. These options include the following:
The following table provides an example of a correlated event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:
A correlated missing event monitor determines an error by the absence of a particular event after the occurrence of another. This resembles the missing event monitor except that instead of searching for the missing event in a particular time window, the monitor searches for the event in a particular time after another event is first detected.
For example, consider an application that performs a backup each evening and creates an event when it starts and a second event when it has completed successfully. A correlated missing event monitor could be created that searches for the event in a particular time window each evening. If both events are detected, then the monitor remains in a healthy state. If the first is found, then the timer starts. If the time is reached before the second event is detected, then the state change is triggered to indicate that the last backup did not occur successfully.
The following table provides an example of a correlated missing event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:
Instead of detecting a particular event to identify an error condition, a missing event monitor uses the absence of a particular event in a particular time window to determine an error. This supports applications that are expected to generate an informational event that indicates a successful operation or the success of a particular action.
For example, consider an application that performs a scheduled data transfer each evening and creates an event when it has completed successfully. A missing event monitor could be created that searches for the event in a particular time window each evening. If the event is detected, then the monitor remains in a healthy state. If it is not found, then it enters error state that indicates that the last transfer did not occur successfully.
The following table provides an example of a missing event monitor by using the following details:
The previous detection criteria describe the conditions under which a monitor changes to a warning or critical state. In addition to detecting an error state, each monitor must have logic defined to determine when the state should be returned to healthy. The different methods for resetting state are shown in the following table:
Each of these methods is discussed at length in the following sections:
With event reset, the monitor is reset when a single occurrence of a specific event is detected. The event must be the same type as the event used for detecting the error condition. For example, a Windows event monitor might specify an event with a particular event source and number to indicate an error condition. Another Windows event with the same event source but a different number might indicate that the error in the application was corrected.
Event reset can only be used if the application provides an event indicating the particular error was corrected. Many applications create an event when an error occurs but may not create a corresponding event that indicates that the error was corrected. Event reset cannot be used in this case.
With manual reset, the monitor never returns to a healthy state automatically. The user must determine whether the problem was corrected and then select the monitor in the Health Explorer and select Reset Health.
The advantage to this strategy is that a monitor can be used for issues that do not create an event that indicates a healthy state. The monitor can affect the health state of the managed object instead of creating a simple alert from a rule. The downtime will be recorded for the object in the State Change Events in the Operations Console and in any availability reports.
There are multiple implications of this strategy that should be considered. The first is the additional work required from the user because the monitor will never automatically reset. It can also result in too much downtime being recorded if the user waits a long time before performing the reset. The problem may have been corrected fairly quickly, but the healthy state will not be recorded until the user performs the reset.
Use of manual reset should be especially cautioned for monitors where there is a potential for a single problem to affect multiple instances of the target class. Because users cannot reset the monitor for multiple instances in the Operations Console, the user would be required to manually open the Health Explorer for each instance to perform this action. Depending on the number of instances, this could result in significant effort for the user.
A timer reset acts the same as a manual reset except that if the user does not manually reset the monitor after a specified time, it will reset automatically. One use of this kind of reset is for issues that continuously log error events until the problem is corrected. Instead of using another event to indicate that the problem was corrected, the previously detected error event for a specified period can be used as the success criteria.
The timer reset can be used in the place of a manual reset providing the advantage of automatically resetting after a while if the user does not perform a manual reset.
Monitors in a System Center Operations Manager 2007 management pack based on performance counters collect numeric data at set intervals and compare it to one or more threshold values. This may be a simple comparison that compares each sample to a single threshold or more complex logic, depending on the requirements of the application.
Multiple kinds of calculations may be performed to determine the threshold for a performance monitor. These threshold types are listed in the following table:
Each kind of logic is described in detail in the following sections:
The simple threshold type is the most basic kind of performance threshold. A single numeric value is provided for the threshold. This threshold is compared to the measured value of the performance data.
Simple threshold supports a two state monitor. One state is set by a performance value equal to or less than the threshold. The other state is set by a performance value greater than the threshold.
The double threshold type is similar to the simple threshold type but allows for two thresholds to be specified. Each threshold is compared to the measured value of the performance data.
Double threshold supports a three state monitor. One state is set by a performance value less than the low threshold. Another state is set by a performance value that is greater than or equal to the low threshold or one that is less than or equal to the high threshold. Another state is set by a value that is greater than the high threshold.
The following table provides an example of a double monitor by using the following details:
The average threshold type calculates the average of a specified number of consecutive samples and compares it to the specified threshold.
Average threshold supports a two state monitor. One state is set by an average performance value equal to or less than the threshold. The other state is set by an average performance value greater than the threshold.
The following table provides an example of an average threshold monitor by using the following details:
The consecutive threshold type compares the threshold value to the performance counter for several consecutive samples. This supports monitors that should not be triggered by only a single value exceeding a threshold. The threshold must be exceeded multiple consecutive times to trigger a change in state.
Consecutive threshold supports a two state monitor. One state is set by the value being either greater than or less than the threshold value for each consecutive sample. The other state is set by a single sample not matching the other criteria.
The following table provides an example of a consecutive sample monitor by using the following details:
The delta threshold type compares the threshold value to the difference between two performance values. This might be two consecutive values or two values separated by a specified number of samples.
Delta threshold supports a two state monitor. One state is set by the difference of two values being greater than the threshold value. The other state is set by the difference of two samples being equal to or less than the threshold value.
The following table provides an example of a delta threshold monitor by using the following details:
A self-tuning threshold monitor uses a learning process to determine the typical values for a specified performance counter object and automatically sets the threshold levels based on the learned values. Avoid self-tuning threshold monitors because they may not work well in most customer environments.
Script monitors run a monitoring script regularly and evaluate the results to determine the state of the monitor. The script could perform such actions as running a synthetic transaction against an application, gathering performance data to be evaluated against a threshold, or retrieving a status of some aspect of the application. Script monitors incur more overhead than the other types of monitors and should be used only when one of those monitors does not provide the required functionality.
Script monitors can use either two states or three states. Criteria must be defined for each state using values from the property bag created by the script. The kinds of values in the property bag will vary depending on the particular script. A numeric value might be compared to a threshold value as in a performance monitor. In that case, the healthy state might be defined by the value being under the threshold value while the critical state is defined by the value being over the same threshold. A synthetic transaction might return a text result indicating whether the test was successful or not. In that case, the criteria for each state would be the string indicating that particular health.
Service monitors measure the running state of a Windows service. There is no configuration required other than the name of the service. This is a two state monitor with the monitor sets the monitor to a healthy state if the service is running and a critical state if the service is not running. The monitor can be configured to check the startup type of the service. This ensures that the service is only monitored if its startup type is set to automatic.