locked
Custom Logical Disk monitor incorrectly flapping between healthy and unhealthy RRS feed

  • Question

  • One of the clients Ops Mgr 2012 SP1 UR8 environments I am supporting has had some custom logical disk monitoring setup; there are 5 groups dynamically populated by logical drives depending on their size (1st group has small drives up to the last group with very large drives). There is a 'Warning' and 'Critical' Monitor setup per server OS version, the Monitors are not Enabled. There are Overrides applied to each group to enable the Monitor and apply a threshold - different threshold for each group.

    During some BAU tuning I could see that some of the above Monitors were appearing as Top-Talking alerts. Further investigation showed that alerts were being triggered by drives that momentarily dropped below the applied threshold. I re-created the Monitors from 'Simple Threshold' to 'Consecutive Samples' and set the 'Number of Samples' to 6 @ 3 minute intervals.

    What I am seeing is that alerts from the above Monitors are still appearing as Top Talkers. When I check the Health Explorer of repeating alerts I can see the disk space is staying the same, below the applied threshold but the health is turning healthy then back to unhealthy. I have confirmed each noisy Object has the expected threshold as per its dynamic group allocation and have also confirmed the drives are not fluctuating above and below the threshold. One thing I have noticed is that some drives Performance View is patchy - lots of dotted lines between the coloured lines.

    Its almost like the Monitor moves a Logical Disk Object into unhealthy state in the correct (and expected) manner, then it somehow picks up an incorrect threshold which is below the current usage level. This moves it into a healthy state only for the whole process to repeat. For example: Drive X: on a server is very large, the Group that it sits in has a threshold of 102400MB, its current usage is ~stable at 45500MB. Looking in Health Explorer I can see 3:01pm green state/ 45573 last sampled value/ # of samples 1 | 3:16pm yellow state/ 45573/ 6 samples | 3:34pm green state/ 45572/ 1 samples | 3:49pm yellow state/ 45571/ 6 samples | 4:01pm green state/ 45425/ 1 sample etc etc.

    I'm scratching my head on this one and would appreciate any suggestions or assistance.

    Thanks

    BT

    Wednesday, April 8, 2015 4:42 AM

All replies

  • Hi,

    Based on your description, it seems like that the performance monitor collect wrong performance data from the agent monitored, I would like to suggest you flush health service state and caches first on those monitored agent.

    You may also check operation manager event logs to see is there any error or warning that related with this issue.

    And you may test an agent by re-installing the SCOM agent and check the result.

    Regards,

    Yan Li


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.


    • Edited by Yan Li_ Monday, April 13, 2015 8:42 AM edit
    Monday, April 13, 2015 8:42 AM
  • Thanks for the reply. It is not just one server / drive this is happening on. I am seeing it on everything; once they go into an unhealthy state they periodically go healthy and back again with no change in disk free space. Just to elaborate on how it is setup; a Monitor has been created for each OS version (2003, 2008 and 2012) and a separate Monitor for Warning and Critical so 6 Monitors in total. Looking at the Warning Monitors; they are created with a threshold of 5120MB for 6 samples and set to disabled. The following groups have been created and the following thresholds added:

    Group 1 (less than 60GB size): override added to enable. This group will then pick up the 5120MB threshold.

    Group 2 (60 – 250GB size): override added to enable and override added for 10240MB threshold

    Group 3 (250 – 500GB size): override added to enable and override added for 20480MB threshold

    Group 4 (500 – 1TB size): override added to enable and override added for 51200MB threshold

    Group 5 (>1TB size): override added to enable and override added for 102400MB threshold

    One drive I was looking at was in Group 2 (threshold of 10240MB), it was staying at approx. 8500MB but periodically going into healthy state then after 10mins (6 polls @ 2min intervals) back to unhealthy. This process repeats once or twice per day.

    I am wondering if the Object is somehow picking up the threshold of the Monitor (5120MB) then going back to its correct overridden threshold. I have setup some test groups and monitors in a lab and will review the results over the coming days.

    When the monitors were setup as 'Simple Threshold' this worked fine but were noisy due to drives spiking downwards. It was only when I re-wrote them as 'Consecutive Samples over Threshold' Monitors that this issue has started occurring.

    Thanks

    Tuesday, April 14, 2015 4:20 AM
  • 1) Windows OS Management Pack has built-in monitor, Windows Server XXX Logical Disk Free Space(MB) Low, to monitor free disk space with consecutive sample so i recommend you to use this monitor

    2) Override this monitor for a group and select the group, you desire for example group 2
    3) setting the override of
      - enabled: true
      - Error threshold for Non-system Drive: 10240MB
      - Error threshold for System Drive: 10240MB
      - Warning threshold for Non-system Drive: 20480MB
      - Warning threshold for System Drive: 20480MB

    Roger

    Tuesday, April 14, 2015 6:49 AM
  • Thanks Roger. I would like to get these Custom Monitors working - or at the very least find out why I am seeing this anomaly. Based on your advice though I have setup the above Monitors on a couple of servers. I have configured them the same as the Custom Monitors (6 polls at 2 mins intervals and 10240MB threshold) but set 'Generates Alert' to False.

    It will be interesting to see if the change in Health State matches up between both Monitors.

    Cheers

    Wednesday, April 15, 2015 2:51 AM