locked
Logical Disk Free Space Monitor - Slow to detect low free space RRS feed

  • Question

  • We are using the built in two trigger (MB and %) logical disk free space monitor in SCOM 2012 R2. We have setup overrides for MB warning and critical for both system and non-system drives and for a group containing disks we do not want monitored. The monitor actually works fine, triggering an alert when both the MB and % free criteria are met. The problem is that it takes almost an hour for the initial alert to fire. After the initial alert, if I further fill the disk to push it from warning to critical, the alert changes within the specifiec interval, which we have left at 15 minutes. The alert also clears using the 15 minute interval.

    Has anyone else seen this behavior with this monitor? A disk monitor that takes an hour to fire is not going to be very useful.

    Thursday, July 17, 2014 10:06 PM

Answers

  • I wanted to see for myself if there was anything else that I might be missing, so I opened up the Windows 2008 Logical Disk Free Space monitor XML and noticed that there is a NumSamples configuration that is set to 4. So, if the interval is 15 minutes, the disk would have to exceed both threshold types for 4 consecutive intervals in order to change state and generate alert. This would be a minimum of 1 hour before an alert is raised with the default 15 minutes interval.

    Unfortunately, NumSamples is not overrideable in the monitor type, which is too bad... The only way to get an alert sooner than one hour is to override interval. For example, if you want an alert within 20 minutes, override interval to 300 seconds (5 minutes).

    Here is the code - see for yourself:

          <UnitMonitor ID="Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace" Accessibility="Public" Enabled="true" Target="Server2008!Microsoft.Windows.Server.2008.LogicalDisk" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Microsoft.Windows.Server.2008.FreeSpace.Monitortype" ConfirmDelivery="true">
            <Category>Custom</Category>
            <AlertSettings AlertMessage="Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace.AlertMessage">
              <AlertOnState>Warning</AlertOnState>
              <AutoResolve>true</AutoResolve>
              <AlertPriority>Normal</AlertPriority>
              <AlertSeverity>MatchMonitorHealth</AlertSeverity>
              <AlertParameters>
                <AlertParameter1>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$</AlertParameter1>
                <AlertParameter2>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</AlertParameter2>
              </AlertParameters>
            </AlertSettings>
            <OperationalStates>
              <OperationalState ID="UnderWarningThresholds" MonitorTypeStateID="UnderWarningThresholds" HealthState="Success" />
              <OperationalState ID="OverWarningUnderErrorThresholds" MonitorTypeStateID="OverWarningUnderErrorThresholds" HealthState="Warning" />
              <OperationalState ID="OverErrorThresholds" MonitorTypeStateID="OverErrorThresholds" HealthState="Error" />
            </OperationalStates>
            <Configuration>
              <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
              <DiskLabel>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$</DiskLabel>
              <IntervalSeconds>900</IntervalSeconds>
              <SystemDriveWarningMBytesThreshold>500</SystemDriveWarningMBytesThreshold>
              <SystemDriveWarningPercentThreshold>10</SystemDriveWarningPercentThreshold>
              <SystemDriveErrorMBytesThreshold>300</SystemDriveErrorMBytesThreshold>
              <SystemDriveErrorPercentThreshold>5</SystemDriveErrorPercentThreshold>
              <NonSystemDriveWarningMBytesThreshold>2000</NonSystemDriveWarningMBytesThreshold>
              <NonSystemDriveWarningPercentThreshold>10</NonSystemDriveWarningPercentThreshold>
              <NonSystemDriveErrorMBytesThreshold>1000</NonSystemDriveErrorMBytesThreshold>
              <NonSystemDriveErrorPercentThreshold>5</NonSystemDriveErrorPercentThreshold>
              <NumSamples>4</NumSamples>
            </Configuration>
          </UnitMonitor>

    This proves 2 things:

    1. Your testing proved that the monitor is working as designed - you got an alert in about an hour

    2. This is a bad design at best, or a bug if you wish, as NumSamples should not be a hidden configuration - it should be exposed in override parameters in the console.

    This should be fixed by Microsoft.


    Jonathan Almquist | SCOMskills, LLC (http://scomskills.com)




    Tuesday, July 22, 2014 5:32 PM

All replies

  • Sounds like it's working as designed. Just remember that both threshold types need to be passed before the alert generates. Take a look at the the logical disk free space calculator to know EXACTLY when an alert should be generated.

    Jonathan Almquist | SCOMskills, LLC (http://scomskills.com)

    Friday, July 18, 2014 2:01 AM
  • The monitor seems work fine. In order to identified the issue, you may try to recreate low disk space issue and check whether the alert will fire for hour delay. From my experience, more likely cause may be the agent has experience some delay in receiving updated monitor threshold .
    Roger
    Friday, July 18, 2014 2:41 AM
  • what could be happening is as below

    The disk space for one of the trigger is met but not the other. the first alert is only fired off after both is met. This means example the MB trigger could have been met at 0800 am and the % trigger only breached the threshold at 0855am. Assuming the interval of 15 mins happened at 0805am, 0820am, 0835am, 0850am, 0905am. You should see the trigger at 0905am.

    My 2 cents worth



    Blog: http://theinfraguys.com

    Follow me at Facebook The Infra Guys Facebook Page

    Please remember to click Mark as Answer on the answer if it helps you in anyway

    Friday, July 18, 2014 6:19 AM
  • Thank you all for the responses. Jonathan, your article and disk space calculator is what we used to setup our alerting. Roger, I have been able to consistently see the 1 hour delay for the initial

    In the case I am testing, the disk is 50 GB with the warning set at 10% and 2048 MB. To test, I simply filled the drive until approximately 1.5 GB was free, which certainly meets both criteria. I filled the disk at 12:52 PM and the alert came out at 1:48 PM. Interestingly, if I open the Health Explorer and do a reset/recalculate health, the disk shows as healthy even though I can see on the server that it is below both thresholds.

    Friday, July 18, 2014 3:18 PM
  • What version of the OS MP are you using?  Cookdown was broken in earlier versions.

    I usually use just percentage, and bump the MB free > 1.5 TB of space.  This way the alert triggers just on % Free.  I also changed the interval, I thought it used to run HOURLY, I changed it to every 15 minutes.  In my honest opinion use one or the other criteria to alert on, I wouldn't use both (% and MB free).  % free is what most ops teams care about.  Using both might be the issue here, especially if the sample value is just shy of the tipping point.


    Regards, Blake Email: mengotto<at>hotmail.com Blog: http://discussitnow.wordpress.com/ If my response was helpful, please mark it as so, if it answered your question, then please also mark it accordingly. Thank you.

    Monday, July 21, 2014 5:40 PM
  • We are running Version 6.0.7061.0 which I believe is the latest.

    We chose the dual monitor due to varying disk sizes but maybe we'll need to change.

    Going to open ticket with MS and will post results here.

    Tuesday, July 22, 2014 4:15 PM
  • If you really think you have the numbers correct, then it's quite possible there is a bug. One way to figure this out is to use Workflow Analyzer to see runtime data for this monitor. I would suggest picking a target machine like you did in your testing, change the frequency for a disk on that system to 2 minutes (or something like that), fill the disk like you did in your previous testing, and run the Workflow Analyzer against that agent.

    Jonathan Almquist | SCOMskills, LLC (http://scomskills.com)

    Tuesday, July 22, 2014 5:19 PM
  • I wanted to see for myself if there was anything else that I might be missing, so I opened up the Windows 2008 Logical Disk Free Space monitor XML and noticed that there is a NumSamples configuration that is set to 4. So, if the interval is 15 minutes, the disk would have to exceed both threshold types for 4 consecutive intervals in order to change state and generate alert. This would be a minimum of 1 hour before an alert is raised with the default 15 minutes interval.

    Unfortunately, NumSamples is not overrideable in the monitor type, which is too bad... The only way to get an alert sooner than one hour is to override interval. For example, if you want an alert within 20 minutes, override interval to 300 seconds (5 minutes).

    Here is the code - see for yourself:

          <UnitMonitor ID="Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace" Accessibility="Public" Enabled="true" Target="Server2008!Microsoft.Windows.Server.2008.LogicalDisk" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Microsoft.Windows.Server.2008.FreeSpace.Monitortype" ConfirmDelivery="true">
            <Category>Custom</Category>
            <AlertSettings AlertMessage="Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace.AlertMessage">
              <AlertOnState>Warning</AlertOnState>
              <AutoResolve>true</AutoResolve>
              <AlertPriority>Normal</AlertPriority>
              <AlertSeverity>MatchMonitorHealth</AlertSeverity>
              <AlertParameters>
                <AlertParameter1>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$</AlertParameter1>
                <AlertParameter2>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</AlertParameter2>
              </AlertParameters>
            </AlertSettings>
            <OperationalStates>
              <OperationalState ID="UnderWarningThresholds" MonitorTypeStateID="UnderWarningThresholds" HealthState="Success" />
              <OperationalState ID="OverWarningUnderErrorThresholds" MonitorTypeStateID="OverWarningUnderErrorThresholds" HealthState="Warning" />
              <OperationalState ID="OverErrorThresholds" MonitorTypeStateID="OverErrorThresholds" HealthState="Error" />
            </OperationalStates>
            <Configuration>
              <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
              <DiskLabel>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$</DiskLabel>
              <IntervalSeconds>900</IntervalSeconds>
              <SystemDriveWarningMBytesThreshold>500</SystemDriveWarningMBytesThreshold>
              <SystemDriveWarningPercentThreshold>10</SystemDriveWarningPercentThreshold>
              <SystemDriveErrorMBytesThreshold>300</SystemDriveErrorMBytesThreshold>
              <SystemDriveErrorPercentThreshold>5</SystemDriveErrorPercentThreshold>
              <NonSystemDriveWarningMBytesThreshold>2000</NonSystemDriveWarningMBytesThreshold>
              <NonSystemDriveWarningPercentThreshold>10</NonSystemDriveWarningPercentThreshold>
              <NonSystemDriveErrorMBytesThreshold>1000</NonSystemDriveErrorMBytesThreshold>
              <NonSystemDriveErrorPercentThreshold>5</NonSystemDriveErrorPercentThreshold>
              <NumSamples>4</NumSamples>
            </Configuration>
          </UnitMonitor>

    This proves 2 things:

    1. Your testing proved that the monitor is working as designed - you got an alert in about an hour

    2. This is a bad design at best, or a bug if you wish, as NumSamples should not be a hidden configuration - it should be exposed in override parameters in the console.

    This should be fixed by Microsoft.


    Jonathan Almquist | SCOMskills, LLC (http://scomskills.com)




    Tuesday, July 22, 2014 5:32 PM
  • Thanks Jonathan! Good find on the NumSamples. Hopefully that will be corrected in the next release of the MP.
    Monday, July 28, 2014 6:11 PM