Wednesday, February 06, 2013 9:29 PM
Having a significant problem with Total CPU Utilization Percentage Monitor flapping and could use some assistance in troubleshooting.
- OpsMgr 2012 RTM with CU3 installed
- Windows 2008 R2 SP1 (Management Servers and Database Servers)
- SQL 2008 R2 SP1 CU6 Operations Database
I have overrided the Total CPU Utilization Percentage Monitor for Windows 2003 Operating Systems with following Settings:
- Interval Seconds: 60
- CPU Queue Length Threshold: 2
- Timeout Seconds: 55
- CPU Percentage Utilization Threshold: 90
- Enabled: True
- Number of Samples: 5
The problem stems that I get a lot of CPU Flapping with alerts re-opening 1 minute after it closed.
- Time: 2/6/2013 7:19:27 AM, State: BAD, QueueLength: 16, PctUsage: 91.871353149414063
- Time: 2/6/2013 7:20:27 AM, State: GOOD, QueueLength: 3, PctUsage: 92.915174865722662
- Time: 2/6/2013 7:21:26 AM, State: BAD, QueueLength: 5, PctUsage: 92.723704528808597
So, first, Queue Length of 3 and PctUsage of 92.9 should not roll back to a GOOD state.
Second, since the Number of Samples = 5, shouldn't there be a gap of at least 4 minutes before another BAD state can be reached (since it'll need to have 5 samples in a row since the last GOOD state)?
I have triple checked overrides. In fact there is only 1 override for the Total CPU Utilization Percentage and I listed those parameters above. Where can I go to start checking 1) why it's interpreting Queue of 3 and Pct of 92 as GOOD and 2) why it can flap between states in 1 minute when Number of Samples is set to 5 and interval is 60 seconds.
Appreciate any help.
Wednesday, February 06, 2013 9:44 PM
I don't think I can answer why it comes back after 1 minute based on the info here, but Johnathon Almquist has a good write up on exactly how the Total CPU Utilization Monitor works and why the queue of 3 is considered good. (I assume this is a dual core system in this specific example.)
Thursday, February 07, 2013 4:00 AMModeratorPersonally, I don't really care for this monitor and how it works, and it confuses a lot of people because queue length parameter is misleading. I'd rather just run the other CPU monitor, and enable the "generate alert" parameter (it only changes state, by default).
Jonathan Almquist | SCOMskills, LLC (http://scomskills.com)
Thursday, February 07, 2013 4:24 AMDzillner, thanks for the link. That explains why the monitor is changing back to GOOD state. Jonathan, I agree that it is confusing, but queue length is an important metric that we'd like to monitor. I may just create separate monitors. I am still confused how a single bad consecutive sample is flipping the monitor back to a bad state so soon. I guess I should crack the sealed mp open and see if there are any clues. Seems like it is not truly looking at consecutive samples.
Thursday, February 07, 2013 7:43 AMModerator
I don't have a system to hand to check and agree that we'd need to look at the xml to see what it is doing. I suspect it is using the system.consolidatorcondition and I think one of the options is for a sliding calculation so it doesn't necessarily reset every x samples. That could possibly account for what you are seeing.
Thursday, February 07, 2013 2:26 PM
Looks like it's using System.Performance.AveragerCondition. So that makes sense.
<ConditionDetection TypeID="SystemPerf!System.Performance.AveragerCondition" ID="CDAverageThreshold"> <NumSamples>$Config/NumSamples$</NumSamples> </ConditionDetection>
I think I will look at creating a new monitor. I'm not sure I like how this works. It's definitely confusing to my service desk that has to respond to the alerts.
Thanks everyone for the guidance!