Tuesday, February 28, 2012 9:49 PMOver the past few months we have experienced several issues with Health Rollup whereby healthy (green) unit monitors are rolling up to Critical (Red) Entity states.
Needless to say recalculate or reset health state does not correct the issue, nor is this related to a particular MP or set of agents (although if pushed I would have to say that the Exchange MP & OpsLOgix MP's seem to worse hit than others).
Subsequently, we are having to flush the health service cache on the RMS to correct the issue. The concern I have is that the lack of "dynamic" health state refreshes is de-valuing the product and many dashboards we have provisioned thoughout our complex.
Would be interested to know if anyone else in the community experience similar issues or is aware of anything that maybe contributing towards this behaviour?
Tuesday, February 28, 2012 10:36 PM
this is one of the most important issue that SCOM 2012 will resolve. i suppose that you have installed the CU5, right? this is not the final solution, but is a good point of start.
Wednesday, February 29, 2012 8:33 AMYes we are already running CU5. I am aware of the changes that will be introduced to Health Explorer in SCOM 2012 but I am kind of looking for a solution in SCOM 2007 :-)
Wednesday, February 29, 2012 9:30 AM
i have the same problem....you are not alone :) but i don't know if there's a solution because i think is a "bug" of the "kernel code" of SCOM 2007
Thursday, March 01, 2012 3:07 AM
I have the same issue too.......
Thursday, March 01, 2012 7:51 AMModerator
This is an issue with SCOM 2007 R2 and I don't think anything new has been introduced in SCOM 2012. Certainly health explorer renders quicker on SCOM 2012 as by default it only shows unhealthy monitors but I have seen rollup issues here as well.
"Needless to say recalculate or reset health state does not correct the issue"
Recalculate will never work in this situation - this button only works for on-demand monitors and roll ups won't be of that type.
Reset Health generally doesn't work and you have to resort to either:
- put the unhealthy monitor into maintenance mode and "hope" that when it comes it does recalculate health correctly
- set an override (enable = false) to disable the monitor and when it goes unmonitored, remove the override.
Why is it happening? Not sure. The times I have seen it most prevalent are:
- environments where the RMS is overloaded or has had connectvity issues with SQL (e.g. check disk queues on SQL to see if it is struggling to keep up with data being inserted).
- environments where administrators are clearing health service state folder on the RMS frequently .. which is usually a sign of problems with the RMS.
How many agents do people have?
How many reporting to the RMS?
How many secondary management servers?
Are certificates being used? If so, is the RMS being used for agent communication via certificates?
Thursday, March 01, 2012 7:36 PM
Thanks Graham, reassuring to know that others experience the same issue. To directly answer your questions:
We have 1000 Agents in our Managagment Group.
We have NO agents that report directly to the RMS.
We have 5 Secondary Managment Servers.
Yes certificates are being used between our gateway servers and our secondary management servers.
Our SCOM SQL Environment is running on its own dedicated HP DL580 G7 with 128Gb RAM, 2 quad core processors, and fast fibre connected HP EVA Storage.
Tempted to open a case with Microsoft Premier Support, naturally prefer to consult our experts in the forum in the first instance tho :-)