Over the past few months we have experienced several issues with Health Rollup whereby healthy (green) unit monitors are rolling up to Critical (Red) Entity states.
Needless to say recalculate or reset health state does not correct the issue, nor is this related to a particular MP or set of agents (although if pushed I would have to say that the Exchange MP & OpsLOgix MP's seem to worse hit than others).
Subsequently, we are having to flush the health service cache on the RMS to correct the issue. The concern I have is that the lack of "dynamic" health state refreshes is de-valuing the product and many dashboards we have provisioned thoughout our complex.
Would be interested to know if anyone else in the community experience similar issues or is aware of anything that maybe contributing towards this behaviour?
this is one of the most important issue that SCOM 2012 will resolve. i suppose that you have installed the CU5, right? this is not the final solution, but is a good point of start.
i have the same problem....you are not alone :) but i don't know if there's a solution because i think is a "bug" of the "kernel code" of SCOM 2007
This is an issue with SCOM 2007 R2 and I don't think anything new has been introduced in SCOM 2012. Certainly health explorer renders quicker on SCOM 2012 as by default it only shows unhealthy monitors but I have seen rollup issues here as well.
"Needless to say recalculate or reset health state does not correct the issue"
Recalculate will never work in this situation - this button only works for on-demand monitors and roll ups won't be of that type.
Reset Health generally doesn't work and you have to resort to either:
- put the unhealthy monitor into maintenance mode and "hope" that when it comes it does recalculate health correctly
- set an override (enable = false) to disable the monitor and when it goes unmonitored, remove the override.
Why is it happening? Not sure. The times I have seen it most prevalent are:
- environments where the RMS is overloaded or has had connectvity issues with SQL (e.g. check disk queues on SQL to see if it is struggling to keep up with data being inserted).
- environments where administrators are clearing health service state folder on the RMS frequently .. which is usually a sign of problems with the RMS.
How many agents do people have?
How many reporting to the RMS?
How many secondary management servers?
Are certificates being used? If so, is the RMS being used for agent communication via certificates?
Thanks Graham, reassuring to know that others experience the same issue. To directly answer your questions:
We have 1000 Agents in our Managagment Group.
We have NO agents that report directly to the RMS.
We have 5 Secondary Managment Servers.
Yes certificates are being used between our gateway servers and our secondary management servers.
Our SCOM SQL Environment is running on its own dedicated HP DL580 G7 with 128Gb RAM, 2 quad core processors, and fast fibre connected HP EVA Storage.
Tempted to open a case with Microsoft Premier Support, naturally prefer to consult our experts in the forum in the first instance tho :-)