locked
Server State View - warning state yet issue was resolved? RRS feed

  • Question

  • Environment - SCOM 2007R2 CU3 -

    Why do I still show a warning in the Server State View (Microsoft Windows DNS Server) yet no active alerts for this issue. 

    Windows 2008 R2 DC that had a DNS error (Health Monitor - 'DNS 2008 AD DS and restart the DNS service Server Service monitor' event ID 4013).  This error was five days ago and was resolved immediately.  I have tried Resetting Health, Recalculating Health in the Health Monitor.  I see the actual event ID in the 'DNS Server' event log but I also show everything since that time on as fine.  I have restarted DNS server service and no error or warning in event logs.

    I'm at a loss as to why a warning would continue to show?

    Sunday, December 12, 2010 4:06 PM

Answers

  • Hi tersevenim,

    I doubt it's anything you have done, unfortunately this is something that can occur with OpsMgr from time to time. The cause isn't always easy to identify, but it could be related to an intermittent performance or connectivity issue at the time.

    The way I understand how OpsMgr works for changing the state of a monitor is detailed below:

    1. Monitor is healthy
    2. Monitor runs a script, checks the registry, etc to retrieve output to help determine the health of the object
    3. Monitor then determines what health state should be set based on the results of the check it just performed, in this case it's Warning
    4. Monitor then provides the Agent the current health state of the monitor
    5. Agent then checks it's config to determine the current health state that it has for this Monitor in it's config, and it returns Healthy
    6. Agent determines it needs to change the health state of this Monitor in it's config, in this case to Warning
    7. Agent updates it's local config in memory and in the local health store (it's a file), so the Monitor's health is now Warning
    8. Agent advises its OpsMgr Management Server (MS) that it's changed it's config, so please retrieve my the updated config file
    9. MS retrieves the updated config file from the Agent, and stores it locally
    10. MS advises the OpsMgr Root Management Server (RMS) that one of it's Agents has an updated config, so please retrieve it
    11. RMS retrieves the updated config file from the MS, and stores it locally
    12. RMS analyses the Agents config file and compares it to the config it has in memory for this Agent, in this case it's different
    13. RMS updates the config it has in memory for the Monitor of this Agent to Warning, and also posts this data to the OpsMgr DB as a state change (this is when the Monitor's health state in Health Explorer will be changed)
    14. This means the Agent & RMS have the same state of the monitor in their config (both in memory and in the local health store)
    15. Monitor then polls again and checks the health of the object to see if it's changed, which in this case it's changed to Healthy
    16. Agent then checks it's config and determines the change of state for the Monitor, so the Agent updates it's config
    17. The same process as detailed in steps 8-13 above should occur, but at some point in that list the config update will stop and the update won't occur on the RMS (possibly connectivity issues and the data is dropped by the Agent, MS or RMS, it's hard to say without analysing the event logs of all 3 systems)
    18. This then results in the RMS having a config for the Agent stating the health of this Monitor as Warning, but the Agent has a config stating the health of this Monitor is Healthy -> We now have a conflict in the state of the Monitor
    19. Monitor then polls again to check the health of the object, and notices it hasn't changed (it's still Healthy)
    20. Agent then checks its config and identifies the Monitor's health state is already set to Healthy, so it doesn't need to update it's config or post a config update request to it's MS
    21. The Monitor waits for it's next polling interval, and repeats steps 19 & 20 -> This is what will occur until the config of the RMS & Agent are in sync, no config update will occur for this monitor until the Monitor's health changes or manual intervention is taken

    Now to fix this issue, we had to force the agent to change the state of this monitor to something else (by applying a maintenance mode), or discarding it's current config and ask the RMS what it believes the agent's config should be (by deleting the local health store), so they're back in sync. This is why the options from Shadowman and myself worked, as it forced the agent to either forcibly change the health of the monitor (by applying a maintenance mode), or retrieve the health state of the monitor from the RMS so they're back in sync (by deleting the local health store).

    You might be thinking the 'Recalculate Monitor' button in Health Explorer should do this, but in reality this isn't what it's designed to do. The Recalculate Health option simply asks the monitor to re-run it's check to determine it's health again right now, rather than waiting for it's next polling interval (which could be 5 mins away or 20 hours away). This option only works when 'OnDemand' monitoring is configured for this specific monitor, which is done as part of the design of the monitor in the management pack.

     - If this is a MP you retrieved from the catalog on PinPoint (which is the case in this situation), then you can't change this yourself withour re-writing the entire monitor and storing it in your own custom MP (which is a pain).

    - If however it is a custom MP that you have written (which is not the case here), then you'll need to read up about 'OnDemand' monitoring and configure the monitor to use it so the 'Reset Health' and 'Recalculate Health' options will work properly. Information regarding 'OnDemand' monitoring and MP authoring can be found here, among other places.
    http://www.authormps.com/dnn/

    Overall though, the only way I know to force the health state of a monitor on an agent to re-sync with the RMS is by following either of these 2 options. The recalculate and reset health options wouldn't help in this situation, as the agent still believed the health state of the monitor as Healthy (which was correct), but the RMS thought it was Warning (which was wrong). And when they're out of sync, the agent will not push that update to the RMS unless the health state of the monitor changes to something different to what the agent currently has stored in it's config (either to Not Monitored when put into maintenance, or to Warning/Critical again).

    Anyway, I hope this wall of text is useful to someone. Happy monitoring!

    Cheers,
    Brian

    • Proposed as answer by Brian Hodgman Monday, December 13, 2010 12:09 AM
    • Marked as answer by tersevenim Monday, December 13, 2010 12:59 AM
    Monday, December 13, 2010 12:09 AM

All replies

  • You can try to do this and see if it's being resolved.

    Stop the HealthService, rename the folder ‘~:\Program Files\System Center Operations Manager 2007\Health Service State ’ to ‘~:\Program Files\System Center Operations Manager 2007\Health Service State_OLD ’ and start the HealthService again.


    Certifications: MCSA 2003|MCSE 2003|MCTS(4*)| MCTIP:SA
    Sunday, December 12, 2010 4:31 PM
  • Try applying a maintenance mode to the server for 5 minutes and see if that fixes the health state. This should force the monitor to recalculate it's health state, and if the DNS error is resolved, the monitor should go back to Healthy.

    Reset Health and Recalculate Health won't always force a monitor to re-test it's health state, from memory I believe this is due to the monitor not having OnDemand monitoring configured (which might be the case here). But by applying a maintenance to the monitor, it will force the monitor to be set to 'Not Monitored' whilst the maintenance is on (which forces a state change), and then it should change to it's true health state once it comes out of maintenance (as it should change to either Healthy, Warning or Critical).

    • Proposed as answer by Brian Hodgman Sunday, December 12, 2010 9:37 PM
    Sunday, December 12, 2010 9:37 PM
  • Hi Brian and Shadowman,

    Thank you for the suggestions. 

    I actually have five servers (identical builds and such) that have this issue so I was able to try both.  I tired Shadowman's suggestion - took a bit but reverted back to good state.  Brian, your suggestion worked as well.  Which leads me to the question of why I had to do any manual intervention at all to resolve an issue that was no longer an issue?  Am I missing some best practice of setting monitors to automically rescan?  Shouldn't each time the monitor, for this particular alert, run it checks if it is healthy and then reports back? 

    It seems like the monitor went to a failed state even though the issue was resolved - that doesn't seem like a very good monitoring system?

    Sunday, December 12, 2010 11:08 PM
  • Hi tersevenim,

    I doubt it's anything you have done, unfortunately this is something that can occur with OpsMgr from time to time. The cause isn't always easy to identify, but it could be related to an intermittent performance or connectivity issue at the time.

    The way I understand how OpsMgr works for changing the state of a monitor is detailed below:

    1. Monitor is healthy
    2. Monitor runs a script, checks the registry, etc to retrieve output to help determine the health of the object
    3. Monitor then determines what health state should be set based on the results of the check it just performed, in this case it's Warning
    4. Monitor then provides the Agent the current health state of the monitor
    5. Agent then checks it's config to determine the current health state that it has for this Monitor in it's config, and it returns Healthy
    6. Agent determines it needs to change the health state of this Monitor in it's config, in this case to Warning
    7. Agent updates it's local config in memory and in the local health store (it's a file), so the Monitor's health is now Warning
    8. Agent advises its OpsMgr Management Server (MS) that it's changed it's config, so please retrieve my the updated config file
    9. MS retrieves the updated config file from the Agent, and stores it locally
    10. MS advises the OpsMgr Root Management Server (RMS) that one of it's Agents has an updated config, so please retrieve it
    11. RMS retrieves the updated config file from the MS, and stores it locally
    12. RMS analyses the Agents config file and compares it to the config it has in memory for this Agent, in this case it's different
    13. RMS updates the config it has in memory for the Monitor of this Agent to Warning, and also posts this data to the OpsMgr DB as a state change (this is when the Monitor's health state in Health Explorer will be changed)
    14. This means the Agent & RMS have the same state of the monitor in their config (both in memory and in the local health store)
    15. Monitor then polls again and checks the health of the object to see if it's changed, which in this case it's changed to Healthy
    16. Agent then checks it's config and determines the change of state for the Monitor, so the Agent updates it's config
    17. The same process as detailed in steps 8-13 above should occur, but at some point in that list the config update will stop and the update won't occur on the RMS (possibly connectivity issues and the data is dropped by the Agent, MS or RMS, it's hard to say without analysing the event logs of all 3 systems)
    18. This then results in the RMS having a config for the Agent stating the health of this Monitor as Warning, but the Agent has a config stating the health of this Monitor is Healthy -> We now have a conflict in the state of the Monitor
    19. Monitor then polls again to check the health of the object, and notices it hasn't changed (it's still Healthy)
    20. Agent then checks its config and identifies the Monitor's health state is already set to Healthy, so it doesn't need to update it's config or post a config update request to it's MS
    21. The Monitor waits for it's next polling interval, and repeats steps 19 & 20 -> This is what will occur until the config of the RMS & Agent are in sync, no config update will occur for this monitor until the Monitor's health changes or manual intervention is taken

    Now to fix this issue, we had to force the agent to change the state of this monitor to something else (by applying a maintenance mode), or discarding it's current config and ask the RMS what it believes the agent's config should be (by deleting the local health store), so they're back in sync. This is why the options from Shadowman and myself worked, as it forced the agent to either forcibly change the health of the monitor (by applying a maintenance mode), or retrieve the health state of the monitor from the RMS so they're back in sync (by deleting the local health store).

    You might be thinking the 'Recalculate Monitor' button in Health Explorer should do this, but in reality this isn't what it's designed to do. The Recalculate Health option simply asks the monitor to re-run it's check to determine it's health again right now, rather than waiting for it's next polling interval (which could be 5 mins away or 20 hours away). This option only works when 'OnDemand' monitoring is configured for this specific monitor, which is done as part of the design of the monitor in the management pack.

     - If this is a MP you retrieved from the catalog on PinPoint (which is the case in this situation), then you can't change this yourself withour re-writing the entire monitor and storing it in your own custom MP (which is a pain).

    - If however it is a custom MP that you have written (which is not the case here), then you'll need to read up about 'OnDemand' monitoring and configure the monitor to use it so the 'Reset Health' and 'Recalculate Health' options will work properly. Information regarding 'OnDemand' monitoring and MP authoring can be found here, among other places.
    http://www.authormps.com/dnn/

    Overall though, the only way I know to force the health state of a monitor on an agent to re-sync with the RMS is by following either of these 2 options. The recalculate and reset health options wouldn't help in this situation, as the agent still believed the health state of the monitor as Healthy (which was correct), but the RMS thought it was Warning (which was wrong). And when they're out of sync, the agent will not push that update to the RMS unless the health state of the monitor changes to something different to what the agent currently has stored in it's config (either to Not Monitored when put into maintenance, or to Warning/Critical again).

    Anyway, I hope this wall of text is useful to someone. Happy monitoring!

    Cheers,
    Brian

    • Proposed as answer by Brian Hodgman Monday, December 13, 2010 12:09 AM
    • Marked as answer by tersevenim Monday, December 13, 2010 12:59 AM
    Monday, December 13, 2010 12:09 AM
  • Hi,

    I would like to add my experience as well. For the DNS monitor (DNS 2008 Troubleshoot AD DS And Restart DNS Server Service Server Monitor) to turn green (third event) it looks for any of the events 4400|4522|4523|4524 to be present in the DNS eventlog (after the event 4013 error was recorded in the DNS eventlog)  to change it's  health status from warning. The problem here is that the DNS server won't write any of these events (from my own experience) though the DNS server works all fin. And because of that the monitor would never turn to healthy again. I have just recently asked Ms if it's the actual DNS MP that is misconfigured (i.e looking for wrong events) or something else.

    Cheers,

    Richard 

    Monday, February 14, 2011 11:23 AM
  • Brian,

    I'm very new to SCOM and after setting up my SCOM 2012 DEV environment a few of my servers began showing warnings for their health state.  I could not figure out why and after reading this thread I tried putting them in maintenance mode for 5 minutes and they are now all showing a healthy state again.  As a newbie to SCOM thanks for your help.

    Wednesday, May 7, 2014 12:59 PM