none
SCOM 1807 - Management Server Fail-over Mechanism RRS feed

  • Question

  • Hi,

    I have setup SCOM with different resource pools, including 2 MSs in each one for load and fail-over purposes.

    I have noticed that when the RMS server goes down for any reason, all SCOM Health Status goes Grey and the Windows Agents go Critical / Grey as well. Is this a normal behavior in SCOM ? To note that all agents have their primary and fail-over host set successfully. The OPSDB confirmed through script that is enabled as a score for the fail-over mechanism as well.

    Is there a way to confirm whether Fail-over between MSs within the same resource pool is actually working?

    Are there any further settings to set / confirm with regards to MSs Fail-over mechanism ?

    Thanks in advance

    Thursday, August 8, 2019 9:33 AM

Answers

  • I have never tested the failover when one of the MS is shutdown, the behavior i was encountering was during scheduled / planned restarts. 

    Personally dont know the duration of the Failover or whether such behavior is normal, but with a shutdown of 1 MS confirmed that after a few min all resources failed to its counterpart.

    Hi,

    this is absolutely fine. I must double check this, but think that I've read that the agent needs about 60 seconds until it fails obver to any or to a configured MS within the Management Group. So a couple of minutes sounds normal to me. 

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Tuesday, August 13, 2019 12:50 PM
    Moderator
  • Hi, As an update on the matter i have performed a shut down of MS1 until SCOM alerts that MS1 is unavailable.

    The alert related to the resource pool unavailable was triggered and all Agents become Grey. The latter persisted until all Agents including Resource Pool was failed over, resuming agents and resource pool to healthy. (Showing in eventvwr that Agent resumed to MS2)

    I believe this concludes the matter. Any further thoughts on the matter ?

    Thanks for all the assistance.

     

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Thursday, August 8, 2019 2:46 PM
  • I have never tested the failover when one of the MS is shutdown, the behavior i was encountering was during scheduled / planned restarts. 

    Personally dont know the duration of the Failover or whether such behavior is normal, but with a shutdown of 1 MS confirmed that after a few min all resources failed to its counterpart.

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Tuesday, August 13, 2019 9:50 AM

All replies

  • Hi,

    How did you create the resource pools, from the Operations Console or via PowerShell?

    If you created the resource pool from within the Operations Console, the Default Observer is enabled by default.

    If you created the resource pool from PowerShell, the Default Observer is disabled by default.

    If the Default Observer is disabled, you will lose high availability for the pool.

    You'll find good information about this on Kevin's blog post here:
    Understanding SCOM Resource Pools

    The official documentation also provides good information over here:
    Resource pool design considerations

    A way to test if the failover works would be to manually trigger an alarm on a monitored computer, then check if the alarm shows up in the Operations Console, or if any notification is sent.

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, August 8, 2019 9:58 AM
  • Thanks for the reply,

    With regards to Resource pools, i have created them through the Console, and confirmed that the observer is enabled.

    Do you have any thought on the behavior mentioned above when the RMS is down ?

    Thursday, August 8, 2019 10:03 AM
  • The RMS used to be a single point of failure, but nowadays it is no longer a single point of failure as all management servers host the services previously hosted only by the RMS. Roles are distributed to all the management servers, one management server becomes unavailable, its responsibilities are automatically redistributed.

    So your behavior doesn't seem right, are all management servers & the database healthy otherwise?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, August 8, 2019 10:05 AM
  • Hi,

    a very importnat note:Windows Agents do not failover within a resource pool, they do a failover to any random management server if not configured otherwise (PowerShell). Only network devices and UNIX\Linux agents do failover within a ressource pool, considering that it is properly configured (see the articles Leon posted and in particular the "Understanding SCOM Resource Pools" from Kevin Holman). 

    Actually, there are two articles on the topic, one written by me and the other by Sameer Mhaisekar, which exaplain in details how agent failover exactly works in case of Windows Agents and also UNIXY\Linux agents:

    WINDOWS AGENTS AND FAILOVER – DEBUNKING THE MYTH!

    LINUX/UNIX AGENT FAILOVER AND RESOURCE POOLS – DEBUNKING THE MYTH! – PART 2

    Hope I could be of assistance!

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov


    Thursday, August 8, 2019 10:07 AM
    Moderator
  • Is there anything interesting in the Operation Manager event viewer in one of the agent that goes grey?
    Thursday, August 8, 2019 10:50 AM
  • Thats what i had read as well that the RMS (use) to be a single point of failure, however for some strange reason this scenario is happening. 

    I have set the All management servers resource pool to manual from automatic, and kept the 2 servers responsible for Windows Agents

    Then i created another 2 resource pools, one for Unix/Linux (containing another 2 MSs), and the other responsible for Web Application Monitoring only.

    When the RMS server is unavailable, and the Management Health goes grey, all other components such as MSs, DBs etc are not affected. i only notice that the Agents and Health goes grey

    Thursday, August 8, 2019 11:18 AM
  • You mentioned that some SCOM agents go into a critical state, what does the Health Explorer say for some of them?

    Check the Operations Manager event log from some of the SCOM agents to see what it complains about.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, August 8, 2019 11:49 AM
  • The agents all go to Grey, and in the mean time which i failed to add is that as soon as this MS is unavailable the All management servers resource pool also reports to be unavailable.
    Thursday, August 8, 2019 11:53 AM
  • Let's start again from the beginning :

    How many MS in that environment? How many gateways? How many of each in which resource pool?

    How are configured primary and secondary MS at the SCOM Windows agents level?

    Do you see any event in the event viewer telling you something is wrong with resource pools? with communicating to any management server?
    • Edited by CyrAz Thursday, August 8, 2019 12:05 PM
    Thursday, August 8, 2019 12:04 PM
  • Let's start again from the beginning :

    How many MS in that environment? How many gateways? How many of each in which resource pool?

    How are configured primary and secondary MS at the SCOM Windows agents level?

    Do you see any event in the event viewer telling you something is wrong with resource pools? with communicating to any management server?

    5 MS in Total, No Gateways

    1. 2 MS in all management resource pool (Half of the Agents are set primary on One and secondary on the other, and vise versa for the other 50%)

    2. 2 In Unix/linux resource pool

    3. 1 in Web Application Monitoring resource pool

    In event viewer, no alerts in Operations Manager section re resource pools, apart from the Heartbeat alerts when they go unavailable.


    • Edited by StonerK Thursday, August 8, 2019 12:15 PM
    Thursday, August 8, 2019 12:10 PM
  • Hi,

    what we need to figure out is why the second server in the All Management Servers resource pool goes also grey, instead of taking over all the agents. Can you please check what events are logged on it when the other server (the RMS) becomes unavailable? 

    There should be definetely something in the events logs that can explain why the resource pool goes grey....

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Thursday, August 8, 2019 12:15 PM
    Moderator
  • As a clarification, the 2nd server does not go grey, it still remains healthy in the management servers view, however all SCOM health group and agents go grey. Will check again re events on the 2nd MSs.
    Thursday, August 8, 2019 12:19 PM
  • No alerts seem to be logged in the eventvwr under operations manager apart from the "failed to heartbeat"
    Thursday, August 8, 2019 12:33 PM
  • It would help if you could be very specific with the terms you use and everything else you are mentioning : 

    What do you call a "SCOM health group"? 

    When you say "agents go grey", is it the Health Service (watcher) class instances? Windows Computer instances? From the "agent managed view"? Another view?

    "No alert seem to be logged in the eventvwr apart for xyz"... on which server? Can you provide us with the exact event content?

    Thanks :)

    Thursday, August 8, 2019 12:45 PM
  • Hi,

    1. SCOM Health Group - Management Group Health - Operations Manager Folder

    2. Agents Grey - From Health Service Watcher / Agent Health State / Windows Computer

    3. Alerts in event viewer are not reported about Resource pools apart from the below example

    The entity "All Management Servers Resource Pool" is not heartbeating. 

    Can the matter be caused due to the below scenario.

    MS1 restarted and hence failing over to MS2 - the fail-over is still in progress and MS1 resumes operations - hence since MS1 is the primary the agents start to fail-over back. Does this might effect ?

    Thursday, August 8, 2019 2:09 PM
  • Hi, As an update on the matter i have performed a shut down of MS1 until SCOM alerts that MS1 is unavailable.

    The alert related to the resource pool unavailable was triggered and all Agents become Grey. The latter persisted until all Agents including Resource Pool was failed over, resuming agents and resource pool to healthy. (Showing in eventvwr that Agent resumed to MS2)

    I believe this concludes the matter. Any further thoughts on the matter ?

    Thanks for all the assistance.

     

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Thursday, August 8, 2019 2:46 PM
  • OK so it basically worked from since the beginning but just too slowly?

    • Edited by CyrAz Thursday, August 8, 2019 6:08 PM
    Thursday, August 8, 2019 5:45 PM
  • I have never tested the failover when one of the MS is shutdown, the behavior i was encountering was during scheduled / planned restarts. 

    Personally dont know the duration of the Failover or whether such behavior is normal, but with a shutdown of 1 MS confirmed that after a few min all resources failed to its counterpart.

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Tuesday, August 13, 2019 9:50 AM
  • I have never tested the failover when one of the MS is shutdown, the behavior i was encountering was during scheduled / planned restarts. 

    Personally dont know the duration of the Failover or whether such behavior is normal, but with a shutdown of 1 MS confirmed that after a few min all resources failed to its counterpart.

    Hi,

    this is absolutely fine. I must double check this, but think that I've read that the agent needs about 60 seconds until it fails obver to any or to a configured MS within the Management Group. So a couple of minutes sounds normal to me. 

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    • Marked as answer by StonerK Tuesday, August 13, 2019 1:54 PM
    Tuesday, August 13, 2019 12:50 PM
    Moderator