locked
Skype4B 2015 HA issue with even number of nodes and SQL AlwaysOn RRS feed

  • Question

  • Hello,

    We are planning Skype for Business High Availability. Currently we are testing it in our test environment, and having issues similar to those described in this tread

    The difference is that we are using SQL AlwaysOn as a HA solution for Back End.

    In our environment we have 2 server rooms built for redundancy in one site and losing one of them should be the situation for HA. For DR we have one more site. We are planning to use 4 FEs and 2 BEs in the SQL AO Availability Group. When we are disconnecting 2 FEs and 1 BE 1-by-1 everything is OK. When we disconnect 2 FEs and an active BE node simultaneously (simulation of 1 Server room failure) SQL is successfully failedover in 5 sec. (as we expected) but FE pool is stopping in 5 minutes. We were expecting that when BE is failedover FE cluster quorum should be OK and will work without any interruption.

    Looking for any advice or explanation.

    Regards,

    Sviatoslav T.


    • Edited by apusnik Friday, April 22, 2016 1:26 PM
    Friday, April 22, 2016 11:51 AM

All replies

  • as I remember when you have even number you use the SQL as quorum parity, so i think that is your problem.

    check below video for more details. https://channel9.msdn.com/Events/Lync-Conference/Lync-Conference-2014/SERV402

    Sunday, April 24, 2016 12:37 PM
  • Thank you for your reply,

    I am aware of the BE server role when there is even number of FE servers. This is the reason why we use SQL AlwaysOn as a HA solution for our BE servers. When an active BE node is failing, it is failedover to the second node in 5 seconds. This topology is stable when we lose servers 1-by-1, and the issue is happening only when we lose them all(2 FEs + Active BE) at the same time.


    • Edited by apusnik Sunday, April 24, 2016 3:40 PM
    Sunday, April 24, 2016 3:40 PM
  • Hi,

    In pools with an even number of servers, Skype for Business Server uses the Primary SQL database as Witness. In a pool like this, if you shut down the primary databaseand switch to the Mirror copy, and shut down enough Front End servers so that not enough are running according tothe preceding table, the entire pool will go down.

    Then you may need to make pool back up by the following SFB Server Management Shell:

    Reset-CsPoolRegistrarState-PoolFQDN<PoolFQDN> -ResetTypeServiceReset

    Best Regards


    Please remember to mark the replies as answers if they help, and unmark the answers if they provide no help. If you have feedback for TechNet Support, contact tnmff@microsoft.com.

    Eason Huang
    TechNet Community Support

    Wednesday, April 27, 2016 3:05 AM
  • i think it is something related to timing that the sql didn't failover in enough time to take the third role so now you have one FE + one BE down which affect quorum.

    try to delay the down of FE till sql failover and check.

    my recommendation is to add FE.

    Wednesday, April 27, 2016 5:15 AM
  • Hi,

    In pools with an even number of servers, Skype for Business Server uses the Primary SQL database as Witness. In a pool like this, if you shut down the primary databaseand switch to the Mirror copy, and shut down enough Front End servers so that not enough are running according tothe preceding table, the entire pool will go down.

    Then you may need to make pool back up by the following SFB Server Management Shell:

    Reset-CsPoolRegistrarState-PoolFQDN<PoolFQDN> -ResetTypeServiceReset

    Best Regards


    Please remember to mark the replies as answers if they help, and unmark the answers if they provide no help. If you have feedback for TechNet Support, contact tnmff@microsoft.com.

    Eason Huang
    TechNet Community Support

    Hi, 

    Looks like your are talking about SQL Mirroring, but we are using SQL AlwaysOn AG, and there is no Primary and Mirror, there is a SQL Listener which is always responding as a Skype4B cluster voter, except 5 seconds for SQL failover procedure. When we shutdown 2 of 4 FEs and only after that initiate SQL failover by failing active BE node, Skype4B is working without any issues. So the question is why it is stopping when the same servers are failing together. 

    Wednesday, April 27, 2016 9:31 AM
  • i think it is something related to timing that the sql didn't failover in enough time to take the third role so now you have one FE + one BE down which affect quorum.

    try to delay the down of FE till sql failover and check.

    my recommendation is to add FE.

    I was thinking about timing to, but 5 seconds of missing SQL listener is not affecting Skype4B cluster work when failing only SQL when 2 of 4 FEs are down.

    We cannot delay any server failure because when the server room is going down all the servers in this room are going down together. :-)

    We cannot add 5th FE because we need to survive a failure when any of the server rooms is down, so equal number of servers is needed in both rooms. 


    • Edited by apusnik Wednesday, April 27, 2016 9:38 AM
    Wednesday, April 27, 2016 9:36 AM