none
Exchange 2016 DAG fails to form quorum after forced failover RRS feed

  • Question

  • Some background. We have a mixed windows server environment of mostly Windows Server 2012R2 member servers and a handful of 2016 and 2008R2 servers. We are currently migrating our 2008R2 and 2012R2 systems to 2016. Almost all systems are Hyper-V guest VM's (including our Exchange servers) hosted on Windows Server 2016 Hyper-V hosts. Our domain is at a functional level of 2012R2 and is split over two sites, site BC and site DY. The sites are connected via VPN tunnel over the internet. In each site are two domain controllers, both are 2016 in the BC site and one 2016 and one 2012R2 in the DY site. One of those DC's in the BC site carry all the FSMO roles.

    Our current project is to replace our existing Exchange 2010SP3RU18 servers running on Windows Server 2008R2. We have 2 in place right now. One in the DY site that has only a couple mailboxes on a single mailbox database and one in the BC site that has an mailbox database with about 80 mailboxes and an archive mailbox database. We have already extended AD and built up two new Exchange 2016CU7 servers on Windows Server 2016. Again, one in the BC site and one in the DY site.

    Originally, the plan for the DY Exchange server was to host mailboxes for some other projects, but that never happened. As such, we decided to maintain an Exchange server in the DY site and utilize it for DR by deploying a DAG and replicating our mailboxes for the BC site to the DY site. Instead of doing this on the 2010 servers, we are going to deploy the 2016 servers in this fashion and migrate the mailboxes to them.

    On to the problem. As it stands, the Exchange 2016 servers are built with CU7, one in DY (called D-EXCHSRV1), and one in BC (called B-EXCHSRV1), and a DAG called BC-DY-DAG in DAC mode was configured. This DAG has a witness configured in BC and an alternate witness configured in DY. We moved a mailbox or two to a mailbox database configured in the DAG and tested it out. Using Test-ReplicationHealth on both servers reports no errors. Using Get-MailboxDatabaseCopyStatus shows all mailboxes as Healthy and/or Mounted, depending on their current owner. Moving the database from server to server using Move-ActiveMailboxDatabase works as expected. If we leave automatic activation on and shut one of the servers down, the mailboxes mount up on the other server as expected and fail back after the fail back time has passed.

    Everything appears to work fine in this setup as configured. However, our plan was to disable automatic activation and only use the DY site server for DR. So we changed the DatabaseCopyAutoActivationPolicy to Blocked on both servers. To verify our solution, we killed power to the BC Exchange server and witness server. We then followed the procedure at https://technet.microsoft.com/en-us/library/dd351049(v=exchg.160).aspx to verify we can bring the database copy online at the DY site in the even of a DR. On the DY site server, we first execute Stop-DatabaseAvailabilityGroup specifying the BC AD site and -ConfigurationOnly parameter. We then stop the cluster service on the DY server and then execute the Restore-DatabaseAvailabilityGroup specifying the DY AD site.

    At this point, when running the Restore-DatabaseAvailabilityGroup CmdLet is where things get messed up. The Restore-DatabaseAvailabilityGroup CmdLet reports the following error:

    [2018-01-29T16:48:03] Server 'B-EXCHSRV1' was marked as stopped in database availability group 'BC-DY-DAG' but couldn't be removed from the cluster. Error: A server-side database availability group administrative operation failed. Error The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: An error occurred while attempting a cluster operation. Error: Cluster API failed: "EvictClusterNodeEx('B-EXCHSRV1.Domain.local') failed with 0x46. Error: The remote server has been paused or is in the process of being started". [Server: D-EXCHSRV1.Domain.local]

    Checking the Restore-DatabaseAvailabilityGroup log file in the C:\ExchangeSetupLogs\DagTasks\ folder seems to show that the local cluster service never completely starts up when the cmd attempts to remove the remote Exchange server from the cluster. This log shows the cluster service on the DY server in the "Joining" state, when it probably should be in the "Up" state. Researching this error seems to point to a timing issue, and most recommendations say to rerun the command. But, that makes no difference in our case.

    I have ran the Get-ClusterLog command and found the section of the log where the service is started with ForceQuorum and can see that the service never fully starts up. The Log shows that it ends up stopping with the following errors:

    00004c50.00003fac::2018/01/29-16:47:33.125 INFO  [VSAM] Node Id for FD info: 92ac05bc-1da2-8bc8-bd73-e800ddb1f70a
    00004c50.0000121c::2018/01/29-16:47:33.126 INFO  [VSAM] Node Id for FD info: 364faf35-a476-379f-9e67-bba72d7bd352
    00004c50.00003fac::2018/01/29-16:47:33.126 INFO  [VSAM] BuildNetworkTarget: remote endpoint , node id 1, bufsize 744
    00004c50.0000121c::2018/01/29-16:47:33.126 INFO  [VSAM] BuildNetworkTarget: remote endpoint \Device\CLUSBFLT\BlockTarget$, node id 2, bufsize 744
    00004c50.0000121c::2018/01/29-16:47:33.126 INFO  [VSAM] SetClusterViewWithTarget: nodeid 2, nodeset 0x2
    00004c50.00003fac::2018/01/29-16:47:33.126 INFO  [VSAM] SetClusterViewWithTarget: nodeid 1, nodeset 0x2
    00004c50.0000121c::2018/01/29-16:47:33.126 ERR   [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
    00004c50.0000121c::2018/01/29-16:47:33.126 INFO  [VSAM] SetClusterViewWithTarget: waiting for completion for node 2
    00004c50.00003fac::2018/01/29-16:47:33.126 ERR   [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
    00004c50.00003fac::2018/01/29-16:47:33.126 INFO  [VSAM] SetClusterViewWithTarget: waiting for completion for node 1
    00004c50.0000121c::2018/01/29-16:47:34.127 ERR   [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
    00004c50.0000121c::2018/01/29-16:47:34.127 INFO  [VSAM] SetClusterViewWithTarget: waiting for completion for node 2
    00004c50.00003fac::2018/01/29-16:47:34.127 ERR   [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
    00004c50.00003fac::2018/01/29-16:47:34.127 INFO  [VSAM] SetClusterViewWithTarget: waiting for completion for node 1
    00001270.00001228::2018/01/29-16:47:34.990 WARN  [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
    00002d34.00001180::2018/01/29-16:47:34.990 WARN  [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
    00001270.00001228::2018/01/29-16:47:34.992 INFO  [RHS] Exiting.
    00002d34.00001180::2018/01/29-16:47:35.003 INFO  [RHS] Exiting.

    I can't find much on the above, but I do know that error 87 is "The parameter is incorrect". Which, is not very helpful in the case. If we try and force the cluster service to start manually with ForceQuorum, it never fully starts up and actually gets stuck in a starting loop where it starts and stops constantly and logs to the event log with the error "The parameter is incorrect".

    We have since rebuilt and reconfigured both Exchange 2016 servers in an attempt to resolve this problem and have ended up facing the exact same issue. I have included a link to the logs below, as that may provide some more information that I may be missing here.

    https://1drv.ms/f/s!ApEl8Q3xIvLoiDrVH8juRS0TjQHB

    Personally, I think this may be a clustering issue, as we can't get the cluster service to start once the other Exchange server and witness are offline. We have configured multi-site SQL servers with AlwaysOn Database Availability groups and have forced fail over and forced quorum to test bringing those online without issue. So, I am a bit surprised this is doing this. This is our first experience with a failover cluster without an administrative access point, but using the PowerShell CmdLets to check cluster, node and resource health before the attempted fail over shows everything in a good state. I'm not sure what else to look at. Any help with this would be greatly appreciated.

    Friday, February 2, 2018 1:10 PM

Answers

  • Sorry Manu, but it's not a networking issue. The network was verified and the DAG works fine when configured. It is only after we try and test a DR scenario that the problem shows up. Thanks for your help, however.

    We have decided not to make use of the DAG, as it poses other problems that relying on MS Failover clustering introduces. So, this issue is moot at this point for us.

    I would open a ticket with Microsoft before giving up on it :)

    Monday, February 12, 2018 6:03 PM
    Moderator

All replies

  • Hi,

    Based on my search, I found the same issue for Exchange 2010, and I think it should be applied for Exchange 2016. The solution is to re-run restore-databaseAvailabilityGroup and the stopped DAG members will be successfully evicted.

    For details, see Exchange 2010: Restore-DatabaseAvailabilityGroup fails to evict nodes error 0x46.

    Hope it helps.

    Regards,

    Manu Meng


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnsf@microsoft.com.

    Click here to learn more. Visit the dedicated forum to share, explore and talk to experts about Microsoft Teams.

    Monday, February 5, 2018 8:49 AM
    Moderator
  • As I mentioned above:

    Researching this error seems to point to a timing issue, and most recommendations say to rerun the command. But, that makes no difference in our case.

    We have tried this already and it does not work. Running Get-ClusterNode <servername> shows the cluster service stuck in the Joining state. It never fully starts to successfully evict the other node.

    Monday, February 5, 2018 2:58 PM
  • Hi,

    Maybe it is a network issue, have you read this?

    What’s Going On With My Cluster?

    Regards,

    Manu Meng


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnsf@microsoft.com.

    Click here to learn more. Visit the dedicated forum to share, explore and talk to experts about Microsoft Teams.

    Friday, February 9, 2018 8:31 AM
    Moderator
  • Sorry Manu, but it's not a networking issue. The network was verified and the DAG works fine when configured. It is only after we try and test a DR scenario that the problem shows up. Thanks for your help, however.

    We have decided not to make use of the DAG, as it poses other problems that relying on MS Failover clustering introduces. So, this issue is moot at this point for us.

    Monday, February 12, 2018 5:19 PM
  • Sorry Manu, but it's not a networking issue. The network was verified and the DAG works fine when configured. It is only after we try and test a DR scenario that the problem shows up. Thanks for your help, however.

    We have decided not to make use of the DAG, as it poses other problems that relying on MS Failover clustering introduces. So, this issue is moot at this point for us.

    I would open a ticket with Microsoft before giving up on it :)

    Monday, February 12, 2018 6:03 PM
    Moderator
  • Hi Joe,

    did you find a solution for your problem? I have the exact same issue...

    Regards,

    N.


    nseslija

    Monday, November 19, 2018 7:23 PM
  • Hi Joe,

    did you find a solution for your problem? I have the exact same issue...

    Regards,

    N.


    nseslija

    nseslija,

    Sorry I missed this. We never found a solution and ended up abandoning the DAG. Did you find a solution, and if so, what was it?

    Thanks, - Joe

    Friday, June 7, 2019 2:08 PM