We currently have a dag with 2 mailbox servers in one AD site (node 1 and 2) and 1 mailbox server in the DR AD site (node 3). When the databases are running on node 1, replication to the other two nodes is fine however, when I fail the databases over to node 2 then replication stops to node 3 however it stays current with node 1. I have tried stopping and restarting replication and still the replication log count keeps climbing. The minute I fail back over to node 1 replication catches back up on node 3 and everything is fine. I have a replication network with static routes on all three nodes. I can ping each of the nodes during the issue and verified connectivity between all three nodes.
- Changed type Simon_WuMicrosoft contingent staff, Moderator Monday, September 30, 2013 5:09 AM
Please run the following command and check the error message.
Get-mailboxdatabasecopystatus DB1\Node3 | FL
Test-Replicationhealth can also shed some light on this.
Did you check the highavailability events in the eventviewer.
Turns out this was an issue that affected the virtual node and the MTU size across the replication network.
Node 1 is a physical box in AD site A
Node 2 is a virtual box in AD site A
Node 3 is a virtual box in AD Site B
When Node 1 was primary for a database replication worked fine to node 2 and node 3 however when a database was failed over to node 2, node 1 kept up with replication however node 3 replication would fail. It turns out the replication network NIC MTU was set to 9000, once I set it to 1300 replication started working as expected.
netsh int ipv4 set subinterface "Replication Network" mtu=1300
I believe best practice is to disable TCP chimney on all nodes. I tried to re-enable and then disable again and no luck. Turns out it was an MTU issue.