Exchange 2010 SP3 RU2 - replication stops when a particular node is active

    General discussion

  • We currently have a dag with 2 mailbox servers in one AD site (node 1 and 2) and 1 mailbox server in the DR AD site (node 3).  When the databases are running on node 1, replication to the other two nodes is fine however, when I fail the databases over to node 2 then replication stops to node 3 however it stays current with node 1.  I have tried stopping and restarting replication and still the replication log count keeps climbing.  The minute I fail back over to node 1 replication catches back up on node 3 and everything is fine.  I have a replication network with static routes on all three nodes.  I can ping each of the nodes during the issue and verified connectivity between all three nodes.

    Wednesday, September 25, 2013 4:27 PM

All replies

  • Hi,

    Please check the application log and see whether there are any related error event ID?


    If you have feedback for TechNet Subscriber Support, contact tnsfl@microsoft.com

    Simon Wu
    TechNet Community Support

    Monday, September 30, 2013 5:11 AM
  • Nothing at all, that was why I came here for help.  I have checked network connections, MTU, etc ... and have found no reason for this to happen.
    Monday, September 30, 2013 2:56 PM
  • I've hit a similar issue and disabling the TCP chimney, then re-enabling on all nodes (followed by a restart of all Replay services) has cleared up the issue.
    Monday, September 30, 2013 9:50 PM
  • Hi,

    What do you mean by 'it stays current with node1'? Can you explain this further? Have you collapsed the DAG Networks?

    Do you have separate NICs for MAPI and Replication networks?


    Thursday, October 03, 2013 9:04 PM
  • Please run the following command and check the error message.

    Get-mailboxdatabasecopystatus DB1\Node3 | FL

    Test-Replicationhealth can also shed some light on this.

    Did you check the highavailability events in the eventviewer.


    Thursday, October 03, 2013 9:10 PM
  • Turns out this was an issue that affected the virtual node and the MTU size across the replication network. 

    DAG 1

    Node 1 is a physical box in AD site A

    Node 2 is a virtual box in AD site A

    Node 3 is a virtual box in AD Site B

    When Node 1 was primary for a database replication worked fine to node 2 and node 3 however when a database was failed over to node 2, node 1 kept up with replication however node 3 replication would fail.  It turns out the replication network NIC MTU was set to 9000, once I set it to 1300 replication started working as expected.

    netsh int ipv4 set subinterface "Replication Network" mtu=1300

    Saturday, October 05, 2013 7:17 PM
  • I believe best practice is to disable TCP chimney on all nodes.  I tried to re-enable and then disable again and no luck.  Turns out it was an MTU issue.


    Saturday, October 05, 2013 7:19 PM
  • Yes we have separate NICs for MAPI, Replication, and storage networks.  The issue was MTU size across the replication network.
    Saturday, October 05, 2013 7:20 PM