locked
DFSR problems with large files and repeated events 5014 with error 1726 RRS feed

  • Question

  • Hello everyone,

    before I describe the enviroment and problem please note that english isn't my native language, but I try to explain the problem as best as possible.

    The enviroment

    • 8 different sites (also in AD Sites and Services) with different subnets (all Class C /24), each site has a local domain controller
    • the sites are connected over vpn ~1Mbit
    • there are two sites each with a "mainserver". Those "mainservers" are connected together with a vpn, the rest of the 6 servers only have a vpn to "mainserver A" and also a vpn to "mainserver B". Those 6 servers are not connect to each other, only to "mainserver A" and "mainserver B"
    • all servers running W2K3 R2 or W2K8 and W2K8 R2
    • all servers have the latest patchlevel
    • all servers are in one domain, no trusts, no subdomains and so on
    • theres no problem with normal AD-Replication (passwords, users) or something like that
    • no obvious errors from dcdiag or netdiag
    • theres only one namespace, lets call it "intranet"
    • this namespace contains only two subfolders, lets call them "folder1" and "folder2"
    • one folder (1) with a lot of smaller files in it (eg. .doc, .xls, .txt) but a lot of subfolders
    • one folder (2) with several bigger files (~20 MB, *.msi) we use them for software deployment
    • all folders exists physical on each server
    • each folder within the namespace has its own replication group
    • the replication bandwidth is set to "full" (18pm - 6am) and to "128kBit" (6am-18pm)
    • theres is a third replication group with some image files (1,8 GB) I also need for software deployment
    • this third replication group only contains as members the mentioned "mainservers" A and B
    • this 3rd replication group is not published in the namespace
    My Problem:
    • folder1 and folder2 are correctly replicated to each of the 8 servers
    • the images files within folder (3) won't replicate
    • if I add smaller files to this folder (3) , they are replicatied as expected

    As described, I can't get the bigger files to replicate after one week so I had a look into the eventlog. I found there, repeatingly every 5.5 minutes, the following entries:

    EventID: 5017
    Error: 1726 (the remote procedure call failed)

    Exactly one second later I get a entry (EventID: 5004) that the connection was restored. I get this entries for every replication group and every server. Not all at the same time because their dfs-services started to different times (eg. after a reboot or something like that). Needless to say that I found the corresponding entries also on the replication partner.

    I tried the following solutions:

    • installed latest NIC-drivers on each server
    • I don't set a file format filter with in the DFSR
    • no teaming at the NICs, no vlans
    • increased the size of the staging folder to 8 GB
    • disabled / excluded folders from antivirus
    • theres no backup running during the time of the entries
    • the wan-connection and also the vpn is alive at this moment
    • reviewed http://support.microsoft.com/kb/948496/en-us and installed the patch & registry settings
    • reviewed http://blogs.technet.com/b/askds/archive/2007/10/05/top-10-common-causes-of-slow-replication-with-dfsr.aspx
    • reviewed http://support.microsoft.com/default.aspx?scid=kb%3bEN-US%3b958802 to get the latest DFS-Files
    • created a dfs-report but can't find no errors

    Any ideas whats going on with my DFSR? Why I get still this messages if smaller files are replicated without any problem? Any ideas why the servers lost their connection every 5.5 minutes? Do you need more informations, just let me know.

    best regards

    Christoph

    Monday, November 1, 2010 6:48 PM

Answers

  • Hi Christoph,

    TCP off loading may be causing the issue.  See

    An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers

    http://support.microsoft.com/kb/948496/

    Note: Windows 2008 shipped with all SNP features turned off.

    We can try the steps below:

    1. Apply the hotfix from
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;948496 that disables the SNP

    2. Add these registry values to all DFSR replications partners that are having RPC failures. You will need to reboot for these changes to take effect.

    Note: Please backup the key before modify it.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

    Value =DisableTaskOffload
    Type = DWORD
    Data = 1

    Value =EnableTCPChimney
    Type = DWORD
    Data = 0

    Value =EnableTCPA
    Type = DWORD
    Data = 0

    Value =EnableRSS
    Type = DWORD
    Data = 0

    Then reboot and check whether issue still exists.


    Shaon Shan| TechNet Subscriber Support in forum| If you have any feedback on our support, please contact tngfb@microsoft.com
    Wednesday, November 3, 2010 7:34 AM

All replies

  • Hi Christoph,

    TCP off loading may be causing the issue.  See

    An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers

    http://support.microsoft.com/kb/948496/

    Note: Windows 2008 shipped with all SNP features turned off.

    We can try the steps below:

    1. Apply the hotfix from
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;948496 that disables the SNP

    2. Add these registry values to all DFSR replications partners that are having RPC failures. You will need to reboot for these changes to take effect.

    Note: Please backup the key before modify it.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

    Value =DisableTaskOffload
    Type = DWORD
    Data = 1

    Value =EnableTCPChimney
    Type = DWORD
    Data = 0

    Value =EnableTCPA
    Type = DWORD
    Data = 0

    Value =EnableRSS
    Type = DWORD
    Data = 0

    Then reboot and check whether issue still exists.


    Shaon Shan| TechNet Subscriber Support in forum| If you have any feedback on our support, please contact tngfb@microsoft.com
    Wednesday, November 3, 2010 7:34 AM
  • Hi Shaon,

     

    thankyou so far for your reply. The mentioned patch is already installed at all involved servers. I also set all of those registry keys except the "DisableTaskOffload". I will try that later today and will reply if this solves the problem or not.

     

    best regards

    Thursday, November 4, 2010 8:47 AM
  • Hi,

    It has been several days so I would like to know whether issue still exists. If there is any new error occurs please just paste in reply.


    Shaon Shan| TechNet Subscriber Support in forum| If you have any feedback on our support, please contact tngfb@microsoft.com
    Tuesday, November 9, 2010 6:18 AM
  • I've got the same problem on a 2008 R2 DFSR server. 

    The Spoke server has 2x Intel 1 GigB on-board NICs that are in active/standby Teamed configuration.

    The HUB server (running on VMware using the 10Gbit vmnet3 NIC driver) logs DFSR Event 5014 every 2-15mins (usually around 5mins).

    The Spoke server does not log the event.

    Replication is running very slowly. This is an initial sync/replication. I had to re-configure the replication folder because of a lost DFSR database on the spoke server. 

    Could it be the Teamed NIC on the spoke that is causing this error on the host?

    A note: the last time I did an initial sync on this folder it processed at about 30K files and hour. This time it's only doing a few hundred and hour.

    Another note: the hub server has several replication groups and connections to several spoke servers. None of the other connections have this problem, but none of the other spoke servers are physical servers or have a teamed NIC.

    Thank you for your help.

     

    Tuesday, July 5, 2011 9:20 PM