none
DFSR RPC replication errors 5014 1726 with large files over VPN RRS feed

  • Question

  • I have a very similar issue as Ryan in his reply to http://social.technet.microsoft.com/Forums/en-GB/winserverfiles/thread/6d49fd71-2236-4fcb-9763-bf1c03459ee8 cited below. I start an additional thread here because the other one is mainly related to Windows Server 2003 (R2) and I'm (as Ryan) on Windows 2008 R2 x64. Here first of all the post from Ryan:

    I've got the same problem on a 2008 R2 DFSR server. 

    The Spoke server has 2x Intel 1 GigB on-board NICs that are in active/standby Teamed configuration.

    The HUB server (running on VMware using the 10Gbit vmnet3 NIC driver) logs DFSR Event 5014 every 2-15mins (usually around 5mins).

    The Spoke server does not log the event.

    Replication is running very slowly. This is an initial sync/replication. I had to re-configure the replication folder because of a lost DFSR database on the spoke server. 

    Could it be the Teamed NIC on the spoke that is causing this error on the host?

    A note: the last time I did an initial sync on this folder it processed at about 30K files and hour. This time it's only doing a few hundred and hour.

    Another note: the hub server has several replication groups and connections to several spoke servers. None of the other connections have this problem, but none of the other spoke servers are physical servers or have a teamed NIC.

    Thank you for your help.

    As stated above my problem is very similar. I'm currently testing a DFSR betweet three fully meshed 2008 R2 x64 Servers with installed SP1. The servers are on three different sites and connected via an asynchronous VPN based on AVM FritzBox with 16Mbin down and 1Mbit up. The VPN connections works stable (ping -t on NetBIOS names, DNS records and IPs works like charme between all hosts and without any problems when copying large files, e.g. 10 1GB files via robocopy). The DFSR also works fine for smaller files, but in one replication group that will be used for replicating larger compressed media files (like e.g. AVIs, VOBs or MGEPs) the testing scenario currently produces 5014 events followed by 5004 events at regular intervals between 5 min and 1 hour. The detailled error is an 1726 (RPC failed) and blows up the DFSR debug low to several MBs with errors 1753 and 1726 like that:

    20110803 11:47:31.932 3792 DOWN 7139 BandwidthThrottler::PrepareForShutdown Preparing for Shutdown. rgId:55B28E92-8670-42F4-A616-A2CA6E8CCB01 rgName:XXX connId:2C0227F0-96DB-41A7-ABDD-FC48EFC707F8 ptr:0000000005B66980 20110803 11:47:31.932 3792 INCO 5777 InConnection::ConnectNetwork New connection connId:{2C0227F0-96DB-41A7-ABDD-FC48EFC707F8} transport:0000000000000000 unghostTransport:0000000000000000 20110803 11:47:31.932 3792 INCO 5780 InConnection::ConnectNetwork connId:{2C0227F0-96DB-41A7-ABDD-FC48EFC707F8} fatalRemoteError:0 20110803 11:47:31.932 3792 INCO 6837 [WARN] InConnection::ReConnectAsync Failed to connect, (attempts: 142) connId:{2C0227F0-96DB-41A7-ABDD-FC48EFC707F8} Error: + [Error:9027(0x2343) InConnection::ConnectNetwork inconnection.cpp:5783 3792 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:9027(0x2343) DownstreamTransport::EstablishConnection downstreamtransport.cpp:4123 3792 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:9027(0x2343) DownstreamTransport::EstablishConnection downstreamtransport.cpp:4045 3792 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:1753(0x6d9) DownstreamTransport::EstablishConnection downstreamtransport.cpp:4045 3792 W In der Endpunktzuordnung sind keine weiteren Endpunkte verfügbar.] 20110803 11:47:32.946 3264 DOWN 4186 [ERROR] DownstreamTransport::EstablishSession Failed on connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} csId:{35A28383-B1FD-467E-B5B0-950A3EE53EF2} rgName:XXX Error: + [Error:9027(0x2343) DownstreamTransport::EstablishSession downstreamtransport.cpp:4179 3264 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:1722(0x6ba) DownstreamTransport::EstablishSession downstreamtransport.cpp:4179 3264 W Der RPC-Server ist nicht verfügbar.] 20110803 11:47:32.946 3264 INCO 2862 InConnection::ProcessErrorStatus Reconnecting on remote error. connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} state:CONNECTED Error: + [Error:9027(0x2343) DownstreamTransport::EstablishSession downstreamtransport.cpp:4200 3264 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:9027(0x2343) DownstreamTransport::EstablishSession downstreamtransport.cpp:4179 3264 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:1722(0x6ba) DownstreamTransport::EstablishSession downstreamtransport.cpp:4179 3264 W Der RPC-Server ist nicht verfügbar.] 20110803 11:47:32.946 3264 INCO 4034 InConnection::ReConnect connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} state:CONNECTED 20110803 11:47:32.946 3264 INCO 4053 [WARN] InConnection::ReConnect Resetting connection. connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} XXX fatal Error:[Error:1722(0x6ba) DownstreamTransport::EstablishSession downstreamtransport.cpp:4179 3264 W Der RPC-Server ist nicht verfügbar.] 20110803 11:47:32.946 3264 INCO 7715 InConnection::ScheduleReconnect connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} state:RECONNECTING 20110803 11:47:32.946 2152 INCO 5972 InConnection::DisConnect state:RECONNECTING connId:{4583F6F7-F287-40B2-8F41-0698331E79AB} 20110803 11:47:32.946 2152 INCO 5985 InConnection::DisConnect transport:0000000008AA52D0 unghostTransport:0000000000000000 20110803 11:47:32.946 2152 DOWN 5558 DownstreamTransport::PrepareForShutdown ptr:0000000008AA52D0 20110803 11:47:32.946 2152 DOWN 7139 BandwidthThrottler::PrepareForShutdown Preparing for Shutdown. rgId:4041DE95-AEA8-45BE-B539-C7DA194749DC rgName:XXX connId:4583F6F7-F287-40B2-8F41-0698331E79AB ptr:0000000008AA53E0 20110803 11:47:32.946 2528 DOWN 2833 AsyncRpcHandler::ProcessReceive Completion. connId:{61BCED33-EDA0-45EC-AC29-6A37CE43D5E6} csId:{00000000-0000-0000-0000-000000000000} reqType:AsyncPollRequest reqState:Completed status:1726 ptr:0000000000191290 20110803 11:47:32.946 2528 DOWN 2858 [ERROR] AsyncRpcHandler::ProcessReceive Failed on connId:{61BCED33-EDA0-45EC-AC29-6A37CE43D5E6} csId:{00000000-0000-0000-0000-000000000000} reqType:AsyncPollRequest reqState:Completed status:1726 Error: + [Error:9027(0x2343) AsyncRpcHandler::ReceiveAsyncPoll downstreamtransport.cpp:2199 2528 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:9027(0x2343) AsyncRpcHandler::ReceiveAsyncPoll downstreamtransport.cpp:2141 2528 C Der Remotepartner hat einen Fehler gemeldet.] + [Error:1726(0x6be) AsyncRpcHandler::ReceiveAsyncPoll downstreamtransport.cpp:2141 2528 W Der Remoteprozeduraufruf ist fehlgeschlagen.]

    As far as I can see the problem is that DFSR tries to throttle down the replication speed, but then RPC fails due to e.g. too many open connections or something like that. My problem now is that I already tried nearly everything I came across here in the forum or searching google, but I simple don't get rid of the continous errors preventing the replication of larger files.

    Here a list of things I already tried and some more details:

    1. Reinitialize the whole replication group in order to make sure that the database is not corrupted (removing all members and readding them to force initial replication from primary member following http://support.microsoft.com/kb/961879)
    2. Switched off RDC in order to make sure that there is no compression problem with already compressed AVIs, MPEGs or VOBs
    3. Preeseeding the replication group with robocopy following http://blogs.technet.com/b/askds/archive/2010/09/07/replacing-dfsr-member-hardware-or-os-part-2-pre-seeding.aspx. After initial replication had finished successfully the problem still occured on any new large file
    4. Checked for a Block Hole Router following http://support.microsoft.com/kb/159211
    5. Changed the MTU setting of the ethernet connection based on a "ping -f vpn-partner -l XXX" test to 1386 (max value for unfragmented packets) following http://support.microsoft.com/kb/900926 
    6. Tried suggestions concerning DisableTaskOffload, EnableTCPChimney, EnableTCPA and EnableRSS from http://qa.social.technet.microsoft.com/Forums/en/winserverfiles/thread/d27bd902-034e-4230-9516-0ede42308193, but most of them are not applicable to Win 2008 R2. Also tried corresponding netsh commands like netsh int ip set global taskoffload=disabled, netsh int tcp set global RSS=disabled, netsh interface tcp set global autotuninglevel=disabled and netsh int tcp set global chimney=disabled without any success.
    7. Also followed http://qa.social.technet.microsoft.com/Forums/en/winserverfiles/thread/3778427a-a594-4f1d-9c97-d8d1e6a56a83, but currently there is no additional security layer besides Windows Firewall between the hosts. 
    8. Checked the staging folder size (currently 75GB) following http://social.technet.microsoft.com/Forums/en-US/winserverfiles/thread/d540be96-99cd-43d7-b817-693d6e7e05ea/ as well as quotas (currently deactivated for the whole drive).
    9. Also installed newest hotfixes for DFS from http://support.microsoft.com/kb/968429/ as also I had the "An unexpected network error occurred." problem when accessing larger files from Windows 7 clients. And of course ... still no success with the 5014, 1726 problem.
    So, I'm looking forward to any suggestions for solving this tricky problem.

    • Edited by Florian Ott Monday, August 15, 2011 7:42 AM Only fixed broken links
    Tuesday, August 9, 2011 3:56 PM

All replies

  • Hi,
     
    Thank you for your question.

    I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.
     
    Thank you for your understanding and support.


    TechNet Subscriber Support in forum |If you have any feedback on our support, please contact tnmff@microsoft.com.
    Wednesday, August 10, 2011 9:20 AM
    Moderator
  • Yes, the teaming is quite possibly one of the most suspicious culprits based on our previous experience. Have you ever tried disabling teaming temporarily to have a test?

    In addition, we also cannot exclude the possibility of issue with physical network such as the router between the hub and this specific spoke site. I do understand the ping seems to be stable. But ping command is not sufficient to test connectivity espeically for ;arge packets.

     


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Peterson Wu Microsoft Online Community Support
    Wednesday, August 10, 2011 9:45 AM
  • Thanks to both of you for your quick answers. I meanwhile was able to test a little bit more and it turned out that the replication of large files does work if there aren't too many files beeing replicated at the same time. What I did is to simply put a certain number of new files of different sizes (300 MB, 500 MB and 1GB) into the replicated folder on one of the sites and watched their replication after a certain while. Here what worked and what didn't:

    Worked:

    • 1 x 300MB file
    • 1 x 500MB file
    • 1 x 1GB file
    • 3 x 1GB files on one site
    • 3 x 1GB files on two different sites (= 6 1GB files)
    • 6 x 1GB files + 2 x 500MB files

    Did not work:

    • 9 x 1GB files
    • 10 x 300MB files

    As 10 x 300MB (which did not work) is roughly the same as 3 x 1GB (which did work) especially the last test is very strange for me. As far as I can see the problem seems to have something to do with the maximum number of simultaneous outgoing connections. In my case with the 10 files the DFSR service opened between two and tree connections for each file when watching "dfsrdiag replicationstate". For me it seems like that each of these connections stays up for a certain while, but is then shut down because of some reasen resulting in a restart of the transfer and the 5014 (error 1726) event on the recieving member. In any case the tranfer does obviously not make any progress, because the "dfsrdiag backlog" for a sending / recieving member combination still shows the same number of files to replicate (in this case 10) after two days although the network connection was replicating continuously with full bandwith during that time. So all I got on the recieving member instead of the files were the events 5014 (as described in detail above) every ten minutes. In the end I had about 20 GB traffic over the VPN connection but the much smaller amount of data the 10 files together had with their 3GB did not get through.

    In addition, we also cannot exclude the possibility of issue with physical network such as the router between the hub and this specific spoke site. I do understand the ping seems to be stable. But ping command is not sufficient to test connectivity espeically for ;arge packets.

    As I can do a robocopy from network share to network share of the 10 files described above without any problems I think the physical connection can't be the problem.

    Yes, the teaming is quite possibly one of the most suspicious culprits based on our previous experience. Have you ever tried disabling teaming temporarily to have a test?

    In my case I don't have teaming enabled on the network adapters. This was only Ryan who had a very similar problem that I cited above. What I do have, is that one of the replication members is a hyper-v guest with an enabled VLAN on its NIC. But the problem also exists if I exclude that member and use the replication only between the not virtualized machines.

    So, my questions to you:

    Any other ideas based on that addition insights?

    Is there perhaps a possibility to manually restrict the maximum number of simultanous connections of the DFSR service? I think that would perhaps already be enough, but I couldn't find a documentation about a feature like that.

    Sunday, August 14, 2011 3:12 PM
  • Windows 2008 and later can support 16 concurrent file downloads. You can get more information from:

    http://blogs.technet.com/b/filecab/archive/2007/12/26/what-s-new-in-windows-server-2008.aspx


    We do have some registry tuning and recommendations:

     

    All registry values are REG_DWORD (and in the explanations below, are always in decimal). All sevregistry tuning for DFSR in Win2008 and Win2008 R2 is made here:

                      HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DFSR\Parameters\Settings

    A restart of the DFSR service is required for the settings to take effect, but a reboot is not required. The list below is not complete, but instead covers the important values for performance.

    The registry values below are worth tuning. The "tested high performance value" is what testing has shown can increase performance.

     

    AsyncIoMaxBufferSizeBytes

    Default value: 2097152

    Possible values: 1048576, 2097152, 4194304, 8388608

    Tested high performance value: 8388608  

    Set on: All DFSR nodes

     

    RpcFileBufferSize

    Default value: 262144

    Possible values: 262144, 524288

    Tested high performance value: 524288

    Set on: All DFSR nodes

     

    StagingThreadCount

    Default value: 6

    (Win2008 R2 only; cannot be changed on Win2008)

    Possible values: 4-16

    Tested high performance value: 8

    Set on: All DFSR nodes. Setting to 16 may generate too much disk IO to be useful.

     

    TotalCreditsMaxCount

    Default value: 1024

    Possible values: 256-4096

    Tested high performance value: 4096

    Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow)

     

    UpdateWorkerThreadCount

    Default value: 16

    Possible values (Win2008): 4-32

    Possible values (Win2008 R2): 4-64

    Tested high performance value: 32

    Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow. The number being raised here is only valuable when replicating in from more servers than the value. I.e. if replicating in 32 servers, set to 32. If replicating in 45 servers set to 45. If replicating in 64 servers set to 64. There is no advantage to the value being higher than the actual number, and obviously if more than 64 servers the most optimization allowed is 64.

     

    When using the above registry tuning on Windows Server 2008 R2, testing revealed that initial sync replication time was approximately twice as as fast as with no registry settings in place when using 32 servers replicating a "data collection" topology to a single hub over thirty-two T1 networks with 32 RG's containing unique branch office data.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Peterson Wu Microsoft Online Community Support
    Monday, August 15, 2011 1:31 AM
  • I forgot to mention above: I already had followed the "Tuning replication performance Howto" from http://blogs.technet.com/b/askds/archive/2010/03/31/tuning-replication-performance-in-dfsr-especially-on-win2008-r2.aspx, so all of the values described above had already been set to the "Tested high performance value". In detail:

    Windows Registry Editor Version 5.00
    
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\DFSR\Parameters\Settings]
    "AsyncIoMaxBufferSizeBytes"=dword:00800000
    "RpcFileBufferSize"=dword:00080000
    "StagingThreadCount"=dword:00000008
    "TotalCreditsMaxCount"=dword:00001000
    "UpdateWorkerThreadCount"=dword:00000020

    I also played around a little bit with the UpdateWorkerThreadCount as this (at least for me) seemed to be the value to change in order to influence to number of concurrent (inbound / outbound) connections, but it didn't. Independent of what I set the ThreadCount to, I always had the same number of connections between two servers in "dfsrdiag replicationstate". So, do you have any specific suggestions, which registry keys to change, in order to throttle down the number of concurrent connections? As the machines are only in a laboratory setting and not very powerful, perhaps also some of the other tunings could be responsible for the problem, because e.g. one machine has to many incoming TCP/ RPC connections?

    In this context maybe also relevant: two of the machines also serve as DNS and AD controller for their sites. Maybe that could be the problem?

    Monday, August 15, 2011 7:57 AM
  • Hi ,

     

    I found this question helpful but i am also facing similar problem, so did we guessed the solution ?. Does the long path name affect the DFSR replication?

     

    any ideas and advise is appreciated.

     

    Thanks

     

    Sunday, November 13, 2011 12:57 PM
  • in my case the situation finally stabilized - after around 2 months [SIC!]. i've calculated the bandwith vs amount of data and they should replicate afer two weeks [including some buffer] still it took so long. 

    for the next server i started with prestaging http://blogs.technet.com/b/askds/archive/2008/02/12/get-out-and-push-getting-the-most-out-of-dfsr-pre-staging.aspx

    that was quicker method. 

     


    -o((: nExoR :))o-
    Monday, November 14, 2011 9:15 AM