none
Replication slow or not working, with warnings about multiple objects RRS feed

  • Question

  • Hi all,

    I've spent the last week trying to get to the bottom of this and am not making much progress. Here is what we have:

    2 sites (call them A and B), 4 DC/DFS servers, two in each site, one virtual and one physical. DC01 (physical) and DC02 (virtual) are in site A, DC03 (virtual) and DC04 (physical) are in site B. Sites are connected by 100Mbps WAN link. There are no sites defined / organized in AD, this is just a different geographical location.

    The original setup was on 2003 servers, a few months ago we upgraded domain and DFS to 2008R2. All 4 servers were fresh install, old servers were retired. All servers are patched by WSUS on a regular basis and rebooted in the middle of the night, we haven't had any issues with that. I've read about DFSR patches and hotfixes, if these are not part of WSUS updates then they have not been applied.

    We have 7 namespaces with 19 folders, each folder is in its own replication group. All replication is set to full mesh (except one folder as described below), and each folder has only one active referral target.

    About a week ago, we discovered that permissions on a couple of critical folders were not as they should be and decided to remedy that. On 3 out of 4 servers they were so messed up that I couldn't even gain access (as full domain admin), and to replace ownership (to domain admin group) it would mean all permissions would also be set to domain admin group, before we can set them to what they really need to be. Since this particular folder contains just under 2TB of data (mostly PDF files 1MB to 4MB each), we decided to replace permissions during after hours. At the time, DC03 was active referral target for this folder, however (for some reason that escapes me at this time) I decided to apply permissions on DC01 and let them replicate to other servers. So this was done, it took about 90 minutes to apply permissions and since we didn't know how long it will take to replicate this to DC03, I switched referral to DC01 which became the only referral target for that folder. We did a quick test and everything seemed ok. We were planning to wait for changes to replicate and then switch referral target back to DC03.

    In the morning we've got the calls about users not being able to access some files. After investigation, we found that files that were saved to DC03 the day before had not been replicated to DC01, and now they were inaccessible as they were still on DC03 but DC01 was the only referral target. XCOPY was used to manually copy files from the day before, however during the investigation we found a handful of files were not replicated from some subfolders going back a couple of months. This was the first time we realized replication may not be working at 100% and started digging deeper.

    At some point during this weekend I rebooted all 4 DCs one by one, without any positive impact. I have also changed full mesh replication to create a chain : DC01 > DC02 > (WAN link) > DC03 > DC04, topology tested ok. I haven't noticed any improvement. Staging area for this folder is set to 128GB, following small staging area events in the event log. Prior to this we've had plenty of disk activity, which has gone down to only a few MB/s and is easily handled by the server (4 CPUs, 8GB memory, 4x3TB disks in RAID5. Since I changed staging area on Friday we've only got one error about high watermark, the same day. At this time logs show occasional sharing violation for different files (normal use pattern from what I can tell) and plenty of info events about files being changed on multiple servers. DFSRS.exe takes around 650MB  and low CPU usage, with about 2-3 MB/s disk traffic.

    Right now we have some folders (not all) that have backlogs to or from DC01, while other servers are current for the most part, except for the 2TB folder we replaced permissions on. That folder currently has a backlog of 1.440 million files (presumably permission changes) DC02 > DC01, and 1.442 million DC01 > DC02. Interestingly dfsrdiag backlog still shows backlog between DC01 and DC03/04 even though they shouldn't be replicating directly according to topology. Backlog numbers are a bit higher than numbers above, it's almost as if backlog didn't go away but rather stands still. I expected any backlog from DC03 > DC01 would become DC03 > DC02 and DC02 > DC01, as per current topology.

    While running dfsdiag backlog commands I found some cases where the command would execute but with warning : 

    [WARNING] Found 2 <DfsrReplicatedFolderConfig> objects with same ReplicationGroupGuid=878ED61A-A737-4C88-8D16-D65CABE68175 and ReplicatedFolderName=uploads; using first object.

    I am not sure if this is related or if the problem existed before we did work a week ago.

    I have followed instructions to rename .XML files into .OLD and have observed new XML files were created following DFSR service restart. It doesn't seem to have made any difference.

    Please let me know what information I can provide to hopefully resolve this. 

    Thanks very much

    Monday, October 29, 2012 6:11 PM

Answers

  • Update : After leaving everything in place over the weekend it seems the backlog has started going down across the board. Faster for some servers and slower for others but going down nevertheless. It's almost as if DFSR service needed a day or two to "get ready" to replicate. Not surprising considering we have about 4 TB of files. It may take a few days (maybe a week) to get everything down to zero, but there is really no way around it apart from re-staging files with robocopy.

    Thanks all!

    Monday, November 5, 2012 9:36 PM

All replies

  • Hi,

    Please have a try on the following steps:

    1) On DFSR server that has the errors from the output run DFSRDiag POLLAD
    2)Stop the DFS Replication service
    3) Go to the drive that holds the replica_ files for the RG such as F:\System Volume Information\DFSR\Config and rename the replica_*.xml files to replica_*.old
    4) Go to the C:\System Volume Information\DFSR\Config and rename the Volume_*.xml files to replica_*.old
    5) Start the DFS Replication service

    Check in the replica_ drive (i.e.- F:\System Volume Information\DFSR\Config) and C:\System Volume Information\DFSR\Config for the new xml files, and in the registy at HKLM\System\CurrentControlSet\Services\DFRS\Access Checks\Replication Groups for the values pertaining to the RG as well as HKLM\System\Current Control Set\Services\DFRS\Parameters\Replication Groups

    Re-run the DFSRDiag commands to verify the fix

    Wednesday, October 31, 2012 9:03 AM
    Moderator
  • Hi,

    Yesterday I executed the procedure you described (also found at http://blogs.technet.com/b/askds/archive/2007/10/05/top-10-common-causes-of-slow-replication-with-dfsr.aspx ) and it resolved the issue with folders being reported twice. Also, last night I applied the latest DFSR hotfix (as per same article above) to all 4 DFS servers and rebooted each. I think there was a slight drop in backlog, but this morning it is still bad and it seems to be dropping very slowly. We really don't have many changes so this should be dropping faster. I've checked all DFSR logs on all servers and haven't found anything out of ordinary, no errors or warnings.

    Right now I'm in the process of building a PS script to get the backlog data for the entire environment so I can track the trend. I find it hard to believe there is no GUI to get backlog, I've found some scripts and utilities but nothing that would be easy to use or get everything at once.

    Another thing I noticed is that when I run a script that uses WMI to get backlog info, it takes a very long time to complete, and queries going to server with backlog are taking over a minute. All together the script takes 20 minutes or more to collect backlog data from 4 servers and 17 replication groups with one folder each. All servers have sufficient CPU, memory, network and disk resources available to them.

    Thanks

    Wednesday, October 31, 2012 7:43 PM
  • Update : I've completed the script and ran it last night to collect current backlog data. This morning I ran it again and all backlog counters for all connections incoming to DC01 are up significantly. It seems only this server's incoming connections are not going through and are stuck. Outgoing as well as all other connections between other servers are at zero backlog.

    Any idea what to look for? DFSR system log shows event 2104 : The DFS Replication service failed to recover from an internal database error on volume E:. Replication has been stopped for all replicated folders on this volume. This happened last night at 1:55AM and there is a event 2002 at 2:17AM "The DFS Replication service successfully initialized replication on volume E:." so it seems it recovered. Not sure why it blew up to begin with. Other than that it looks pretty clean.

    Any help appreciated.

    Thanks

    Thursday, November 1, 2012 4:41 PM
  • Update : After leaving everything in place over the weekend it seems the backlog has started going down across the board. Faster for some servers and slower for others but going down nevertheless. It's almost as if DFSR service needed a day or two to "get ready" to replicate. Not surprising considering we have about 4 TB of files. It may take a few days (maybe a week) to get everything down to zero, but there is really no way around it apart from re-staging files with robocopy.

    Thanks all!

    Monday, November 5, 2012 9:36 PM