locked
DFS-R Backlog Appears Stuck RRS feed

  • Question

  • I inherited 2 DFS-R hosts connected via a WAN.  We'll call them Michigan and California.  We have several replication groups, one of which is massive, in the realm of 1.4million files and 1.6 Terabytes after deduplication.  About 2 weeks ago, all of a sudden we had a 1.4 million file back log.  For whatever reason, that backlog jumped back up to 1.4 million at least once since then.  Fast forward to earlier this week and the backlog of files being sent from California to Detroit is ZERO.  However, the backlog of files being sent from Detroit to California is seemingly stuck at right around 80,000 and growing (as users continue to make changes).  For the life of me, I can't figure out what's the hold up.  The DFSR logs are pretty much greek to me.  I've seen suggestions to disable membership of the backlogged node, wait for the changes to replicate in AD and get picked up by the member servers, then re-enable it, to kick off an initial sync.  The problem is, I'm under the impression if I do that I'll lose data that hasn't replicated from Detroit to California yet, since California would become the "master".  Sure I can run a preemptive backup, but there's still a chance that someone will change something while that 18 hour backup runs.

    Are there any ideas of what I can do.  I'd love to narrow this down to figure out what exactly is the hold up.  I'm at my wit's end with this thing.

    Long term, the plan is to replace this current DFS solution with something else or to at least prune it down and/or break it into smaller parts.  However, I'm stuck with what I've got for the moment.

    Any help would be GREATLY appreciated.

    Thursday, January 12, 2017 6:00 AM

All replies

  • Hi,

    First please check if staging folder is large enough. A recommended size is 1.5x<size of your largest file in replication group>.

    Also here is a tuning guide of DFSR. Please see if it could help:

    https://blogs.technet.microsoft.com/askds/2010/03/31/tuning-replication-performance-in-dfsr-especially-on-win2008-r2/

    Also this article provided many general steps about troubleshooting DFS replication issue. Please see:

    Top 10 Common Causes of Slow Replication with DFSR
    https://blogs.technet.microsoft.com/askds/2007/10/05/top-10-common-causes-of-slow-replication-with-dfsr/

    In addition, for large files of DFSR, I suggest you could check the thread discussed before.

    https://social.technet.microsoft.com/Forums/windowsserver/en-US/eb8fe2d1-6383-4521-bd05-5fec74046dae/large-files-count-in-dfs?forum=winserverfiles

     Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, January 12, 2017 7:24 AM
  • My current staging quota is right around 400GB.

    The result of: Get-ChildItem x:\documents -recurse | Sort-Object length -descending | select-object -first 32 | measure-object -property length –sum

    Is as follows:

    Count    : 32
    Average  :
    Sum      : 178769868106
    Maximum  :
    Minimum  :
    Property : Length

    Which if my math is right, sum is right around 179GB, let's round that up to 180GB.  Now even if I double that, I'm at 360Gb, which is below the 400GB quota.

    So, I hightly doubt it's my staging quota that's causing the issue.

    I should also note that these servers are running Windows Server 2012R2.

    Thursday, January 12, 2017 6:46 PM
  • Hi,

    May I ask the files are all in the same replication group?

    If so, could you find the file and move it out of the replication group to see the result? You can put it back if it back to work normal.

    Also if replication of other folders works just fine, you can test to temporarily move the affected folder out of the replication group (if only 1 replication group) or create a new replication group for that folder to see the result. 

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, January 13, 2017 8:21 AM
  • I have like 5 replication groups.  Each consisting of 1 "root" folder containing many subfolders.  The issue I'm referring to is the main replication group/folder.  It's not just 1 subfolder that's causing the issue, either.  Or at least I highly doubt it given the backlog size.  We're talking about 80,000 files though.  That's a lot of files to try and move.  I checked the backlog and have removed a few specific files from the disk, but they still show in backlog.

    Other replication groups/folders seem to be fine.  There is 1 other one that is having issues, but it's much smaller so I can rebuild it if necessary.

    Friday, January 13, 2017 2:51 PM
  • Hi,

    >There is 1 other one that is having issues, but it's much smaller so I can rebuild it if necessary.

    Could you please if rebuild it that could be work?

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, January 16, 2017 8:26 AM
  • I've done the following on the one that's causing me the most issue.

    1) Disabled the namespace referral to the server that has isn't receiving updates, figuring the other one is the most "up to date".  Thus, forcing everyone right now to edit/access files from Detroit.

    2) Removed the existing replication group that was stuck

    3) Allowed time for changes to propagate and the DFS Management console to show properly on both servers

    4) Created a new replication group under a new name (MAIN) and attached the existing data folders to the group (X:\documents in Detroit, R:\documents in Cali)

    5) Added some folders ($Recycle.Bin, TemporaryItems, some other MacOS specific folders and files) as filters so they don't try to replicate.

    6) Ran "wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo get replicationgroupname,replicatedfoldername,state"

    The result of 6 for the group in question is an state of 5, meaning error.

    I thought, oops, I must have bungled something.  So I went through the process again.  Still ended in state of 5.  HOWEVER, I can't find anything in Event Viewer that shows an error.  So, I'm completely stumped.

    Tuesday, January 17, 2017 7:27 PM
  • Scratch that last message.  I guess I was being impatient.  It's now in a state of 2 and an initial replication has begun.  Hopefully this works!
    Tuesday, January 17, 2017 7:30 PM
  • Hi,

    If you have more updates please feel free to contact.

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, January 18, 2017 1:37 AM
  • Well 3 days later I thought we were in good condition, but still stuck.  I suppose maybe it's 1 large file just going slow?  I don't know any way to tell that for sure.  The Detroit server backlog is 0.  The Cali server is at about 23000 and hasn't budged in hours.  I'm at a complete loss for words at this point.  The list of files in the backlog (the first 100 that it shows you) hasn't changed all day.
    Friday, January 20, 2017 10:02 PM
  • Can you post the Event Log errors from both replication member servers please?

    Miguel Fra
    Falcon IT Services
    https://www.falconitservices.com

     



    • Edited by Miguel Fra Saturday, January 21, 2017 4:44 AM
    Saturday, January 21, 2017 4:43 AM
  • All I'm getting for errors in the DFS Replication event log are a bunch of 5002, 5012 and 4004 errors.  However, the last error was on 1/19 at 11:30PM (ET).  Replication was still working until mid afternoon on the 20th, so communications may have broken down at some point, but it certainly reestablished itself.

    Oddly, the backlog dropped by 1 file since I last posted.  This is absolutely bizarre.  It'd be nice if the debug logs were actually readable.

    Saturday, January 21, 2017 3:45 PM
  • 5002, 5012 are caused by connectivity problems.

    Check that Windows Firewall is disabled on both replication partners
    The correct IP of the computer is resolvable both in DNS (FQDN) and NetBIOS
    Check that the servers have static IP address assignments
    Check that the static IP's are not in DHCP pool or at least reserved
    Check that VPN connection is steady and no packet loss over a period of time
    Check that RPC server is running
    Check the permissions on the DFS target folders
    Make sure the servers use the domain internal DNS settings

    4004 is a DNS error

    Check that there are not duplicate entries for the same host in DNS
    Check for static DNS entries being correct
    Make sure the DNS service is running

    This event 4004 could be the likely cause because DFS relies on DNS for connections, thus the first two errors.

    http://www.eventid.net/display.asp?eventid=4004&eventno=334&source=dns&phase=1


    Miguel Fra
    Falcon IT Services
    https://www.falconitservices.com

     


    • Edited by Miguel Fra Saturday, January 21, 2017 4:15 PM
    Saturday, January 21, 2017 4:07 PM
  • I'm fully aware of what those errors could be caused by.  As I already stated, those errors stopped a day before the replication stopped and are most likely because the router had to be rebooted - which would naturally cause a connection issue.  They are static IP'd.  They can both resolve each other.  The VPN is no better or worse than it was 2 days ago when replication was seemingly working.  RPC is running.

    These are the same exact problems I was having before rebuilding the replication group, which is WHY I rebuilt the replication group.  The only difference is that now the backlog is in the opposite direction and smaller.

    Saturday, January 21, 2017 4:14 PM
  • You might have some large files that are causing congestion, so when it's turn for a particularly large file to be replicated, it takes a long time and the process may appear to be stuck.

    This is my only other guess since you say these errors have stopped and no further errors such as dirty shutdown journal wraps, or others that would cause the DFSR service to stop working. 


    Miguel Fra
    Falcon IT Services
    https://www.falconitservices.com

     


    • Edited by Miguel Fra Saturday, January 21, 2017 4:34 PM
    Saturday, January 21, 2017 4:32 PM
  • I THINK I've found a way to get a complete list of files in the backlog.  I'm going to try using that to query file sizes in hope that this is the case.  I wish event viewer had more pertinent information.  There's probably something in the debug logs, but those would be easier to read if they were in binary.
    • Proposed as answer by Mary Dong Friday, February 3, 2017 5:19 AM
    Saturday, January 21, 2017 4:53 PM
  • Hi,

    If you have new finding/updates for this issue please feel free to contact us.

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, January 23, 2017 6:31 AM
  • Hi SO Admin,

    You can get a list of the files in backlog with the following PowerShell command:

    get-dfsrbacklog -SourceComputerName SDC-ITSVFS01 -DestinationComputerName <your dfs server> | ft FullPathName



    Charlie Coverdale

    Disclaimer: This posting is provided "AS IS" with no warranties or guarantees, and

    confers no rights.

    Wednesday, August 15, 2018 11:09 AM
  • Once you have the list of files, try cut /pasting one of them out of its replication group, then cut/pasting it back to its original location.

    Check again whether it is still stuck in Backlog.


    Charlie Coverdale

    Disclaimer: This posting is provided 'AS IS' with no warranties or guarantees, and confers no rights.

    Saturday, August 25, 2018 12:01 AM
  • I get mutliple time the same file with the 'get-dfsrbacklog -SourceComputerName srv1 -DestinationComputerName srv2' command.

    Identifier                  : {A6932761-5743-466F-B377-1F72369FC590}-v2171045
    Flags                       : 4
    Attributes                  : 8224
    GlobalVersionSequenceNumber : {A6932761-5743-466F-B377-1F72369FC590}-v2171082
    UpdateSequenceNumber        : 6493845216
    ParentId                    : {A6932761-5743-466F-B377-1F72369FC590}-v322863
    FileId                      : 562949955553409
    Volume                      : \\.\E:
    Fence                       : 3
    Clock                       : 131823534192995188
    CreateTime                  : 25/09/2018 14:43:37
    UpdateTime                  : 25/09/2018 14:50:19
    FileHash                    : 101c70b07b3e5acf 0a50d25617eae481
    FileName                    : FVA 20180925 notulen - for merge.docx
    FullPathName                : E:\Departments\ToBranchOffices\Managers\Business Plannen 2019\FVA 20180925 notulen - for merge.docx
    Index                       : 3
    ReplicatedFolderId          : d30cb00d-782a-435e-a05d-06642e2132db

    Identifier                  : {A6932761-5743-466F-B377-1F72369FC590}-v2171086
    Flags                       : 5
    Attributes                  : 8224
    GlobalVersionSequenceNumber : {A6932761-5743-466F-B377-1F72369FC590}-v2171086
    UpdateSequenceNumber        : 6493849872
    ParentId                    : {A6932761-5743-466F-B377-1F72369FC590}-v322863
    FileId                      : 844424932264065
    Volume                      : \\.\E:
    Fence                       : 3
    Clock                       : 131823536704571352
    CreateTime                  : 25/09/2018 14:43:37
    UpdateTime                  : 25/09/2018 14:54:30
    FileHash                    : f05bc7056604f569 42c7060cf3ca996a
    FileName                    : FVA 20180925 notulen - for merge.docx
    FullPathName                : E:\Departments\ToBranchOffices\Managers\Business Plannen 2019\FVA 20180925 notulen - for merge.docx
    Index                       : 4
    ReplicatedFolderId          : d30cb00d-782a-435e-a05d-06642e2132db

    Seems like the update time is different and I assume that's the cause of it.

    We have it with multiple files, some files even 6,8 or more times.

    How can we get only the unique files?


    Tuesday, September 25, 2018 1:30 PM
  • $backlog = get-dfsrbacklog -SourceComputerName srv1 -DestinationComputerName srv2'
    
    $Result = $backlog | select FullPathName
    
    $Result = $Result | select -unique
    
    $Result



    Charlie Coverdale

    Disclaimer: This posting is provided 'AS IS' with no warranties or guarantees, and confers no rights.

    Saturday, September 29, 2018 8:35 AM
  • How have you gotten on with this?

    Charlie Coverdale

    Disclaimer: You take sole responsibility for any actions you take based on my posts.

    Friday, May 3, 2019 7:55 AM