Stellen Sie eine FrageStellen Sie eine Frage
 

BeantwortetDFS replication slows server down

  • Dienstag, 30. Juni 2009 12:15Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    We have here a DFS with replication and we have several slow downs a day, where nothing goes and the access to files on the partition stops complete responding - and the whole network (because every client has a DFS folder as mapped drive from the server). I have found out, with process monitor, that this happens only, when the dfsr.exe performs readfile on the path "d:\$Directory " (at least 100 times in a second) - which is no folder or anything. What happens there?

    PS.: We have 2 folders in DFS each with 1 TB of files (many pdf files and large jpeg pictures). The problem occured on a Dell Storage Server), which was replaced with an IBM Storage server, because we thought the lags came from the raid controller. But the problem is now on both servers which are in a gigabit network and in a replication group. There is also another Dell Server who is replicated over 2 Mbit ethernet connect which hasn't the problems.

    PPS.: All server have Windows 2003 R2 SP2.

Antworten

  • Dienstag, 21. Juli 2009 10:58Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     Beantwortet
    Okayyy, it seems, that the RDC option was the bad guy, because it was activated on a folder (on the IBM) that got many new files every day in many directories with many files. After I deactivated the option, the system had suddenly a good performance with no or very few slowdowns.

Alle Antworten

  • Mittwoch, 1. Juli 2009 06:53Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    Something new: I found out, that when dfsr.exe trys to make readfile on $Directory, the dfsr.exe has no E/A read, no E/A write and no E/A other . Nothing happens, no number changes. As if it stops and waits for something. This now happens severel times every hour.

    Can it be, that there are just too many files to replicate? Or are some files locked (there is no event view entry, when it stops)?
  • Mittwoch, 1. Juli 2009 10:03David Shen - MSFTMSFT, ModeratorTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     

    Hello Michael316,

     

    As we know, DFS replication depends on the USN journal to detect the change of DFS data in NTFS file system on all of the DFS member servers. As you mentioned that "The problem occured on a Dell Storage Server, which was replaced with an IBM Storage server", this behavior has made DFS be aware that data on that DFS replica (the IBM storage server that hold the file share of the DFS data) being changed. Therefore, the DFS tries to re-synchronize the data from other upstream replication partner to that new downstream replication partner through DFS replication (dfsr.exe) methods.

     

    Please wait for some while to see if data will be replicated to that new server properly. If not, to troubleshoot this issue, would you please collect DFS Replication diagnostic report on that server?

     

    Steps:

     

    Create a diagnostic report for DFS Replication

    http://technet.microsoft.com/en-us/library/cc778105.aspx

     

    The report exist in the folder C:\DFSReports of the server that you run the diagnostic report.

     

    Meanwhile, please also collect the process monitor trace logs when the issue occurs. You may send the report and trace logs file to tfwst@microsoft.com

     

    I appreciate your time and effort.


    This posting is provided "AS IS" with no warranties, and confers no rights.
  • Mittwoch, 1. Juli 2009 11:28Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    The Problem was there on the DELL, before it was replaced by the IBM Server. That was the reason, why we replaced the Server. We thought it was a problem with the Dell Raid controller.

    The IBM was build by replication and the initial phase was over. Then we switched the active DFS path of the shared folders from the DELL to the IBM server.

    And then, the slowdown problem - and the readfile on "$Directory" occured on both servers. The Dell is still running as backup, while the main productive server is now the IBM which replicates to the Dell.

    I will send then the reports.
  • Donnerstag, 2. Juli 2009 10:09Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    We have now an idea from the dell service, that the 11x750 GB 7200 SATA harddiscs (the ibm has also only 7200 SATA HDs, Dual Port afaik) in the raid are maybe too slow for 40 users, especially the cache access can be a bottleneck. But these problems were not there from the beginning. Very strange.
  • Freitag, 3. Juli 2009 03:39Ned PyleMSFTTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     

    $Directory is a special NTFS metadata file structure, and DFSR should not specifically care about it. Please run CHKDSK against this D: volume on this server, off hours. If CHKDSK finds errors, run it with its /R option.


    Ned Pyle [MSFT] - MS Enterprise Platforms Support - Beta Team
  • Freitag, 3. Juli 2009 08:56Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    How long will it take for chkdsk to check a 6 Terabyte Raid5?

    And how can it be, that this error occurs on two different server? (which are both in a replication group)

    Or does DFSR replicates HD errors? ;-)
  • Montag, 6. Juli 2009 03:55David Shen - MSFTMSFT, ModeratorTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    Hi Michael316,

    I think the time depends on the size of the volume that you try to run chkdsk on. I suggest that you refer to Ned's information to run CHKDSK on that volume to check if there is any error.

    Thanks for co-operation.


    This posting is provided "AS IS" with no warranties, and confers no rights.
  • Montag, 6. Juli 2009 06:32Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    The other problem is, that we don't know if the raid is still alive, after chkdsk. (you can read some horror stories on the web)

    But I think we have found the problem. It seems, that everytime a pdf is generated with crystal reports (through a web application), the ominous ReadFile on $Directory occurs. Not sure is, if it is crystal reports, or the saving of the file or the sending to the client or whatever.
  • Montag, 6. Juli 2009 15:24Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    Ok, it seems that the replication is still in the game.
  • Dienstag, 7. Juli 2009 10:42Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    Can it be, that there is just too many data, too many files and too many worker for this server config?

    But we also get slowdowns, when only 3-4 people are connected (over internet) in the middle of the night.

    Is there maybe a problem with DFS replication, when people are creating, opening and uploading (PDF) files from web access over a web application on the file server? We have 2 Mbit upload and download for the whole company.

    EDIT: What is also funny is, that the Dell and IBM server have both the dfsr Readfile on $Directory, but not at the same time. And we are also only working on the IBM server, but both are affected with the slowdowns. As if an access error is replicated. How can it be?
  • Mittwoch, 8. Juli 2009 10:14Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    Now something new: Before every "Readfile" block, while the system freezes, the last dfsr Operation is "QueryDirectory" on "...\dfrsprivate\Staging\ContentSet..."

    What does this mean?

    PS.: The details of Readfile on $Directory are: Non-cached, Paging I/O, Synchronous Paging I/O

    AND - if we try to open, direct on the server, a folder which has many files, it lasts very long and we also get Readfile on $Directory, BUT from the explorer.exe.

    AND - before the QueryDirectory cames often or almost always immediately a CreateFile on the (first part of the) querydirectory path.

    It seems, that the dfsr process can't open or read a directory, or something inside of it, in the staging folder, which he created before and then he waits and waits. Or the Folder is there and is so full, that it lasts a little bit, until the dfsr process can open it - like browsing with the explorer in directories with many files...

    But why? Is it just the bad performance of the hard drives? AND another Idea: The files on the server are placed and created in directories which are containing 20000 to 800000 (and sometimes more) files. Has this something to do with the replication, that it is maybe the opening of the folder, where the file was changed or created, that's causing the replication to wait?!

    And another thing is, that when the bandwidth of the replication is higher, the slowdowns are occuring more, than when the bandwidth is reduced. Can it be, that when the bandwidth is higher, and he can replicate a file immediately, the error occurs also immediately, and when the bandwidth is low, the replication is delayed and the error comes later?
  • Donnerstag, 9. Juli 2009 09:58Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     
    I found out now, that it stopped and slowed down, when a jpg file was renamed. How can THIS influence the replication process???

    Help me Obi Wan Kenobi! You're my last hope! ;)

    EDIT: Ok, it seems, that the renaming wasn't the fault.

    Can it be, that there is a little problem with the replication with folders that are containing a large amount of files?
  • Dienstag, 21. Juli 2009 10:58Michael316 TeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillenTeilnehmermedaillen
     Beantwortet
    Okayyy, it seems, that the RDC option was the bad guy, because it was activated on a folder (on the IBM) that got many new files every day in many directories with many files. After I deactivated the option, the system had suddenly a good performance with no or very few slowdowns.