none
Windows 2008 R2 - large file copy uses all available memory and then tranfer rate decreases dramatically (20x)

    Question

  • I have a problem that was discussed in the following link but never resolved.  I'm unable to reply to that thread, so I've created a new one in the hope that someone might be able to help.

    http://social.technet.microsoft.com/forums/en-us/windowsserver2008r2general/thread/74C2C9CA-F8C1-4C37-BC8C-CD074CE0C6CD?prof=required

    I have two Windows 2008 R2 servers, and I'm trying to copy large (minimum 50GB) files back and forth between the servers.  If I copy a 50GB file from server 0 to server 1, the transfer rate stays at just below 1 gigabit/sec on a gigabit switch.  However, if I copy a 50GB file from server 1 to server 0, the copy begins at just below 1 gigabit/sec, but once the amount of data transferred is equal to the amount of available RAM on server 0, the transfer rate steadily decreases (will continue to decrease rapidly and might level off at just 50 megabit/sec).  It doesn't matter if the file is pushed or pulled.

    Server 0 is a Dell PE2950 with 24GB of RAM and 2 dual core Xeon 5110 CPUs @ 1.6GHz

    Server 1 is a Dell PE2950 with 32GB of RAM and 1 quad core Xeon E5420 CPU @ 2.5GHZ

    I have seen this happen before on Windows 2008 x64 without R2, and I've used DynCache http://www.microsoft.com/downloads/en/details.aspx?FamilyID=e24ade0a-5efe-43c8-b9c3-5d0ecb2f39af&displaylang=en to resolve it.  However, DynCache is not supported on Windows 2008 R2, and it's not supposed to be needed on R2 because the problem was supposedly fixed / solved.  Interestingly, I only have the issue on one of the two R2 servers. 

    In task manager on the problem server, as soon as I start the file transfer, I can watch the available memory begin to drop.  At the moment I have 24GB of RAM in the server, and about 16GB of that is available.  Once 16GB of the 50GB file has been transferred, the available memory gets down to 0 in task manager, and then the transfer rate tanks.  The OS was installed just a week or two ago.  It has Hyper-V and SNMP installed, as well as the latest Windows updates.  I then installed the File Services role as well, but the problem still exists.  Nothing else has been installed. 

    Clearly there is still an issue here in Windows 2008 R2, but it doesn't seem to affect all servers in all situations.  There are also clearly other people having the same problem, but to my knowledge Microsoft has yet to acknowledge or address the issue in Windows 2008 R2.  Can anyone help?

    Thanks.

     

     

     

     

     

     

    • Edited by DougZuck Thursday, October 21, 2010 9:59 PM hyperlinks added
    Thursday, October 21, 2010 9:58 PM

Answers

  • After all this time, I finally "solved" the problem I was having. It seems like in this posting there are possibly multiple issues being discussed, because it's not clear that everyone's situation is that same as mine, so keep that in mind when you read what I changed to make the problem go away in our environment.

    In summary, it all came down to the write-caching policy on the RAID controller. We are dealing with Dell servers and Dell controllers, and I have reproduced the issue on both embedded/internal RAID controllers as well as their external RAID controllers for direct-attached storage arrays.

    When the RAID5 virtual disk on the controller is set to write-through, the copy performance issue exists. When the virtual disk is switched to write-back, the problem disappears. By default we always use write-back caching, but when the RAID battery fails (or while the battery is charging), the controller automatically switches the virtual disk back to write-through until the battery is replaced or charged. You can select to "force write-back" when the battery is dead, but the consequence is possible data loss if the server were to crash or lose power during a write operation.

    Interestingly, the performance difference of write-through vs write-back caching often seems negligible. However, for certain operations, including large file copies across a network, there are clearly issues. Interestingly, I have not been able to reproduce the problem when copying files on the same server from one array to another. It only seems to exist when copying across a network where the destination drive has a write-through caching policy.

    I hope this info helps some other people with similar problems. I can't believe that after all this time and so much testing that it all came down to a single setting. It's disappointing to discover that the write-caching policy can have such a large impact for some operations and a nearly non-existent impact for others.

    • Marked as answer by DougZuck Thursday, August 18, 2011 5:54 PM
    Thursday, August 18, 2011 5:29 PM

All replies

  • Hi DougZuck,

     

    Thanks for posting here.

     

    I would suggest to check if this issue persist with perform the commands below to disable the TCP Chimney Offload/Receive Side Scaling feature in Windows server 2008 R2.

     

    netsh interface tcp set global rss=disabled

    netsh interface tcp set global autotuninglevel=disabled

    Reboot the server

     

    For background information, please refer to the article below:

     

    Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008

    http://support.microsoft.com/kb/951037

     

    Thanks.

     

    Tiger Li

     

    TechNet Subscriber Support in forum

    If you have any feedback on our support, please contact tngfb@microsoft.com 


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
    Friday, October 22, 2010 3:00 AM
  • Thank you for your response, Tiger Li.  I followed your instructions and did the following, but the problem still exists.

    netsh interface tcp set global rss=disabled

    netsh interface tcp set global autotuninglevel=disabled

    Reboot the server

     


    Any other suggestions?

    Thanks.
    Friday, October 22, 2010 1:29 PM
  • Hi DougZuck,

     

    Thanks for update.

     

    Are there any error occurred in event log ?

    Please verify which application’s memory usage is increasing by perform “perfmon.exe /res” to use resource monitor when large file transfer begin.

    What if remove installed roles(hyper v , fileserver ,snmp) on server, is this issue persisted?

    Please update the latest firmware for NIC and RAID controller .

     

    Thanks.

     

    Tiger Li

     

    TechNet Subscriber Support in forum

    If you have any feedback on our support, please contact tngfb@microsoft.com


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
    Monday, October 25, 2010 5:41 AM
  • No event log errors

    Resource monitor shows SYSTEM increasing in memory usage when the copy starts

    RamMap confirms that the file that's being copied is what's being cached and what's using up all the RAM

    The problem still exists when FileServer role and SNMP are removed.  I'm not able to remove Hyper-V from this server because it needs to continue hosting virtual machines.

    Firmware for NIC and RAID are already latest versions.

     

    The issue is clearly an OS issue.  The issue existed in 2008 x64 without R2, so I'm not surprised that it also exists in 2008 R2.  The problem is that at least in 2008 x64 without R2 you could use DynCache to work around it, but DynCache unfortunately cannot be used on 2008 R2.  Additionally, in 2008 R2, the problem seems to only exist on some servers.  It's unclear why this is the case.

     

    Thanks.

    Monday, October 25, 2010 4:13 PM
  • How are you performing the file copy?  I think you're seeing the effects of a buffered file copy.  Try using XCOPY with the new /J switch to do an unbuffered file copy.  I've been able to move roughly 4GB per hour between servers using that new switch, with no ill effects to the servers themselves.
    Wednesday, October 27, 2010 6:16 PM
  • Tracy - thanks for the response.  Yes, I'm able to use xcopy /j, but as you pointed out, it is EXTREMELY slow.  4GB per hour is simply not going to cut it.  I have files that are hundreds of GB.

     

    Thanks,
    Doug

    Wednesday, October 27, 2010 6:25 PM
  • Is it possible that whatever you are using to copy the files (may be windows explorer itself - which is a poor method to use if copying that much data) simply isnt releasing the used memory properly and when it slows down its actually utilizing the pagefile for memory? I personally would use Robocopy for a job like this and its worked very well for me. I have not noticed the exact speed of the transfer's but typically its limited to the slowest component in the transfer, HDD, SAN, NIC, HBA, CPU, LINK, Switch, etc. Just a thought.
    Wednesday, October 27, 2010 6:54 PM
  • Robocopy and Windows Explorer both exhibit the same behavior.  The file gets cached, available memory drops to 0 or near 0, then the file transfer rate drops dramatically.  You can avoid the caching by using xcopy /J, which is an unbuffered copy, but it's too slow for transferring very large files.

    -Doug

    Wednesday, October 27, 2010 7:30 PM
  • But since you only see the behavior on one server and not both i would assume some kind of problem on one server that is apparently going to be hard to identify.
    Thursday, October 28, 2010 1:11 PM
  • Yes, that's true.  However, please also note that this is a brand new installation of Windows 2008 R2.  The server was then joined to the domain.  Then Hyper-V was installed and Windows Updates were applied.  Nothing else was done (except later installed FileServices role just to see if it might fix the caching issue, but it didn't).  Additionally, I have seen numerous other postings on the web from other people seeing the same behavior in 2008 R2.  The issue seems pretty clearly an OS issue, especially since it also exists in 2008 x64 without R2.  But again, I'd like to highlight that in 2008 without R2 you could use DynCache to modify the caching behavior.  DynCache, however, is not supported on 2008 R2. 
    Thursday, October 28, 2010 2:04 PM
  • This sounds like the disk is not able to keep up with the network throughput.  A perfmon to see the caching would help in troubleshooting this.

    If the server is not primarily a file server you can configure the following so that there is less emphasis on caching:

    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
    LargeSystemCache=0    (DWord decimal)

    If the server/applications are slowing down because it is running out of memory then you can configure the OS to try to keep a little more padding.  By default a 64bit OS will try to keep available memory at 64MB or higher.  64MB limit is good because that means the OS can use the extra memory not being used to cache files in memory.  Memory is much faster than disk access.  However if your system has large spikes in memory usage then setting a higher Low Memory Threshold might prevent some sluggishness during sharp high memory demands.  Below is a sample configuring it to 200MB

    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
    LowMemoryThreshold=200    (DWord decimal)

    Windows 2008 R2 has improved memory management algorithms in comparison to windows 2008.  Windows 2008 R2 should not need Dyncache beyond the following hotfix that further refines the new memory management algorithms.

    979149  A computer that is running Windows 7 or Windows Server 2008 R2 becomes unresponsive when you run a large application - http://support.microsoft.com/default.aspx?scid=kb;EN-US;979149

    Anything further than this would require a paid incident to fully troubleshoot what is happening to the memory on your system further.

    Hope this helps

     


    David J. This posting is provided "AS IS" with no warranties, and confers no rights. Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
    Wednesday, November 10, 2010 9:11 PM
    Moderator
  • Thanks for the reply!  Those are very good suggestions and I will try them all and report back soon.

     

    -Doug

    Wednesday, November 10, 2010 10:31 PM
  • Hi Doug, did you reslove you problem? I'm facing the  same problem now. I tried these suggestions above, but none worked


    LC
    Wednesday, November 17, 2010 10:05 AM
  • I spent some time today doing some more testing, but unfortunately neither the registry key nor the hotfix have fixed the problem.

    -Doug

    Wednesday, November 17, 2010 7:11 PM
  • I'm having same issue when copying from drive to drive on Windows Server 2008 R2 with hyper-v role test server. Used physical memory goes up and down during copying process, system responsiveness overall is very bad.
    Wednesday, November 24, 2010 7:07 PM
  • How come MS does not solve this? There is a lot of posts on many sites about this issue.
    Thursday, November 25, 2010 8:10 PM
  • Hi,

    We have to analyze the SMB packets , please collect simultaneous netmon / ethereal traces between server 0 and server 1

    and another simultaneous trace between server 1 and server 0.

    You have to make sure that

    a) start the trace

    b) reproduce the problem

    c) stop the trace.

    upload it to your ftp / location from where i can analyze the data.

    Friday, November 26, 2010 4:32 AM
    Moderator
  • Hi

    I have a very similar problem.  In my case, it is a brand new Dell R415, Server 2008R2 and Hyper-V installed.

    Interestingly, the problem only occurs when copying files to and from iscsi drives.  For example, one of iscsi devices is a Netgear ReadyNAS.  If I map a SMB share, a 10GB file will copy at 11.5MB/Sec (100mbps network) as expected with no rise in memory usage.  Copy the same file to the same device, but this time to a mounted iscsi path and the memory usage rises until it is maxed out.  This is on the Lan network using the onboard broadcom NICs.

    We have a Broadberry SAN on a separate ISCSI gigabit network with managed switch connected to a dual Intel PT NIC on the same R415 server.  Copying files to this device exhibits the exact same behaviour.

    We also have an older Dell 1900, 2008R2 on the same network.  This too has hyper-v and connected to both networks and doesn't have the same issue to these devices.

    As we are a Gold partner, I have opened a case with MS on Monday using one of our incidents.  The performance team and networking team are both taking turns at solving this issue, but nothing as yet.

    The latest is they have done exactly what Sainath has suggested and are currently examining the netmon traces.

    In the meantime, if anyone has any suggestions, they would be very well received!

    Wednesday, December 01, 2010 9:34 PM
  • Simon - thanks for sharing.  Please do let us know what the MS team comes up with.

    -Doug
    Wednesday, December 01, 2010 11:27 PM
  • Anything heard back from MS yet? Simon?
    Wednesday, December 15, 2010 3:48 PM
  • Hello Simon,

     

    I've got here the same exact Problem:

    Dell Server R415, Server 2008 R2 and Hyper-V. What I can say at the moment:

    Same Problem as you describe. But here some additional Informations:
    - We've additionaly have a Intel Gigabit Card. --> same Problem on both sides
    - Also, I tried with direct cabling (without switch) --> same problem

    Tuesday, December 21, 2010 3:22 PM
  • We also see this behavior on our iSCSI Sans, all 3 IBM DS3500 SAN's, and all IBM x3650 M2's as hosts, (some in cluster confugrations, some not.

    Performance degrades so badly that IO issues crop up with the running Hyper-V images, and start crashing those (putting them in critical, or saved state.  Tjhis occurs even on hosts with 96 GB of RAM. 

    IOPS on the iSCSI SAN do start to peak and hit critical thresholds, so I'm assuming the server starts buffering and caching rhe copy to memory when it cant write tot he disk.

    This causes hard faults in the memory and kills the file copy and even the running images, as I mentioned before.

    I think it has to do more with SAN throughput though, because if we have two Shared CSV's, the RAID 10 SAS array images chug along happily, while the images running on SATA start dying.

    A fix, or explaination would be hugely helpful as I've been struggling with this for over a year now.

    Tuesday, December 28, 2010 11:44 PM
  • Tuesday, December 28, 2010 11:47 PM
  • Hi Guys

    Sorry for the delay, work committments etc..

    I worked with Microsoft for the best part of two weeks, but unfortunately they did not come up with an answer.  We had the performance team, iscsi team and network teams all working on this but no useful answer. To be honest, I now don't think there is a single answer.

    Using Jperf (iperf) on default settings, I now have a wirespeed of 600mbps, not perfect, but usable.  I achived this using a mixture of windows TCP tweaks from Starwind's site http://www.starwindsoftware.com/forums/starwind-f5/recommended-tcp-settings-t2293.html, updating all the network card drivers, and changing and matching EVERY setting on the SAN and Hyper-V server until I got a result.  According to anyone who knows anything about ISCSI, I should be getting 950mbps wire speed.  But with the effort I have put into this, a consistent 600mbps is acceptable. 

    We have an older Dell Powerconnect GB switch.  Interestingly, jumbo frames made things worse.  Though I believe there are differences between switches as to whether they use 9014 or 9000 MTU size.

    Sorry I don't have a fix, but try the starwind info above.  There is another thread about offloading and hyper V, that makes a differnce too.  Changing and matching settings helps, as does updating the drivers.  The last driver gave me 30mbps extra.

    As for measuring, I like Jperf, but Iometer appears to be a professional tool for measuring wire speed and  disk reads / writes, though I've not tried it yet.

    Tuesday, January 25, 2011 11:50 AM
  • This is utterly ridiculous. We just noticed this problem as well, and it's quite concerning that a company that has been developing server operating systems for well over 15 years is not able to provide a stable OS that allows for large files to be copied without bringing down the server.

    To be honest, we have noticed a concerning trend. Starting with Server 2008, we have seen an increasing amount of bugs with the OS (performance, stability) that are not being fixed by Microsoft. There is clearly no interest from Microsoft's side to address these issues that are obviously and easily reproducible.

    In our case it doesn't involve SAN etc. at all. It's Windows 2008 Storage Server Standard just copying a 50Gb file that brings the server down in minutes.

    Should we downgrade to Windows NT or Windows 2000? :-)

    • Edited by WizardOz Friday, February 18, 2011 6:35 PM Added more details
    Friday, February 18, 2011 6:19 PM
  • hi guys.

     

    just thought i'd throw this into the mix. we are facing a similar problem at a clients. we have just benn virtualising a clients server onto the same hardware. we used the shadow protect HIR methodology and got the machine up and running inside a temporary server. then rebuild the hardware adding new raid 5 arrays to host virtual machines.

    the hardware is an intel s5000 based server with 20Gb of RAM. the host OS is installed on the intel embedded raid aray. there are 2 raid 5 arrays on indepenant lsi 9240-8i controllers which will each host a virtual machine..  we have removed the sata harddrive from the temporary server containing the vhd's for the new virtual machines, connected it to a sata port on the mobo and are copying the vhd's to the 2 raid 5 arrays. it's currently copying at a mind blowing  6Mb / sec

    i don't think this is solely a network issue. i think you guys see it manifesting itself as such and there is a deeper problem...  we only just did it yesterday so i haven't investigated too much.

    i get the same issue more or less. available memory plumits and then copy time goes up exponentially almost

    I will report back when and if i get any useful information. but for now i am going to simply wait for the copy to finish as i need to get the machine back up and running....

    cheers

    chris

    Sunday, March 06, 2011 8:05 AM
  • I am seeing the same issue w/ a twist.  See strange test results below.

     

    Problem:

    When copying files larger than 2GB from one drive to another on a server running Windows 2008 R2, the transfer rate is 20MB/s or less.

     Environment:

    Windows 2008 R2 server running on VMware ESX 4.1 connected to two iSCSI volumes [Drive S: (source), Drive D: (Destination)].  10GB network connection to SAN employing MPIO.

    Windows 2003 R2 server running on VMware ESX 4.1 connected to two iSCSI volumes [Drive S: (source), Drive D: (Destination)].   10GB network connection to SAN employing MPIO.

     

    Test Scenario:

    Scenario 1:   Windows 2K8 R2 server on ESX 4.1 has Drive S: and Drive D: connected via iSCSI over 10GB network.  Copy large (3GB) file from Drive S: to Drive D:.  The transfer rate shows as ~19 MB/s, and takes wel over an hour to transfer.  Copy small (512MB) file from Drive S: to Drive D:, and this transfers right away.

    We then disconnect the drives from the W2K8 R2 server. 

    Scenario 2:  Then connect the same drives to Windows 2k3 R2 server  on ESX 4.1 connected via iSCSI over 10GB network. Copy large (3GB) file from Drive S: to Drive D:.  The transfer rate shows as 147 MB/s, and transfers in minutes.

     So, it appears the issue is the Windows 2008 R2 O.S., or some technology it is leveraging diffrently than Windows 2003 R2

    A very interesting oddity I tripped across -

    When running Scenario 1, (W2K8 R2), I ran an IO Meter write test pointing to the Drive S: (the drive we are transferring the file from).   The instant I start writing to that volume, the file copy rate on the large file job going from Drive S: to Drive D: jumps from ~19MB/s to ~300MB/s, and the file copies in less than a minute.  This is reproducible time and again.  For some reason, writing to the Source Volume using IO Meter causes the transfer rate jump up, and sustain the expected rate.

     

    Any ideas or pointers are greatly appreciated.

    -Gary

     

     

    Tuesday, March 08, 2011 10:18 PM
  • We have been experiencing this issue for over a year on two different hardware platforms; one running Win 2K3 x64 Ent and the other running Win 2k3 ia64 both attached to SAN. I'm pretty sure this issue will arise no matter the OS version. The source drive seems to be the issue in some cases only being able to read about 500kbs to 8mbps. The activity on the SAN is almost non existant as is the destination on our virtual library.

    Our process is to run a nightly full SQL backup to a consolidated file server. Then our backup software backs up that consolidated volume to a virtual library. Both the SQL backup and the backup to virtual library have this same issue. The originating server is obviously a SQL server. We have many SQL servers that send backups here. The consolidated backup file server is also a SQL server. The virtual library is just a backup solution.

    Wednesday, March 09, 2011 5:08 PM
  • mgr34 - thanks for the input, but I really don't think the issue you're experiencing is the same as the issue being discussed in this thread.

     

    Thanks.

    Wednesday, March 09, 2011 5:24 PM
  • mgr34 - thanks for the input, but I really don't think the issue you're experiencing is the same as the issue being discussed in this thread.

     

    Thanks.


    I believe it is as the performance related to available memory is identical to what you described in the original post in the 2nd to last paragraph. It is more visible on our SQL server because of the way memory is allocated for SQL. We would usually have about 6-8gb of memory free for the OS and other services. Last night we configured SQL to free up 50gb of memory prior to the backup running and we saw the performance hold out until that 50gb was used up by the file copy.

    Do you agree that sounds like what you're dealing with?

    Wednesday, March 09, 2011 8:40 PM
  • Interesting.  That does, indeed, sound like the issue we're experiencing.  What's strange to me is that I've never seen this problem on Windows 2003.  We have hundreds of SQL databases on Windows 2003, with many being 500GB to 1.5TB in size, and I've never had this issue either doing SQL backups or large SQL mdf file copies on Windows 2003.  In Windows 2008 non-R2, this problem is readily apparent but work-around-able with DynCache.  In Win 2008 R2, this problem seems to happen with some servers but not others, and unfortunately there is no workaround. 

    -Doug
    Wednesday, March 09, 2011 8:59 PM
  • Today I was working on a Windows 2008 R2 server with 2 MPIO iSCSI connections to a volume.  As mentioned in my test above, we started a file transfer, but in this case we were copying from an iSCSI volume to a local physical volume.  The transfer rate was ~18MB/s.   We started IO Meter doing a write to the source drive as I had done in my test above, and once again, the rates jumped to 155MB/s.  In this case it is a physical server, not a VM.  So, I am really curious what IO Meter does that it "opens" up the communication.

     

    Any ideas what IO Meter is doing?

    Friday, March 18, 2011 10:15 PM
  • I've been banging my head against a wall for a week of 20 hour days.  I'm building a prototype / test server (Windows Server 2008 RT 64bit Ent) in my home lab with the hopes of running Hyper-V for a multi-server dev / test environment.  My server is a PC class machine, I7 960 (8 core) @ 3.2GHz, 12GB Kingston DDR3-1333 RAM (going to 24), on an Asus P6X58D-E mobo, built-in GB NIC (Marvell Yukon 88E8056) and I've added an Intel 1000/Pro GT GB NIC, 2 WD 1TB Caviar Black HDs mirrored on the built-in Marvell 6GB/s RAID controller.

    Before I go too far, I should remind you that the plumber's fawcets always leak, the mechanic always drives a beater, and electricians sit in the dark.  You'll see what I means shortly.

    I built the "server" last week and eventually had Hyper-V running with 4 test server guests.  The whole time I was building the server I struggled with performance problems.  I wasn't sure if it was network or disk but I found the recommendations for disabling all of the offloading and greening in the advanced NIC settings (which helped a little) and kept going.  Because this is a workstation class server, for the most part, I'm at the mercy of the "out-of-box" Microsoft drivers as most hardware vendors aren't providing 2008 RT 64bit drivers for workstation hardware but I was "lucky" and had managed to find most of the drivers I needed from the actual vendor sites (not from Asus).  All was going well, I was many hours into building the guest servers and then I decided to shut the server down and move it to my lab (it was in my living room up til that point).  When I fired it back up it was dead.  I was getting a BSOD and reboot so fast, I had to film it then step through the video one frame at a time to see that it was a Stop Error 7B (hardware, likely disk problem).  Chkdsk /F or /R ended up being completely useless because of a memory leak in 2008 RT Chkdsk when encountering very large files (VLFs) that is apparently a known issue but not being worked on to resolve because it doesn't affect too many people.  Yeah right!  Only those with large VHDs and databases need worry!  The Chkdsks mostly failed and what I ended up with was a trashed system/boot partition, a completely wiped out Apps partition but my Data partition with my VMs and VHDs appeared, for the most part, intact.  I tend to blame a crappy Marvell storage driver for this corruption but the jury is still out.

    I know this is getting long winded but bear with me:  I'm documenting this for myself and anyone else that's Googling this problem because there is a lot of conflicting crap out there regarding this issue(s).

    I lost 1/2 a day trying to resolve this issue before giving up and starting over.  This time I was going to be dilligent and back up as I went along (see my comment about the plumbers/mechanics/electricians).  I rebuilt the server, formatting (not quick - never quick) the partitions and when I had a bare bones, patched, service packed, and repatched server I backed it up to an empty (thanks to Chkdsk) partion on the server.  The next morning I went to copy the backups to my XP workstation and that's when I was struck by the poor performance issues again.  Keep in mind, Hyper-V wasn't even installed yet.  The server had no roles or features installed yet.  I ran into all the problems mentioned above: slow network, all memory going to cache until it's at 0MB free and never being relinquished even after the transfer is aborted - making the server very sluggish to completely unresponsive, network activity dropping to 0 for periods of time in the middle of a transfer, transfers of large files taking forever or being unable to complete, etc.  When looking in Perfmon, I was getting around 6MB per interval (MBpI) on a GB network it didn't matter if it was push or pull.  From the XP workstation to the server I was getting about triple that performance, push or pull.  Not great but better.  Win7Pro on my laptop was getting comparable results over 130Mbps wireless.  But as I'd be watching the copy progress in Perfmon, I'd see the odd peak of 20 MB per interval, an average of 6 MBpI, but then periods of complete flat-line 0.  There was no other traffic on the GB network.  If I browsed the mapped drive suddenly I'd get activity on the transfer again or if there already was activity, I'd get a 20 MBpI peak.  So in response to Gwaters' question, I don't think it matters what IO Meter is doing so much as the fact that it is doing "something" on the server which keeps the NIC and stack awake.

    I downloaded DiskBench and was getting decent performance (1.2 GBps) from disk to disk on the local server - nothing near the 5 or 6 GBps the controller / drives are set for but decent.  Everyone on the net was saying to run network captures to provide them more info so I ran Wireshark and besides a few lost segments and a whole bunch of SMB2 traffic that I wouldn't have expected since XP was the destination, I didn't see anything bad.

    I researched the problem some more and ran into all kinds of conflicting information regarding NIC and TCP settings and registry tweaks for LANMAN Server, and workstation and TCPIP.  I have honestly tried almost every combination over the last several days.  And it's completely hit and miss.  This morning I had a pleasant surprise:  Microsoft had released an update for:  Slow performance in applications that use the DirectWrite API on a computer that is running Windows 7 or Windows Server 2008 R2.  (http://support.microsoft.com/kb/2505438)  Surely this couldn't be related.  Surely a font issue wouldn't cause all kinds of performance problems.  Guess what?  It resolved some of the issues.  Performance increased about 1MBpI on the backup copy to the XP box.  There are no longer any dead periods of 0 network activity on a large transfer.  Memory on the server isn't dwindling down to 0 MB free and never being relinquished.  Hooray!  Oh snap, overall performance still sucks.

    So I kept going.  I downloaded LanSpeedTest.  It flushes the caches and removes the hard drives from the equation to do a test of the network transfer.  I ran a bunch of tests before and after applying all of the http://www.starwindsoftware.com/forums/starwind-f5/recommended-tcp-settings-t2293.html tweaks.

    Before:

    Copy 20MB file to XP from 2008 = 143Mbps Writing, 223Mbps Reading

    Copy 3GB file to XP from 2008 = 109Mbps Writing, 219Mbps Reading

    After Starwind's Tweaks:

    Copy 20MB file to XP from 2008 = 90-171Mbps Writing, 189-215Mbps Reading (ran multiple times because of a poor first run attributable to the server possibly still settling down after a reboot)

    Copy 3GB file to XP from 2008 = 112Mbps Writing, 221Mbps Reading


    So I'm not seeing anything like the gains that Simon gained.  Re-reading his post I see there's a reference to another link at Starwind specfic to Hyper-V I should investigate further.  But so far, it doesn't seem to matter what I do to the stack, everything still works about the same.  I may never see that kind of performance since I'm running a cheapola Netgear GB switch.  The items that are having a positive effect are: 

    1. Disable all offloading on the advanced NIC settings - especially to get Hyper-V to even work.
    2. The MS performance patch: http://support.microsoft.com/kb/2505438
    3. Disabling virusscan on the client = 1 MBps for VLF copies.

    One thing I forgot to mention is that I have the exact same performance from the built-in Marvel Yukon NIC with the latest drivers and the Intel NIC with the MS drivers.  I was hoping the Intel NIC would save the day.  It didn't.

    If anyone has any other recommendations I'll be willing to try them.  Now wish me luck.  I'm going to resume my work with Hyper-V and hopefully I don't lose the whole server again.

    Cheers,

    Lazarus

     

    Wednesday, March 23, 2011 9:06 PM
  • Unfortunately I'm now experiencing the same issue on a different Server 2008 R2 machine, but this one doesn't have Hyper-V or any additional server roles.  Also, just as a point to Lazarus's post about KB2505438, it doesn't seem to have any impact on the issue on either of the machines that I see the problem on. 

    Hopefully Microsoft will address this issue at some point.

    Thanks.

    Thursday, March 31, 2011 5:48 PM
  • Add 1 to the list of people who are experiencing this same problem. My environment doesn't include a SAN or iSCSI volumes, just standard SAS drive volumes. Am I correct in assuming this is a Server 2008 R2 issue, and not a Hyper-V issue?

     

    Each night we copy:

    * FROM: About 100GB of SQL backups from a physical machine (SQL Database server)

    * TO: the HOST machine of a Hyper-V machine (we do this because our backup drive is a USB external HDD, which none of the virtuals can access)

    Anyone have success with software such as ViceVersa PRO, or are we likely to experience the same situation with that (where it consumes all available memory on the target machine)?

    -Rich

    Tuesday, April 26, 2011 2:57 PM
  • Rich - are you doing local SQL backups and then copying the .bak files in a second step? If so, one possible workaround might be to execute the SQL backups directly to the destination server in one step rather than doing them locally and having a second step to copy them over. I don't know for sure if this would work, but I think it might. -Doug
    Tuesday, April 26, 2011 3:09 PM
  • I am also seeing this problem copying large files (exported VMs) from a SAN to the local SAS RAID on a Windows 2008 R2 server. I will follow the thread, thanks for your work.

     


    CarolChi
    Thursday, April 28, 2011 9:21 AM
  • I’ve had my hyper-v server (Windows Server 2008 R2) attached to my SAN for about one year without any performance issues. Normally I get about 100MB/s read and write performance. My server is a Dell R710, 4 NICs dedicated to iSCSI with Round Robin.

    Two weeks ago I upgraded the server hardware and also did a complete OS reinstall. After the upgrade read performance has dropped to about 30MB/sec. Except for the difference in hardware (additional CPU and RAM) the configuration is exactly the same.

    KB2505438 did not resolve my problem.

    Wednesday, May 11, 2011 10:08 AM
  • We have a few VM hosts HP DL585G2 and a DL385 G5 both having the same issue. After making some changes I have been able to get a 75GB file to copy from a Physical server to a VM guest.

    After making the following change On the VM host  & some on the VM guests I have had better luck copying large files.

    1) applied all the latest windows updates (including sp1 for 2008 R2) Including KB2505438 everything but IE9.

    2) My config has a nic for each vm, currently 3 vm per host and the HOST has a total of 6 interfaces.

    2a) disable all un-used interfaces.

    2b) disable power mgmt to turn off each interface.

    2c) NIC Properties I only checked Microsoft Vitrual network switch protocol and HP Network Configuration Utility.

    -ON the VM HOST I do not have any IP4 or 6 checked everything is unchecked... only thing checked is listed in 2c.

    3) On the VM guest  the Network Interface  has everything checked

    3a) I have seen aticles that talk about unchecking TCP Checksum offload, however I have not made any changes there.

    3b) install all the windows update on VM guest.

    4) Shut down vm guest

    5) reboot VM host

    6) power up VM guest and try the copy again.

    During this copy my disk write avg was 36 MB/s

     

     5/13/2011 ---

    I am two for two now on large file copies after making these changes. On Monday I will try again, and the try on my other VM server that was having issues.




    • Edited by ntschultz Friday, May 13, 2011 6:43 PM updated with additional test results
    • Proposed as answer by ntschultz Friday, May 13, 2011 6:48 PM
    • Unproposed as answer by DougZuck Friday, May 13, 2011 7:15 PM
    Thursday, May 12, 2011 5:34 PM
  • Thanks for testing and posting, but I'm not sure the problem you're describing is really the same as the problem being discussed in most of the rest of this thread.  For now I've "unproposed" your answer.

    You're issue, if I'm understanding correctly, is with copying large files from inside VM guests to other physical servers.  The problem on this thread is more about file copy performance, in general, from physical server to physical server. Additionally, it's not a question of being unable to perform the copy.  The question on the table is "why is performance so bad when doing a large file copy, and why does it start good and then gradually get worse as the free memory available in the machine is used up?"

    Additionally, you mention that you are able to get 36MB/sec.  When copying large files from one physical server to another physical server on a gigabit network, we should be able to get upwards of 125MB/sec.  36MB/sec simply isn't acceptable for our purposes.

     

    -Doug

    Friday, May 13, 2011 7:21 PM
  • To be more clear on what my issue was, which I believe is what others are experiencing:

    When copying large files they fail and all memory marked as FREE become used.

    The error I would get when the large file copy fails 

    An unexpected error is keeping you from copying this file. I fyou continue to receive this error, you can use the error code to search for help with this problem.

    Error: 0x8007046A: Note enough sever stroage is available to process this command.

    My file copys... I RDP into the VM and UNC or map a drive to a phyiscal server and copy the file over.

    Additionlly when watching task manager performance tab I see all memory posted under FREE get used up.  This sounds like the issue others are experiencing.  

    Todays test failed... calling MS to open a ticket.
    • Edited by ntschultz Monday, May 16, 2011 3:33 PM more testing results
    Monday, May 16, 2011 2:14 PM
  • My case was solved using Tiger Li’s suggestions:

    netsh interface tcp set global rss=disabled

    netsh interface tcp set global autotuninglevel=disabled

    Reboot the server

    http://support.microsoft.com/kb/951037

     

    The server users Broadcom nics for iSCSI.

    Friday, May 27, 2011 8:53 AM
  • Hi,

    Have you managed to solve the issue you are having?

    I am getting exactly the same issue on all of our 2008R2 /Win 7 vm's

    I do not get any problems with 2003 servers though.

    I thought it was to do with the Virtual NIC that it was assigned - all of the 2003 boxes have "Flexible" as the network adapter type in vSphere - the 2008 R2 and Win 7 vm's have E1000 as their network adapter?

    Simon

    Tuesday, July 05, 2011 1:19 PM
  • Same problem here on W7 and W2k8R2...

    Is there a fix out?


    Thanks
    Matias

    Wednesday, July 13, 2011 1:55 AM
  • I have the same thing happening where the copy takes all RAM and then stalls out.  Furthermore, if it is running on a server that is also running a process that takes a lot of RAM, such as SQL server, it will choke SQL out and cause it to stop responding as well.

    Here is the hitch:

    I can watch the RAM usage grow to 100% in the task manager if I copy the file to a clustered drive from Windows 2003 to Windows 2008 R2.  However, if I copy the file to a non-clustered drive (i.e. C:\), the cache usage goes up, but the RAM usage DOES NOT GROW and it behaves properly.

    If I copy the same file from a Microsoft Server 2008 R2 instead of 2003, the RAM usage will grow no matter which drive I copy to.

    So if I copy a file to my C:\ drive, why does the RAM meter depleting when copying from Windows 2008 R2, yet it does NOT deplete when copying from Windows 2003?  Yet the cache shows the same behavior in both cases.

    Furthermore, once the RAM reaches a certain point, the network traffic will STOP and the RAM will slowly creep back down.  Then it will start again.  This repeats over and over.

    If I use XCOPY /J, the cache never rises, and neither does the RAM meter.  But this is not an acceptable workaround for two reasons.  One, it won't buffer anything.  Two, it doesn't solve the problem, as one of my new techs might copy a file to a production system by accident and take it down by accident in the middle of the day.

    By the way, I modified the DynCache source so it would run on Windows 2008 R2, but it has ZERO EFFECT.  The cache will continue to go way past what SetSystemFileCacheSize() is set to during the copy, and release the cache back to the correct size after the file copy has finished.

    This is becoming a serious problem for our production systems.  Time to contact Microsoft. 

     




    Thursday, July 14, 2011 9:31 PM
  • This issue is going to cause me some headaches in the next month or two.  We currently have a 2008 (non-R2) file cluster which requires the DynCache service to be running due to a few users who occasionally process large datafiles over the network.  Without DynCache we start seeing VSS failures due to insufficient memory which causes all sorts of issues including DPM sync failures, etc.  Not to mention the server performance grinds to a halt.  (two-node cluster w/ 16GB RAM in each node)

    We are planning to migrate to Storage Server 2008 R2 soon which I'm assuming would be affected by the same issue. 

     

    Friday, July 15, 2011 11:04 PM
  • After sending the following to Tiger Li my account seems to bee ok, so I
    send you an idea to the obove mentioned problem too.
    In my case I have to copy a directory tree with large amount of files (more than a million small files) from a pc in the net to Server 2008 R2.
    Transfer becomes slower and slower and after a while complete ram is used and the system crashes.
    Same happens with bigger files (about 500 with 1-2 gb size).
    In short: I changed the harddrive in the server from new type 'advanced format' to an older drive with standard format and all problems are gone.
    May be, that this helps finding a solution
    Regards
    Saturday, July 23, 2011 12:16 AM
  • hey,

    I dunno if i understand you scenario, but if i got you right- all you need to do is to configure a new disk in the guest VM as IDE drive and not as SCSI, and it will work perfectly

    I dunno why, but i'm having the same problem with the disk is configured as scsi @ hyperv vm's.

     

    let me know if it worked for you.

     

    cheers

     

     

     


    Tuesday, August 02, 2011 3:00 PM
  • I have the same problem on one single server. Just to be sure, symptoms are unlimited cache growth, even when reading large files. After the file is read, cache memory is freed, but if it's size is more than available memory... everything hangs and all I can do is to kill the process (takes few minutes to even show taskmgr screen, everything is madly swapped to pagefile).
    Since the server's role is pretty unimportant, I can afford to test some unofficial programs and fixes.

    This is what worked for me - http://www.uwe-sieber.de/ntcacheset_e.html
    I didn't use old dyncache but suppose this program does the same. Takes two command-line parameters, min and max values of permitted cache size.

    At least after running it with 1 2048 parameters I do not see something like 'winlogon could not show you ctrl+alt+del window. you can try pressing reset to reboot the computer' anymore, and there are always 'free' memory in task manager's performance.

    System - Win2008R2 SP1

    Tuesday, August 09, 2011 4:38 AM
  • Just reset from SCSI to IDE on the VM settings and it made no difference.

     

    Increased memory for the VM from 2048MB -> 6144MB. This allows bigger files to be copied but after the 6GB is filled, the file copy slows to a crawl. So, copying any flie that is larger than the VM's memory allocation is impossible.

     

    Can anyone from MSFT comment and provide a fix. This is clearly a repeatable problem and seems to be affecting many people.

    Saturday, August 13, 2011 5:05 PM
  • After all this time, I finally "solved" the problem I was having. It seems like in this posting there are possibly multiple issues being discussed, because it's not clear that everyone's situation is that same as mine, so keep that in mind when you read what I changed to make the problem go away in our environment.

    In summary, it all came down to the write-caching policy on the RAID controller. We are dealing with Dell servers and Dell controllers, and I have reproduced the issue on both embedded/internal RAID controllers as well as their external RAID controllers for direct-attached storage arrays.

    When the RAID5 virtual disk on the controller is set to write-through, the copy performance issue exists. When the virtual disk is switched to write-back, the problem disappears. By default we always use write-back caching, but when the RAID battery fails (or while the battery is charging), the controller automatically switches the virtual disk back to write-through until the battery is replaced or charged. You can select to "force write-back" when the battery is dead, but the consequence is possible data loss if the server were to crash or lose power during a write operation.

    Interestingly, the performance difference of write-through vs write-back caching often seems negligible. However, for certain operations, including large file copies across a network, there are clearly issues. Interestingly, I have not been able to reproduce the problem when copying files on the same server from one array to another. It only seems to exist when copying across a network where the destination drive has a write-through caching policy.

    I hope this info helps some other people with similar problems. I can't believe that after all this time and so much testing that it all came down to a single setting. It's disappointing to discover that the write-caching policy can have such a large impact for some operations and a nearly non-existent impact for others.

    • Marked as answer by DougZuck Thursday, August 18, 2011 5:54 PM
    Thursday, August 18, 2011 5:29 PM
  • I'm not 100% sure whether my problem is exactly the same as yours since this is a long thread, but your solution does not work for me. RAID controller write-caching policy seems to have no effect either way.

    The specifics of my problem are:

    Windows 7 x64 Ultimate with SP1: All latest updates as at time of posting. Copying a 72GB file from RAID10 array (4x WD Velociraptors). RAID controller is on-board Intel C600 series. Copying to (in the same physical machine) a Seagate Barracuda 2TB green drive. When I start the copy I get ridiculously high transfer rates, but I see that it's not really copying that is happening since my cached RAM just goes up until the "Free" RAM reports 0. At that point the file copy slows right done (sitting on 20 MB/s as I speak). I know that even the 2TB green drive can do a lot better than 20MB/s; I'm sure network to network copy with the 2TB green drives involved was closer to 100MB/s. I might as well be trying to copy this to a USB 2.0 external drive its going so slow. I'd expect with no network involved there should be no bottle neck, but clearly Win 7 has managed to implement one.

    For now I am just going to live with this (40 minutes left to go, with 36.6GB to go) and hopefully all that cached memory will be freed at the end, but this seems like a major problem in Windows file copy. :-(

    You say that your problem is copying over a network, and that you cannot reproduce from array to array. My case is a little different; no network, but I am trying to copy from fast array to single drive. I wonder if the slower destination is the thing that causes the issue? i.e., if my destination was also of similar speed to my source, would I have this caching / slow down issue? Either way, it would be easier to take if the performance of the copy wasn't 25% of what I know the slowest drive can support.


    Wayne H

    Saturday, March 31, 2012 6:20 AM
  • I agree that it doesn't sound like exactly the same issue, but it certainly does sound similar.  In my case, it was the the caching policy set on the destination/target array that made a difference.  It had nothing to do with the source.  In your case, you're saying that you have an on-board Intel C600 RAID controller, but I take it to mean that your source drive (the 4x WD Velociraptor RAID10) is using the Intel C600?  What kind of controller is handling the destination/target Seagate drive?  Maybe see if there's a controller update available or see if there are any controller settings to tweak on the Seagate drive's controller?  You should also do a test to confirm what you said about the network copy to the Seagate being faster, just to be sure it only exists when copying internal to internal.  I hear ya though that it's totally frustrating when things like this just seem to not make much sense.  I don't have any great suggestions beyond what I've just said. 
    Saturday, March 31, 2012 3:35 PM
  • I was searching for my problem using google and the very first result was this page. My problem is very similar to the one described by DougZuck.

    When I copy huge files (4 GB or more) over the network the transfer begins very fast (38 MB/s) and then after a few seconds it slows down to a pathetic level (less than 1 MB/s) and makes the computer unresponsive.

    Server 1 (source) is a Dell PowerEdge 840 running Windows 2008 R2 Standard SP1

    Server 2 (destination) is an HP Microserver running Windows 2003 R2 Standard SP2

    The problem was caused because the disk write cache was disabled on the destination server (HP Microserver).

    Simply enabling the disk write cache using the windows device manager solves the problem! Now the transfers are always fast at 38 MB/s and sustained, with no slow downs.

    Thank you very much, DougZuck, for giving me the solution!!

    Monday, April 02, 2012 5:34 AM
  • Right, the source (RAID10 Velociraptor array) is under the C600 series RAID controller as is the single 2TB destination drive. At the time I hit this problem that destination was a single 2TB non-RAID drive, but it is just in the final stages of mirroring to become a RAID1 array under the same controller. I'm not sure if having multiple arrays under the same controller is a good idea or not, but since this is just a home desktop system, I don't have much choice!

    Last night I did have to copy some data from an external SATA drive (eSATA) to the single 2TB drive and it was also slower than I expected (avg 25MB/s) so maybe that speed is what I should expect (although it seems slow for a 7200rpm drive). As mentioned above, in my old system I was able to get 100MB/s+ copying to the same 7200 rpm drive, although that was in a different system and under a different RAID controller (and running as RAID1). I'm not sure why it would be so slow in this new system. In fact the weirdest thing about it is that it seems random access (lots of small files) is just as quick as very large files, which feels suspect to say the least!

    Anyway, I don't usually have a huge use case for copying these large files around; this was more or less part of staging my new system, so I'll probably forget about this issue now, until the next time I have to recover data or stage a new system!

    Thanks for taking the time to reply.


    Wayne H

    Tuesday, April 03, 2012 10:00 AM
  • I also was experiencing this issue on a Dell R720 running Windows 2008 R2 with only the Hyper-V role installed.  The server has 32 GB of RAM and during large file transfers (tens of thousands of files using about 3TB of storage in total) all of the Free Memory would become Standby Memory .  Once the Free Memory Pool had dropped to zero, the server would become sluggish and the copy would fail.

    Using the registry modification suggested by InformationOverload fixed the issue for me.

    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
    LargeSystemCache=0    (DWord decimal)

    After modifying this key and rebooting, the Free Memory Pool would never drop below 5 GB during file copy operations and seemed to be trying to stay around 6 GB (18% of total RAM).

    Wednesday, May 16, 2012 3:29 PM
  • So, it seems that Hyper-v Role (or maybe file server role) installation disables write-cache on the hard drive. In my case all those nightmares disappeared once I re-enabled write caching on the devices from the Policies of the Harddisk device property. At least temporarily as it is risky in case of power loss.

    Joseph Saad - SharePoint 2010 MCITP, SharePoint 2010 MCPD, MCSE, CCIE RS 20243

    • Proposed as answer by PLT11 Friday, December 21, 2012 2:36 PM
    Friday, June 15, 2012 6:20 PM
  • You are correct, I have faced similar issues with one of my W2k3 server while copying (Write) huge data the disk performance goes below 2Mbps/sec due to Battery Issue and Cache is unusable... A new Battery replacement resolved this

    • Proposed as answer by G. Hanna Tuesday, October 02, 2012 12:53 AM
    • Unproposed as answer by G. Hanna Tuesday, October 02, 2012 12:54 AM
    Friday, June 22, 2012 4:32 PM
  • Synackz, I am experiencing the same issue with Server 2012 and Hyper-V.  

    I tried your LargeSystemCache=0 solution.  The transfer speed of large file copies dropped dramatically so either I'm missing something or this will not solve my problem.

    gbs

    Thursday, January 10, 2013 2:10 PM
  • Hello Gregg,
    If u cancel the copy action does your Server 2012 freezes or takes like 10 minutes before explorer (csv's) responses again?

    I'm copying a large file (1.8Tb) from a volume1 to volume2, performance is very bad 8Mb/s but what is worse, when I cancel the copy my explorer freezes and the second time my cluster service crashed. I'm going to try this autotuning=disabled option and see what happens.

    (Server 2012 3node cluster fully patched with hyper-v using clusted shared volumes)

    Greetz,
    Peter

    Friday, January 11, 2013 3:11 PM


  • It seems that starting with Vista/2008 there has not been a method of restricting the the I/O Buffer used during a copy process if the system was running another process that required constant disk access (file server shares / service, hyperv, ect...). The larger the I/O Buffer grows, the more RAM it uses. DynCache was released when this problem was first discovered to restrict how much memory the I/O Buffer can use, however with 7/R2 it is no longer effective.

    Interestingly, this has NOT been repaired in 2k12 either. Although Robocopy has more features in 2012 that might be able to assist with this.

    I was able to resolve this with on several of my systems by changing the write cache policy as well. However, I have 4 or 5 servers that are using controllers that don't allow this. Additionally, it is advisable to keep disk caching disabled on some servers (hyper-v host, fileservers, etc..) to avoid data corruption. 

    Although I don't know exactly why enabling the write cache works, I would assume that since it is caching it on the controller, it comes accross the wire already "buffered" so the OS doesn't need to. 

    Obviously there is no real solution here. It is a bug with the OS and MS needs to fix it with a patch. However, for systems that won't allow me to , or that I don't wish to, enable the write cache on the controller, I have found that I can still perform those copies. I do so by utilizing any copy program that doesn't use the OS Cache. FastCopy and TerraCopy both come to mind. 

    You are able to configure TerraCopy to use the OS I/O Buffer if you so desire, but obviously that wouldn't be helpful. TerraCopy does not allow you to change the size of its I/O Buffer. However, since it is not using the one built into the OS, and it has its own restriction on how much RAM its I/O Buffer can utilize, I was able to copy 2-400GB files without raising my RAM usage more than 1GB.

    FastCopy is more of a barebones software and doesn't have as many features as TerraCopy does (File Queues, Checksum, etc). However, it also uses its own buffer thereby bypassing this issue. It also allows you to customize the buffer (128MB to 1024MB).




    ---------------------------------------- The key to success is many failures.


    • Edited by Avi Green Wednesday, January 23, 2013 5:22 PM 2012
    • Proposed as answer by Avi Green Wednesday, January 23, 2013 5:26 PM
    Wednesday, January 23, 2013 5:17 PM
  • I just have to add to this since I seem to be running a very different scenario than anyone else here. I too have this issue, but ONLY with SQL backup files once they reach roughly 50GB in size.

    I have a machine that was recently replaced by some nice new hardware. This machine previously ran almost our entire business aside from our SQL DBs. File server, print server, AD DC, two Hyper-V VMs running Exchange and TFS. Machine had 3-1TB drives in it in RAID5 setup using crappy embedded Intel RAID controller. I also had a standalone drive specifically for some backups. Those backups would then be pushed to the azure cloud. I would run robocopy on a nightly basis to copy local files to the backup disk, and that wouldn't be a problem. The second I decided to pull the DB backups from the DB machine to this machine, all hell would break loose. IO would eventually lock up to the point where I couldn't do anything on the VMs... Needless to say, I just stopped dong backups over the network. I left them on the DB machine, and that was good enough. I figured I'd try again once we got our new server.

    Now, with a brand new server in place, I have taken this old machine and demoted it to be a backup server. Turned off RAID, and am using only standalone disks in AHCI configuration. A 500GB OS drive, a 1.5TB drive for the backups, then 2-1TB drives for mirroring the 1.5TB, essentially creating my own mirroring process using robocopy. These backups then get pushed to azure cloud on a regular basis. This backup server has 8GB of ram in it. Our large DB backup file can be anywhere from 25GB to 90GB depending on the day of the week, and how many differential backups have been done, since we do multiples per day (up to 3). Keep in mind, I also did a fresh install of Windows server 2008 R2 with literally NOTHING on it. No other roles, no software... just windows itself. I installed all updates I could before turning it into the official backup server.

    Watching this robocopy process has been interesting. The system takes this memory and puts it into a modified state, meaning that it must be written to disk before it will release that memory for other use. robocopy locks up, the task engine locks up, the cmd.exe process locks up. Try to end them, and it just doesn't work. The results in this new machine setup are identical to what they were before I eliminated RAID. It made no difference at all.

    Going with the write-caching theory, I decided to turn off write caching on the 1.5TB drive, just to see what would happen. The result was identical, it just happened sooner. Instead of the server rapidly copying large amounts of data, it started copying this one file (53.8g)... never saw any percentages. It just locked up and is hanging. I killed the processes associated, like 30 minutes ago, and it still hasn't killed itself. If I try to restart the server, it will hang on shutdown. I must forcefully power off the server and restart. robocopy.exe is using roughly 25% of overall processing (quad core proc with hyper threading)... not flatline, it's ALL OVER THE PLACE. However, there is nearly zero disk IO, no network activity, and no memory activity since there is no write caching.

    However, and here's the kicker, use robocopy from the DB server, to push to the backup server, and it fails at 68.3% (roughly the same spot as where the backup server eventually stops with write caching enabled): "ERROR 665 (0x00000299) Copying file... The requested operation could not be completed due to a file system limitation". What file system limitation? WTF does that even mean? 

    It's extremely frustrating that I've been using robocopy for a very long time, and it's starting to become an unreliable tool for doing backups. I feel like I need to buy backup software or start using windows backup. That would be stupid.
    Thursday, January 24, 2013 4:27 PM
  • I have heard from Rahul Sharma at MS Support that http://support.microsoft.com/kb/2564236/EN-US?wa=wsignin1.0 will fix it. I am trying tonight. will let post result.

    ---------------------------------------- The key to success is many failures.

    Tuesday, January 29, 2013 6:25 PM
  • It seems as if this is not quite the issue, as we're not talking about a read problem, but rather a write problem. Did you have any luck with this hot fix?
    • Edited by UNWebman Wednesday, January 30, 2013 6:11 PM so many spaces
    Wednesday, January 30, 2013 3:45 PM
  • This issue has been around since Windows 2000, I found this issue doing large SQL server backups over the network.  By default windows uses as much RAM as it can to cache an incoming file (minus the kernel limit set by Microsoft). The memory is not released until the EOF is reached.  So if the file is bigger then all the RAM the Cache process will use Virtual Memory until that is gone as well.  If the copy is using the same Controller then you get contention between the incoming file I/O and Virtual Memory I/O, which makes the issue worse and can crash some servers.   I worked with Microsoft on this issue for months and I was finally told they “might fix” the issue in Windows 2008, guess not.  Anyway, if you add a RAID controller with onboard cache (512 Meg or more) windows disables its incoming cache process in favor of the RAID card’s cache and the issue should clear up.  I finally added a Linux box to our network to dump our SQL files via SAMBA to fix the issue without the raid upgrade.

    • Proposed as answer by jbrooks702 Tuesday, February 26, 2013 3:38 PM
    Tuesday, February 26, 2013 3:34 PM
  • It seems to me that DougZuck's solution of using linux is the only real solution to writing large files to a CIFS / SMB / windows file shares.

    I've experienced this issue whenever I have used Windows XP / Vista / 7 or Windows Server 2003 / 2008 / 2008r2 as the location to copy the files to over the file share.  To reproduce the problem simply

    • obtain a file that is at least 2 times the size of the physical ram in the destination server (typical of backup files and backup servers)
    • have a faster source hdd than destination (typical of production server -> backup server hdd speeds)
    • have a slower destination hdd than network link speed (again typical of 1gbps networks and backup server hdd)
    • drag and drop the file to the destination file share

    None of these are difficult to obtain, and occur frequently.  I haven't been able to test Server 2012 with this issue yet, as our current 2012 servers are too fast compared to the source servers and network we have available.

    Friday, March 15, 2013 1:25 AM
  • I just ran into this problem myself copying a rather large VMDK file.

    Thinking outside of the box a bit for a workaround, has anyone tried chunking (dividing into smaller files), performing the copy, then unchunking files at the destination?

    Microsoft - come on, if you can fit the underlying problem, at least build a workaround (like this) into your copy commands - make it transparent to the user, that way at least your mess is hidden and people don't resort to using SAMBA file-servers.  How embarrassing...

    Saturday, April 06, 2013 7:44 PM
  • Up to now I only know TWO workaround for that "caching beyond reason" issue:

    xcopy /J

    and http://files.first-world.info/supercopier/2.2/  - do NOT use the newer versions, they use buffered write again. I recommend setting the buffer to 262144 or 1048576.

    That issue is still the same with Server 2012... *sigh*.

    Wednesday, May 08, 2013 12:02 PM
  • Server 2012 still has the issue.

    In my case, using Server 2012 as VM host, tried copying a large file from an iSCSI target to an external USB drive for archiving. It would take all available memory and cause the server and VMs to respond very slowly. I transferred the file by physically pausing the copy every time memory was about full, letting if flush, and continuing.


    KTSaved

    Thursday, August 01, 2013 11:26 PM
  • Well, after all this time you just solved my 2 weeks of working way to hard at this. The adaptec 5805 fancy controller was choking after 256mb file size or bigger in both Windows 7 and Open Suse 12.2 as VM's. I tried all kinds of drives, stripes and mirrors, and in the end, turning on the write back cache (with out a Battery Backup on the controller) did the trick! I was doing this all under VMWare ESXI 5.1 - latest driver and firmware as well.

    However, under straight Open Suse 12.2 I had no write desegregation on any file transfers without write back cache enabled - interesting.

    One other note, SSD drives did not give me any issues under EXSI and OpenSuse.

    Thanks for finally helping me put this one to rest.

    Good Night.


    always interested in learning something

    Tuesday, August 13, 2013 3:17 AM
  • Does anyone have any suggestions for resolving this in a virtualized environment with a SAN?  Since the RAID card of the hosts does not come into play, adjusting the write back cache will not apply.

    I tried all of the other options everyone has contributed (Receive side scaling, Large System Cache, Chimney offload, etc) without any changes.

    I've spent about 40 hours on this.  My solution is Dell from Top to bottom (R620 Hosts, Equallogic SAN, PC6224 switches) and after extensive work with them they are simply saying it's the performance drop is expected.

    Thanks

    Friday, November 08, 2013 9:15 PM
  • I question why you say that adjusting the cache policy doesn't apply in a virtualized environment.  To the contrary, I think it *does* apply since the hardware layer with the RAID configuration is below the virtual layer where the disk is presented.  It's certainly possible that in your scenario adjusting the cache policy on the controller won't solve the problem, but I think it's something you should definitely try.  You might be surprised at the results.  Good luck.
    Friday, November 08, 2013 9:50 PM
  • Hi Doug,

    Very tired so I probably wasn't clear but I was referring to the cache policy of the RAID card on the Host servers.  This is an VMWare (ESXi 5.1) environment with a Equallogic SAN.  I believe the thinking is that the RAID cards on the ESXi hosts don't even come into play since I'm not using any local storage.  Quite a few techs at Dell felt strongly about it, but I happy to hear otherwise if you think that is false.

    I've tried everything else to fix this, so I certainly would give it a shot if I could, but the RAID card in my hosts is a H310 that doesn't have a cache or battery. 

    Friday, November 08, 2013 9:57 PM
  • Is the issue happening on a domain controller? If it does, I would recommend against enable the write-cache if you do not have batteries. But I would question why are you trying to copy to/from a domain controller.

    Joseph Saad - SharePoint 2010 Microsoft Certified Master, SharePoint Microsoft Certified Solutions Master, MCSE, CCIE RS 20243, PMP, MBA.


    • Edited by JosephSaad Saturday, November 09, 2013 12:44 AM Signature update
    Friday, November 08, 2013 10:00 PM
  • Agreed.  The cache on the controllers would only be a factor for disks that are connected to those controllers.  I didn't realize that's what you were saying in your previous posting.  :)

    Friday, November 08, 2013 10:09 PM
  • There are other scenarios where enabling write cache is not an option: For Example, a shared disk in a cluster. The lesson learnt is: do not copy large files when write-cache cannot or should not be enabled.

    That is tough for home labs or single-host environments where the Server hosts multiple services that one of them should be at least a domain controller.


    Joseph Saad - SharePoint 2010 MCITP, SharePoint 2010 MCPD, MCSE, CCIE RS 20243


    • Edited by JosephSaad Saturday, November 09, 2013 12:54 AM grammar correction.
    Saturday, November 09, 2013 12:50 AM
  • Another alternative solution.  I had this issue on Server 2008 R2 VMs (running on VMWare 5.1).  I tried every fix in this tread (except for the write caching since it wasn't applicable in my Host/SAN environment) and unfortunately none of it worked.  I believe it has been resolved now however.

    It turned out the be an issue with the Windows Dynamic Cache.  You need to contact MS and request the hotfix for Server 2008 R2 as the download available on the web doesn't support 2008 R2.

    http://support.microsoft.com/kb/976618/en-us

    The hotfix essentially has you create a service and set some parameters in the registry.  Once I got it running, the network transfer speed drop offs stopped happening.

    Monday, November 11, 2013 10:09 PM
  • Another alternative solution.

    http://support.microsoft.com/kb/976618/en-us

    The hotfix essentially has you create a service and set some parameters in the registry.  Once I got it running, the network transfer speed drop offs stopped happening.

    THANK YOU for pointing to that article. Finally something I can bash MS with. I hope they will be forced to do the same for Server 2012 where I have the same issue (though not as often).

    Did anyone had the chance to test 2012 R2 whether they still don't have it under control?

    Monday, November 11, 2013 10:36 PM
  • Just struck this problem on Server 2012. Just to reiterate, the write caching setting in the Device Manager policies tab for the destination drive was unticked. This makes sense, I suppose, as the default, to avoid data loss on power failures. Turning write caching on resolved the issue.

    So it's "by design", not an issue with the OS itself.


    G H

    P.S. WHen researching this issue, I found that if a server is promoted to be a domain controller (AD DS role), the write caching setting is reset every reboot.

    http://blog.sharepointsaigon.com/2013/02/how-to-enable-write-caching-on-domain.html

    https://communities.intel.com/thread/44184


    • Edited by greenhart Friday, November 22, 2013 2:27 AM
    Thursday, November 21, 2013 10:35 PM
  • Greenhart,

    Write cache being disabled by default is "by design" in some instances (for instance, on DCs) and yes I would agree that it alone is not an issue with the OS, however the greater issue of poor network performance during large file copies is an issue with the OS itself.  The write cache change is just a work around, and a poor one at that given the potential for corrupt data.  Dynamic Cache service is the best way to address it, however that too has it's drawbacks which, depending on the workload, could be just as harmful as changing write cache settings.

    All of that said, I have come across yet another cause of this same problem (poor network performance on file transfers after the RAM cache is used up) with another solution.  In short, if you are using VMware, don't provision disks with VMware's "Lazy Zero" option.  If you do, you will run into the exact performance problems listed here.

    You can check if your disks were provisioned with Lazy Zero by running this command on your host:

    vmkfstools -D

    There is a guide to changing them over from Lazy Zero to Eager Zero here:

    http://ryanmangansitblog.wordpress.com/2013/03/19/converting-a-lazy-zeroed-disk-to-eagerly/

     
    Friday, November 22, 2013 2:14 PM
  • Hi everybody, I had exactly the same issue. We have DELL PowerEdge R710 with PERC H700 Integrated 132 GB RAM with ESXi 5.1 on board. 6x1TB SAS drives organized with RAID 5 as 1 logical drive. We have 6 VMs, one of them is Windows server 2008R2 file server. So, when i'm tried to copy big files (2 GB for example) is starting normally transfer data with 26-27Mb/s but then it's down to 5-3 Mb/sec and finally raise error. I'm checked RAID settings and find out that write policy is - Write-back, i'm change it to Write-trough and this is little help me, it's not crashed anymore but still copy data slowly (notice that I have 1GB network connectivity on both side server/PC) at the same time I can copy this file to different VM on the same host with 100 Mb/sec. Reason for file server was - compression for NTFS volume :) when I disable compression for one folder it's start working well ~80-90 Mb/sec.
    Tuesday, January 14, 2014 9:19 AM
  • Thanks this worked for me.
    Tuesday, January 28, 2014 9:29 PM