none
DPM 2016 MBS Performance downward spiral

    Frage

  • Hi Guys!
    I´m so pissed, really i´m at a point were i would like to just uninstall and go home.
    Why? Since we upgrade to DPM 2016 and switched to MBS we have massive problems.
    The System
    We have 3 independent locations,
    On every location is one physical machine with Windows Server 2016 and DPM 2016 UR4.
    On 2 Location we have and Raid 6 Storage attachd with Fibrechannel.

    On 1 Location we use and Buffalo NAS attached via ISCSI and an 1 Gigabit connection.
    On all 3 Location we also use Tapes.

    The Problem:
    On all 3 locations the performance is gradually going down. If something took 15 minutes at the start it took 1 our and 50 minutes a few weeks later.
    The performance goes down until every single backup takes so long that new scheduled backups are in qeue. 
    So, after 2 – 3 Months my backups basicly stop working.

    The current "Workaround":

    1. Take the Storage offline.
    2. Violently delete it from DPM.
    3. Take the Storage online an reattach it to DPM.
    4. DPM now formats that with REFS.
    5. DPM Shell: dmpsync.exe –ReallocateReplica
    6. Get-DPMProtectionGroup -DPMServerName scdpm | Get-DPMDatasource | Start-DPMDatasourceConsistencyCheck
    7. Wait until everything is finishd.

    Now Backups will work fine for a few weeks, than they work for a few weeks, than they work, then they kind of work, then you start to pull your hair and then I start at 0 again.

    At the moment it´s Round number 6.


    I also had very long conversation with the storage manufacturer and i´m sure: It´s not the storage.
    The problem is REFS and MBS.

    So i´m not the first one with this problem, but please - is there anyone who found a real solution for this. Yes? Can you explain it to me as simple as possible. Not because I would not understand but im already at a point that I need easy words to be sure there is absolutely nothing I do wrong. ^^
    Oh it´s not the Windows Defender / Anti Virus problem.


    Thanks in advance. I also will provide an Screenshot were I write to tape from the storage. The first 3 entries are from before my last rebuild and the last one from today. I rebuild yesterday.
    I also did not wait as long as usually, so if you think 1:53 is not so much longer than 00:19… if I would wait longer it would be 5+ hours and than 8+ hours and than …

    https://imgur.com/ejf7bJm










    • Bearbeitet Intirius Donnerstag, 25. Januar 2018 12:50
    Donnerstag, 25. Januar 2018 12:42

Alle Antworten

  • You are not alone. The problem in this case is mostly ReFS, the file system behind MBS. DPM is doing things with ReFS that fills up the servers RAM and also causes slower backups.

    Things that helped, but didn't solve the problem:

    1. Install the latest Cumulative Update for Windows Server 2016. Somewhere in there is an updated refs driver that helps a bit with the memory consumption. See: https://support.microsoft.com/en-us/help/4016173/fix-heavy-memory-usage-in-refs-on-windows-server-2016-and-windows-10

    2. There is also this KB about the server becoming unresponsive: https://support.microsoft.com/en-us/help/4035951/refs-volume-using-dpm-becomes-unresponsive-on-windows-server-2016

    You can also play around with the registry settings in the two KBs. With these articles i was able to get our server stable for the most part. Some backups are still super slow, but at least it doesn't completly freeze anymore.

    There is also a lengthy discussion in the Veeam Forums about this: https://forums.veeam.com/veeam-backup-replication-f2/refs-4k-horror-story-t40629-780.html

    According to the Veeam forums another update to the ReFS problems will be included in the February 2018 CU for Windows Server 2016. Let's hope this fixes this for good. I am tired of seeing a backup of a 6TB fileserver running for almost 22 hours with one thirds of the backups never completing.

    Donnerstag, 25. Januar 2018 14:13
  • Hi,

    there is a known Bug with RefS on Server 2016, not DPM related.

    We have tested a private Fix, and the result has been very positive. From Microsoft we heard the RTM of this Fix should be released at the End of February.

    So only thing you can do, is still wait, sorry.


    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Freitag, 26. Januar 2018 09:06
  • Another thing that helped, at least in the short term, is to put more RAM into your DPM server and reboot it regularly. Not a permanent fix, but for us some backups run a lot faster, at least for a couple of days.
    Freitag, 26. Januar 2018 12:39
  • Hey guys, Microsoft just released an optional patch that includes a new ReFS driver that reportably improves performance: https://support.microsoft.com/en-us/help/4074590/windows-10-update-kb4074590

    I'm installing it now and testing it out on my server. I'm going to do the "workaround" to start from scratch after I install it.

    Thank the guys over at https://forums.veeam.com/veeam-backup-replication-f2/refs-4k-horror-story-t40629-810.html I've been following that thread for awhile.

    Freitag, 23. Februar 2018 15:56
  • Brilliant - after playing around with those "tuneable parameters" I don't get "vhdmp" error messages anymore in System log but backup itself is as slow and buggy as before.

    Microsoft, fix this issue ASAP!

    Mittwoch, 28. Februar 2018 12:48
  • The fix is here: https://support.microsoft.com/en-us/help/4077525

    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Mittwoch, 28. Februar 2018 15:04
  • Thanks@Michael - seems I can't find that update within SCCM/WSUS right now?
    Has it been pulled (I read about some errors caused by that update)?

    Donnerstag, 1. März 2018 10:17
  • I have no info about that.

    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Donnerstag, 1. März 2018 12:43
  • Is anyone here using LTO7 tape (sas connection) and having slow issues as well?  Would 400 gb per hour with compression be a tad slow, even from a sas 12gbps raid 6 array ?  

    The specs on our library state min 2.2 TB to 5.4 TB / hour though.


    Tech, the Universe, Everything: http://tech-stew.com Just Plane Crazy http://flight-stew.com



    • Bearbeitet techfun89 Donnerstag, 1. März 2018 16:37
    Donnerstag, 1. März 2018 16:36
  • It is an optional update.
    Donnerstag, 1. März 2018 21:03
  • Totally weird stuff again - none of my 2016 servers finds this update, neither on SCCM/WSUS nor directly via Windows Update (and yes, I check for optional updates as well). By manually downloading the update from Microsoft Catalogue one can clearly see that the update is only available for Windows 10 systems.

    Nonetheless, I manually installed the update on my 2016 DPM server, at least that worked flawlessly. Now let's see how DPM and MBS behave...

    Freitag, 2. März 2018 08:40
  • Unfortunately, I still have problems with BMR's with MBS after this update.  After more than a year of waiting and many $$$ spent in premier support, it would be good to finally get these to work.
    Freitag, 2. März 2018 16:35
  • Installed the Update linked by Michael - absolutely no difference, Backups are slow and unreliable. I know we're a bit oldschool running our Disk2Disk-Storage on Synology NAS systems connected via ISCSI. But that worked flawlessly with DPM2012 - could have been faster (how ironic, compared to the situation now) but it simply worked.

    What I see now is DPM reading with around 10MB/s when doing Tape backups where it was reading with ten times the speed before on DPM2012. Write speed drops below 1MB/s - ridiculous.

    I played around a bit with the Synos, switching LUNs from block- to file-level, using RAID10 instead of RAID5 and so further and so on. No changes to be seen, absolutely no changes.

    I'm now gonna open a ticket with MS but from what I read in several other threads I might already know about their answer...

    Montag, 5. März 2018 12:23
  • Good luck with that.  May be you'll get someone who actually cares about getting these issues resolved, but I doubt it.
    Montag, 5. März 2018 14:21
  • Have you also made some Changes to the registry like described in the KB Article?

    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Montag, 5. März 2018 19:38
  • Only tried the timeout registry value without success.  Anyone have luck with the others?  Don't really have time to diagnose all of them, esp. if it turns out to be a waste of time.  Some guidance from MS would be helpful here.  I expect DPM to work without all of this effort as this is not my job.

    Montag, 5. März 2018 21:24
  • Hi, we are having this issue on about 10 DPM servers in our environment, I've installed the february hotfix that is supposed to solve the performance issues with ReFS but I see no improvement whatsoever. I've configured one of those servers with the registry changes but no improvement either.

    Tape backups go below 1GB/min, where we should be able to achieve 10GB/min.

    However it looks like disk based backup and console responsiveness are better with the hotfix.

    Marc

    Dienstag, 6. März 2018 11:22
  • Hello Guy's

    we also have Perfomance Problems since we use  DPM 2016.
    Has andybody a recommendation for the "tuneable REfs Settings" ?

    The truth about MBS is: 3x slower Backups with a (big) pinch of unrelability.

    regards
    Stefan

    Mittwoch, 7. März 2018 09:13
  • Hi,

    here are the recommended REG Settings from MS Support Case.


    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Mittwoch, 7. März 2018 13:12
  • Michael:

    Did those recommended settings work?  I've had many recommendations for more than a year of various support cases from Microsoft and none have worked.  I'm highly skeptical about wasting more time.

    Thanks.


    • Bearbeitet simdoc Mittwoch, 7. März 2018 14:25
    Mittwoch, 7. März 2018 14:25
  • At the moment we are not seeing any Troubles, but the related Customer to our Support Case, wasn't touched so far with the official Fix, cause we havent had time till now.

    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Mittwoch, 7. März 2018 16:28
  • Hi,

    Good News Maybe

    I'm not sure if you had a fix but I experienced the same issues, worked fine for a while but then backups started to hang and finally to a point where I had to delete the storage and re-create, this put us at risk with our clients as we had agreements for 28 days retention. Logged with Microsoft, ReFS is the issue

    First suggestion, install KB4077525 - Didn't make any changes.

    Second suggestion, install reg keys for for optimal performance, these are: 

    Windows Registry Editor Version 5.00

     

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]

    "RefsEnableLargeWorkingSetTrim"=dword:00000001

    "RefsNumberOfChunksToTrim"=dword:00000032

    "RefsDisableCachedPins"=dword:00000001

    "RefsProcessedDeleteQueueEntryCountThreshold"=dword:00002048

    "RefsEnableInlineTrim"=dword:00000001

     

     

    Windows Registry Editor Version 5.00

     

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration\DiskStorage]

    "DuplicateExtentBatchSizeinMB"=dword:00000100

     

     

    Windows Registry Editor Version 5.00

     

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Disk]

    "TimeOutValue"=dword:00000120

    This did improve the backup speed however again still not good enough, backups had gone from finishing at 09:00 to 07:00.

    Microsoft now stated that it was a known issue and that the Re-FS storage team are working on a fix.

    The good part now, well for me, going back to basics I went back to the start to check all config. I found that on the RAID card it was set to 'Write Through', changed this to 'Write Back'. Magic..a Backup which was taking 8-9 hours to complete now completes in 20-40 minutes.

    I made the change 2 days ago and working fine now, obviously could go back to issues but for now I'm happy.

    John


    Freitag, 9. März 2018 09:16
  • Hi JRH81,

    Did you only change the setting or did you recreate the volume as well? Some RAID Controllers change from Write Back to Write Through if they detect an issue with the battery.

    Our servers are already on write back, so not a solution for us.

    Marc

    Freitag, 9. März 2018 09:29
  • Hi Marc,

    Just changed the RAID setting, battery is fine, I think it was just missed during the config.

    Sorry to hear it hasn't helped, could be a temp solution and could be back to square one next week.

    John

    Freitag, 9. März 2018 09:35
  • Installed the mentioned updated and also used the regkeys but nothings helps. The back-ups just go every time slower and slower...The DPM server is only using 10% CPU and 20% memory so that is not the problem. Does anybody already tried DPM 1801 if that brings some improvements?

    Montag, 12. März 2018 11:32
  • Hi,

    as this is related to RefS and not to DPM, 1801 will not change anything.


    Michael Seidl (MVP)

    SYSCTR Senior Consultant, Blogger, CEO

    Blog | Twitter | Facebook | LinkedIn | Xing | Youtube

    Note: Posts are provided "AS IS" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    Montag, 12. März 2018 13:00
  • Ok thanks for the reply. Does anyone contact with Microsoft regarding this issue? Tomorrow the new patches will be released, is there a new update maybe for this? 
    Montag, 12. März 2018 13:51
  • We were having an issue with 2 DPM2016 servers at 2 different sites where during a scheduled backup or online recovery point upload the jobs would hang, and when you tried to login to the server all you got was a black screen - resetting the machine was the only way to get it back.  Installed the February 2018 updates and it helped a little bit, but then eventually the same thing would happen after a few days.  Played with the registry settings a bit - not really alot of science here just increasing values.

    DPM server has 4 cores and 64gb of RAM - never shows any signs of being stressed at all, but it seems like ReFS must do something with the memory that doesn't show in normal tools.

    Right now I have these registry settings, and I've had successful runs (knock on wood) for the past week straight, which is the longest it's gone in a while without intervention:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]
    "RefsDisableLastAccessUpdate"=dword:00000001
    "RefsEnableInlineTrim"=dword:00000001
    "RefsDisableCachedPins"=dword:00000001
    "RefsProcessedDeleteQueueEntryCountThreshold"=dword:00010000
    "RefsNumberOfChunksToTrim"=dword:00000020
    "RefsEnableLargeWorkingSetTrim"=dword:00000001

    Hope this helps...

    Montag, 12. März 2018 14:09
  • Hi,

    ReFS does not like many "cheap" RAID Controllers (including those in SAN Storages) and the way they are behaving.

    I would be really interested in the target storage you guys are using.

    With high end storage we do not experience those massive impacts.

    Furthermore DPM 2016 is even more picky on shared storage than 2012R2; so it does not like when you share the backend disks (not LUNs!) with other DPM-Server or serves.
    But that depends on the storage, of course.

    There is a difference between a 500k Netapp FAS and a Qnap or Synology

    regards

    /bkpfast


    My postings are provided "AS IS" with no warranties and confer no rights

    Dienstag, 13. März 2018 10:55
  • Tried all the regkeys, but no luck.

    We are not using any RAID controllers. We have made a storage pool from the physical disks in Windows itself. It is a physical server so no shared storage. Disk cabinets are connected with SAS HBA's. 

    So still very poor performance. 



    • Bearbeitet Lucas-076 Mittwoch, 14. März 2018 12:37
    Mittwoch, 14. März 2018 12:20
  • That's us too. No RAID but JBOD configured as Windows storage pools.
    Mittwoch, 14. März 2018 14:44
  • I have to revise my last entry, at least a little bit:
    Changing the LUNs on my Synologys from Block-Level to File-Level seems to do a bit of the trick, read/write speed has indeed increased. Following several advises from forums on the net, only File-Level LUNs do support caching whereas Block-Level LUNs don't.

    Did not make any difference on DPM2012 with NTFS-formatted storage but for DPM2016 using ReFS it obviously does. At least from what I've seen during the last two weeks while changing the LUNs (not completed yet), this seems to solve the issues with Disk2Disk-Backups.

    Nonetheless, read speed while doing Tape Backup is still slow as hell. Maybe this also get's solved once all LUNs are File-Based LUNs....

    Donnerstag, 15. März 2018 07:43
  • Hi,

    First of all congratulations bkfast on your expensive storage.

    I don't read in the hardware requirements of DPM that we need an all-flash array to get decent performance. And it also does not explain the fact that if we destroy the filesystem and recreate all replica's the same system is up to 20-25 times faster and DOES perform at decent speed, that is 10GB/min tape backups to LTO6.

    Only after several weeks/months (or a certain number of iterations of the tape backup) we see a severe regression in tape backup speed. This is what the initial poster of this thread is reporting btw.

    For the record, we have a mix of Dell T620 and T630 servers running DPM. They all have a PERC RAID controller with 1GB cache.

    Marc

    Donnerstag, 15. März 2018 08:34
  • hi all,

    pls exist solution for slow backup in the 2016 dmp ur4 ... i have instaled 2018-02 but not afect.

    OS: 2016

    DPM 2016 UR4

    backup storage is SAN dell sc2000.....

    is solution create ntfs and backup to this ?


    Falcon

    Sonntag, 18. März 2018 13:25
  • Hi all.

    I have the same problem with DPM 2016 on Refs system. Installed KB4077525 not help and registry settings too. DPM 2012 R2 hasn't performance problem in this server and storage. DPM storage is IBM Storwize v3700 and attached by SAS.

    I think start search other backup solution for our company.

    Dienstag, 20. März 2018 08:46
  • HI Guy's

    i think we are seeing some improvements after applying kb4077525.
    The Registry Settings where already applied.
    Still the Backups are not 100% reliable.

    We are Using Direct Attatached 3,5' 7200rpm Disk  withing Raid 6 (HP Raid Controller).

    regards
    Stefan

    Dienstag, 20. März 2018 09:58
  • what you use ? the storage is connect to storage pool and this add to dpm ? Or direct add volume with disk manager and connect to dpm ?

    Falcon

    Dienstag, 20. März 2018 13:33
  • Hi,

    we have 3x disk enclouser (each enclouser with 12x SATA/Midline SAS, arragned as one Raid 6 Array) resulting into 3 Luns on the Raid Controller.

    I've added these 3 lun'S to one storage Pool and created on Big Vdisk.

    The BIG vdisk was added to dpm.

    regards

    Dienstag, 20. März 2018 13:50
  • HI Guy's,

    today seems like a  little miracle, all backup finished with no erros and in a acceptable to fast time.
    10TB File Server Recovery Point in arround 30 Minutes.

    Could it be that the update does some secret optimization in the Background?

    regards

    Stefan

    Mittwoch, 21. März 2018 06:58
  • Applied the patch of last week (KB4088787) still no performance improvement. After 17hours only 28GB copied. 
    Donnerstag, 22. März 2018 07:02
  • Applied the patch of last week (KB4088787) still no performance improvement. After 17hours only 28GB copied. 

    You are talking about Tape Peformance right?

    Donnerstag, 22. März 2018 07:10
  • I'm sorry, no just a simple disk to disk recovery point creation
    Donnerstag, 22. März 2018 07:23
  • We've been having exactly the same problem.

    Initially we see great performance.  The disks sit at around transfer 1.2GB/s speed during replica creation, and then slowly as the weeks go by performance gets worse and worse, until tape jobs that normally finish before we get into work Monday are finishing Thursday morning!

    I've been following the Veam thread and have all the patches and tried all the registry keys.  The memory issue has gone away, but ReFS performance still degrades over time.

    From what I can see it appears to be a fragmentation issue.  Initially most of the reading and writing is sequential, but as time goes on it gets more and more random which is why performance drops.  Blowing the disk away and recreating the replicas lays everything out nicely for sequential access again.  Allocate on write side affect or something..  


    • Bearbeitet DJL Donnerstag, 22. März 2018 09:26
    Donnerstag, 22. März 2018 09:24
  • For us the latest Windows Server patches fixed most of the performance issues we had. DPM still uses a lot more memory than the system requirements suggest, but at least the backups are stable now. We had a fileserver protection point than run for ~36 hours, it's down to around 3 hours now.
    Donnerstag, 22. März 2018 14:32
  • well, today i installed the March-Updates (KB4088787, KB4089510) and did a Reboot.

    Now the Problem is Back, 500 GB Recovery Point -> now over 3 hours (it was down to 30 Minutes).

    Now the question is: Did the update change something, or is refs getting faster with longer uptime?

    Will Monitor this again after the weekend.

    Regards

    Stefan

    Freitag, 30. März 2018 13:12
  • i mean that after restart is situation better, but after few days.....

    Falcon

    Montag, 2. April 2018 20:07
  • After the March updates are also seeing slower backups after a couple of days. Not nearly as bad as before, but still noticable. My guess is that ReFS is still using too much memory on our really large backups (6 TB fileserver) and that eventually slows down things a bit. But for us it's managable, i am just happy that our server doesn't completly lock up anymore.
    Freitag, 6. April 2018 13:13
  • Months have passed, situation still unchanged. Installed May updates, DPM UR5, ... - nothing has changed, disk2disk backups take forever, tape backups will never finish, whole system gets blocked..

    Microsoft, get this issues sorted out IMMEDIATELY!

    Montag, 14. Mai 2018 06:37
  • Hello Joerg,

    we also had a very big performance Problems with DPM and Mondern Backup Storage.

    But Since the refs driver was udpated (March Update, if i remember correct) the Perfomance now is pretty good.

    I don't know if this observation is correct but to me it seems that, after a reboot the performance is slower and after 1 oder 2 Days the performance is very satisfying for us.

    We are using DAS Storage (Local RAID6 Arrays with HP Raid Controller). So maybe thers something special to your iscsi setup. 

    Do you have the possibility to do a test Backup to a local disk? If the local Backup performs better, you know that you need to dig into your iscsi setup. 

    regards

    Stefan

     

    Montag, 14. Mai 2018 07:30
  • Thanks Stefan - we see an opposite behaviour, like after a reboot system runs pretty well and gets slower and slower over days until it gets stuck completely. Had > 500 recovery points in queue this morning, where like >400 were in pending state and around 100 were "In Progress" but not doing anything (network usage on my DPM server <1Mbps).

    Nonetheless I will (again) try your suggestion and use a local disk as target....
    Can't be a "general ISCSI issue" as jobs work when being newly created and start getting slow after like 2 or three weeks. From my point of view there still is a big, big bug within the implementation of ReFS.

    Maybe you can gimme some details on your setup?
    DPM-Database on same machine? How much RAM is it allowed to use?
    How much RAM is installed on your system in total?
    Stuff like that... Thx. ;-)



    • Bearbeitet Joerg Ott Montag, 14. Mai 2018 11:14
    Montag, 14. Mai 2018 08:08
  • Horrible performance for us too--much, much worse than 2012 R2.  We're not using iSCSI.  We are using JBOD configurations with SATA3 in Windows Storage Pools for MBS.  We have 12 physical cores (not using VM any more because performance was even worse) and 80 GB RAM in each box, much more hardware than 2012 R2 for much less performance. 

    While all types of backups are slower and more unreliable, it is almost impossible to get BMRs to work.

    Montag, 14. Mai 2018 13:31
  • we have same trouble , DPM 1801 connected to SAN storage... in the system eventlog is 
    The description for Event ID 129 from source vhdmp cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
    If the event originated on another computer, the display information had to be saved with the event.
    The following information was included with the event:
    \Device\RaidPort4.....

    Falcon

    Montag, 14. Mai 2018 15:31
  • Still the same here. Horrendous performance.  

    We've just bought a new server to try and get things moving again.

    We'll be looking at alternative solutions soon as DPM is almost useless now.

    Montag, 21. Mai 2018 18:55
  • Good morning.  DPM 1801 on W2016 here as well.  Backing up to local VHD.  Performance is horrible and now we cannot even remove protection.  MMC will sit for days on "Processing" trying to remove protection from a very small backup set.   I've tried all the mentioned fixes through this thread.  Any updates from anyone else?
    Mittwoch, 23. Mai 2018 12:52
  • Hey Guy's

    how much Memory does your server have?

    What i forgott to mention: When the problems with our DPM 2016 started i upgrade the RAM a few times.

    With DPM 2012R2 we where running on 16 Gig ram.
    Now the Same Server is Running DPM 2016 with 128 GB.

    Maybe thats the reason why my Sever runs well.

    regards
    Stefan

    Mittwoch, 23. Mai 2018 13:54
  • Update on our end.  Our DPM server is a HyperV VM  I found an obscure article about poor performance when "Enable Virtual Machine queue" was checked on certain servers with certain NICs.  UNCHECKING this box in Hyper V fixed all our issues.  Our DPM server is now running better than it ever has.  

    I will mention that I had also tried nearly (if not all)  suggestions in this thread, so it could be possible a combination of these things resolved our issue.
    Donnerstag, 24. Mai 2018 17:18
  • Would you mind sharing the link you found with the suggested fix of unchecking "enable virtual machine queue"?
    Donnerstag, 24. Mai 2018 21:27
  • We're running DELL servers and guess what, they are using BroadCom NICs that are affected by this VMQ bug... never had an issue before with that, so I'm wondering if that really can be crucial now after migrating DPM from 2012R2 to 2016. Nonetheless, I will give it a try...

    Here's some stuff to read:
    https://www.dell.com/support/article/de/de/debsdt1/sln132131/windows-server-slow-network-performance-on-hyper-v-virtual-machines-with-virtual-machine-queue-vmq-enabled?lang=en

    Don't go to deep with the firmware/driver versions, there's other threads across the net stating this also happens with different BroadCom NICs that already have latest firmware and drivers.

    And here's what MS has to say:
    https://support.microsoft.com/en-us/help/2902166/poor-network-performance-on-virtual-machines-on-a-windows-server-2012

    Cheers,
    Joerg

    Dienstag, 29. Mai 2018 12:31
  • Having the performance issues myself, and find it unfathomable how bad it is. 

    I'm using a Dell MD3060e with a 12GB/s redundant HBAs and 20 8TB drives. Server is physical with 160GB of RAM, 2 CPUs and 16 Cores, and the storage is set up as a Windows Storage Spaces pool.  This server well equipped for this task. I'm also on the 2018-06 cumulative update and have enabled the registry entries listed in the previous posts KBs. The server was also rebooted yesterday to test performance again.

    My Hyper-V replicas (as an example) are taking forever to back up. Currently I have a job that has taken 16.5 hours and the backup snapshot size is only 70GB (verified by checking the avhd created during the volume snapshot). I just manually copied the snapshot avhd over to my DPM server and it took about 15 minutes (not bad over a Gigabit network), so I know it's not a network issue.

    My REFS volume currently has a queue of 100. Can't say I've ever seen it that high when the same storage space was formatted as NTFS for the legacy storage setup, so I'll add another confirmation that it's definitely an ReFS problem. 

    I have tried everything I can think, so once again it's back to waiting on Microsoft (and my backups to complete...maybe by Sunday?). I really don't want to roll back as despite the performance issues the space used by replicas in MBS is 1/3 to 1/2 of what it was on Legacy storage replicas.

    Mittwoch, 20. Juni 2018 15:16
  • We have applied all the reg fixes mentioned on here, all patches, even contacted Microsoft tech support.  Even worse is when you do a VM backup using VMware backups in DPM it only does 1 VM.  On a daily basis we spend so much time just tweaking DPM to make sure it will run as expected.   We are still waiting for the Microsoft tech support to get back to use with a fix for the VMware backup issues and the overall slowness.  Its been about 3 months and still no updates from the tech support team.  DPM in my opinion has just gotten worse.  Its the worst backup software I have ever experienced.  At this point it looks like robocopy will work better in getting daily backups rather than this filled with bugs DPM. 

    Ishan

    Donnerstag, 28. Juni 2018 13:16
  • Similar problem here. DPM 1801 with MPS on WS 2016 running in Azure IaaS. Straight file copy between DPM and File Server completes with expected IOPS and throughput based on VM size (300 MB/s avg). DPM backups get 10 MB/s avg.

    It doesn't seem like ReFS is the culprit since both my file server and the DPM volume are formatted ReFS, and like I said I can copy content just using Windows Explorer and performance is great. It's only the DPM backups that have problems.

    Mittwoch, 11. Juli 2018 03:07
  • Ok, here's a little write-up that seems to point into the right direction:
    After adding some more RAM to my physical DPM2016 server we now have 128GB available - indeed just increasing the RAM solved some of the problems, but not all of them.

    We're using Synology NAS systems (3 at this time) as short term storage and back with DPM2012 it was fine (and recommended by Syno users around the globe) to use block-level LUNs on the NAS, connect (MPIO for sure) them via ISCSI to DPM and just add these NTFS volumes into DPM.

    Things (kinda?) changed with DPM2016/ReFS... Me and my friend "Trial&Error" now re-configured two LUNs on one of the Synos from block level to file based LUNs with advanced features (guess what: With DSM6.2 and up Synology stripped out the block level LUNs, you can only use file based LUNs now); in fact, I just killed and re-created the LUNs after doing the update to DSM6.2.

    While doing Exchange backups from my DAG it now is more than obviuos that "old" block level LUNs (even on updated Synos) seem to be the major issue: After re-creating the LUNs I removed two Exchange DBs from the PG and re-added them using the new LUNs as storage target. Now while backups runs, DBs located on new LUNs are backed up within minutes whereas DBs on old LUNs stay with 0MB transferred for one hour or even longer.

    Same goes for Hyper-V VM backups from my Hyper-V clusters and I guess it will be the same for all other backup types as well. Main difference I can see here: Ability to read/write simultaneously is pretty bad on old, file based LUNs whereas it gets dramatically better on new, file based LUNs.

    For the new LUNs I also decided to set the Volumes up as NTFS volumes and then put VHDX files on those NTFS volumes. After that I created a Storage Pool using those VHDX files, basically according to Charbels instructions available here: https://charbelnemnom.com/2016/10/how-to-reduce-dpm-2016-storage-consumption-by-enabling-deduplication-on-modern-backup-storage/

    So still on our way but for now it looks promising... ;-)

    Donnerstag, 12. Juli 2018 08:56
  • Hi Cory,

    I wouldn't jump to conclusion that this is not a ReFS problem, you cannot compare a file copy job with the way DPM syncs data because DPM uses the copy on write feature in ReFS which could be the cause of the problem.

    A DPM support engineer confirmed to me that the issue is with ReFS and that the Windows team is supposed to fix this. I wonder what is taking them so long to do so.

    Best regards,

    Marc

    Freitag, 13. Juli 2018 08:27
  • Hi Joerg,

    The thing is that (see the initial post of this thread) when you (re-)create your backup storage, backups run fast but performance degrades over time.

    So you really have to see this over a period of one/two months from the time you (re-)created the ReFS volume in DPM.

    Marc

    Freitag, 13. Juli 2018 08:32
  • I agree with Marc cause we have a case open with Microsoft as well and the support engineer  told us that ReFS is the issue and the ReFS team is aware of this and they are trying to fix it. 

    Ishan

    Freitag, 13. Juli 2018 13:35
  • I had several support cases open with Microsoft using a premier account on DPM 2016.  Some lasted almost 2 years starting near the time of its initial release.  I got little for the cost of those cases and the cost of my time.  Don't expect much.
    Freitag, 13. Juli 2018 13:55
  • Simdoc,

    Yup i have the same feeling too.  Its been  3 months and they havent even updated me.  Not even anything positive that oh hey we might have a fix.  I am not having any expectations from them.  Its coming to a point that we might drop DPM if their new release doesn't have all the fixes.  Looks like DPM needs a dedicated admin who is hired only to babysit DPM all day. 


    Ishan

    Freitag, 13. Juli 2018 14:28
  • Hi Guys

    I just wanted to add that I have been following this post with closely for a while to help us out with similar issues.

    We have a number of backup servers that are running Veeam and Synology iSCSI LUNs with 50TB+ drives. We upgraded them to Windows Server 2016 with REFS and backups ran so slow that we couldn't complete.

    We changed the config on the Synology from normal iSCSI block  ended up formatting our LUNs to file level EXT4 on the Synology side and NTFS on the Windows side. We have also downgraded and of the backup servers that we upgraded to Server 2016 back to 2012R2 as we got much better performance on 2012R2.

    The downgrade to 2012R2 to a nightmare because 2012R2 can't read an REFS drive that is formatted by 2016.

    Good luck with REFS, not going to touch that for at least a couple years.

    Andrew

    Donnerstag, 16. August 2018 06:14
  • Hi guys,

    after several weeks of running DPM 2016 connected to Synology with volume-based LUNs using Advanced LUN features I can't see any performance issues anymore. But be aware that you should use Storage Pools as described in Charbels instructions (see my previous post for the link) - if you just mount the LUN directly into DPM it may lead to data loss in case your Synology get's a hickup on the ISCSI connections (suffered from that, ReFS was not cabable of repairing the file system, lost around 15TB of data).

    I just skipped all the dedup-steps Charbel describes and since then read and write performance are no longer an issue, also my tape backups don't get blocked by running disk-2-disk jobs anymore.

    As a sum up:
    - increase memory on your DPM machine, we went from 64GB to 128GB
    - use "intelligent" LUNs on your storage
    - mount your LUNs using Storage Pools on Server 2016

    After having solved this we can now take care of the "real" problems like MSDPM crashing around midnight or not being able to add new VMs to my protection group when the VMs are located on a 2016 Hyper-V Cluster, running version 8... Sigh...

    Donnerstag, 6. September 2018 12:26
  • Hi,

    performance is an issue with MBS - we mostly see it while migrating replicas / RPs.

    But some other ReFS related issues in our environments disappeared after we switches from mount points to drive letters.

    Perhaps it helps in other environments, too

    regards

    /bkpfast


    My postings are provided &quot;AS IS&quot; with no warranties and confer no rights

    Donnerstag, 18. Oktober 2018 15:23
  • Hello guys, I have experienced the same issues everyone else here has but I was able to resolve them completely. The only thing I can tell that I have different from you is a registry entry I made in my tunable parameters section. I dont see where anyone on this thread has disabled the DPM storage calculation. This has singularly had a major impact on my DPM performance. I am pasting all my parameters below but pay particular attention to the "DisableReFSStorageComputation"="1" entry.

    This will break the computed storage size value as reported on your protection group status detail screen but I didn't care as long as DPM functioned without my constant intervention. Quickly here, my DPM server is DAS Physical dedicated ose. 8 core, 80GB RAM on 12 7.2K SAS2 4TB drives RAID 10. DPM 1801 and serv 16 latest build.

    You will notice after making this entry and rebooting DPM that the long delay at the start of a job where your disk queue length shoots up and DPM seemingly is processing the meaning of life for 20 mins before transferring the first byte of data across the wire goes away. Good luck everybody

    Windows Registry Editor Version 5.00
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Disk]
    "TimeOutValue"=dword:00000078
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration\DiskStorage]
    "DuplicateExtentBatchSizeinMB"=dword:00000064
    "DisableReFSStorageComputation"="1"
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]
    "RefsDisableLastAccessUpdate"=dword:00000001
    "RefsEnableLargeWorkingSetTrim"=dword:00000001
    "RefsNumberOfChunksToTrim"=dword:00000020
    "RefsDisableCachedPins"=dword:00000001
    "RefsProcessedDeleteQueueEntryCountThreshold"=dword:00000200
    "RefsEnableInlineTrim"=dword:00000001


    -Jason

    Sonntag, 11. November 2018 16:29