none
DPM 2016 UR2 extreme slowdown RRS feed

  • Question

  • Over time our DPM has slowed down to the point of almost un-usability.  When first setup our sync times were around 2-4 minutes depending on data churn and other factors, but as time has gone on it has increased to 20 minutes -4 hours for a sync of less than a GB.  The underlying hardware and software hasn't changed, so I'm not sure where to begin.  Below is a screenshot showing the extremely long synchronizations that are being experienced, and it's just a random sample of syncs that actually completed today.  

    

    There's not much in the event log that appears related, just the random smattering of errors usually seen on the server, examples include:

    Filter Manager failed to attach to volume '\Device\Harddisk123\DR214'.  This volume will be unavailable for filtering until a reboot.  The final status was 0xC03A001C.


    The description for Event ID 999 from source MSDPM cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. If the event originated on another computer, the display information had to be saved with the event. The following information was included with the event: An unexpected error caused a failure for process 'CPWrapperServiceHost'. Restart the DPM process 'CPWrapperServiceHost'. Problem Details: <FatalServiceError><__System><ID>19</ID><Seq>0</Seq><TimeCreated>6/15/2017 3:06:01 PM</TimeCreated><Source>DpmThreadPool.cs</Source><Line>163</Line><HasError>True</HasError></__System><ExceptionType>NullReferenceException</ExceptionType><ExceptionMessage>Object reference not set to an instance of an object.</ExceptionMessage><ExceptionDetails>System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.Internal.EnterpriseStorage.Dpm.CPWrapperService.CPWrapperServiceWCFHost.GetCertificateCheckTimerInterval() at Microsoft.Internal.EnterpriseStorage.Dpm.CPWrapperService.CPWrapperServiceWCFHost..ctor() at Microsoft.Internal.EnterpriseStorage.Dpm.CPWrapperService.CPWrapperService..ctor() at Microsoft.Internal.EnterpriseStorage.Dpm.CPWrapperService.CPWrapperService.Main()</ExceptionDetails></FatalServiceError> the message resource is present but the message is not found in the string/message table

    Among a few others, mostly just some occasional crashes of DPM itself.  I'd be happy to provide any information at all that would be deemed useful.  I have ran some generic disk benchmarks and the performance of the hardware itself does not appear to have degraded.  We are running DPM 2016 UR2 inside a virtual machine that is the sole VM on a DL 380 with 128GB of RAM (32GB for the VM) that has dedup running on the host, and the disks passed through to the DPM VM.

    Edit:  Just wanted to include that this is a transient issue.  Server performance can sometimes increase, particularly after a reboot and the server has cleared out all the failed jobs (caused by the reboot). Having just rebooted the server I can say that syncs are currently taking 5-30 minutes after ~an hour has passed after reboot.  This is still a bit higher than when the server was first stood up but it's still usable.  All dedup jobs on the host have throttling enabled in an attempt to prevent disk contention.  The dedup schedules haven't changed since the server was setup aside from the addition of 30% dedup job throttling. 
    • Edited by JN1226 Thursday, June 15, 2017 4:56 PM
    Thursday, June 15, 2017 4:06 PM

All replies

  • If this is running on Windows Server 2016, make shure you install all the latest updates. Windows Server 2016 had a memory leak/bug with ReFS that eventually used up all RAM on the DPM server. When that happened DPM 2016 came to a crawl, which sounds like what you are seeing.
    Thursday, June 29, 2017 9:58 AM
  • I have the same problem. Just started in the past week. See my post here, your description is identical to my issue. But I'm not using dedupe. I've tried all the latest server 2016 CU's and regressed back to 2017-04 CU with no luck.

    As I've been down for almost a week now I either need to open a case with MS or do a rebuild. Does anyone know if you can rebuild and import database without rebuilding the PG disk replica's, like you could with prior versions of DPM? K

    Thursday, June 29, 2017 12:48 PM
  • I think we've patched past that, but I'm currently working on getting the absolute latest patches possible.  We've had other issues with ReFS to be honest.  One of our servers at a secondary site would blue screen and reboot twice every morning at 1:30AM until one day when it would just blue screen and reboot every 5 minutes.  I'm not sure what fixed it as I didn't really make any changes that should have, but it seems to have gone away on its own.

    I'm currently missing the absolute latest May patches on this server and I will apply every possible update and check back.

    Thursday, June 29, 2017 2:33 PM
  • From what I've read the best way to re-import a database is to simply back it up in SQL with 2016.  There was a specific application with DPM 2012R2 and before for DPM database backup but it seems to be gone in 2016.  I was attempting to "split" my disks as I believe part of my problem is my replica size (currently ~100TB), which can be split to 50TB on one VM and 50TB on another without much issue.  My problem was that this involved a server with a different name, which DPM does not appear to support in the slightest bit.
    Thursday, June 29, 2017 2:36 PM
  • I've spent the entire day on the phone with MS Support. I was told from the start, that support will help resolve ONLY one replica issue. Meaning if they fix that one, and the rest still fail, I have to open another ticket WTF??? Anyway.. so far they have pointed the finger at SQL 2016 SP1 (SP1 is not supported). Ok fine I uninstall that. Same problem. Then they say Disk Enclosure management console is the problem (OK uninstall that) same problem. Several hours digging in SQL, then looking at event logs and say "Filter Manager" error is caused by 3rd party software (EMC powerpath) Ok uninstall that, this requires reboot and it seems to start working (key word seems). Because of the reboot and only one replica was running job backup speed returned but as soon as 82 other jobs kick off, we are back to the same problem.

    So I throw the towel in for today and will resume case with MS tomorrow. K

    Thursday, June 29, 2017 9:04 PM
  • We are also experiencing the same issues with SCDPM which began about a week ago.  Recovery points and consistency checks which previously took minutes to complete run for hours with little to no data transferred.  After rebooting the server performance improves but after a set of scheduled jobs begins the same degraded performance is seen.  When logging on the server there is sometimes a blank screen for minutes until the desktop appears.  The SCDPM console will become completely unresponsive after initiating a consistency check.  Typically it must be closed or ended via task manager and reopened.  We have three SCDPM servers, one located at each site, that have all begun to exhibit this behavior and all within the last week.  We are running SCDPM 2016 with the latest UR on Windows Server 2016 VMs with the latest updates.  Two of the physical storage servers are running Windows Server 2016 with MBS and one is running Server 2012 R2.  I'm most likely going to open a Microsoft case as well since I've been trying all week to return our organizations backups to a fully functional state without success.

    Friday, June 30, 2017 3:37 PM
  • After working with Microsoft Support on 2017-07-03, we uninstalled all of the June 2017 Windows updates for Server 2016. This made no improvement.  They acknowledged that other customers were seeing the same issues but they were in an investigative stage yet so they wouldn't confirm this as a bug.  I was provided three options that we were not excited about.

    1. Wait until Microsoft had an answer.

    2. Rebuild the SCDPM server using the existing database.

    3. Rebuild the SCDPM server using a new database.

    The rebuild options were completely reinstalling Windows, not simply a reinstall of SCDPM.  I waited until today to discuss our options with other staff.  It just so happened that when I logged onto the SCDPM server I noticed Windows Defender was utilizing around 50 percent CPU.  I had not seen this behavior previously and there wasn't a scheduled scan running.  This led us to configure a SCEP policy to disable scheduled scans and real time protection.  After doing so the server has been performing as it did in the past.  Jobs are completing in a timely fashion.  I believe this could be the issue others are seeing.  Being that Windows Defender is installed by default on Windows Server 2016 we didn't look at this as a possible issue until seeing the high CPU utilization by chance.  If you're running Windows Server 2016, try disabling Windows Defender scheduled scans and real time protection and let us know your results.  As I mentioned we seem to be fully operational again.

    • Proposed as answer by DJL Wednesday, July 5, 2017 10:22 PM
    Wednesday, July 5, 2017 5:34 PM
  • I have a support case open with Microsoft for exactly the same issues - no joy with them yet either.  It's affecting our primary and secondary DPM servers - it's so bad we've just blown away our secondary server and started using Hyper-V replica to get offsite backups working again.

    Will try disabling defender and see how that goes.

    Wednesday, July 5, 2017 7:58 PM
  • Glen - thank you! Disabling defender instantly seems to have resolved the problem!  Jobs that had been stuck for hours completed within seconds and disk transfer rates are back up to 1.2GB/s which i haven't seen for ages!

    I'm using "Turn off Windows Defender" under Computer Configuration | Administrative Templates | Windows Components | Windows Defender

    I had path exclusions set for the DPM ReFS drives and DPM directory, and process exclusions for all the DPM process, SQL, csc etc

    Wednesday, July 5, 2017 8:12 PM
  • Can confirm that it's windows defender.

    Had this issue for 2 weeks. Reinstalled our servers with blank DPM database all ok. Now after 1 week and 4 days the first server i reinstalled is having the same issue.

    As soon as i turn off defender BOOM it's back in business. Working with MS as well on this. Have another issue that is causing backups to go in loop on some VM's. Once i turnd off defender on that server, one VM almost instant finished. Hoping it will solve it for the rest.

    Jan-Tore Pedersen

    Thursday, July 6, 2017 8:23 AM
  • Great to hear the issue seems to be resolved.  It's unfortunate that rebuilding the entire server doesn't work and we end up where we started after doing so.  I guess the SCDPM and Defender teams will have to get to work on a solution.
    Thursday, July 6, 2017 7:46 PM
  • It looks like others had their issue resolved by turning off Windows Defender, but that did not fix the issue for us.  We already had exceptions in beforehand, so I'm not sure if that makes any difference.  After putting in some time on the weekend I believe our issue has finally been traced down to a corrupt NTFS logical volume on our JBOD server, which passes deduped VHDXs to the VM that contains DPM.  Originally we split our environment in half, spreading it across 2 VMs in an attempt to see if this would alleviate the issue.

    After the split the new server was performing exceptionally well, whereas the old one had gotten so bad it was essentially hard locking every 10-15 minutes.  The old server was rebuilt but the issues persisted, and the server event log was still full of eventid 129 issues every 30 seconds and we eventually had issues mounting one of the volumes.  It was actually very similar to what would happen back in the day when a computer would lock up while trying to slowly read data from a floppy disk.  Doing a chkdsk of the NTFS volumes showed that one had errors in the bitmap file, but even after it repaired the issue the ReFS volume inside the VM still would not mount or be read to/from properly.

    We eventually just wiped the logical drive, and after the reformat and getting the storage added back in the issue seems to be resolved for the time being.

    Monday, July 10, 2017 3:31 PM
  • Good to hear your issue seems resolved as well, JN1226!

    For our DPM 2016 UR2 servers, disabling Defender made a big difference on System State/BMR jobs. Most of these jobs did not finish within 1 day. With defender disabled I see 40MB/s or more on the same jobs.

    Hopefully MS finds a solution to have the same perfomance with Defender enabled.


    Kind regards, Mark

    Tuesday, July 11, 2017 8:53 PM
  • Just for info Microsoft Support have said the product team are aware and a Windows Defender update will be available shortly to fix the issue


    • Edited by DJL Friday, July 14, 2017 12:53 PM
    Friday, July 14, 2017 12:52 PM
  • Thank you!  I just wanted to chime in and echo my frustration with DPM 2016. I've been on DPM since the 2007 days.  While there has been an issue here or there, this one seems to be the worst one I have experienced.  I disabled defender on both of my DPM servers and it seemed to work at first, but performance has started to degrade again.

    The only difference is that I can manage the server when Defender is off, but disk throughput is still horrible. Protected VMs going in to inconsistent states is weekly.  Consistency Checks take 40 hours in certain cases.

    Monday, July 17, 2017 4:49 PM
  • I was quite optimistic in an earlier post, but the performance drops dramatically when more System State/BMR jobs are running. It seems to stall the whole DPM server, with a total throughput of 3MB/s. Consistency Checks are also indeed really slow.

    Kind regards, Mark

    Monday, July 17, 2017 5:13 PM
  • Also keep in mind that in order to disable Defender completely, you need to use the local GPO. Just disabling real-time protection does not solve the problem:

    • Start the Local Group Policy editor
    • Go to Computer Configuration | Administrative Template | Windows Components | Windows Defender
    • Enable the  "Turn off Windows Defender" setting.

    Full story at Jan Tore Pederesen's blog: http://jtpedersen.com/index.php/2017/07/06/windows-defender-issue-on-windows-server-2016-with-dpm/


    Kind regards, Mark

    Monday, July 17, 2017 10:16 PM
  • Does anyone have an update on this issue? A lot of parallel System State jobs are resulting in hardly any throughput. Since a System State is performed using a temporary share on the DPM Server, I though I'd have a look at the SMBServer Eventlog.

    I encountered a lot of 1020 warmings in the SMBServer Operational eventlog (Event Viewer -> Applications and Services -> Microsoft -> Windows -> SMBServer -> Operational:

    Log Name:      Microsoft-Windows-SMBServer/Operational
    Source:        Microsoft-Windows-SMBServer
    Date:          8-8-2017 10:18:22
    Event ID:      1020
    Task Category: (1020)
    Level:         Warning
    Keywords:      (8)
    User:          SYSTEM
    Computer:      <DPM 2016 SERVER>
    Description:
    File system operation has taken longer than expected.
    
    Client Name: \\**.***.*.**
    Client Address: **.***.*.**:57066
    User Name: <DOMAIN>\<SERVER>$
    Session ID: 0x94002C000049
    Share Name: \\*\197f9b999a49477db79d4e4c9f314761
    File Name: WINDOWSIMAGEBACKUP\<SERVER>\BACKUP 2017-08-08 061157\7AE90BC5-06F6-11E6-80B5-806E6F6E6963.VHDX
    Command: 17
    Duration (in milliseconds): 19846
    Warning Threshold (in milliseconds): 15000
    
    Guidance:
    
    The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.
    

    Does anyone else see these as well?


    Kind regards, Mark

    Tuesday, August 8, 2017 12:43 PM
  • I had pretty much the same. The only way around it was to avoid using MBS and use legacy storage.
    Tuesday, August 15, 2017 6:38 AM
  • Any one using Windows Defender successfully with DPM?  I'm seeing unacceptable performance impact with the recommended exclusions relative to turning it off:  https://docs.microsoft.com/en-us/previous-versions/system-center/system-center-2012-R2/hh757911(v=sc.12).  Using DPM 2019 on Windows Server 2019.

    Monday, May 13, 2019 9:51 PM