none
Multiple DPM 2019 servers with Storage Pools on Server 2016 VM, hosted on Server 2016 Hyper-V getting VHDMP errors and delayed write RRS feed

  • Question

  • Hi

    In my environment with the move to DPM 2016 two years ago we upgraded our production setup inline with MS recommendations to a virtual DPM 2016 server. The storage is provided by passing through multiple VHDX files as 1TB drives to the Virtual DPM Server. These are then all attached to the VM as four separate storage pools and presented to DPM. The idea being the storage will be deduplicated, expandable and portable. We also use a Virtual Tape Library for long term storage.

    These servers backup our production clusters cross site. So DPMSite1 backs up ClusterSite2 and vice versa. We had been having some problems with the jobs overrunning and not being completed overnight with the virtual DPM 2016.

    We also have two physical DPM servers setup cross site for our test cluster, without being virtualised in any way. It had to be physical at the time due to the use of a physical tape drive. 

    I upgraded the test servers to DPM 2019 about a month ago, no problem. They are running perfectly, backing up a large workload every night successfully. With those servers running well, I made the plunge and upgraded the production servers to DPM 2019.

    The update went fine, but the failed/over running jobs have now got a lot worse. I'm finding the 40 jobs still running in the morning on both servers, and often 20/30 failures.  I initially put this potentially down to network congestion, but as my test servers are performing perfectly, also with a large load, over the same links, I suspect the issue is with the virtual DPM servers themselves, not the links. The errors I'm getting on them would conform this.

    I checked the logs on virtual servers and I get multiple of the following on both:-

    The description for Event ID 129 from source vhdmp cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

    If the event originated on another computer, the display information had to be saved with the event.

    The following information was included with the event: 

    \Device\RaidPort7

    I also get the following highlighted via Server Manager on both

    Error 50  NTFS {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

    and

    Error 9 : Virtual Disk Service: Unexpected provider failure. Restarting the service may fix the problem. Error code: 8007001F@02000014

    and 

    Error 1 : Virtual Disk Service : Unexpected failure. Error code: C000000E@020A0007

    So obviously this makes me think there's a serious issue with latency on the communication between the host storage and the vms. Is there anything obvious I can do to improve this? I have tried updating the drivers and firmware on the host server, but to no avail.

    Thursday, June 6, 2019 9:53 AM

Answers

  • Hi, yes the problem is much better now thanks to the tweaks below:- 

    I think the first one made the most difference to the performance, as it stops DPM calculating the sizes for each protected item.

    'Unless this is extremely necessary for your company, we suggest disabling it on DPM server. We are working on a way to get this details in background in a more efficient way. We can enable this in the future.

    To disable this size calculation background operation:

    Open DPM Powershell and run: .\Manage-DPMDSStorageSizeUpdate.ps1 StopSizeAutoUpdate

     

    Functionality of the Script: Manage-DPMDSStorageSizeUpdate.ps1

    ManageStorageInfo – Mandatory – one of 'StartSizeAutoUpdate','StopSizeAutoUpdate','GetSizeAutoUpdateStatus','UpdateSizeInfo'

    1. GetSizeAutoUpdateStatus – Tells whether automatic size calculation at backup and pruning is on or not. Default is ON.

    2. StopSizeAutoUpdate – Stops size calculation at backup and pruning. Size will be shown as a '-' in UI after this.

    3.  UpdateSizeInfo:

          - Updates the sizes of all data sources in the DB and outputs the result to a csv file (can be given at input - UpdatedDSSizeReport).

          - User can also selectively update sizes for specific data sources by specifying a file in "UpdateSizeForDS" parameter. This file should contain data source ids for the required data sources to be updated 1 in each line.

          - If there are datasources which could not be updated with correct size (for eg. if a backup is going on for them), then we output these in the file specified in "FailedDSSizeUpdateFile" parameter. If there are some data sources which could not be updated then this file can be sent as a "UpdateSizeForDS" parameter in another call the script which   would update sizes for only these data sources.

    4. StartSizeAutoUpdate – Sets automatic size

    e calculation at backup and pruning to be ON. If it was OFF previously then user should also run this script with UpdateSizeInfo to ensure that the UI shows the correct value.

    ----------------------------------------------------------------------------------------------------------------

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration\DiskStorage]

    "DuplicateExtentBatchSizeinMB"=dword:00000064 -> Change from 100 to 64

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]

    "RefsEnableLargeWorkingSetTrim"=dword:00000001

    "RefsDisableCachedPins"=dword:00000001

    "RefsProcessedDeleteQueueEntryCountThreshold"=dword:00000800

    "RefsNumberOfChunksToTrim"=dword:00000020

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Disk]

    "TimeOutValue"=dword:00000078 / Change from 60 to 78

    Created the Key ParallelMountDismount on HKLM\Software\Microsoft\Microsoft Data Protection Manager\Configuration

    Created DWORD with name Enable with value 1

    Created DWORD with name ParallelMountDismountLimit and value 5

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Wbem\CIMOM]

    "High Threshold On Client Objects (B)"="60000000"

    "High Threshold On Events (B)"="60000000"

    Hope this helps someone

    Alig69

    • Marked as answer by alig69 Tuesday, June 18, 2019 2:42 PM
    Tuesday, June 18, 2019 7:15 AM

All replies

  • Hello,

    These events are usually caused by a backup software, in your case DPM 2019, it does seem that the disks may be the culprit here and aren't performing enough for the backup (I have experienced this many times).

    Can you tell us a bit more about your DPM environment, how is the network configured, do you have any separate backup network?

    Are the cluster sites far away geographically? If they are, then there will be some latency.

    What storage system are you using? What kind of disks FC / SCSI? Are they SAS / SATA / SSD?

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, June 6, 2019 12:13 PM
  • Hi thanks for the reply and sorry about the delay. I have brought in MS Premier support to look at it as it was causing us a lot of problems. It does look like its the storage is the problem. Support have given me some registry changes to make which appear to have improved things enormously. Once I can verify they have worked I'll post them back here for anyone else who finds this.

    We have no separate backup network, but the backups only occur overnight. The two sites are not far away geographically, but are connected by 10GB links. The storage is on direct attached storage, fed into the DPM VM via VHDXs and Storage Pools. The REFS appears to be where the issue may have been.

    Will post back when I'm more certain its the fix

    Monday, June 10, 2019 11:02 AM
  • Thanks for getting back, I'm glad to hear that you have progress, let's hope all goes well!

    With 10GB links there shouldn't be any issues in the network communication as I see it.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, June 10, 2019 9:27 PM
  • Hi,

    Just checking to see if you have any update?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, June 17, 2019 10:09 PM
  • Hi, yes the problem is much better now thanks to the tweaks below:- 

    I think the first one made the most difference to the performance, as it stops DPM calculating the sizes for each protected item.

    'Unless this is extremely necessary for your company, we suggest disabling it on DPM server. We are working on a way to get this details in background in a more efficient way. We can enable this in the future.

    To disable this size calculation background operation:

    Open DPM Powershell and run: .\Manage-DPMDSStorageSizeUpdate.ps1 StopSizeAutoUpdate

     

    Functionality of the Script: Manage-DPMDSStorageSizeUpdate.ps1

    ManageStorageInfo – Mandatory – one of 'StartSizeAutoUpdate','StopSizeAutoUpdate','GetSizeAutoUpdateStatus','UpdateSizeInfo'

    1. GetSizeAutoUpdateStatus – Tells whether automatic size calculation at backup and pruning is on or not. Default is ON.

    2. StopSizeAutoUpdate – Stops size calculation at backup and pruning. Size will be shown as a '-' in UI after this.

    3.  UpdateSizeInfo:

          - Updates the sizes of all data sources in the DB and outputs the result to a csv file (can be given at input - UpdatedDSSizeReport).

          - User can also selectively update sizes for specific data sources by specifying a file in "UpdateSizeForDS" parameter. This file should contain data source ids for the required data sources to be updated 1 in each line.

          - If there are datasources which could not be updated with correct size (for eg. if a backup is going on for them), then we output these in the file specified in "FailedDSSizeUpdateFile" parameter. If there are some data sources which could not be updated then this file can be sent as a "UpdateSizeForDS" parameter in another call the script which   would update sizes for only these data sources.

    4. StartSizeAutoUpdate – Sets automatic size

    e calculation at backup and pruning to be ON. If it was OFF previously then user should also run this script with UpdateSizeInfo to ensure that the UI shows the correct value.

    ----------------------------------------------------------------------------------------------------------------

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration\DiskStorage]

    "DuplicateExtentBatchSizeinMB"=dword:00000064 -> Change from 100 to 64

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]

    "RefsEnableLargeWorkingSetTrim"=dword:00000001

    "RefsDisableCachedPins"=dword:00000001

    "RefsProcessedDeleteQueueEntryCountThreshold"=dword:00000800

    "RefsNumberOfChunksToTrim"=dword:00000020

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Disk]

    "TimeOutValue"=dword:00000078 / Change from 60 to 78

    Created the Key ParallelMountDismount on HKLM\Software\Microsoft\Microsoft Data Protection Manager\Configuration

    Created DWORD with name Enable with value 1

    Created DWORD with name ParallelMountDismountLimit and value 5

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Wbem\CIMOM]

    "High Threshold On Client Objects (B)"="60000000"

    "High Threshold On Events (B)"="60000000"

    Hope this helps someone

    Alig69

    • Marked as answer by alig69 Tuesday, June 18, 2019 2:42 PM
    Tuesday, June 18, 2019 7:15 AM