none
Host-level backup taking too long RRS feed

  • Question

  • Hi. DPM 2016 protecting couple standalone Hyper-V Server 2016 servers. There's one particular VM, which is relatively large (1900GB of provisioned space, of which over 1600GB is allocated). This VM is protected using host-level backups. The problem is that these jobs take extremely long time.

    There's daily churn of 300-800GB and the jobs in DPM run for 6-16 hours, which is not acceptable. Backup is done using RCT (guest OS is WS2016). I understand that the daily churn does not account for the whole backup process (and time), but this is extremely slow - for example one job took 7hrs 41mins and transferred 297747MB - which makes it ~10MB/s. This seems to be a problem with this one particular VM, because there are other larger VMs protected and the backups take considerably less time (for example a job that transferred 680GB took 3.5hrs, another 1.3TB in 2.75hrs etc.).

    Is there a way to troubleshoot/investigate this?

    Can host-level backup be affected by guest load (cpu/disk/network)?

    Monday, September 3, 2018 11:36 AM

All replies

  • Hello!

    Is there any other backups running simultaneously with this particular VM that is slow on backup?

    Is there anything else going on on the network during this time?

    Do you have a separate network for backup or how is it configured?

    I would monitor the network and the disks, usually one of these two are the bottleneck in backup in general.

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, September 3, 2018 12:32 PM
  • Yes, other backups run simultaneusly, since this job is taking this long :)

    No separate network for backup, it's not practical in our setup.

    Network/disks are not under (significant) load. I've done basic performance troubleshooting (monitoring hosts/DPM server), it's this one particular VM exhibiting this extreme behavior. Since the job usually runs this long, it overlaps many other backups, but also backups of VMs in the same protection group and on the same host run lot faster than this one. The hosts use local SSDs as storage (capable of 700MB/s sequential reads), the backup storage is capable of ~300MB/s sequential writes (depending on block size). The backup storage is loaded (slower access times and higher queue depth) during the nightly backup cycle, but the above note still stands (same pg, hosts, faster).

    I've looked into DPM logs, but they don't make it easy to find progress of given backup job (if the information is even there). What also doesn't help is that DPM doesn't give you any indication of the job progress, but I don't want to derail this topic right away myself.

    Do you know if the DPM logs might contain any clues or human-readable information on what it is actually doing this long?

    Monday, September 3, 2018 12:47 PM
  • The DPM logs are unfortunately not very detailed, at least for this matter.

    What kind of storage is the backup storage? 

    There's also a possibility that the virtual NIC of the VM is overwhelmed, have you monitored the NICs of the  DPM server and the virtual NIC(s) of the problematic virtual machine?


    Also do you have any antivirus that could be slowing down the backups?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, September 3, 2018 1:09 PM
  • Backup storage is Equallogic group of 3 SANs. The DPM server is a VM, so it has 2 iSCSI vNICs utilizing MPIO.

    The job is currently the only one running on the DPM server, what I see is upto 500Mbit/s reads (receive) with average about 300Mbit/s on LAN NIC (reading data from the host) and +- same write (send) rate on the iSCSI NIC.

    The DPM vNICs (LAN/iSCSI) are on a 2x10Gbit vSwitch. The vNICs are QoSed, but the host's pNICs are not under any significant load and the QoS is high enough (like 4Gbit/s max), so that shouldn't be an issue.

    So what I see is that none of the infrastructure is overloaded, yet the job runs slow.

    DPM has Hyper-V role installed and runs default Windows Defender, so there are automatic .VHDX exclusions in place. Windows Defender on the Hyper-V hosts.

    vNIC of the problematic VM should play no role in this? We're talking host-level (block) backup of the VM, there's no DPM agent in the VM.

    Monday, September 3, 2018 1:28 PM
  • True the vNIC plays no role here, got lost in my own thoughts there :-)

    Can you tell us for how long has this problem been? Also have you installed the latest update rollup to see whether it could help? 


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, September 3, 2018 2:51 PM
  • Just to be sure, I've updated DPM to UR5 and updated the OS. It seems to have no impact on the job performance.

    For comparison, I've collected elapsed times and transferred data from jobs for the problematic VM and other VM (different host and PG, but even larger (4400GB, of which ~1800GB allocated/used). I took sample of last ~30 jobs and here's the average backup performance (note I avoid saying transfer rate performance):

    Problematic VM: 14,7 MB/s

    Other VM: 40,9 MB/s

    The daily churn for this other VM also changes a lot, but the jobs for the big changes perform lot better than the average (140-180MB/s, for example 2.6TB transferred, job took 4hrs 3mins), which is skyhigh compared to the ~13MB/s of the problematic VM.

    Ofcourse, the backups run in different times for these 2 VMs, but since the problematic VM's backup usually overlaps all other backups and is usually the only job remaining running, there's nothing that could 'hinder' at those times, so I'm fairly certain this is not a disk/network bottleneck.

    I've also tried moving the VM to other host, the backup performance didn't change.

    Wednesday, September 5, 2018 12:07 PM
  • I cannot really come up with anything other than creating a ticket to Microsoft about this issue (if you haven't already).

    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, September 10, 2018 9:45 PM
  • Thanks anyway for you input Leon.

    We're currently doing some tests using another (new) DPM server and trying to diagnose if it's a storage issue or not (switching between local SSD storage and EQL SAN as backup targets).

    Wednesday, September 12, 2018 9:21 AM
  • I would be very interested to hear if you do find a solution to your problem, it would also help the community!

    Good luck!


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, September 12, 2018 9:23 AM