none
Live Migration Fails at 3% - SOLVED RRS feed

  • Question

  • We recently setup a Hyper-V Fail-over cluster using (2) Dell R610 servers and a direct attached (SAS) Dell MD3200 storage array. The servers are identical: Xeon E5630 Processors, 64 Gbs of RAM.  The hosts are running Windows Hyper-V 2016. (so we are heavily Powershell dependent)  There are two networks cards, one integrated and one add-on. (4) of NICs (2 from each card) are NIC teamed for production network traffic, and (1) NIC on each server's integrate card is dedicated to Live Migration traffic. The Live migration subnet is on its own physically separate 1Gb network hardware (we are not 10gb capable). We are running 4 production VM's: DC, Appserver, Backup Server, File Server. We also have some test VMs.

    When we first set the system up, Live Migration ran perfectly. We started having issues once we put load on the VMs. The exact issue is that when we attempt Live Migration of the production servers they fail at 3%. We receive a time out error message (see below). Every now and then the DC, Appserver, and Backup server will Live Migrate, but never the File Server (with 3Tb VHD). The inconsistency is killing us. The Test servers, with no real load, migrate fine.

    We have plenty of available hardware resources, and have even dropped the specs of the VM's to the lowest possible (1gig RAM and 1 Proc) but they still time out. We use Fail-over Cluster Manager and Hyper-V Manager to administer the cluster; we are not using SCVMM.

    Error Message:

    Live migration of 'Virtual Machine PRGFILESHARE' failed.

     

    Virtual machine migration operation for 'PRGFILESHARE' failed at migration source 'PRGHYPERV1'. (Virtual machine ID F6F0B8CA-D100-4A7C-8115-AC09FC47125A)

     

    Planned virtual machine creation failed for virtual machine 'PRGFILESHARE': This operation returned because the timeout period expired. (0x800705B4). (Virtual Machine ID F6F0B8CA-D100-4A7C-8115-AC09FC47125A).

     

    Failed to receive data for a Virtual Machine migration: This operation returned because the timeout period expired. (0x800705B4).

    RESOLUTION - Delete Checkpoints

    It appears that the checkpoints we were taking daily on this server (and others) were causing the live migration issues. We had a script to take checkpoints (formerly called a Snapshot) every morning between 7am-8am and keep them for 5 days. During our troubleshooting we deleted all of the existing checkpoints, took a fresh one, and ran the live migration on the server that wasn't working. It worked perfectly. For servers that were successfully live migrating, but at a slow rate, this trick increased their live migration speed dramatically. While we're happy that everything works we don't understand why these checkpoints were causing the issue. We never tried to migrate while a checkpoint was being created and the checkpoint files were stored on the cluster shared volume not the host servers. If our understanding of live migration is correct, only the CPU and RAM information are being copied over. I have read that a shallow copy of the VM is copied over but I don't see how a static snapshot file would factor in.

    • Edited by Brad Fitz Thursday, August 3, 2017 9:08 PM Solved
    Thursday, July 20, 2017 7:11 PM

Answers

  • RESOLUTION - Delete Checkpoints

    It appears that the checkpoints we were taking daily on this server (and others) were causing the live migration issues. We had a script to take checkpoints (formerly called a Snapshot) every morning between 7am-8am and keep them for 5 days. During our troubleshooting we deleted all of the existing checkpoints, took a fresh one, and ran the live migration on the server that wasn't working. It worked perfectly. For servers that were successfully live migrating, but at a slow rate, this trick increased their live migration speed dramatically. While we're happy that everything works we don't understand why these checkpoints were causing the issue. We never tried to migrate while a checkpoint was being created and the checkpoint files were stored on the cluster shared volume not the host servers. If our understanding of live migration is correct, only the CPU and RAM information are being copied over. I have read that a shallow copy of the VM is copied over but I don't see how a static snapshot file would factor in.
    • Proposed as answer by Nedim Mehic Friday, August 4, 2017 6:47 AM
    • Marked as answer by Brad Fitz Friday, August 4, 2017 3:27 PM
    Thursday, August 3, 2017 9:08 PM

All replies

  • Hi,

    Can you check if the authentication under HyperV settings -> Live Migration - Advance Features - Authentication Protocol are both the same on the servers.

    Do you have any backups running that might be causing the issue? If so try to stop the service.

    Friday, July 21, 2017 12:43 AM
  • Hi,

    what is the authentication type the source and destination servers are using?

    If it is kerberos authentication, then try to change the setting to "Use any authentication protocol" under "trust this computer for delegation to specific services only"

    Hope this helps!!!

    Regards,

    Bala 

    • Proposed as answer by _Namor_ Thursday, March 15, 2018 6:31 AM
    Friday, July 21, 2017 12:32 PM
  • Thanks Michael and Bala.

    Authentication is set to use kerberos under the advanced features of live migration settings. We have tried both "use kerberos only"& "Use any authentication protocl" in the delegation tab of the server properties in AD. Both servers have cifs and MS virtual system migration service delegations setup for each other. (we followed this technet article on setting up the Kerberos constrained delegation: https://blogs.technet.microsoft.com/matthts/2012/06/10/configuring-kerberos-constrained-delegation-for-hyper-v-management/)

    We don't think it's the Authenication because as we mentioned Live Migration still works sometimes. It is "fast" for machines with no load, and inconsistent for machines with light to moderate load. (will either take minutes or just time out)

    The main issue is that file server that will not Live Migrate at all. It is currently set up for DFS replication which only runs at night. It has no backup software on it. Essentially it is idle during the day but still won't Live Migrate. It has a 3Tb VHD with 1Tb used. We are not Live Migrating the storage so we can't figure out why this machine consistently times out. From what we've read the only thing migrating is the RAM and CPU cache.

    We are thinking of NIC teaming and additional port to up the bandwidth to 2 gigs.

    - Brad & Nate

    Friday, July 21, 2017 4:00 PM
  • Hi,

    Will it work when the server is shutdown? Anything in the event log?

    It should not timeout unless there's network outage or space issue in the destination.

    Tuesday, July 25, 2017 4:11 AM
  • When the server is shutdown failover clustering only allows Quick Migration which works but its the Live Migration we want to be utilizing. As far as the event logs all we are getting is the errors stated in the original post. 
    Wednesday, July 26, 2017 1:07 PM
  • Hi Brad,

    This issue may also occur if there were expired certificates in the machines' certificate store. Open mmc and add the certificate snap-in for the local machine (I assume you logged on to the problematic machine. you can also add another computer here).

    Expand Personal store and click on Certificates. On certificates pane, you may verify the certificates if one of them is expired.

    Regards,

    Bala

    Thursday, July 27, 2017 5:35 AM
  • RESOLUTION - Delete Checkpoints

    It appears that the checkpoints we were taking daily on this server (and others) were causing the live migration issues. We had a script to take checkpoints (formerly called a Snapshot) every morning between 7am-8am and keep them for 5 days. During our troubleshooting we deleted all of the existing checkpoints, took a fresh one, and ran the live migration on the server that wasn't working. It worked perfectly. For servers that were successfully live migrating, but at a slow rate, this trick increased their live migration speed dramatically. While we're happy that everything works we don't understand why these checkpoints were causing the issue. We never tried to migrate while a checkpoint was being created and the checkpoint files were stored on the cluster shared volume not the host servers. If our understanding of live migration is correct, only the CPU and RAM information are being copied over. I have read that a shallow copy of the VM is copied over but I don't see how a static snapshot file would factor in.
    • Proposed as answer by Nedim Mehic Friday, August 4, 2017 6:47 AM
    • Marked as answer by Brad Fitz Friday, August 4, 2017 3:27 PM
    Thursday, August 3, 2017 9:08 PM
  • Bala, thank you for your answer.

    I had two hyper-v hosts in cluster and they were renamed. After the server were renamed the live migrations stoped to work.


    Felippe

    Thursday, November 1, 2018 8:32 PM
  • This is great that you found a solution, but honestly this solution is terrible.  When dealing with a pooled VDI collection that is set to automatically roll back on sign out, a checkpoint is mandatory.  It is also automatically created.  This isn't a solution it is a band aid to the real issue.  We should be able to migrate with snapshots/checkpoints.  I know this is asking a lot of Microsoft but it is something that needs to be addressed before they started offering hyper-v VDI.  
    Wednesday, December 12, 2018 11:41 PM