none
Migrate VM from 2012R2 Cluster to 2016 Cluster causes outage

    Question

  • Good Afternoon:

    We are replacing a 2012 R2 cluster and migrating to new hardware running 2016.  These clusters will host about a dozen notes and a few hundred Virtual Machines.

    The hosts have 4 10GB ports (2 teamed into a switch for the VMS, and two for storage using MPIO ).  Storage is on an SMB3 share hosted on a 2016 Scale Out File server.  VMM 2016 manages both clusters in the same instance.

    When moving the VM workload between the 2012 R2 and the 2016 cluster, host reports the following errors:

    '<vmname>': Virtual hard disk '<pathTo.vhdx>' has detected a recoverable error. Current status: Disconnected. (Virtual machine ID <vmid>)

    '<vmname>': Virtual hard disk '<pathTo.vhdx>' received a resiliency status notification. Current status: Permanent Failure. (Virtual machine ID <vmid>)

    '<vmname>': Virtual hard disk resiliency failed to recover the drive '<pathTo.vhdx>'. The virtual machine will be powered off. Current status: Permanent Failure. (Virtual machine ID <vmid>)

    '<vmname>' was paused for critical error. (Virtual machine ID <vmid>)

    The machine ends up powered off.  After a few moments, you can just power the machine on and all appears to be well.  The event logs on the guest appear as the power was pulled form the machine and it experienced an improper shutdown.

    Once the machine is on the 2016 cluster, it can move between the hosts all day long with no problems.  It's only the initial migration from the 2012 R2 to the 2016 cluster.  The disk and network are the same -- the only thing that is migrating is the workload.

    Any thoughts?

    Wednesday, March 08, 2017 8:56 PM

All replies

  • Hi Sir,

    Have you tired to migrate a problematic VM to other nodes of 2012R2 , same error ?

    Have you tried to offline that VM then migrate it , it will migrate (quick migrate) successfully ?

     

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, March 09, 2017 10:18 AM
    Moderator
  • It's not one VM -- it's any VM from the 2012 R2 Cluster moving to the 2016 Cluster.  It moves successfully, but losses it's disk powers and the host the VM off.  It even runs for a few seconds on the new cluster without noticeable issues, then the machine just poweres off.  All I need to do to complete the move is to turn the VM back on.

    I can move the VMs within the 2012 R2 cluster to any node without issues.  It's just that initial move to the 2016 cluster that has problems.

    Moving them back to the 2012 R2 cluster doesn't seem to have a problem.  And it's strange, but it seems that moving them a second time to the 2016 cluster also doesn't have the problem.  It's almost list the first move to a new cluster causes the issue.

    I have not tried moving a machine that is off.  This is a HA production environment and I can't really take the downtime for any of these machines.  I can spin up some test VMs if I had to, but I'm sure the transition to the new cluster would be fine since the move does actually complete.  The host just thinks the Guest lost the disk (I'm not convinced it did, as there isn't any evidence of that in the guest event logs -- which, I know, if it did lose the disk it would be hard to log it in the event logs!).

    Thursday, March 09, 2017 3:00 PM
  • I've discovered some more information.  The outage only occurs when the storage is migrated.  This is a Cluster > Cluster live migration using SMB3 and the storage is available to both hosts -- it shouldn't need to migrate the storage,  but it does.  It ends up in the same place.  I didn't realize this until I migrated a VM with a 260GB disk and it took a long time.

    Friday, March 10, 2017 6:35 PM
  • "The hosts have 4 10GB ports (2 teamed into a switch for the VMS, and two for storage using MPIO ).  Storage is on an SMB3 "

    How did you configure MPIO for SMB storage?  SMB does not support MPIO.  It does use multi-channel, though.

    "The outage only occurs when the storage is migrated.  This is a Cluster > Cluster live migration using SMB3 and the storage is available to both hosts -- it shouldn't need to migrate the storage"

    It will not move storage when it does a live migration between nodes of a cluster.  When it is going outside the cluster, the assumption is that storage has to move.  I understand your point that it should not need to, and it would not if you simply stopped it on the original cluster and restarted it on the second, but since the live migration is determining that it is moving off the cluster, it makes the assumption that it has to move the storage, too. 

    If you must perform a live migration instead of briefly turning the VM off on the original and starting it on the second, you could create a mixed mode 2012 R2 and 2016 cluster using the common shared storage.  Then a live migration from any 2012 R2 node to any 2016 node should not have any issue.  The mixed cluster is a new feature of 2016 and is meant for migration purposes.  You should not leave it in the mixed state for more than maybe a month.


    . : | : . : | : . tim

    Friday, March 10, 2017 8:40 PM
  • You're right -- it's not MPIO.

    I've discovered the problem, though -- the new cluster had some networking problems.  After resolving these issues, Migrations are happening between clusters without issues.

    Thursday, March 16, 2017 12:51 AM
  • Hi Joe,

    Could you elaborate on the "new cluster had some networking problems"? I am currently working in a VERY similar issue (VMM 2016, SMB Shares, Inter Cluster migrations fail, Intra Cluster migrations do not fail). Any details are greatly appreciated.

    Thanks!

    Wednesday, May 03, 2017 2:32 AM
  • Hi Joe,

    Could you elaborate on the "new cluster had some networking problems"? I am currently working in a VERY similar issue (VMM 2016, SMB Shares, Inter Cluster migrations fail, Intra Cluster migrations do not fail). Any details are greatly appreciated.

    Thanks!

    The problem I found had to do with Switch Independent Teaming.   It seemed to have problems in our environment, so I worked with our network team and had the switch ports teamed on the switch instead of relying on the OS to do it.

    You might want to check the MTU as well -- make sure everything is the same.  I found some mismatched hosts (old servers didn't have jumbo frames enabled, new ones did).  Worth checking!  Don't forget to look at your file servers, too.

    Wednesday, May 03, 2017 5:28 PM