none
Live migrations fail during drain from Cluster-Aware Updating RRS feed

  • Question

  • We're trying to implement Cluster-Aware updating but we keep running into issues where virtual machine migrations fail to migrate.

    Our cluster(s) have plenty of memory allowing frequently for 2 nodes of a 6 node cluster to be completely devoid of roles. We've kicked off CAU, it patches and reboots the nodes with no roles and then moves onto the others. While attempting to drain one of the remaining nodes, it will kick off live migrations (no low priority roles). Since our max migration value is 2, we will continually get 21501 warnings as it works through the list. Towards the end, and only occasionally, the last few will fail with a 21502 due to not enough memory. This then hangs the drain until manual intervention.

    21501
    Live migration of 'SCVMM BRMWD-SPDEV02' failed.
    Virtual machine migration operation for 'BRMWD-SPDEV02' failed at migration destination 'BRMWD-HYPV02'. (Virtual machine ID 2A4EC899-079C-4355-A503-F097FAF33E2B)
    Failed to perform migration on virtual machine 'BRMWD-SPDEV02' because virtual machine migration limit '2' was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID 2A4EC899-079C-4355-A503-F097FAF33E2B)

    21502
    Live migration of 'Virtual Machine BRMWT-FE01' failed.
    Virtual machine migration operation for 'BRMWT-FE01' failed at migration destination 'BRMWD-HYPV02'. (Virtual machine ID 385026E5-7B2F-46EA-ADFE-EF854F76A4FE)
    'BRMWT-FE01' could not initialize. (Virtual machine ID 385026E5-7B2F-46EA-ADFE-EF854F76A4FE)
    Not enough memory in the system to start the virtual machine BRMWT-FE01 with ram size 2048 megabytes. (Virtual machine ID 385026E5-7B2F-46EA-ADFE-EF854F76A4FE)

    I know we could likely just increase the number of live migrations to get around this or even assigning all VMs to preferred owners to keep the cluster more balanced. This is unfounded but it seems like when a CAU drain is initiated it is picking a static host to move all VMs to rather than using the best possible node on each migration.

    Can someone confirm for me if this is accurate or if there is any way of changing this?


    • Edited by benpross Wednesday, September 28, 2016 3:29 PM update
    Wednesday, September 28, 2016 3:19 PM

All replies


  • 21501
    Live migration of 'SCVMM BRMWD-SPDEV02' failed.
    Virtual machine migration operation for 'BRMWD-SPDEV02' failed at migration destination 'BRMWD-HYPV02'. (Virtual machine ID 2A4EC899-079C-4355-A503-F097FAF33E2B)
    Failed to perform migration on virtual machine 'BRMWD-SPDEV02' because virtual machine migration limit '2' was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID 2A4EC899-079C-4355-A503-F097FAF33E2B)

    Hi,

    Did you try to increase the timeout for migration jobs?

    https://support.microsoft.com/en-us/kb/2790310

    FrenchITGuy.com

    Wednesday, September 28, 2016 3:35 PM
  • I'm not using VMM Fabric Patching, I'm using good ol' Failover Clustering with CAU. Looked everywhere for something similar to LiveMigrationQueueTimeoutSecs but was not able to find a matching registry for FCM.
    Wednesday, September 28, 2016 8:56 PM
  • Hi Benpross,

    >>Can someone confirm for me if this is accurate or if there is any way of changing this

    As far as I know, CAU is an automatic process and we could not change it manually.

    How about the result if you live migrate the VMs manually?

    Besides, are there any related information in cluster validation.

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, September 29, 2016 2:30 AM
    Moderator
  • When the migrations fail, and the drain process pauses, if we manually live migrate the failed VMs to best possible, they move without a single issue and the drain process completes. When I look at the host that had failed the migration, it is indeed out of memory or usually close to it.

    I had thought about creating a disabled scheduled task that is only enabled when CAU kicks off (via a pre-patching script). This would check for the Event ID 21502 and when it occurs, do a Move-ClusterVirtualMachineRole but this seems like a kludgy Band-Aid type fix.

    Cluster validation shows all greens and happy. The only information given on the failed migrations are the 21501/2.


    • Edited by benpross Thursday, September 29, 2016 11:53 PM
    Thursday, September 29, 2016 1:50 PM
  • Hi Benpross,

    >>When I look at the host that had failed the migration, it is indeed out of memory or usually close to it.

    If the memory is not enough on host, have you tried to assign less memory to the VM?

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, October 3, 2016 1:47 AM
    Moderator
  • There was an entire node that was unused with 144gb of memory on it. Why was it trying to move all roles to this other node eventually filling it up?
    Monday, October 3, 2016 12:59 PM
  • The cluster simply performs a round-robin update in order to ensure every node gets updated.  You have to ensure that a node can take the load that will eventually end up on it.  Even if the live migration had occurred to the node with 144 GB of memory, eventually it will come back to the server that is giving you the issue.

    There is the ability to perform pre- and post- drain scripts.  If there are some VMs that you can turn off (therefore not using any memory), you could do that.  Or you could move to another host.  But it sounds like your configuration might be a little undersized for what you are trying to do.


    . : | : . : | : . tim


    Monday, October 3, 2016 8:07 PM
  • I thought CAU was doing least roles on a node in order. 

    I'm not understanding how it can be undersized though or how shrinking VM memory is going to solve any of this issue? I have 2 servers without a single role on them, each node in the entire cluster has 144gb of memory each. It's not migrating them using Best Possible Node, it seems to be migrating in some other fashion.

    I have plenty of resources for all roles to be deployed across the cluster to allow a full (even 2 depending) node to be completely empty but not enough resources for all roles to be on one node. Stating I should shrink the sizes of the VMs or increasing memory on the host isn't the issue. I'm trying to figure out why it doesn't seem to be migrating them using Best Possible Node.

    I have thought about a pre-post scripts to do some moving but I wouldn't think you should have to re-balance your cluster between each patching node.


    • Edited by benpross Thursday, October 6, 2016 8:24 PM
    Wednesday, October 5, 2016 6:58 PM
  • Hi Benpross,

    >>When I look at the host that had failed the migration, it is indeed out of memory or usually close to it.

    That's why we suggest you try to assign less memory.

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, October 10, 2016 6:16 AM
    Moderator
  • But why would I want to go lessen the memory assigned to VMs when there are hosts that have plenty of resources available still to take more VMs?
    Monday, October 10, 2016 1:37 PM
  • "I'm trying to figure out why it doesn't seem to be migrating them using Best Possible Node."

    Because it isn't.  The base capability of CAU does not handle that.  If you want more intelligence, you use something like SCVMM.  Not the first time something is implemented in SCVMM that is not in the base product.

     

    . : | : . : | : . tim

    Monday, October 10, 2016 3:19 PM
  • Well I tested out SCVMM and the same thing happened. I'll just learn to live with it or wait till the next SCCM version comes out that hopefully will have cluster patching down better. Thanks everyone for the thoughts.
    Friday, October 14, 2016 7:35 PM