locked
1808 Cluster AutoBalancer event invoked when host resources not taxed RRS feed

  • Question

  • Hello,

    I have a two-node Hyper-V HA cluster running on Server 2016.  This is in a test environment with very little load on it now.  There are a total of 5 VMs (all running server 2016), with three (DB1, DB2, and App) on node1 and two on node2.  I have not set any preferred nodes for the VMs, and Balancer settings are default for the cluster.  The three VMs on node1 communicate with each other most, which is why they are together.  The two hosts are identical with 16 cores (2 sockets) and 64GB of RAM each.  Hyperthreading is disabled and virtualization is enabled in the BIOS.  Each VM in the cluster has 1 vCPU and 4GB of RAM, with the exception of DB1, which has 2 vCPUs.

    The cluster node assignments have been in this configuration for a while, working fine.  I've always been able to live migrate them between the nodes without issue when needed for maintenance, but they are always put back on their respective nodes for normal operation.

    A couple days ago, McAfee on App went nuts and consumed a lot of CPU for a sustained period.  This appears to have been the catalyst for the 1808 event and the cluster moving both that server and DB2 to the other node.  We were able to quiet down McAfee later so it wouldn't consume so much CPU on App, but didn't see anything on DB2 that would increase utilization.

    Knowing all that, my question is: Why would the cluster move a server from one node to the other when the physical resources of the original node it was on were not even close to hitting any limits?  I could have had all five VMs on one node, with a total of 6 procs and 20GB of RAM, and if all the VMs combined were using all of their virtual resources it would still be much lower than what the physical host has.  So, to ask my question in another way... Shouldn't the reason to trigger an AutoBalancer event be caused by collective Host utilization and not that of the VMs on them per se?

    Please explain why this VM moved on it's own when host resources were not taxed.

    Thank You


    • Edited by beverlyvg Tuesday, June 30, 2020 2:57 PM fixed a typo
    Monday, June 29, 2020 9:00 PM

All replies

  • Hi beverlyvg,

    Generally, the VM will failover to another node when it failed to online on the current node. 

    Please open the event log, and check the "Windows Logs>System", "Application and Service Logs>Microsoft>Windows>FailoverClustering", check the logs related with the VM failover, if there's any "warning", "error", please provide them for analysis.

    Thanks for your time!

    Best Regards,

    Anne 


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Tuesday, June 30, 2020 7:04 AM
  • I agree with Anne that you need to check something else about the VMs.  You say that the APP VM is configured with 1 vCPU and 4 GB of RAM.  On a 16 processor host with 64 GB of physical memory, even if APP were running at 100% CPU, that would still be only 1/16 of the CPU capacity of the host.  

    tim

    Tuesday, June 30, 2020 1:53 PM
  • There are no warnings or errors in the FailoverClustering event log in either node of the cluster. As noted, everything has been running fine on the cluster until these VMs moved over. The only entry in the log of note is the "Informational" entry with ID 1808. It reads: "Cluster load balancer has identified node 'Node1' is exceeding CPU or memory usage threshold. Cluster group 'DB2' will be moved to node 'Node2' to balance the cluster." There is an identical entry for server "App", which we know had high vCPU utilization. I'm wondering where these thresholds are configured and why the cluster is triggering them based on VM utilization as opposed to Host utilization.
    Tuesday, June 30, 2020 2:52 PM
  • Thank you for the articles, but they are not much of a "deep dive". There really isn't much in those at all. There is certainly nothing there to help identify why the 1808 event occurred in my environment.  Anything else you can find that has more detail?

    Thank You,

    Vinnie

    Wednesday, July 1, 2020 2:15 PM
  • Hi beverlyvg,

    > I'm wondering where these thresholds are configured and why the cluster is triggering them based on VM utilization as opposed to Host utilization.

    VM Load Balancing is enabled by default and when load balancing occurs can be configured by the cluster common property 'AutoBalancerMode'. To control when Node Fairness balances the cluster:

    Using Failover Cluster Manager:
    Right-click on your cluster name and select the "Properties" option
    Graphic of selecting property for cluster through Failover Cluster Manager

    Select the "Balancer" pane
    Graphic of selecting the balancer option through Failover Cluster Manager

    Please check the following article about the Virtual Machine Load Balancing deep-dive:

    https://docs.microsoft.com/en-us/windows-server/failover-clustering/vm-load-balancing-deep-dive

    Thanks for your time!

    Best Regards,

    Anne


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.


    Thursday, July 2, 2020 2:07 AM
  • "There is certainly nothing there to help identify why the 1808 event occurred in my environment."

    You already identified why the event occurred.  In your problem description, you state "McAfee on App went nuts and consumed a lot of CPU for a sustained period."  Use of a lot of CPU for a sustained period of time is one reason for the load balancer to kick in.

    The articles tell you how to configure your system for different reactions from the balancer.

    One thing I would consider on all your VMs is to look at how many virtual CPUs you have defined for each VM.  Windows runs better with a minimum of 2 CPUs.  You have only one that is configured with 2, and since that appears to be a DB VM, it may be aided by having more.  That's a determination you need to make.  But for the other VMs with one CPU, that CPU will get hit harder because the OS spends more time scheduling to it.  Windows runs lots of processes simultaneously to run the OS.  I have found that it is much more efficient when it has at least two CPUs on which to schedule the processes.  For many VMs, two is all that are needed.  If you are running multi-threaded applications like a data base server, and there is a reasonable load on it, then that would benefit by using more CPUs.

    You state your hosts have 16 cores.  That means you can define up to 16 cores in each VM running on the host.  The sum of the vCPUs assigned to the VMs does not have to be less than the number of physical CPUs/cores on the host.  16 vCPUs is not going to help a domain controller, but a domain controller definitely runs better with two instead of one CPU.  By placing more than one vCPU in a VM, a single run-away single-threaded process will not cause the VM to use 100% of the CPU available to the VM because the a single process will use the capacity of a single CPU, not both.  That may prevent the triggering of the load balancer.


    tim

    Thursday, July 2, 2020 1:56 PM