none
Cluster node - Event 252 - cluster service crashed RRS feed

  • Question

  • We have a 4 node Windows Server 2016 Hyper-V Cluster.  Over the weekend, one node of the cluster reported this in the System event log:

    "Memory allocated for packets in a vRss queue (on CPU 14) on switch 620355C8-4D29-4D91-BE80-B840921EDC4A (Friendly Name: Team_Trunked) due to low resource on the physical NIC has increased to 256MB. Packets will be dropped once queue size reaches 512MB."

    Seven seconds later it reported that it had increased to 512MB, then two seconds later the LiveMigration NIC reported it had begun resetting, then a few seconds later a reset was issued to \device\raidport1 (source: ql2300).  After two minutes of this and a few other repeats, I started getting warnings of CSVs no longer being able to access this cluster node and then the Cluster service shut down on this node and all VMs are shut down and migrated to other nodes in the Cluster.  

    Our weekly DPM backups of the VMs had started about 1.5 hours before this occurred so there was some additional strain on the NICs at this time, though that should have gone through the NIC the OS is running on so I don't know why that would have affected the other NICs where the VMs general data was going through (Team_Trunked) and the LM NIC.

    Does it make sense that the first warning about Event 252 would have caused all this or is there more to this?  


    • Edited by WSUAL2 Monday, August 12, 2019 4:19 PM
    Monday, August 12, 2019 3:57 PM

All replies

  • Hi ,

    Some network adapters set their receive buffers low to conserve allocated memory from the host. The low value results in dropped packets and decreased performance. Therefore, for receive-intensive scenarios, we recommend that you increase the receive buffer value to the maximum.

    Please increase the "Receive buffers" in the physical NIC on the Hyper V hosts, check if it could help. The default value is 256, you could increase to 1024 and then check the result.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Tuesday, August 13, 2019 3:40 AM
  • Hi ,

    Just want to confirm the current situations.

    Please feel free to let us know if you need further assistance.                   

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Thursday, August 15, 2019 7:30 AM
  • On the node of the cluster that had the issue, Virtualsrv1, there are 3 teams.

    Team_LM – Live Migration – Switch Independent – Dynamic – 2, 1Gb adapters 
    (HPE Ethernet 1 Gb 4-port 366T Adapter #6 & HPE Ethernet 1 Gb 4-port 366T Adapter #5)
    (Cluster Only)

    Team_OS – OS Team – Switch Independent – Address Hash – 1, 1Gb and 1 100Mb adapters
    (HPE Ethernet 1 Gb 4-port 366T Adapter #3 & HPE Ethernet 1 Gb 4-port 366T Adapter #7)
    (Cluster & Client)

    Team_Trunked – VM traffic – Switch Independent – Dynamic – 2, 10Gb adapters
    (HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter & HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter #2)

    There is also the HPE Ethernet 1 Gb 4-port 366T Adapter that is specifically for cluster communication.
    (Cluster Only)

    Running this powershell command gives this:  Get-ClusterNetwork | ft Name, Metric, AutoMetric, Role

    Name                           Metric   AutoMetric        Role
    ----                              ------      ----------         ----
    Clust_Mgmt                 1000      False                 Cluster
    Clust_Mgmt_100MB      80000    True                  None
    Live_Migration              2000      False                 Cluster
    Team_OS                     70385    True                  ClusterAndClient

    With the metric set as it is, Cluster communication (pings to ensure all nodes are there) should occur through “Clust_Mgmt”, then if not available “Live_Migration”, then “Team_OS”.  Correct?

    The big question is why did this node of the cluster lose communication if the adapter - HPE Ethernet 1 Gb 4-port 366T Adapter #6 (one of the Live Migration adapters) lost connectivity.  That shouldn’t cause the cluster to lose communication with this node.

    I want to understand this issue to make sure this doesn’t happen again.

    I have changed the “Receive Buffers” from 256 to 2048 on all of the 10 Gb adapters on all of the nodes which is where all the VM traffic occurs on.  That should fix that issue.  I don’t see how this Event ID 252 issue caused this whole mess though.


    Friday, August 16, 2019 6:09 PM
  • Hi ,

    Sorry for the delayed response.

    >>I want to understand this issue to make sure this doesn’t happen again.

    I have changed the “Receive Buffers” from 256 to 2048 on all of the 10 Gb adapters on all of the nodes which is where all the VM traffic occurs on.  That should fix that issue.  

    In order to narrow down the issue, I would suggest you increase the “Receive Buffers” to do a check. If the issue occurs again, we need to do further troubleshooting.

    If the issue doesn't occur , I would think the problem is caused by event 252. 
    Since I did not find any related resource talking about such situation in Microsoft official document, I would suggest you open a case with Microsoft, more -in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.

    Here is the link:
    https://support.microsoft.com/en-us/gp/support-options-for-business

    Best Regards,

    Candy



    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   


    Wednesday, August 21, 2019 1:35 AM
  • I have opened a support ticket with Microsoft about this issue and I am awaiting a response.  I will post the details when I receive them.
    Monday, August 26, 2019 2:57 PM
  • Hi ,

    Thanks for your efforts you have put into this case.

    By sharing your experience you can help other community members facing similar problems. Thanks for your understanding.

    I will wait for your good news.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Tuesday, August 27, 2019 1:43 AM
  • So after 2.5 months working on a support call with Microsoft, I thought I had this fixed.  I showed them the specific settings that were changed and they confirmed it was correct but that is not the case as Event 252 occurred over the weekend and one node lost cluster membership and VMs went into varying states having issues.

    This issue was deemed to be "since having both VMQ and RSS configured can cause an overlapping, this is because the default configuration will put the NICs to use processor 0, which is also the one being used by System."  To fix this issue, to alleviate processor overlaps, and reduce NUMA node distance, I ran the following commands:

    Set-NetadapterVMQ -name "10GB_HV1_SW1" -BaseProcessorNumber 2
    Set-NetadapterVMQ -name "10GB_HV2_SW2" -BaseProcessorNumber 30 

    Set-NetAdapterRss -name 10GB_HV1_SW1 -BaseProcessorGroup 0 -BaseProcessorNumber 2 -MaxProcessorGroup 1 -maxprocessornumber 26

    Set-NetAdapterRss -name 10GB_HV2_SW2 -BaseProcessorGroup 2 -BaseProcessorNumber 0 -MaxProcessorGroup 3 -maxprocessornumber 26

    Set-NetAdapterAdvancedProperty -Name "10GB_HV1_SW1" -RegistryKeyword '*NumaNodeId' -RegistryValue '0'

    Set-NetAdapterAdvancedProperty -Name "10GB_HV2_SW2" -RegistryKeyword '*NumaNodeId' -RegistryValue '2'

    =================================================================================

    Results are the following from running this commands:

    Get-NetAdapterRss -Name "*" | Where-Object -FilterScript { $_.Enabled }

    Name                                            : 10GB_HV1_SW1
    InterfaceDescription                            : HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter
    Enabled                                         : True
    NumberOfReceiveQueues                           : 128
    Profile                                         : NUMAStatic
    BaseProcessor: [Group:Number]                   : 0:2
    MaxProcessor: [Group:Number]                    : 1:26
    MaxProcessors                                   : 16
    RssProcessorArray: [Group:Number/NUMA Distance] : 0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0  0:16/0
                                                      0:18/0  0:20/0  0:22/0  0:24/0  0:26/0  1:0/31534  1:2/31534
                                                      1:4/31534
                                                      1:6/31534  1:8/31534  1:10/31534  1:12/31534  1:14/31534  1:16/31534
                                                       1:18/31534  1:20/31534
                                                      1:22/31534  1:24/31534  1:26/31534

    Name                                            : 10GB_HV2_SW2
    InterfaceDescription                            : HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter #2
    Enabled                                         : True
    NumberOfReceiveQueues                           : 128
    Profile                                         : NUMAStatic
    BaseProcessor: [Group:Number]                   : 2:0
    MaxProcessor: [Group:Number]                    : 3:26
    MaxProcessors                                   : 16
    RssProcessorArray: [Group:Number/NUMA Distance] : 2:0/0  2:2/0  2:4/0  2:6/0  2:8/0  2:10/0  2:12/0  2:14/0
                                                      2:16/0  2:18/0  2:20/0  2:22/0  2:24/0  2:26/0  3:0/32546  3:2/32546
                                                      3:4/32546  3:6/32546  3:8/32546  3:10/32546  3:12/32546  3:14/32546
                                                      3:16/32546  3:18/32546
                                                      3:20/32546  3:22/32546  3:24/32546  3:26/32546

    ===================================================================

    Microsoft support said this was the correct configuration but it did not fix the issue.  These are the two NICs that are part of the Hyper-V switch.  The thought was as I understood it was to move this traffic off of CPU 0 on Numanode 0 which it did.  Since there are 4 numanodes, I am now wondering if I should move traffic off of CPU 0 on the other 3 numanodes as well.  Though that would mean picking one Numanode for each of the two 10 Gb Nics and then the other two numanodes would not be used at all for the VMs which seems like wasting them.  

    Does anyone have any help or advice?


    • Edited by WSUAL2 Tuesday, November 26, 2019 2:43 PM
    Monday, November 25, 2019 9:30 PM
  • In addition, here are the results of where VMQ is enabled:

    PS C:\Windows\system32> Get-NetAdapterVmq -Name "*" | Where-Object -FilterScript { $_.Enabled }

    Name                    InterfaceDescription                      Enabled  BaseVmqProc MaxProcs  #RecQueues
    ---------------         --------------------                         ---------  --------------- ----------- ---------
    10GB_HV1_SW1    HPE Ethernet 10Gb 2-port 560...#2  True      0:2               16             31
    10GB_HV2_SW2    HPE Ethernet 10Gb 2-port 560...#1  True      2:0               16             31
    Team_Trunked       Microsoft Network Adapter Mu...#2   True      0:0                               62


    • Edited by WSUAL2 Tuesday, November 26, 2019 3:09 PM
    Tuesday, November 26, 2019 3:04 PM
  • I recommend that you completely tear down this configuration and start over with something simpler.

    1. Disconnect all virtual machines from their virtual switch.
    2. Document the TCP/IP configuration of all virtual and team adapters in the management operating system.
    3. Destroy all virtual switches.
    4. Destroy all teams.
    5. Disable all gigabit adapters.
    6. Donate the 100mb adapter to an electronics recycler.
    7. If your environment uses VLANs, configure the physical switch ports for your 10GB adapters to carry all VLANs applicable to your nodes, your cluster, and your virtual machines.
    8. Create a switch-embedded team on the 10GB adapters. Make certain that it uses the Weight mode for QoS.
    9. Create two virtual NICs in the management operating system: one for management and one for cluster traffic. Make the management adapter routable (give it a gateway). Make the cluster adapter non-routable and do not allow it to register in DNS. Make sure these adapters have IPs in distinct subnets. I personally would instruct the cluster to prioritize the cluster adapters for Live Migration, but it will never practically matter.
    10. Re-attach the virtual machine vNICs to your new vSwitch.

    Problems that I see in your existing build:

    • 100mb adapters have not been supported in clustering for as long as I have been clustering (2010). I don't know why Microsoft support did not flag that.
    • Active/active teaming does not prioritize faster team members to carry traffic. Your hybrid gigabit/100mb team treats both members as equal partners and load balances evenly across them.
    • WS2016 made the cluster network metrics effectively meaningless. You should assume that the cluster will treat all networks marked to allow cluster traffic as equal members when distributing inter-node traffic. As with teaming, it does not even check the speed of the member adapters first. The full explanation is more complicated, but this description is correct for all practical purposes.
    • Multiple teams and switches means lots of packet thrashing.

    Basically, you have a lot of places for traffic to back up while waiting for slower members. Does that cause the error that you receive? I have no idea. But, none of this excessive engineering does anything positive. The 10GB adapters outclass the gigabit and 100mb adapters to the point of reducing their contributions to nuisance levels.

    IF, and ONLY IF, you can produce usage charts that show network contention, you can consider implementing network QoS to prevent your Live Migrations or backups from drowning out VM traffic. Unless you have really snappy disks or an environment with highly unusual traffic characteristics, I don't think you'll have a problem.

    A switch-embedded team will automatically configure VMQ and RSS for you. It does not skip over CPU 0:0 though, so you can tweak that later -- again, ONLY if you can prove that you have a problem.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Tuesday, November 26, 2019 10:49 PM
  • Thank you for your thoughts and action plan.  Feel free to critique me as needed.  I value your input.  

    On a different and our newest node of our Cluster I did something similar but with LBFO teaming of the two 10 Gb Nics.  I then created a Virtual Switch using weight mode and then created vCluster, vLiveMigration, and vManagement adapters each on separate subnets.  I did apply QoS on them just to ensure live migrations between nodes with 10 GB cards didn't saturate anything using: 

    Set-VMSwitch -Name Team_Trunked -DefaultFlowMinimumBandwidthWeight 50
    Set-VMNetworkAdapter –ManagementOS –Name “LiveMigration” –MinimumBandwidthWeight 20
    Set-VMNetworkAdapter –ManagementOS –Name “Cluster” –MinimumBandwidthWeight 20
    Set-VMNetworkAdapter –ManagementOS –Name “Management” –MinimumBandwidthWeight 10


    I plan to tear down everything like you suggested on the server in question with the issues and rebuild the networking using SET like you recommend but have a few questions.

    1.) You don't recommend having separate subnets for LiveMigration and one for dedicated Cluster communication, correct?  My previous understanding was that by having the a dedicated Cluster subnet for communication that there would never be an issue for communication between the nodes if the LiveMigration traffic was high and setting the cluster metric for prioritization on the dedication Cluster subnet.  Sounds like things have change with cluster metrics in WS2016 so that doesn't matter anymore?

    2.) Down the road, if I had some kind of network contention, can I use the QoS settings I used above?  Or is that not valid SET?

    3.) Do I need to revert the VMQ, RSS, and Numanode settings I mentioned in the previous post that I did on the 10Gb Nics or is it ok to leave it the way it is?

    4.) I have been looking over your post here about SET.  https://www.altaro.com/hyper-v/complete-guide-hyper-v-networking/  Looks like that should cover most of the configuration questions.  Anything else I should be aware of?

    Wednesday, November 27, 2019 9:48 PM
  • You don't recommend having separate subnets for LiveMigration and one for dedicated Cluster communication, correct?

    Correct. Just one for cluster communications, distinct from the management network.

    My previous understanding was that

    All of that goes back to the pre-2012 days before we could leverage convergence to dissociate layer 7 from layers 1-3. Back then, we had single gigabit and no teaming. So, to keep everyone happy, we had to manually balance traffic. So, we (somewhat arbitrarily) drew lines around CSV/cluster traffic, Live Migration traffic, host management traffic, and VM traffic.

    Now, we have teaming and load-balancing at layer 2. We let that figure out all the nasty parts of where to put traffic. We only need to make certain that it has sufficient tools. You have two physical 10GB cards in your team, so I ask you to make two logical pathways. I call them "Management" and "Cluster" because they will kind of do that. But really, I want to ensure that only one can route traffic because having a multi-homed system with routing choices never leads to anything but pain and suffering.

    If you follow the directions exactly as I laid out, both your "Management" adapter and your "Cluster" adapter will both allow for inter-node communications and Live Migration traffic. Two logical pathways, two physical pathways, and cluster traffic and Live Migration traffic can use all of it. VMs connect to the SET, so each of their network adapters can send on both physical pathways and receive on one. The only thing in this build that can't use both adapters is inbound traffic to the management OS and inbound traffic to individual VM adapters. With 10GB, I doubt that will ever be a problem. The "Switch Independent+Hyper-V Port" mode does that anyway.

    SET configures VMQ and RSS when installed on a system that has never had those options change. I would expect it to reconfigure yours as well. Clean that NUMA stuff off, though.

    Networking QoS is probably the biggest timewaster in Hyper-V. Organizations pushing enough traffic to make use of it usually have enough money and incentive to jump up to 40GB or 100GB adapters or scale out. For everybody else, it's CPU time that doesn't make any better decisions than just not having QoS at all. But yes, you can put the settings back later if you want.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, November 27, 2019 10:14 PM
  • Sounds good.  I will tackle this next week after Thanksgiving.  Thanks so much again for sharing your expertise.
    Wednesday, November 27, 2019 10:24 PM
  • I have an all flash SAN connected to the Hyper-V Cluster via FC so CSVs are the storage that is present.  If I tear down all the networking for this node, other nodes will not be able to communicate with this node so will CSVs that are "owned" by this node automatically move to another node without any disruption?  Or should I manually move the CSVs to other nodes of the cluster and evict this node from the cluster to avoid any issues before I tear down the networking?
    Tuesday, December 3, 2019 5:33 PM
  • Just pause the node in Failover Cluster Manager and resume it after you're done. Or Suspend-ClusterNode/Resume-ClusterNode in PowerShell.

    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Tuesday, December 3, 2019 5:45 PM
  • Got it.  Thanks!
    Tuesday, December 3, 2019 5:51 PM
  • When creating the new SET all I need to do is run this command as it will create both net adapters, correct?  

    New-VMSwitch -Name SETSwitch -NetAdapterName Management, Cluster -EnableEmbeddedTeaming $true  -MinimumBandwidthMode Weight

    What about setting the load balancing mode?  Should it be Dynamic or Hyper-V Port?
    I have 200 VMs spread across 4 nodes.

    Also, what about EnableIov?  I confirmed that this 10Gb card supports SR-IOV so it should be possible to enable it, though I do not want to complicate anything if it is unnecessary.  

    Tuesday, December 3, 2019 9:48 PM
  • My bad, it should have been the following, correct?

    New-VMSwitch -Name SETSwitch -NetAdapterName 10GB_HV1_SW1, 10GB_HV2_SW2 -EnableEmbeddedTeaming $true  -MinimumBandwidthMode Weight

    Add-VMNetworkAdapter –ManagementOS –Name “Cluster” –SwitchName “SETSwitch”

    Add-VMNetworkAdapter –ManagementOS –Name “Management” –SwitchName “SETSwitch”

    What about setting the load balancing mode?  Should it be Dynamic or Hyper-V Port?
    I have 200 VMs spread across 4 nodes.

    Also, what about EnableIov?  I confirmed that this 10Gb card supports SR-IOV so it should be possible to enable it, though I do not want to complicate anything if it is unnecessary.  

    Tuesday, December 3, 2019 10:03 PM
  • Close.

    Add "-AllowManagementOS $false" to your New-VMSwitch statement or you'll get an extraneous virtual adapter.

    It will default to Dynamic. Dynamic gives better overall load balancing than Hyper-V Port. Only use Hyper-V Port when you can't use Dynamic.

    As far as I know, IOV and teaming are still incompatible. You can try it. It will error if it won't work and you can just resubmit without it.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Tuesday, December 3, 2019 10:10 PM
  • You are correct.  I found a few articles confirming that IOV and teaming are incompatible.  

    I just created a SET and when looking at the Virtual Switch from the Virtual Switch Manager in Hyper-V Manager under the drop down list for "External Network", it has each of the 10Gb NICs listed separately and one is selected by default.  This is different in comparison to LBFO teaming where there is a Teamed option to select from the list such as "Microsoft Network Adapter Multiplexor Driver".  Does that mean that the Virtual Switch is only using one of the NICs?  Or is that just the way it looks with SET?

    Wednesday, December 4, 2019 4:51 PM
  • Just the way it looks. Use "Get-VMSwitchTeam" to verify.

    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 4:54 PM
  • Things are looking good now. 

    For the vEthernet (Cluster) adapter, I know I need to have checked "Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks and Internet Protocol Version 4 (TCP/IPv4).  The following by default are also checked, QoS Packet Scheduler, Microsoft LLDP Protocol Driver, Link-Layer Topology Discovery Responder, and Link-Layer Topology Discover Mapper I/O Driver.  Do I need all these or should some be unchecked?

    Wednesday, December 4, 2019 5:40 PM
  • I don't usually spend time on any of that. You'd probably quiet the network down a tiny bit by disabling it all, but not likely enough to matter.

    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 5:46 PM
  • Got it.  I'll leave that to the way it is.

    Regarding these metrics we briefly discussed earlier.  The Clust_Mgmt_100MB one will disappear when I finish up the other nodes.  Live_Migration will disappear as well.  Should I set the AutoMetric back to True for Clust_Mgmt?  Or doesn't it matter?

    Name                           Metric   AutoMetric        Role
    ----                              ------      ----------         ----
    Clust_Mgmt                 1000      False                 Cluster
    Clust_Mgmt_100MB      80000    True                  None
    Live_Migration              2000      False                 Cluster
    Team_OS                     70385    True                  ClusterAndClient

    I did leave the setting to keep network processing off of Core 0 as that dropped CPU utilization of the Hyper-V nodes significantly when I did that.  It also resolved an issue with a VM running SQL server that had very high CPU usage.  I had a MS case open for that at the same time as the other one.  As soon as I made the change to move network processing off of Core 0 on one Node and migrated the VM to that node, the high CPU usage for that VM was gone immediately.  Also, I ran the command Get-NetAdapterAdvancedProperty -Name "*" and I noticed that the Starting RSS CPU (RssBaseProcNumber) was 0 for the 2nd 10Gb NIC so I changed that to 2 as well.  I think that may have been part of the issue why got that Event 252 issue that crashed that node the last time.  I guess we will see what happens going forward.  

    Wednesday, December 4, 2019 6:41 PM
  • I don't think anything will meaningfully change no matter what you do with the metric. I personally would set it back to auto as that's the default. Maintaining documentation for overrides and ensuring that they survive upgrades and migrations is a pain. My overall process starts with everything at its simplest and only adds complications to solve problems.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 6:52 PM
  • Got it.

    To start the network changes, I put the node into Maintenance Mode with VMM.  I just stopped Maintenance mode and I got the following warning and it automatically paused the Cluster service on the node.

    Warning (50079)
    Highly available virtual switch 'Team_Trunked' designated for cluster use, is not configured correctly on host 'Virtualsrv2'.

    Recommended Action
    Reconfigure the virtual switch to have same connectivity as other hosts in the cluster.


    Then I refreshed the node again in VMM and got a similar message and it paused the cluster service.

    Warning (50077)
    Virtual switch 'Team_Trunked' may become not highly available if the host 'Virtualsrv2' is brought online or moved out of maintenance mode as connectivity does not match with other hosts.

    Recommended Action
    Reconfigure the virtual switch to have same connectivity as other hosts in the cluster, before taking the host out of maintenance mode.

    This is a network map from VMM.  Is it not compatible to have SET on one Node and LBFO on the other?  Kind of makes it impossible to transition if that's the case.  It doesn't show the SET on Virtualsrv2 as a logical network, which must be what it is complaining about.  How do I fix that?

    Wednesday, December 4, 2019 7:52 PM
  • Hmm. Well, having VMM is kind of an important detail that I didn't know about. It always makes Hyper-V networking a lot harder than necessary.

    I think probably the easiest thing is to just remove the whole cluster from VMM management until you're done. When you re-add it, the switches won't be "logical switches". Unless you're doing something with SDN or VMM-specific networking plugins, you won't miss it.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 8:02 PM
  • Sorry I didn't mention that.  I agree it makes Hyper-V networking harder than it needs to be.  I always try to do any cluster/networking work directly from the nodes and with the Failover Cluster Manager.  I wouldn't even use it but it provides me a way to allow users that need console access to certain VMs without letting them see all the VMs or prevent unnecessary access to the Cluster nodes themselves.  The Clouds allow a nice way to delegate access to certain groups with only certain permissions to do some things.  If I remove the whole cluster from VMM, I'm assuming it will remove all delegated permissions to VMs that I have granted as well as removing them from the clouds they are in so I would have to redo all of that when I add it back in, correct?  I have 53 VMs as members of clouds for people to be able to access.  

    Any other way to fix this that you know of?

    Wednesday, December 4, 2019 8:35 PM
  • I have never found a way to get VMM to ignore things it doesn't like. I assume it will retain the cloud itself, but it will probably lose everything else. Maybe it's best to re-establish a management connection over a gigabit adapter and have it remediate the switch to the old way.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 8:53 PM
  • So when you say re-establish a management connection, what exactly do you mean?  Do you mean creating a temporary "dummy" 1 GB LBFO Nic team with the name "Teamed_Trunked" just to kind of trick VMM intil I get the rest of the nodes switched over?
    Wednesday, December 4, 2019 9:15 PM
  • Unless something has changed, VMM can't lose connectivity to the host while it's making changes to a virtual switch. As in, it won't just ship the commands to the agent and let it do the work. So, you can use the gigabit as a temporary management link or rebuild the team or whatever works best for you to get back to a place of normalcy. I think that since your logical switch didn't have a management component, you'll need to keep using the gigabit for management to prevent VMM from panicking.

    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, December 4, 2019 9:26 PM
  • So basically it won't work to move to SET, correct? 

    Then I would propose doing what I did on my newest node.  Create an LBFO team out of the 2 - 10Gb NICs, then create the Hyper-V virtual switch that binds to this team, then create the virtual NIC adapters, one for Management and one for Cluster and leave out the LiveMigration one as it isn't necessary.  This would get rid of all the 1GB adapters and VMM didn't complain about it with Virtualsrv4.

    Wednesday, December 4, 2019 9:38 PM
  • I tried creating a temporary 1 GB Management link, then tore down my SET recreated the SET but VMM still did not like it.  I then tore down everything and created an LBFO team, vswitch, and Management and Cluster vNICs.  I would have preferred to go with SET but since I am not able to totally bring down VMM for an extended period of time, this was the next best thing I could do.  All 1Gb NICs are disabled and everything is running over the 10Gb NICs now so that is good.  

    Thank you for all your time and assistance.  It is greatly appreciated.  I will be building a Hyper-Converged Cluster in the not too distant future and will certainly be using SET for that and now that I know how to do that it will be much easier to roll that out.

    Friday, December 6, 2019 3:52 PM
  • Sorry I didn't get back to you on Wednesday, got tied up and then forgot about the thread.

    Glad that you got it all worked out. You've made a lot of improvements even without going all the way to SET. Hopefully this makes things better. Don't forget to check on your VMQ/RSS distributions.

    If you won't use SDN or any VMM-specific networking plugins, I would not use logical switches on future builds. After a great deal of trial and error, I now deploy Hyper-V hosts outside of VMM, configure them the way that I want, then add them under VMM management. I don't allow VMM to control networking anymore. I believe that it CAN do SET, but I don't know how and don't see any value-add.

    You might look into scripting the more intricate parts of your setup. I'm sure there's a way to preserve those security assignments so that you can quickly re-apply them if necessary.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Friday, December 6, 2019 4:02 PM
  • What is your recommended way and commands to check and monitor VMQ/RSS distributions?  Anything in particular you keep an eye out for?

    With VMM, I never created any logical switches within VMM.  I always deploy and configure my hosts outside of VMM as well and then add them in.  It looks like it just sucks in the LBFO teamed interface of the Hyper-V Cluster into VMM when you add the host(s).  I have two other stand alone Hyper-V hosts that it did the same thing with.  The one thing in VMM I did was add Network Sites to the "Team_Trunked" logical switch so I can assign VLANs to VMs from VMM.  

    I will have to look into scripting out those security assignments when I get a chance.

    Friday, December 6, 2019 5:40 PM
  • I don't have any special tricks for VMQ/RSS. Same thing you've been doing.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Friday, December 6, 2019 6:57 PM
  • My update for anyone following this thread:

    I got all the nodes in my cluster switched over to use just the 10Gb NICs so that was good and performance seems to be doing well.  Then on 12/10 I got the dreaded 252 event again.  Luckily it didn't crash any nodes this time because I'm assuming it didn't hit that 512MB queue maximum it refers to "due to low resource on the physical NIC has increased to 256MB. Packets will be dropped once queue size reaches 512MB"

    Baffled and frustrated I started reviewing all my NIC properties again looking for something with a 512 setting and I came across the "Transmit Buffers" property that was set at 512.  I increased that on 12/12 on all the nodes to 2048 in hopes that is the source of this issue.  I had already increased the "Receive Buffers" property from 256 to 2048 back at the beginning of this mess and that didn't fix it.  Time will tell if this fixes it but I at least have some hope so we'll see what happens.  

    Tuesday, December 17, 2019 4:40 PM