none
Cluster node - Event 252 - cluster service crashed RRS feed

  • Question

  • We have a 4 node Windows Server 2016 Hyper-V Cluster.  Over the weekend, one node of the cluster reported this in the System event log:

    "Memory allocated for packets in a vRss queue (on CPU 14) on switch 620355C8-4D29-4D91-BE80-B840921EDC4A (Friendly Name: Team_Trunked) due to low resource on the physical NIC has increased to 256MB. Packets will be dropped once queue size reaches 512MB."

    Seven seconds later it reported that it had increased to 512MB, then two seconds later the LiveMigration NIC reported it had begun resetting, then a few seconds later a reset was issued to \device\raidport1 (source: ql2300).  After two minutes of this and a few other repeats, I started getting warnings of CSVs no longer being able to access this cluster node and then the Cluster service shut down on this node and all VMs are shut down and migrated to other nodes in the Cluster.  

    Our weekly DPM backups of the VMs had started about 1.5 hours before this occurred so there was some additional strain on the NICs at this time, though that should have gone through the NIC the OS is running on so I don't know why that would have affected the other NICs where the VMs general data was going through (Team_Trunked) and the LM NIC.

    Does it make sense that the first warning about Event 252 would have caused all this or is there more to this?  


    • Edited by WSUAL2 Monday, August 12, 2019 4:19 PM
    Monday, August 12, 2019 3:57 PM

Answers

  • Hi ,

    Sorry for the delayed response.

    >>I want to understand this issue to make sure this doesn’t happen again.

    I have changed the “Receive Buffers” from 256 to 2048 on all of the 10 Gb adapters on all of the nodes which is where all the VM traffic occurs on.  That should fix that issue.  

    In order to narrow down the issue, I would suggest you increase the “Receive Buffers” to do a check. If the issue occurs again, we need to do further troubleshooting.

    If the issue doesn't occur , I would think the problem is caused by event 252. 
    Since I did not find any related resource talking about such situation in Microsoft official document, I would suggest you open a case with Microsoft, more -in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.

    Here is the link:
    https://support.microsoft.com/en-us/gp/support-options-for-business

    Best Regards,

    Candy



    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   


    Wednesday, August 21, 2019 1:35 AM

All replies

  • Hi ,

    Some network adapters set their receive buffers low to conserve allocated memory from the host. The low value results in dropped packets and decreased performance. Therefore, for receive-intensive scenarios, we recommend that you increase the receive buffer value to the maximum.

    Please increase the "Receive buffers" in the physical NIC on the Hyper V hosts, check if it could help. The default value is 256, you could increase to 1024 and then check the result.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Tuesday, August 13, 2019 3:40 AM
  • Hi ,

    Just want to confirm the current situations.

    Please feel free to let us know if you need further assistance.                   

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Thursday, August 15, 2019 7:30 AM
  • On the node of the cluster that had the issue, Virtualsrv1, there are 3 teams.

    Team_LM – Live Migration – Switch Independent – Dynamic – 2, 1Gb adapters 
    (HPE Ethernet 1 Gb 4-port 366T Adapter #6 & HPE Ethernet 1 Gb 4-port 366T Adapter #5)
    (Cluster Only)

    Team_OS – OS Team – Switch Independent – Address Hash – 1, 1Gb and 1 100Mb adapters
    (HPE Ethernet 1 Gb 4-port 366T Adapter #3 & HPE Ethernet 1 Gb 4-port 366T Adapter #7)
    (Cluster & Client)

    Team_Trunked – VM traffic – Switch Independent – Dynamic – 2, 10Gb adapters
    (HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter & HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter #2)

    There is also the HPE Ethernet 1 Gb 4-port 366T Adapter that is specifically for cluster communication.
    (Cluster Only)

    Running this powershell command gives this:  Get-ClusterNetwork | ft Name, Metric, AutoMetric, Role

    Name                           Metric   AutoMetric        Role
    ----                              ------      ----------         ----
    Clust_Mgmt                 1000      False                 Cluster
    Clust_Mgmt_100MB      80000    True                  None
    Live_Migration              2000      False                 Cluster
    Team_OS                     70385    True                  ClusterAndClient

    With the metric set as it is, Cluster communication (pings to ensure all nodes are there) should occur through “Clust_Mgmt”, then if not available “Live_Migration”, then “Team_OS”.  Correct?

    The big question is why did this node of the cluster lose communication if the adapter - HPE Ethernet 1 Gb 4-port 366T Adapter #6 (one of the Live Migration adapters) lost connectivity.  That shouldn’t cause the cluster to lose communication with this node.

    I want to understand this issue to make sure this doesn’t happen again.

    I have changed the “Receive Buffers” from 256 to 2048 on all of the 10 Gb adapters on all of the nodes which is where all the VM traffic occurs on.  That should fix that issue.  I don’t see how this Event ID 252 issue caused this whole mess though.


    Friday, August 16, 2019 6:09 PM
  • Hi ,

    Sorry for the delayed response.

    >>I want to understand this issue to make sure this doesn’t happen again.

    I have changed the “Receive Buffers” from 256 to 2048 on all of the 10 Gb adapters on all of the nodes which is where all the VM traffic occurs on.  That should fix that issue.  

    In order to narrow down the issue, I would suggest you increase the “Receive Buffers” to do a check. If the issue occurs again, we need to do further troubleshooting.

    If the issue doesn't occur , I would think the problem is caused by event 252. 
    Since I did not find any related resource talking about such situation in Microsoft official document, I would suggest you open a case with Microsoft, more -in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.

    Here is the link:
    https://support.microsoft.com/en-us/gp/support-options-for-business

    Best Regards,

    Candy



    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   


    Wednesday, August 21, 2019 1:35 AM
  • I have opened a support ticket with Microsoft about this issue and I am awaiting a response.  I will post the details when I receive them.
    Monday, August 26, 2019 2:57 PM
  • Hi ,

    Thanks for your efforts you have put into this case.

    By sharing your experience you can help other community members facing similar problems. Thanks for your understanding.

    I will wait for your good news.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Tuesday, August 27, 2019 1:43 AM