none
Network perfomance issues(Chrashes) in hyper-v 2016

    Question

  • Hello folkes

    We have a problem. We have created a hyper-v 2016 cluster with three HP Proliant Dl 380 g8 servers. We are using two Intel x540-t2 (Dual port 10Gb/s Nic) in every server. The drivers are all up to date.

    I have created two nic teams, named: "Ockero Virtual Switch 1" and "Team01". "Ockero Virtual Switch 1" is dedicated to vm traffic and "Team01" is dedicated for management and host traffic. They have the following configuration:

    Name: Team01
    Members: Ethernet 16, Ethernet 15
    Teamingmode: Switchindependant
    Loadbalancingalgorithm: Dynamic
    ------------------------------------------------
    Name: Ockero Logical Switch 1
    Members: Ethernet 13, Ethernet 14
    Teamingmode: Switchindependant
    Loadbalancingalgorithm: Dynamic

    We have had some major chrashes where not only the vm:s loose network connectivity but also the hosts. And you know what happens when a host looses connection to the other hosts... well thats something you dont want in your production environment. We conntacted microsoft and they told us to do the following:

    1. Disable TCP Chimmey Offload, Receive Side Scaling and NetDMA on the server.
      1. Disable TCP Chimmey Offload:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global chimmey=disabled
      2. Disable Receive Side Scaling:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global rss=disabled
      3. Disable NetDMA:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global netdma=disabled

    2.) Disable VMQ on the physical adapters.

    We did the above things and we havent had any crashes since, but we have strange performance issues. And i think this is related to the hosttraffic rather then VM traffic. Every now and then we get traffic stalls. If i copy a file from the fileserver to my computer i first have 1Gb/s and then it goes down to 0b/s and stays there for 10-15 seconds and then again 1Gb/s. But this also occurs in a vm when copying from the local harddrive, from c:\ to c:\. I dont know if this is two different things but i have an idéa that it could be the network loadbalancing algorithm. But has anyone out there seen this before?

     

    Friday, March 10, 2017 7:34 AM

All replies

  • Hi,

    Are there any other applications running?

    >>ut this also occurs in a vm when copying from the local harddrive, from c:\ to c:\

    Do you mean from one VM to another?

    When copying, check the hardware usage in task manager, is it normal?

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, March 13, 2017 4:27 AM
    Moderator
  • Hi,
    Are there any updates on the issue?
    You could mark the reply as answer if it is helpful.
    Best Regards,
    Leo

    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, March 27, 2017 2:05 AM
    Moderator
  • Hello,

    I've been pulling my hair for the last month with exactly the same issue. We've got Dell servers with 10Gbs teamed NIC ~ 20Gbs for mgt traffic, cluster & Live migration. This issue started appearing after we applied April patches, at least that's when we noticed. My investigation is pointing to RSS being broken and when we engage Live Migration the server crashes never being able to reach more than 4Gbs on a link which use to run for years fine at 20Gbs. Opened a case with MS but not getting much love other than have you turned on Jumbo Frames, LSO etc...blah, blah. The reason you don't get high net speed anymore is because RSS is disabled and RSS is used to distribute traffic across multiple logical CPUs. Each one can handle no more than 4-5Gbs. As I type this, I've taken one of the hosts out of the cluster and rebuild, but applied no patches. Been testing all morning and achieving 19.7 Gbs persistently with RSS on. RSS off it drops to 3.8Gbs.....Will update

    Alek

    Tuesday, May 09, 2017 12:54 AM
  • Hello folkes

    We have a problem. We have created a hyper-v 2016 cluster with three HP Proliant Dl 380 g8 servers. We are using two Intel x540-t2 (Dual port 10Gb/s Nic) in every server. The drivers are all up to date.

    I have created two nic teams, named: "Ockero Virtual Switch 1" and "Team01". "Ockero Virtual Switch 1" is dedicated to vm traffic and "Team01" is dedicated for management and host traffic. They have the following configuration:

    Name: Team01
    Members: Ethernet 16, Ethernet 15
    Teamingmode: Switchindependant
    Loadbalancingalgorithm: Dynamic
    ------------------------------------------------
    Name: Ockero Logical Switch 1
    Members: Ethernet 13, Ethernet 14
    Teamingmode: Switchindependant
    Loadbalancingalgorithm: Dynamic

    We have had some major chrashes where not only the vm:s loose network connectivity but also the hosts. And you know what happens when a host looses connection to the other hosts... well thats something you dont want in your production environment. We conntacted microsoft and they told us to do the following:

    1. Disable TCP Chimmey Offload, Receive Side Scaling and NetDMA on the server.
      1. Disable TCP Chimmey Offload:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global chimmey=disabled
      2. Disable Receive Side Scaling:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global rss=disabled
      3. Disable NetDMA:
        • Use administrative credentials to open a command prompt.
        • At the command prompt type the following command and then press enter: netsh interface tcp set global netdma=disabled

    2.) Disable VMQ on the physical adapters.

    We did the above things and we havent had any crashes since, but we have strange performance issues. And i think this is related to the hosttraffic rather then VM traffic. Every now and then we get traffic stalls. If i copy a file from the fileserver to my computer i first have 1Gb/s and then it goes down to 0b/s and stays there for 10-15 seconds and then again 1Gb/s. But this also occurs in a vm when copying from the local harddrive, from c:\ to c:\. I dont know if this is two different things but i have an idéa that it could be the network loadbalancing algorithm. But has anyone out there seen this before?

     

    First and foremost, don't use a file copy to test network throughput.  You should be using ntttcp, which is a Microsoft network performance profiling tool meant to serve this exact use case.

    https://gallery.technet.microsoft.com/NTttcp-Version-528-Now-f8b12769

    Second, as alex mentioned, RSS distributes network load across the CPU cores thereby increasing potential throughput.  Each CPU core can process ~4Gbps.  If you disable RSS you lose all the performance benefits.

    If I were you I'd review the Hyper-V VMMS and other Hyper-V logs around the time the issue last occurred to see if any relevant event data is present.  Also if you are running a cluster I'd recommend dumping the cluster log via PowerShell and scanning for the events.  This data will be helpful in determining root cause plus is information which could be supplied to a vendor (ie HP or Intel) if needed in the future.

    With that said, I'd review the HP/Intel websites for firmware updates and look over the release notes/change logs.

    Last, you might want to work through the process of elimination with your situation re-enabling each of the above features until you find the culprit.  Once you've identified the issue and have gathered logs to point to root cause (ie NIC) then it's data you can use in a support case with HP/Intel.


    Tuesday, May 09, 2017 1:22 AM
  • Update:

    Confirmed that April 2017 CU and May 2017 CU breaks RSS. On the other hand March 2017 CU does NOT break it. The trick to look for broken RSS is, one should use Get-NetAdapterRss command and specifically pay attention to IndirectionTable: [Group:Number]. On a broken system, the RSS will be Enabled: (TRUE) but IndirectionTable: [Group:Number] will be blank:

    Name                                            : NIC2
    InterfaceDescription                        : QLogic BCM57800 10 Gigabit Ethernet (NDIS VBD Client) #43
    Enabled                                         : True
    NumberOfReceiveQueues                 : 8
    Profile                                            : NUMA
    BaseProcessor: [Group:Number]                   : 0:0
    MaxProcessor: [Group:Number]                    : 0:46
    MaxProcessors                                   : 16
    RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                      0:16/0  0:18/0  0:20/0  0:22/0  0:24/32767  0:26/32767  0:28/32767
                                                      0:30/32767
                                                      0:32/32767  0:34/32767  0:36/32767  0:38/32767  0:40/32767
                                                      0:42/32767  0:44/32767  0:46/32767
    IndirectionTable: [Group:Number]                :

    whereas normally functioning RSS looks like this:

    Name                                            : NIC2
    InterfaceDescription                            : QLogic BCM57800 10 Gigabit Ethernet (NDIS VBD Client) #43
    Enabled                                         : True
    NumberOfReceiveQueues                           : 8
    Profile                                         : NUMA
    BaseProcessor: [Group:Number]                   : 0:0
    MaxProcessor: [Group:Number]                    : 0:46
    MaxProcessors                                   : 16
    RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                      0:16/0  0:18/0  0:20/0  0:22/0  0:24/32767  0:26/32767  0:28/32767
                                                      0:30/32767
                                                      0:32/32767  0:34/32767  0:36/32767  0:38/32767  0:40/32767
                                                      0:42/32767  0:44/32767  0:46/32767
    IndirectionTable: [Group:Number]                : 0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30
                                                      0:0   0:24    0:2     0:26    0:4     0:28    0:6     0:30

    There is also another tell tale sign of broken RSS and that is when using ntttcp for testing speed. Even though on a 20Gb/s system the Ethernet adapter may show speeds up to 19Gbs , looking at the CPU tab of Task Manager only single logical CPU will be used. Few times I falsely believed all is good and joined a patched (broken) host to a cluster and when loading it with VMs the issue came forward. The link maxes out at 4Gbs when during Live Migration (the same link that just showed 19Gbs using ntttcp). Cluster heartbeat sharing the same physical NICs fails and crashes the server, as the host gets ejected.

    Repeated several tests with the same outcomes.

    Hopefully someone will get alerted at MS re this.

    Alek


    • Edited by alexser2006 Wednesday, May 10, 2017 2:42 AM Correction
    Wednesday, May 10, 2017 2:40 AM