none
Network glitch between a 2 node 2016 cluster

    Question

  • Hi, I am experiencing a strange networking glitch on a 2 node hyper-converged 2016 cluster.  It's causing all of the hair I haven't pulled out to turn grey, this one has me stumped.  Let me describe the cluster:

    2 nodes, each node has a Xeon V4, lots of RAM and connectivity is handled by an HP branded Mellanox Connectx-3 dual port card.  The Mellanox has the latest firmware (2.40.5000, I think 5030 just came out but HP hasn't uploaded it yet) and I am running the latest 5.35 drivers.  Storage is handled by S2D using 3 Samsung PM863 enterprise SSDs per machine and a single boot SSD.  There are about two dozen VMs running on the nodes.  The cluster has two networks:

    Port 1, front end network, 10Gbps connected to a switch, handles all VM/host communication to the world

    Port2, back end network, 40Gbps connected directly via a fiber DAC, handles S2D, heartbeat, migrations.  I've tried two different DACs, no change.

    So now the symptoms:

    1. Storage performance is variable.  One second a VM will get 30k read/7k write iops, the next it will be 60k read/12k write.  Sequential fluctuates between 200 and 300+ MB/sec.
    2. Backups fail.  They fail with random timeout messages (I'm using Altaro VM Backup) which the Altaro team has traced back to the network glitch using a tool they built, I'll post the results in a bit.
    3. Cyberpower UPS agent software randomly loses connection. 
    4. Other random glitches like stopping/starting services over the network can be instant or take two minutes.

    The support team over at Altaro have a great tool that I've been using to test this problem out.  It consists of a server and a client that pass traffic between each other and verify the packets have arrived properly (not just a raw bandwidth test).  I've also used wireshark to verify the issue.  I'll post screen shots of both below.  

    Now for the testing I've already done.  When using the Altaro tool the following situations work, and by work I mean I can let the tool run forever at any setting without a failure:

    • VM to local Host
    • VM to remote Host
    • VM to VM
    • Workstation to either Host

    And with a near immediate failure:

    • Host to Host (either direction, either network, front end or back end).

    So I can run this test to either host using the 10Gbps front end network from ANY machine (VM, physical) without issue.  If I run the test host to host using the same network I get a failure...  I also get a failure if I run the test between the hosts using the 40Gbps back end network, changing the DAC doesn't affect it.  

    My first thought was that the tool had an issue so I wiresharked it.  If I capture packets while the tool runs between the hosts I see missing packets show up, they don't appear in any of the non failure scenarios.

    So far I've tried adjusting settings like MTU and anything else that looks like it might have an effect, no luck.  I've swapped DACs with no change as well.  The only thing I've done that *seems* to have made a difference is updating the Mellanox driver from 5.25 to 5.35.  Before the change the Altaro tool would error out when run from VM to host, now it works fine.  It also seemed to completely fix the problem for a short period of time.  VM benchmarks became solid at 100k read/40k write iops with 400MB + sequential, and it was consistent (I believe this is how the cluster should perform all the time).  It didn't last however as I am back to having the random glitches.  Any thoughts on the matter?  

    A screen shot of a working test, sequential counters highlighted:

    REDACTED.  Ms hasn't verified my ancient account yet so no screen shots...  I"ll see what I can do to get verified and post them.



    • Edited by mdchaser Wednesday, March 08, 2017 6:45 PM
    Wednesday, March 08, 2017 6:44 PM

All replies

  • Hi Sir,

    >>If I run the test host to host using the same network I get a failure...  I also get a failure if I run the test between the hosts using the 40Gbps back end network, changing the DAC doesn't affect it.  

    >>The only thing I've done that *seems* to have made a difference is updating the Mellanox driver from 5.25 to 5.35. 

    >>HP branded Mellanox Connectx-3 dual port card.

    Have you checked whether or not VMQ is enabled for Mellanox ?

    If yes , I'd suggest you try to disable VMQ then test again .

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, March 09, 2017 11:38 AM
    Moderator
  • Morning, thanks for getting back to me!  I've verified VMQ was disabled on the back end network and enabled on the front end.  I seem to receive the error no matter the VMQ setting.  I've also tried disabling flow control with no luck, the test runs for less than 10 seconds then errors out showing dropped frames in wireshark.
    • Edited by mdchaser Thursday, March 09, 2017 7:58 PM
    Thursday, March 09, 2017 7:52 PM
  • Hi Sir,

    Based on my experience , I'd like to narrow it down with :

    1. disable all offload feature for physical NIC .

    2. replace cables / change switch ports for physical host NICs .

     

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    • Proposed as answer by JustbobChico Tuesday, March 14, 2017 6:46 PM
    Monday, March 13, 2017 3:16 AM
    Moderator
  • Elton, you are a genius!  I disabled tx/rx offload and haven't had a dropped packet since!  Now I can start re-growing my hair...  Any thoughts as to why that caused issues?  Will I see a performance drop with offloading disabled?

    Thanks!

    Jeff R.

    Monday, March 13, 2017 10:57 PM
  • Hi Sir,

    >>Any thoughts as to why that caused issues? 

    This is more related to the "compatibility" when NIC driver treats the network packets (when offloading tx/rx checksum ) .

    I'd suggest you involve hardware vendor to test this behavior .

    Hope it is helpful to you .

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Tuesday, March 14, 2017 2:43 AM
    Moderator