locked
Intermittent connection reset on 2012 R2 Hyper-V with a CentOS 6.5 client VM RRS feed

  • Question

  • I am really at a loss for what is causing a very random TCP connection reset on our CentOS 6.5 client VMs. This is mostly seen on HTTP connections when end users are using the Apache-based apps. The error shows in their browsers as a "The connection to the server was reset while the page was loading". They simply hit refresh and the page loads fine. I have been able to get the same error while running Wireshark on my system and it shows "Acknowledgment number: Broken TCP. The acknowledge field is nonzero while the ACK flag is not set" We also experience a MySQL connection error also that we believe is caused by this same error.

    They are running a HP Proliant BL460c G6 blade with a NC360m mezz nic card. The driver is update to date. We are running the standalone 2012 R2 Hyper-V server "free version". The CentOS clients have the std "network adapter" connected the virtual switch is attached to the NC360m nic card. No logs on either the CentOS client or the Hyper-V server show any errors. I wanted to try using a legacy network adapter, but it does not seem to work. The CentOS VMs are Gen1, so I am not sure why the legacy will not work.

    Any ideas or suggestions on what to do next to resolve this? I have searched the net and forums for answers and solutions but have come up with nothing.

    • Moved by BrianEhMVP Wednesday, October 15, 2014 8:24 PM
    Wednesday, October 15, 2014 7:50 PM

Answers

  • Finally got to the bottom of this. Ended up being our Barracuda Web Filter. It was set to scan traffic to destinations on our internal network. Once I created a rule to ignore internal server addresses, the resets went away. It appeared the Barracuda scan of packets sometimes changed the payload causing the handshake to fail as described above.
    Thursday, December 4, 2014 9:15 PM

All replies

  • This is a stab in the dark but could you try without firewall and selinux

    #For now lets disable selinux

    setenforce 0
    sed -i "s|SELINUX=enforcing|SELINUX=disabled|g" /etc/selinux/config


    #For now lets turn the firewall off

    service iptables stop
    chkconfig iptables off

    Thursday, October 16, 2014 1:56 PM
  • Unfortunately, I already have those two suggestions implemented. I did think that maybe my VM template I generated the servers from was corrupt somehow. So I built a fresh VM from the an ISO image and using CentOS 7, but got the same error this morning. I am now working on building a Gen2 VM to test this out.

    This really is baffling.

    Thursday, October 16, 2014 2:02 PM
  • Whats the frequency of the problem

    Does it work alright for a while after a reboot then get worse?

    I know its frustrating to get these open ended questions but we have to start somewhere.

    Thursday, October 16, 2014 2:07 PM
  • It really is random, it can happen when a user first connects to app on the VM or after they have already done so. Like when submitting a form or navigating to a different page. The VMs we see this error on are the ones that have the highest traffic. Reboots really do not seem to change the frequency of the error. It really does look like it fails at the vNic since it never makes it to the application layer.
    Thursday, October 16, 2014 2:13 PM
  • You could try the latest version of fedora 21 which should be similar enough for your app to install on and has most of the upstream hyperv driver changes just as a test.

    I know its pain but you could probably set that up fairly quickly.  The hyperv drivers in centos 6.5 an7 are a bit behind and there has been some work done on the network side.

    Having said that I am running Centos 6.5 with MySQL apache app no probs.

    Thursday, October 16, 2014 2:22 PM
  • Unfortunately, I already have those two suggestions implemented. I did think that maybe my VM template I generated the servers from was corrupt somehow. So I built a fresh VM from the an ISO image and using CentOS 7, but got the same error this morning. I am now working on building a Gen2 VM to test this out.

    This really is baffling.


    Thursday, October 16, 2014 2:23 PM
  • Are you running it as a Gen1 or Gen2 VM? Also, what kind of hardware are you on? Fedora or SUSE was on my list of OS's to test out and I might very well be doing that soon.
    Thursday, October 16, 2014 2:26 PM
  • Its CentOS 6.5 on 2008 R2 DELL poweredge R710  so I guess Gen1?  So yes its not the same host environment as you so you may be hitting some 2012 R2 issues with LIS.  Seems like a pretty fundamental prob though so I would be surprised if no one had seen that before.  Maybe someone from Microsoft should chime in here.

    I suggest trying Fedora 21 as its closest to CentoOS yum etc and should be familiar.  Should be up and running pretty quickly.

    Let us know how you go..


    Thursday, October 16, 2014 2:31 PM
  • Also does DMESG show anything relating to the hv_netsvc
    Thursday, October 16, 2014 2:36 PM
  • DMESG only shows the following after a boot, but I think it is typical:

    hv_vmbus: registering driver hyperv_fb
    hyperv_fb: Screen resolution: 1152x864, Color depth: 32
    Console: switching to colour frame buffer device 144x54
    hv_utils: Registering HyperV Utility Driver
    hv_vmbus: registering driver hv_util
    sd 0:0:0:0: Attached scsi generic sg0 type 0
    hv_vmbus: registering driver hv_netvsc
    hv_netvsc: hv_netvsc channel opened successfully
    hv_netvsc vmbus_0_14: Device MAC 00:15:5d:64:4c:1a link state up

    Also I should not that these resets errors do not occur will persistent connections. I can have a SSH session open all day long and not have a problem. I should also say that back when we were on 2012 R1, I do not think this occured either. I would have to check because between R1 and R2 we moved all with apps to AWS. But decided to bring them back inhouse.

    Thursday, October 16, 2014 2:46 PM
  • Hmm....

    I think at this stage you should try fedora 21 and report back.

    Just as an elimination step.

    Thursday, October 16, 2014 3:00 PM
  • Reporting back that I have this resolved.

    I did install Fedora 20 and this still had the issue using the Hyper Network adapter. I switched it over to the legacy network adapter and could not reproduce this error. Now the trick was figuring if I could get CentOS 6.5 to work with network legacy adapter, which I have not been able to do. I found a post on a separate forum that mentioned that when more than 1 vCPU is allocated to a CentOS VM, you needed to disable irqbalance via "chkconfig irqbalance off" and reboot.

    So after doing this to a production centOS box, I have been unable to reproduce the error. Prior to this fix this morning, I could trigger the error on the same VM pretty consistently.

    Friday, October 17, 2014 7:47 PM
  • Why didn't you want to try 21? That would have incorporated more patches.

    Anyway glad the workaround works for you but it would be nice to clear this up.  It seems a pretty fundamental problem and since legacy net adapters are no longer in gen 2 VMs we really need to get to the bottom of this.

    I have business critical systems running on 2008 r2 CentOS 6.5 and I doesn't leave me with much confidence when I hear of this problem and the Legacy Network adapter workaround.  BTW performance is greatly reduced on legacy if I remember from many years ago (the last time I used a legacy adapter) but I guess that isn't an issue for a internal web app.

    I really think microsoft should in on this.


    Friday, October 17, 2014 9:01 PM
  • I did not want to run a alpha release on a business production app if it indeed had been the only working solution for us. That is needing to switch the servers from CentOS to Fedora.

    Agree that Microsoft should chime in on this. Like you mentioned, these are internal web apps, so no big deal on the reduced performance. Hence the reason my important stuff is not virtualized at this point.
    Friday, October 17, 2014 11:14 PM
  • Update: I just had a connect reset. So now back to the drawing board on this. I really hope someone at Microsoft would read this and point me at least in a direction to be investigating.
    Saturday, October 18, 2014 2:13 AM
  • Finally got to the bottom of this. Ended up being our Barracuda Web Filter. It was set to scan traffic to destinations on our internal network. Once I created a rule to ignore internal server addresses, the resets went away. It appeared the Barracuda scan of packets sometimes changed the payload causing the handshake to fail as described above.
    Thursday, December 4, 2014 9:15 PM
  • Thanks for posting what the solution is.  I'm glad to know it wasn't a problem with Linux itself on Hyper-V. :-)

    Michael Kelley, Lead Program Manager, Open Source Technology Center

    Saturday, December 6, 2014 1:08 AM
    Moderator
  • Thank you so much for posting this solution. You helped me fix an issue I'd been struggling with for longer than I'd like to admit.
    Wednesday, May 8, 2019 6:58 PM