none
VMs have no network connectivity after reboot

    Question

  • Hello,

    We've recently had a problem appear across our Hyper-V 2008 R2 and 2012 clusters. When a virtual machine is rebooted, some of its virtual network cards will have no network connection. The VMs run 2008 R2 and 2012, both of which are affected.

    Some VMs have 4 virtual NICs and different ones will lose connection randomly, it can take 4 reboots before all of them are working. The problem happens as soon as the VM comes up after a reboot, the network shows as connected in Windows, yet no traffic will pass through it. Running an ipconfig on the affected VM will show 2 IP addresses for each NIC with no connection - the normal static address and an autoconfiguration 169 address.

    The hosts and VMs are all up-to-date with the latest patches, integration services have been updated on all the VMs and there are no errors showing in the event logs. I've tried reinstalling the virtual NICs, created a new virtual switch, disabling offloading and disabling VMQ but nothing has helped. Once all the virtual NICs are connected, they don't have any more problems until the next reboot.

    I'm really stuck with this one now.  Any advice would be appreciated.

    Thursday, December 19, 2013 12:32 PM

Answers

  • We have determined the problem to be caused by a recent Cisco Catalyst IOS upgrade that globally turns on 'ip device tracking'. This happens in Cisco IOS version 15.X.

    The ip device tracking in combination with how Windows 2008 and higher handles gratuitous ARP is what causes the issues.

    You can see if you have ip device tracking enabled by running the command below on your switch. The device will show ‘ip device tracking’.

    Show config | inc tracking

    Cisco References

    http://www.cisco.com/en/US/products/ps6638/products_white_paper09186a0080c1dbb1.shtml

    https://supportforums.cisco.com/thread/2244042

    Cisco recommends that you modify the probe delay which we tried but this did not fix our situation.

    ip device tracking probe delay 10

    We had to run the following commands on each switch.

    int range gig0/1 - 24
    ip device track max 0

    You can confirm you have fixed the problem by downloading Xarp and running it on a host or VM connected to your network. When fixed the alerts will be gone.

    • Proposed as answer by ColeThompson Monday, January 06, 2014 6:58 AM
    • Marked as answer by bestseany Monday, January 06, 2014 10:46 AM
    Monday, January 06, 2014 6:58 AM

All replies

  • Could do with a little more information, are these clusters managed by SCVMM 2012, if so have switches been assigned to hosts using fabric management. ?

    how have you assigned the virtual network on the hosts, single NIC or Teamed, VLAN tagging, ?

    Microsoft Teaming, or Vendor Teaming I.E HP ?

    if you were to live migrate one of these guests with a failed network , will it start to work once live migrated ?

    its seems a bit odd that its happening to virtual servers on separate clusters, I would look more at your network connectivity to the hyper-v hosts servers, could be an issues with spanning tree, or VLAN tagging at your corporate switch level.

    thanks

    Mark

    Thursday, December 19, 2013 3:02 PM
  • Generally when I have seen a NIC with both a fixed IP address and an APIPA address it has come as a result of an address conflict on the network.  Machine comes up with its defined address, finds something else on the network with that address, so reverts to APIPA.  Any messages in the event log?

    .:|:.:|:. tim

    Thursday, December 19, 2013 4:17 PM
  • I'm having the exact same issue on an 8 Node Windows 2012 Hyper-V Cluster. It is happens very randomly when rebooting the VMs. I have had this happen to dozens of different VMs on the different VLANs and this does not seem to be caused by IP Address conflicts. I can easily recreate the issue by rebooting the VM until it gets a 169.254.X.X address which usually takes between 2 and 13 tries.

    We are using NIC Teaming with 2 separate Teams (Management, Hyper-V Network).

    Servers are using Broadcom 5709C network adapters. Latest firmware and drivers. Cisco catalyst switches which have been recently updated.

    Disabling IPv6 on the VMs had no impact on the issue. Neither did disabling 'Large Send Offload Version 2' for both IPv4 and IPv6 on the VMs.  

    No errors in the event logs on the Physical hosts. The only warning that shows right after the issue occurs on the VM is, Event ID 1014, DNS Client Events, "Name resolution for the name teredo.ipv6.microsoft.com timed out after none of the configured DNS server responded."


    • Edited by ColeThompson Monday, January 06, 2014 6:56 AM Too much info.
    Thursday, December 19, 2013 9:09 PM
  • Could do with a little more information, are these clusters managed by SCVMM 2012, if so have switches been assigned to hosts using fabric management. ?

    how have you assigned the virtual network on the hosts, single NIC or Teamed, VLAN tagging, ?

    Microsoft Teaming, or Vendor Teaming I.E HP ?

    if you were to live migrate one of these guests with a failed network , will it start to work once live migrated ?

    its seems a bit odd that its happening to virtual servers on separate clusters, I would look more at your network connectivity to the hyper-v hosts servers, could be an issues with spanning tree, or VLAN tagging at your corporate switch level.

    thanks

    Mark

    We're not using SCVMM to manage the clusters. We use NIC Teaming for the VM LAN access, but we also have some non-teamed NICs for iSCSI access within the VMs too. Both types of NIC are affected.

    We don't have any VLAN configuration within Hyper-V or the host, the only VLAN configuration is within the switches themselves. We have 2 Cisco switches with an even split between the 3 cluster nodes and IP SAN. 

    I've not tried live migrating but will give it a go today.

    We've definitely not got any address conflicts.
    • Edited by bestseany Friday, December 20, 2013 9:49 AM
    Friday, December 20, 2013 9:46 AM
  • I'm having the exact same issue on an 8 Node Windows 2012 Hyper-V Cluster. It is happens very randomly when rebooting the VMs. I have had this happen to dozens of different VMs on the different VLANs and this does not seem to be caused by IP Address conflicts. I can easily recreate the issue by rebooting the VM until it gets a 169.254.X.X address which usually takes between 2 and 13 tries.

    We are using NIC Teaming with 2 separate Teams (Management, Hyper-V Network) using independent switches for redundancy. Management uses Hash and Hyper-V Network is using Hyper-V Port.

    I even had the issue happen to one of the Physical hosts where the network did not come back up after the reboot.

    These are IBM x3650 M3 servers are using Broadcom 5709C network adapters. Latest firmware and drivers from IBM. Cisco catalyst switches which have been recently updated.

    Disabling IPv6 on the VMs had no impact on the issue. Neither did disabling 'Large Send Offload Version 2' for both IPv4 and IPv6 on the VMs.  

    We use standard VLANs for all VMs with SCVMM 2012 SP1 versus VM Networks.

    No errors in the event logs on the Physical hosts. The only warning that shows right after the issue occurs on the VM is, Event ID 1014, DNS Client Events, "Name resolution for the name teredo.ipv6.microsoft.com timed out after none of the configured DNS server responded."

    It looks like you have a similar issue to me. We're using HP DL380p servers, but also have Broadcom network cards and Cisco Catalyst switches. I wonder if there's a common link....
    Friday, December 20, 2013 9:48 AM
  • Ok, another update...

    After a live migrate the NICs start working again! 

    Also, something I didn't notice before, when I do an ipconfig /all, the affected NICs with the 169.x address shows '(duplicate)' next to the real static IP. So it must be seeing a conflict somehow....

    Friday, December 20, 2013 10:07 AM
  • it sounds similar to something we had, but the issue I had ,was vm's randomly loosing network connectivity under high load, this was a know issue on 2008R2 which was fixed, but it seems to have come back in 2012.

    I am currently running 2012 R2 and haven't seen it happen yet.

    to be honest, we use Static IP addresses on all our servers, your hosts should be static IP really, best practice.

    you could run something like wireshark to look at the network traffic on your DC and DHCP servers, when the virtual servers or hosts boot up, and request DHCP addresses, and see what is happening.

    its could still be switch port configuration. I would get help from your network guys too.

    Sorry I couldn't be more helpful

    Cheers

    Mark

    Friday, December 20, 2013 10:29 AM
  • I did see the problem about high load causing network loss, but this is different.

    We do have static IPs on all our servers - VMs and hosts. That's what makes this problem strange, it's detecting an IP conflict that doesn't exist.

    I am one of the network guys :-) However, I think this is a Hyper-V/Windows Server problem rather than an actual network problem.

    Friday, December 20, 2013 10:50 AM
  • LOL cool, are your host and Guest networks on a separate VLANs, its usually good practice to do this?

    is the NIC Binding order set correctly on the hosts ? 

    Do you have Port Fast enabled on your switch ports ?

    there are some good articles online for best practice configuration of hyper-v hosts. its worth double checking your hosts have these requirements.

    Are your DC and DHCP/DNS servers Separate Physical Servers from the Hyper-V estate ?

    sorry so many question just trying to build a mental picture of your infrastructure to better understand.

    Cheers

    Mark


    Friday, December 20, 2013 11:19 AM
  • LOL cool, are your host and Guest networks on a separate VLANs, its usually good practice to do this?

    is the NIC Binding order set correctly on the hosts ? 

    Do you have Port Fast enabled on your switch ports ?

    there are some good articles online for best practice configuration of hyper-v hosts. its worth double checking your hosts have these requirements.

    Are your DC and DHCP/DNS servers Separate Physical Servers from the Hyper-V estate ?

    sorry so many question just trying to build a mental picture of your infrastructure to better understand.

    Cheers

    Mark


    The hosts and guests are on different VLANs, as the hosts are on our main LAN and the guests are in an DMZ. We have a couple of other VLANs for iSCSI and cluster communications too. We don't have portfast enabled on the switches.

    I've haven't touched the NIC binding on the hosts, as it's never been something that needed changing in the past. The hosts use our main LAN DCs to get AD and DNS etc, but the VMs are using DCs running as a VMs in the cluster as they're on a self-contained network in the DMZ.

    Looking throught the event logs again, I've noticed an event id of 4199 for tcpip saying there was an address conflict for 0.0.0.0 and it gives the MAC address of the switch port.


    Friday, December 20, 2013 2:53 PM
  • In our case all hosts have separate VLANs and there are separate VLANs for everything.

    The virtual switches do not have IP addresses.

    Port fast is 'enabled when static access' on MGMT network. It is set to 'enabled' on the Hyper-V network. All servers and VMs have static IP Addresses.

    Interestingly enough this also happened to a client's physical server which is using Intel NICs and Windows Server 2012. In this case it was the management network that had the problem. They were connected to a Catalyst switch as well but with port fast disabled.

    I suspect the problem is Windows 2012 networking as I never had this problem with any VMs on the Windows 2008 R2 Hyper-V platform that were also using the same hardware, switches and VLANs. The only difference now is Windows 2012 Hyper-V hosts and NIC teaming.

    Also, there are no problems with Live Migrations. The only time this happens is when rebooting the VMs and I have also seen this happen to 2 physical Windows 2012 Hyper-V hosts.  

    Friday, December 20, 2013 3:24 PM
  • please try enabling port fast, and re-try your diagnostics, I have had nothing but problems with networks when port fast is disabled. I think this will resolve your issues.

    Cheers

    Mark

    Friday, December 20, 2013 5:48 PM
  • Similar problem here. Server 2012 on DL360p G8, 4 nic ports, 2 dedicated to iSCSI, 1 for Mgmt network and 1for VMs. No teaming. VM Nic is Trunk Port connected to Cisco 3750 switch stack. No port fast. Oddly, I stumbled upon this while configuring edge switches (Cisco Small Business SG300 with uplinks to 3750 stack). Each switch reported itself as duplicate on uplink port. Today just noticed the Windows Guest Event message after rebooting and losing connectivity even though statically assigned address on vNic. Switching to DHCP...problem goes away. Static, problem randomly returns.....and the APIPA address is given along with the static gateway and DNS settings. I have tried resetting to DHCP, removing nic within VM, shutdown, and remove/recreate vNic. Still no joy after applying static address. Been doing virtualization for a very long time.....mostly VMWare. Never have I seen such a pesky issue of this kind. 1 step away from calling Microsoft. Anyone else had any success fixing? -SH

    SH

    Tuesday, December 31, 2013 2:02 AM
  • Still troubleshooting the problem. This is not a port fast issue for us as far as we can tell. Interestingly enough we have a pair of 3750's on the network and a SG200.
    Tuesday, December 31, 2013 2:10 AM
  • Hi,

    I've only just come back to look at this due to the Christmas break. I'll see if enabling portfast helps.

    Thursday, January 02, 2014 8:58 AM
  • Any word on this as our issue is not a port fast related issue. I suspect it is a NIC teaming configuration issue which I can start new post on.
    Saturday, January 04, 2014 5:14 PM
  • We have determined the problem to be caused by a recent Cisco Catalyst IOS upgrade that globally turns on 'ip device tracking'. This happens in Cisco IOS version 15.X.

    The ip device tracking in combination with how Windows 2008 and higher handles gratuitous ARP is what causes the issues.

    You can see if you have ip device tracking enabled by running the command below on your switch. The device will show ‘ip device tracking’.

    Show config | inc tracking

    Cisco References

    http://www.cisco.com/en/US/products/ps6638/products_white_paper09186a0080c1dbb1.shtml

    https://supportforums.cisco.com/thread/2244042

    Cisco recommends that you modify the probe delay which we tried but this did not fix our situation.

    ip device tracking probe delay 10

    We had to run the following commands on each switch.

    int range gig0/1 - 24
    ip device track max 0

    You can confirm you have fixed the problem by downloading Xarp and running it on a host or VM connected to your network. When fixed the alerts will be gone.

    • Proposed as answer by ColeThompson Monday, January 06, 2014 6:58 AM
    • Marked as answer by bestseany Monday, January 06, 2014 10:46 AM
    Monday, January 06, 2014 6:58 AM
  • We have determined the problem to be caused by a recent Cisco Catalyst IOS upgrade that globally turns on 'ip device tracking'. This happens in Cisco IOS version 15.X.

    The ip device tracking in combination with how Windows 2008 and higher handles gratuitous ARP is what causes the issues.

    You can see if you have ip device tracking enabled by running the command below on your switch. The device will show ‘ip device tracking’.

    Show config | inc tracking

    Cisco References

    http://www.cisco.com/en/US/products/ps6638/products_white_paper09186a0080c1dbb1.shtml

    https://supportforums.cisco.com/thread/2244042

    Cisco recommends that you modify the probe delay which we tried but this did not fix our situation.

    ip device tracking probe delay 10

    We had to run the following commands on each switch.

    int range gig0/1 - 24
    ip device track max 0

    You can confirm you have fixed the problem by downloading Xarp and running it on a host or VM connected to your network. When fixed the alerts will be gone.

    I've just tried this change on our switches and it appears to have fixed it! I've rebooted a few different VMs several times and their virtual NICs stay connected every time.

    Thanks for your solution, it is appreciated. This problem has been driving me mad!

    One question though, I think one of our 2008 R2 Hyper-V clusters that also has the same problem is connected to a HP switch. Would having a Cisco switch somewhere else downstream still be able to affect these hosts?

    Monday, January 06, 2014 10:46 AM
  • Yes having a Cisco switch with ip device tracking enabled anywhere on your network connected with a trunk port would cause the problem and would affect any host or VM on the network.

    You can test by installing XArp which will detect the problem as an ARP attack and will provide you with the MAC address of the problem switch or switches.

    You can download a free version of Xarp from http://www.chrismc.de/development/xarp/

    Monday, January 06, 2014 2:50 PM
  • Yes having a Cisco switch with ip device tracking enabled anywhere on your network connected with a trunk port would cause the problem and would affect any host or VM on the network.

    You can test by installing XArp which will detect the problem as an ARP attack and will provide you with the MAC address of the problem switch or switches.

    You can download a free version of Xarp from http://www.chrismc.de/development/xarp/

    That's great. Thanks for your help!
    Monday, January 06, 2014 3:12 PM