none
VMs will get stuck stopping and unable to migrate servers from that host

    Question

  • We've implemented a 4 server failover cluster with Windows 2016 datacenter. But 3 times now over the last 6 months or so we've had the same incident. The VMs on one host will get stuck in a stopping state when they're shutdown. We'll also be unable to migrate servers from or to that host. We're also unable to drain that host via the Failover Cluster.

    In the Event Viewer under Hyper-V-VMMS Admin we'll see events 19060 for the VMs that are stuck:

    'VM' failed to perform the 'Unregister Virtual Machine Configuration' operation. The virtual machine is currently performing the following operation: 'Turning Off'. (Virtual machine ID IDGUID)

    In the Failover Cluster Manager cluster events, we'll see events 1205, 1254 and 1069 for the VMs.

    To resolve we have to shutdown the host, which gets stuck as it tries to drain its roles to the hosts but can't, so we end up forcibly powering off the host.

    While it is stuck I've tried to forcibly shutdown the Hyper-V Virtual Machine Management service, but it gets stuck completely. This is the same for trying to do it for the affected VM processes. We also see host status warnings in SCVMM, generally saying the host isn't responding.

    I'm looking at logging a case for it now, but previously I've updated all of the hosts in the cluster to the latest updates (about a month ago). They also have ADDS roles on them (not setup) due to the Host Guardian role, but I've been removing it seeing as we've not used Host Guardian yet.

    Has anyone out there experience similar issues and been able to resolve it?

    Wednesday, May 03, 2017 11:06 PM

All replies

  • Hi,

    Do you find any issues since the post ?

    Wednesday, May 17, 2017 9:14 AM
  • Hi Sir,

    >>The VMs on one host will get stuck in a stopping state when they're shutdown

    That issue always happens on that hyper-v host ?

    If yes , I'd suggest you evict that hyper-v host and re-install hyper-v role .

    (As you mentioned the VMMS service are also stucked when you disable it )

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, May 19, 2017 8:16 AM
    Moderator
  • I had a ticket lodged with Microsoft support. While they didn't fix the issue, I ended up finding the root cause. One of the SFP+ Adapaters was generating a 10400 NDIS event stating that the driver detected that the hardware wasn't responding to instructions, so Windows would then reset the adapter. The Adapter was part of a NIC team which was then used for a vswitch in Hyper-V. For some reason when the adapter gets reset, it generates an error with the vswitch which then seems to completely break the VMMS service.

    Microsoft has offered no explanation as to why this happens. The point of NIC teaming is so that if one adapter drops, everything can keep working. We ended up updating drivers, and I logged a call with the OEM to get firmware and other updates done. All we can do now is cross our fingers that it doesn't error again.

    Tuesday, May 23, 2017 12:09 PM
  • cdry_10,

    We are seeing the exact same scenario with our hypervisor.  It runs Server 2016, has two Intel X710 10G network cards that are teamed, and after seeing your post, we noticed it is also getting event ID 10400 in the event viewer close to when we notice the issue.  We'll update the drivers as soon as possible.  In the meantime, we've created a scheduled task that fires if event ID 10400 is logged, which sends us an email, and schedules a reboot of the server during off-hours. We'll be happy to provide a sample of the script if you are interested.

    I will continue to update this post with information I learn about this issue, and would very much appreciate if you would do the same.  Thanks!

    Tuesday, May 23, 2017 10:30 PM
  • We too just had this same issue occur with Intel(R) Ethernet 10G 4P X540/I350 on Server 2016 .  It logged an Event ID 10400 and all the VMs on that server eventually ended up in a "stopping" state (while still responding)

    We have redundant switches set up and we had powered off a switch during maintenance (something we've done many times with no issue on Server 2012 R2) So shutting off the switch associated to that particular port was the catalyst in our case. After reading your post (THANK YOU), we discovered the 10400 event and it correlated to that switch shutdown time.

    Did MS ever solve this issue?

    @ACT1  Can you provide that script?

    Monday, June 19, 2017 8:02 PM
  • @Seth H.

    I apologize for the delay, I didn't see any notifications of a reply on the thread.  We have two servers with the exact same hardware, and only one of them is experiencing this issue.  The one without the issue did have a slightly older firmware on the Intel X710 NIC, so we downgraded the firmware on the problem server with no luck.  We also updated drivers, then rolled back to older versions, and still had the issue each time.  Even worse, when the script that I mentioned earlier would pick up Event ID 10400 and schedule a reboot, the server was unable to kill the VM service and would hang on shutdown.  The only way we are able to get the server functional again is to log into the Dell iDRAC interface and perform a cold boot, since the server is at a remote datacenter.  Our next idea is to scrap the Intel cards and replace them with another brand...although we went to Intel because of constant issues with Broadcoms.  Honestly I'd rather not provide the script as is at this point because it may actually harm your server in its current state - however, here's a modified version that will at least notify you. 

    In Task Scheduler, set the trigger for "On an event" and the details to "On Event - Log: System, Source: Microsoft-Windows-NDIS, Event ID: 10400

    $time = Get-Date -format HH
    $time = [int]$time
    
    $PSEmailServer = "SMTP_SERVER_NAME"
    
    
    $body = "Event ID 10400 logged on AFFECTED_SERVER_NAME.  This event has been connected to an issue that hoses the VMMS service and makes it unable to manage virtual machines.  Please don't attempt to restart, shutdown, or pause any VMs on this hypervisor.  Error logged at $time."
    
    Send-MailMessage -To "YOUR_EMAIL@YOUR_DOMAIN.COM" -From "SOME_ADDRESS@YOUR_DOMAIN.COM" -Subject "AFFECTED_SERVER_NAME: EVENT ID 10400 Logged at $time" -Body "$body" -BodyAsHtml

    Tuesday, July 25, 2017 5:44 PM
  • Hello all,

    we face the same issues after upgrading to 2016.Did you found any solution on this?Furthermore we have our mcc(failover cluster+HyperV Manager) crash.

    Thank you.

    Friday, August 11, 2017 6:43 AM
  • This has been seen time and time again and 99% of the time comes back to bad NIC drivers or firmware... First things first (and i hate this as a response) is turn off VMQ if you can, Personally i really like VMQ and do not recommend turning it off but the issue is normally around this area where the NIC vendor has a bad implementation of it.

    This posting is provided AS IS with no warranties, and confers no rights. Please remember, if you see a post that helped you please click Vote as Helpful, if it answered your question, please click Mark as Answer. I do not work for Microsoft, I manage a large estate in the private sector, my views are generally first hand production experiences. Emma's Baby Diary About Me

    Friday, August 11, 2017 8:22 AM
  • We are dealing with the same thing recently on our 2016 cluster with 10 nodes. It really is KILLING us. I have a support ticket with Microsoft open right now but so far they are not being helpful.

    The Hyper-V Virtual Machine Management service gets stuck as well.

    Also it isn't just shutting down VM's but also when they are trying to live migrate.

    Monday, February 19, 2018 6:57 AM
  • I have the same issue, but ours is while creating snapshots during a backing up.

    https://social.technet.microsoft.com/Forums/en-US/0d99f310-77cf-43b8-b20b-1f5b1388a787/hyperv-2016-vms-stuck-creating-checkpoint-9-while-starting-backups?forum=winserverhyperv

    The issue went away from a about 6 months, but has come back after the last lot of updates.

    Ours are Dell R730s with Intel X710 NICs in a team.

    Did any find a solution please?

    Tuesday, March 13, 2018 3:48 PM
  • Hi.

    We got another diversify of this bug. Our situation is like this - https://social.technet.microsoft.com/Forums/windowsserver/en-US/2729e5a4-4810-457e-a917-3ce48c10cf73/hyperv-2016-unable-to-expand-vhdx-in-windows-2016-cluster-and-while-powering-down-vm-it-gets?forum=winserverhyperv

    Do you use Veeam B&R for backup? Or another software with CBT/RCT?

    Tuesday, March 13, 2018 4:11 PM
  • Seems to be a similar issues and yes I do use Veeam but in this lab environment I didn't have Veeam installed.

    I was able to rule it down to Hyper-V replication. If its enabled I can easily reproduce this problem, disabling it then the problem stops.

    If anyone is willing to put in the effort I would love if someone could setup a lab cluster and setup Hyper-V replication using the broker to another cluster to see if they experience the same issue.

    Tuesday, March 13, 2018 4:22 PM
  • We use Hyper-V replication between standalone hosts and cluster (using broker) - problem exists for example on standalone hosts with replicas.
    Tuesday, March 13, 2018 7:32 PM
  • Hello,

    We are experiencing the same issue on multiple clusters and multiple hosts.We also see the 10400 NDIS errors approx 6 hours before the VMs run into the 'stopping' state.

    Was your issue solved by updating the NIC driver? We are using Intel NIC's in Dell r730 servers..

    Any suggestions on how to solve this are very welcome,


    Jan

    Friday, June 01, 2018 5:29 AM
  • We were having this issue but it was 100% related to cluster to cluster hyper-v replication. There was a bug fix provided for Windows 10 (even though the KB also says 2016) but no downloads. It does appear that the fix was rolled into a CU because after we recently applied 2018-04 and 2018-05 the issue went away:

    This issue was a bug and it documented here for Windows 10 but the downloads don't include a fix for Server 2016: https://support.microsoft.com/en-us/help/4077525/windows-10-update-kb4077525

    Addresses issue that causes Hyper-V VMs that are replicated using Hyper-V Replica or Azure Site Recovery to stop responding at 92% if a Windows Server 2016 Failover Cluster is set up with NIC Teaming enabled.  The issue also occurs while stopping the VM, during Live Migration, while stopping the VMMS service, or during Host node shutdown. The user must then use a hard restart on the host machine to recover.

    This was my exact issue minus the 92% (mine was 84%). I believe they rolled this into another CU but didn't document it because after I applied all updates to all my hosts the issue is no longer present. I wish I knew which CU they rolled it into.

    My Microsoft case number was 118021317638159. 


    • Proposed as answer by Quadrantids Friday, June 01, 2018 1:15 PM
    • Edited by Quadrantids Friday, June 01, 2018 1:16 PM
    Friday, June 01, 2018 1:15 PM
  • We have a 3 node 2016 Cluster and have similar issues. Live Migration could Fail at 84%. When this happens on a host, the host is lost. Only a hard reset would bring the host and its VMs back again. Frustrating. We are on May Updates 2018 We do NOT use Replicas, it’s just a 3 nose cluster with a central FC Storage. Vmms service gets stuck and could not be stopped any more. A Shutdown fails because the migration of the hosted VMs fails, only BMC helps . We have a MS case opened but didn’t heard now since days from them. After the great Backup Bug (Backup of 2012R2 DCs hangs, hard reset of the host needed) which got solved after 8 month, I see no solution in the next 6 month from MS. Since we host a multi tenant environment, we cannot wait. Evaluating VMware now and when it’s Ok we we will move the HyperV Machine from our other 1000 on premise customers to VMware too. This product called Server 2016 is one of the most unreliable server platforms i’ve ever seen.
    Friday, June 01, 2018 6:20 PM