none
Hyper-V 2016 VMs stuck 'Creating checkpoint 9%' while starting backups RRS feed

  • 질문

  • We have a two clustered W2016 Hyper-V hosts, every couple of days one of the hosts gets stuck when the backup kicks off. In Hyper-V manager the VMs all say 'Creating checkpoint 9%' It's always the same percentage 9%. You can't cancel the operation and the VMMS service refuses to stop, the only way to get out of the mess is to shutdown the VMs, and hard reset the effected node. The backup works for a few days then it all starts again.

    The only events I can see on the effected node is:

    Event ID: 19060 source: Hyper-V-VMMS

    'VMName' failed to perform the 'Creating Checkpoint' operation. The virtual machine is currently performing the following operation: 'Creating Checkpoint'.

    Can anybody help please? Cluster validation is clean. Hosts and guests are patched up.

    2017년 3월 28일 화요일 오후 8:40

모든 응답

  • Hi Sir,

    Are you using windows server backup to backup these VMs ?

    How long is the backup interval ?

    Every time the backup will stuck at "9%" ?

     

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    2017년 3월 29일 수요일 오전 8:46
    중재자
  • Thanks for responding,

    The backup is Veeam 9.5

    Backup interval is configured to run each night at 23:00, it normally takes 20 minutes to finish.

    When the backup job starts it must call to the Hyper-V host to take a checkpoint, it this that stalls at 9% each time, not the backup it's self.

    VMMS becomes unresponsive on the host and I can't even LiveMigration VMs, hard resetting the host seems to be the only way to get out of the situation.

    2017년 3월 29일 수요일 오전 9:10
  • @AustinT have you tries restarting VMMS ?

    restart-service VMMS

    What does the job information tell you about this job ?

    gwmi -namespace root\virtualization\v2 -class msvm_concretejob

    gwmi -namespace root\virtualization\v2 -class msvm_migrationjob

    gwmi -namespace root\virtualization\v2 -class msvm_storagejob

    2017년 3월 29일 수요일 오후 4:15
  • Hi, 

    The VMMS service is unresponsive and won't restart.

    Here is the output from the msvm_concretejob....the other two returned nothing.

    PS C:\> gwmi -namespace root\virtualization\v2 -class msvm_concretejob 

    __GENUS                 : 2
    __CLASS                 : Msvm_ConcreteJob
    __SUPERCLASS            : CIM_ConcreteJob
    __DYNASTY               : CIM_ManagedElement
    __RELPATH               : Msvm_ConcreteJob.InstanceID="8C874D34-8F00-412F-8F92-056A62C954CB"
    __PROPERTY_COUNT        : 41
    __DERIVATION            : {CIM_ConcreteJob, CIM_Job, CIM_LogicalElement, CIM_ManagedSystemElement...}
    __SERVER                : HV2
    __NAMESPACE             : root\virtualization\v2
    __PATH                  : \\HV2\root\virtualization\v2:Msvm_ConcreteJob.InstanceID="8C874D34-8F00-412F-8F92-056A62C954CB"
    Cancellable             : True
    Caption                 : Creating Checkpoint
    CommunicationStatus     :
    DeleteOnCompletion      : False
    Description             : Creating Virtual Machine Checkpoint
    DetailedStatus          :
    ElapsedTime             : 00000001183119.965512:000
    ElementName             : Creating Checkpoint
    ErrorCode               : 0
    ErrorDescription        :
    ErrorSummaryDescription :
    HealthState             : 5
    InstallDate             : 16010101000000.000000-000
    InstanceID              : 8C874D34-8F00-412F-8F92-056A62C954CB
    JobRunTimes             : 1
    JobState                : 4
    JobStatus               : Job is running
    JobType                 : 70
    LocalOrUtcTime          : 2
    Name                    : Creating Checkpoint
    Notify                  :
    OperatingStatus         :
    OperationalStatus       : {2}
    OtherRecoveryAction     :
    Owner                   : NT AUTHORITY\SYSTEM
    PercentComplete         : 9
    PrimaryStatus           :
    Priority                : 0
    RecoveryAction          : 2
    RunDay                  :
    RunDayOfWeek            :
    RunMonth                :
    RunStartInterval        :
    ScheduledStartTime      : 20170327231150.000000-000
    StartTime               : 20170327231150.000000-000
    Status                  : OK
    StatusDescriptions      : {Job is running}
    TimeBeforeRemoval       : 00000000000500.000000:000
    TimeOfLastStateChange   : 20170327231150.000000-000
    TimeSubmitted           : 20170327231150.000000-000
    UntilTime               :
    PSComputerName          : HV2

    __GENUS                 : 2
    __CLASS                 : Msvm_ConcreteJob
    __SUPERCLASS            : CIM_ConcreteJob
    __DYNASTY               : CIM_ManagedElement
    __RELPATH               : Msvm_ConcreteJob.InstanceID="C4CA521A-3309-48D8-B659-F7EF6B95615B"
    __PROPERTY_COUNT        : 41
    __DERIVATION            : {CIM_ConcreteJob, CIM_Job, CIM_LogicalElement, CIM_ManagedSystemElement...}
    __SERVER                : HV2
    __NAMESPACE             : root\virtualization\v2
    __PATH                  : \\HV2\root\virtualization\v2:Msvm_ConcreteJob.InstanceID="C4CA521A-3309-48D8-B659-F7EF6B95615B"
    Cancellable             : True
    Caption                 : Creating Checkpoint
    CommunicationStatus     :
    DeleteOnCompletion      : False
    Description             : Creating Virtual Machine Checkpoint
    DetailedStatus          :
    ElapsedTime             : 00000001184028.776685:000
    ElementName             : Creating Checkpoint
    ErrorCode               : 0
    ErrorDescription        :
    ErrorSummaryDescription :
    HealthState             : 5
    InstallDate             : 16010101000000.000000-000
    InstanceID              : C4CA521A-3309-48D8-B659-F7EF6B95615B
    JobRunTimes             : 1
    JobState                : 4
    JobStatus               : Job is running
    JobType                 : 70
    LocalOrUtcTime          : 2
    Name                    : Creating Checkpoint
    Notify                  :
    OperatingStatus         :
    OperationalStatus       : {2}
    OtherRecoveryAction     :
    Owner                   : NT AUTHORITY\SYSTEM
    PercentComplete         : 9
    PrimaryStatus           :
    Priority                : 0
    RecoveryAction          : 2
    RunDay                  :
    RunDayOfWeek            :
    RunMonth                :
    RunStartInterval        :
    ScheduledStartTime      : 20170327230241.000000-000
    StartTime               : 20170327230241.000000-000
    Status                  : OK
    StatusDescriptions      : {Job is running}
    TimeBeforeRemoval       : 00000000000500.000000:000
    TimeOfLastStateChange   : 20170327230241.000000-000
    TimeSubmitted           : 20170327230241.000000-000
    UntilTime               :



    • 편집됨 AustinT 2017년 3월 29일 수요일 오후 5:48
    2017년 3월 29일 수요일 오후 5:47
  • The issue came back even with CSV cache disabled. See this thread for people having similar issues.

    All poster have Dell based Intel 10GbE cards with Windows 2016 installed.

    https://social.technet.microsoft.com/Forums/WINDOWS/en-US/7b95bc5b-02d1-4dbb-a341-0517ae30cd9e/vms-will-get-stuck-stopping-and-unable-to-migrate-servers-from-that-host?forum=winserverhyperv

    2018년 3월 13일 화요일 오후 3:53
  • It appears that I have the same issue...but this is on a single HP server, Windows 2012r2 using StorageCraft ShadowProtect for backup. 

    Even the 9% part. 


     
    2018년 3월 29일 목요일 오전 10:18
  • @seriouslytho Do you have VMQ enabled? Do you have Intel network adapters with the Microsoft driver installed?

    • 답변으로 제안됨 salasidis 2019년 8월 25일 일요일 오후 10:17
    2018년 3월 29일 목요일 오전 11:29
  • Austin,

    Did you ever find a resolution?  I'm having the issue with a Dell EMC XC640 Cluster.  Is VEEAM the culprit?

    Regards,

    John

    2018년 10월 17일 수요일 오후 7:01
  • Did anyone find a resolution? I currently have the same issue with Dell M630 cluster, attached to SAN.
    2018년 11월 16일 금요일 오후 4:18
  • Hi,

    The solution for me was to update the network adapter firmware and use the latest Intel drivers from the Dell support website.

    IMHO it was the inbox drivers from Microsoft causing the issues, the Intel drivers seem to have fixed the issue.

    All the best

    2018년 11월 16일 금요일 오후 4:28
  • Hey guys,

    the solution for me was to rename a broken Hyper-V Server so it was not loaded any more.

    The Hyper-V server was in one folder including disks and machine. So I stopped and killed the hyper-v services and simply renamed the folder. All services started up and all other virtual machines worked fine after this action.

    Best greets,

    Daniel Hebel

    2019년 2월 21일 목요일 오전 10:55
  • I've got that same issue... stuck at 9% creating checkpoint on one node of a 2-node hyper-v cluster.  I"m also using veeam 9.5u4.  

    I"ve got the latest NIC drivers installed and latest updates on windows 2019.

    Daniel, what do you mean "I renamed the folder".  

    I can't restart the node right now or do anything drastic to try and recover during production hours.  I tried restarting one VM and now its stuck unable to fully turn off.  I assume from above, I can't simply taskkill vmms or the vmwp processes.


    I notice two events in the vmms admin event viewer which seem to be at the start of the issue...  error 32587 and 32510 .  Both say "the description of event ID #### from source cannot be found..."  It referenced a VM that was in the off state.  I was using hyper-v replication too for a few key VMs and have turned that off per https://social.technet.microsoft.com/Forums/WINDOWS/en-US/7b95bc5b-02d1-4dbb-a341-0517ae30cd9e/vms-will-get-stuck-stopping-and-unable-to-migrate-servers-from-that-host?forum=winserverhyperv 
    • 편집됨 trump26901 2019년 3월 13일 수요일 오후 2:35
    2019년 3월 13일 수요일 오후 2:17
  • Hi trump26091,

    What kind of checkpoint are your guests configured for: Production or Standard?

    mlavie 

    2019년 3월 13일 수요일 오후 3:09
  • production with the check box option to create standard if the guest does not support creation of production checkpoints.

    there are about 30 VMs in the backup job all of which are in the same cluster.  most use the veeam app-aware processing function, but a few are copy only.  The VMs that are stuck are all on the same host and are a mixture of app-aware and copy only in veeam.  I had three VMs that were set to replicate to an off-cluster host and one of those was on the problem host as well.

    I go the same outputs for as AustinT back in his old post with the wmi calls and only gwmi -namespace root\virtualization\v2 -class msvm_concretejob  gives a response and my response looks almost identical.  I didn't try killing vmms since he said that didn't help and at least for the moment I don't want to upset the cluster until I can allow for a more controlled potential outage.

    Is there a way to force those Msvm_ConcreteJob jobs to stop?

    2019년 3월 13일 수요일 오후 3:40
  • Hi trump26091,

    Could you please try setting the checkpoints explicitly to Standard (and not as a sub-option under Production)?

    Please let me know if that helps.

    mlavie

    2019년 3월 13일 수요일 오후 3:49
  • I can't... its stuck creating a snapshot at 9%.

    There are 7VMs on the same host and they are all affected by that.  The other 25 VMs on the other host were perfectly fine and complete their backup and snapshots without issue.

    2019년 3월 13일 수요일 오후 3:54
  • 100% it was drivers for me. Is this a Dell server with Intel NICS?

    Use this PowerShell to see the driver version and more importantly the driver provider.  It should say Intel and not Microsoft.

    Get-NetAdapter | where {$_.InterfaceDescription -notlike "Hyper-V*"} | where {$_.InterfaceDescription -notlike "Microsoft*"}| FT Name, DriverInformation, DriverProvider -AutoSize

    2019년 3월 13일 수요일 오후 3:59
  • Supermicro server with Intel NICs.  

    Intel Driver provider...   Driver Date 2018-12-06 Version 1.9.230.0 NDIS 6.82 .  Its an x722 card.

    2019년 3월 13일 수요일 오후 4:02
  • All I can say is that I had the same problem, and changing the checkpoint to explicitly Standard solved it.
    2019년 3월 13일 수요일 오후 4:19
  • I'll keep that in mind.  maybe the production one increases the chances of the problem coming up, but for now all I can say is that something on the host broke.

    I should be able to get in a full cluster shutdown tonight, so that should hopefully give me some more breathing room to trouble-shoot.

    2019년 3월 13일 수요일 오후 4:34
  • I am having the exact same issue, Dell Servers XC 630 (essentially R730's), Intel cards X520, and using Veeam, drivers are Intel 4.1.4.0. Checkpoints are set to use Production with standard as failback. Did you manage to find anything in your troubleshooting trump26901 ?

    At the moment I do not appear to be able to stop the checkpoint creation, VM's on here will not shutdown or migrate off,  so looking to have to do the forced host reboot. I could do without this happening again, if anyone found any further resolution ?

    2019년 3월 28일 목요일 오후 12:45
  • nothing yet.  I did a "controlled" shutdown of the cluster and all running VMs (any that would not shut off, I got them to stop at the shutting off phase and then I shut down the cluster and did taskkill to allow the nodes to reboot.  I haven't run into the problem again yet and so I don't want to do anything too drastic and try and fix something that only happens under a very rare set of conditions which we might never see again.  If it happens to me again I'll start looking into it more seriously. 
    2019년 3월 29일 금요일 오후 2:18
  • We had the same issue today with Windows Server 2016 Datacenter running Hyper-V, and VM running Windows Server 2016 Standard, Exchange 2016 15.1.1415.2.

    Cancelled checkpoint creation from Hyper-V - no result.

    Graceful stop of Veeam backup job - no result

    Force stop Veeam backup job - after one minute the checkpoint creation stopped. VM was on, but terminal would not connect. Force shutdown VM and boot again - VM works as normal.

    Veeam log after stopped job: 

    02-04-2019 06:02:51 :: Failed to create VM recovery checkpoint (mode: Veeam application-aware processing) Details: Job failed ('Checkpoint operation for '<server name>' failed. (Virtual machine ID <VM GUID>)

    Checkpoint operation for '<server name>' was cancelled. (Virtual machine ID <VM GUID>)

    '<server name>' could not initiate a checkpoint operation: This operation returned because the timeout period expired. (0x800705B4). (Virtual machine ID <VM GUID>)  

    and

    02-04-2019 06:23:19 :: Retrying snapshot creation attempt (Failed to create production checkpoint.)

    I found this article too: http://www.checkyourlogs.net/?p=60293

     
    • 편집됨 PawnProxy 2019년 4월 2일 화요일 오전 8:18
    2019년 4월 2일 화요일 오전 8:16
  • Was that host in a cluster or just standalone?

    It happened again to us the other night.  I've changed all VMs to standard checkpoint now as opposed to production per mlavie's suggestion.  Nice to know that I don't have to totally reboot the host next time as long as I can turn off the veeam server.

    I didn't see anything particularly obvious for why its getting stuck in my logs so hopefully others in here are watching this thread and together we can work out a commonality.

    My environment:

    2-node S2D Cluster on Windows Server DC 2019 - latest updates

    Intel X722-4 NICs for hosts/VMs, Mellanox 4lx crossover cables for RDMA/cluster

    Veeam 9.5u4 (I just looked and there is now a u4a available that I will update to now).


    2019년 4월 2일 화요일 오후 3:08
  • Good Afternoon,

    Please confirm that the Security Policy "User Rights Assignment > Log on as a batch job" is not defined by Group Policy on any Hyper-V Hosts.

    You can confirm if it is by performing the following steps;

    • Start
    • Run
    • "Secpol.msc", click OK
    • Local Policies > User Rights Assignment > Log on as a batch job
    • Confirm that "Add User or Group..." is not greyed-out.

    If this has been defined, remove the group policy from the host - update group policy, then reboot the host.

    Confirm if the issue still exists.

    Thanks.

    • User Rights Assignment
    2019년 4월 4일 목요일 오후 2:45
  • I checked both my cluster hosts and both are NOT greyed-out.  Both contain "Administrators", "Backup Operators", and "Performance Log Users" .

    I"m not sure if the timing is repeatable or just random, but it appears to have happened roughly 3-weeks apart... I installed and setup the system, roughly three weeks later the issue happened, and then roughly three weeks after that it happened again.  If it repeats again at the same pace, that will happen in roughly two weeks.

    2019년 4월 4일 목요일 오후 2:57
  • I have a host in the same boat, not clustered, Using Quest Rapid Recovery, 2019 STD, VSS gets wacked and then VM's are stuck in 9% checkpoint.

    Host is an Intel board S2600WFT with X722 10GBATE-T build in adapters.

    PS does report they are using Microsoft Drivers per one of the comments in this thread.

    I will see about updating the drivers to Intel drivers after hours.

    https://downloadcenter-origin.intel.com/product/89015/Intel-Server-Board-S2600WFT

    Jason

    2019년 4월 8일 월요일 오후 9:03
  • We also have a customer who is affected by this. 

    Their setup:

    • 3 Node Hyper-V Cluster running Server 2019 Datacenter, all with the latest updates/firmware/drivers
    • Replication server at remote site
    • HPE MSA2052 with the latest firmware
    • Altaro VM Backup 8.2.1.3 

    So far this has affected VMs on different hosts, so it is not specific to one host. There appears to be no pattern to this. The backups run as scheduled each night, and one VM will get stuck on 'Creating Checkpoint (9%)'. Following this, the VMs on that node can no longer be managed. You cannot:

    • Create checkpoints
    • Live migrate to another host
    • Quick migrate to another host
    • Manage replication (if replication is enabled for that VM)

    The following does continue to work:

    • You can log in to each VM on the affected host, either via RDP or Hyper-V Console
    • Replication-enabled VMs continue to replicate as normal, except for the affected VM. The affected VM stops replicating.

    Essentially, VMs on the affected host respond as though there is no issue with this host.

    Please allow me to re-iterate that this has affected VMs which are configured for replication, and VMs which are not configured for replication. A different VM has been affected, on a different host, each time this has happened.


    Previously, the following recovered the host:

    • Restart VMMS (this took a very long time, and crashed out another VM)
    • Reboot the affected node

    Following _Dickins suggestion, we removed our GPO from the hosts which define users in 'Log on as a service'. This restored the default account(s) which require this access, but unfortunately made no difference.

    This is a brand new Infrastructure Refresh, which was only installed in 2019. Everything is fully up to date in terms of drivers/firmware.

    Each VM has the following Integration Service enabled:

    • Operating system shutdown
    • Data Exchange
    • Heartbeat
    • Backup (volume shadow copy)

    And the following disabled:

    • Time synchronization
    • Guest services

    Our customer is using 'Production checkpoints', but I'm considering changing these to 'Standard checkpoints' as a test, as others have said this has resolved the issue.


    2019년 4월 9일 화요일 오전 9:10
  • Well the standard checkpoint didn't fix it for me.  I"ve seen a bunch of "Cluster Shared Volume 'CSV#' has entered a paused state because of '(c0e7000b)'.  All I/O will temporarily be queued until a path to the volume is reestablished.  event5120 and event 5142.  The odd thing is that the error is on the node that owned the CSV resource disk and also happened to be the node that has the effected VMs.  Not sure if its a pattern yet, but I the last two and possibly three times this happened, it has been the same host with the same problems.
    2019년 4월 15일 월요일 오후 4:12
  • Since my response on Tuesday, April 9 - the issue has so far not reoccurred. If this does reoccur, I will post here again with more information.
    2019년 4월 17일 수요일 오전 11:48
  • Just to note, having the same issue with Acronis Advanced Backup on Windows Server 2019.

    Stuck at creating chjeckpoint 9%

    Cannot stop hyper V - stuck at stopping. The only solution is to reboot the server (which promtly hangs at stopping Hyper V), and then after a  few seconds of that, remote process kill vmms.exe. The server will then reboot.

    My NICs are 40G Intel XL710-QDA2 - the drivers are Microsoft, but their date is later than the one on the intel web site, and both are version 6.80.

    Log on as Batch Job Local security policy as above is OK

    This only occurs on 2019 - the 2012R2 servers have no problems

    2019년 4월 19일 금요일 오후 5:02
  • Hello,

    I'm facing the similar issue, while trying to create the checkpoint of multiple VMs. The VMs are become  unresponsive.....

    Single VM check point is working fine without any issue

    Issue started since 10th April. Earlier it was fine.

    Applied the following patched in the server

    KB4091664
    KB4480961
    KB4489882
    KB4487026

    KB4489882 - Targets to the following components. 

    Security updates to Microsoft Edge, Microsoft Scripting Engine, Internet Explorer, Windows Shell, Windows App Platform and Frameworks, Windows Kernel-Mode Drivers, Windows Hyper-V, Windows Datacenter Networking, Windows Fundamentals, Windows Server, Windows Kernel, Windows MSXML, and the Microsoft JET Database Engine.

    Unfortunately, customer is denying to take a simultaneous backup of Hyper-V VMs using Windows Server Backup tool

    Is anybody have a suggestion on this? Please help


    Jaril Nambiar


    • 편집됨 Jaril Nambiar 2019년 4월 20일 토요일 오후 5:27
    2019년 4월 20일 토요일 오후 5:24
  • Same problem here on a HyperV 2019 Failover Cluster on both nodes.
    2019년 4월 26일 금요일 오전 6:53
  • Also having the same issue with a newly installed 2 node Failover Cluster running Server 2019 /w Hyper-V role.  CSV over iSCSi.  Latest Intel drivers/firmware, Windows updates, etc are applied.  I left the default checkpoint settings of production then standard if not supported.

    I'm certain our issues are related to the backups which had run successfully for the past 2 weeks but hung on week 3 which others have mentioned above.  We use Veritas Backup Exec 20.3 Rev 1188 which is also backing up a 2012 R2 Hyper-V cluster without problems.

    After noticing our backup job was hung and a VM status showing Creating Checkpoint (9%), I stopped the backup exec agent on the Hyper-V host which took forever to stop.  Now the Volume Shadow Copy (VSS) service is in a stopping state.  Can't manage/migrate/shutdown VMs but they are thankfully still running.  I will have to do a forceful shutdown tonight.

    Since this issue is happening with both Server 2016 and 2019 Hyper-V, and with various backup software.  I'd say this is a Microsoft problem? 

       

    2019년 5월 8일 수요일 오후 4:03
  • The problem continues - seems I can get 1 backup done OK before the problem occurs, and then stuck eternally at Creating Checkpoint 9%. This is with the latest Acronis backup software.

    If I turn off Volume Shadow Copy Service and Volume Shadow Copy Service for virtual machines then the problem does not occur, but the backups would be unreliable (so not an option).

     
    2019년 5월 13일 월요일 오전 2:02
  • I haven't seen this error come back for me since April 15th and that breaks the rough patter I was seeing of every 2-weeks I had experienced prior to that.  Some additional notes:

    I originally created nested mirror accelerated parity drives AND enabled deduplication.  I then added a number of additional physical disks to each node and  optimized the storage pool each time.  I have since moved everything to just simple two-way mirror drives WITHOUT deduplication.  Eventually I'll try adding the nested mirror setup back as I"m guessing it was either the dedup chunks or the mirror accelerated parity overhead that might have been contributing to the problem.

    2019년 5월 13일 월요일 오후 1:39
  • I also have the same issue on a brand new Hyper-V host running Windows Server 2019 Standard with 5 VM's. 2 x Windows Server 2019, 2 x Windows Server 2008 R2 and 1 x Windows 7 Pro.

    Backup software is Veritas Backup Exec v20.3 but i noticed the issue already occurs when creating a manual checkpoint myself from the Hyper-V console of one of the 2008 R2 servers and also stuck on 9%. No way to recover without rebooting the Hyper-V host. As I'm a Microsoft partner and having 10 Action Pack incidents I will contact Microsoft Support for this issue.

    The specific server is a Dell PowerEdge R740 server with quad port Intel i350 Gigabit NIC's...

    2019년 5월 14일 화요일 오후 1:50
  • thanks for the update.  good luck and please let us know what shakes out of it.

    2019년 5월 14일 화요일 오후 2:35
  • I am having the same issue.

    Server 2019 STD with Hyper-V. Veeam 9.5 4a running on separate box. Dell Poweredge R730XD with Intel 10GbE and 1GbE cards is the host. Only fix is to hard reboot the server.

    The problem started for me with a Veeam backup failing on a guest that suddenly acts like it lost its network connection. Veeam backup fails. I try to shutdown the guest and it hangs at "shutting down." The next scheduled Veeam backup then gets stuck at 9% on a fully functioning VM. The VM that is being backed up runs just fine. Trying to shutdown guests from the host, they get stuck at "stopping". The fix is to hard reboot the box. I try to shutdown the guests via RDP to hopefully get them to shutdown somewhat gracefully. Then I try to shutdown the host and wait a good while, about an hour, then I have to reset the server.

    From all these postings, it really feels like it is an Intel card/driver issue. Could this be resolved by changing out the Intel cards for Broadcom? On the Poweredge's the 10Gbe/1GbE part code is Y36FR, I believe. I am checking with my reseller now to find the availability of that part to ship me one out. I will post with my findings.

    2019년 5월 15일 수요일 오후 7:54
  • My server has Intel 40G QSFP+ network cards with this problem occurring.

    Even reboot is slow - takes a while on stopping Hyper V. To speed up the reboot - I have to remote process kill the hyper V process, which then allows the boot to proceed in about 10-15 seconds

    2019년 5월 15일 수요일 오후 11:30
  • Just to confirm, since changing the Checkpoint method from 'Production checkpoints' to 'Standard checkpoints' - this has continued to work, without fail, for over 5 weeks.

    • 답변으로 제안됨 DEmms 2019년 5월 16일 목요일 오전 8:55
    2019년 5월 16일 목요일 오전 8:55
  • My server has Intel 40G QSFP+ network cards with this problem occurring.

    Even reboot is slow - takes a while on stopping Hyper V. To speed up the reboot - I have to remote process kill the hyper V process, which then allows the boot to proceed in about 10-15 seconds

    I do not believe the reboot itself is slow as is the hung up vm that won't shut down.  If you kill those tasks first, the reboot would have been normal. 

    That being said, I was (fingers crossed its in the past) having the issue and my production NICs are Intel x722 series cards.  I switched to standard checkpoints, but I definitely had one additional recurrence since that happened.  I then moved the VMs to a new CSV (I"m using an S2D cluster) and haven't had a problem since.  Not sure which was the fix, but so far its been running for over a month without issue now and it used to happen roughly every 2 weeks for me prior to that.

     
    2019년 5월 16일 목요일 오후 12:47
  • I managed to get 21 hours from a reboot before another problem with backups. After the reboot last night, Veeam was able to get several backups out of the host with no issues. I have migrated other guests off the host and all I have left is Exchange 2013. Veeam gets stuck at 9% and won't go any further. Killed the VMMS task and it woudn't register as stopped. Logged into RDP and shutdown Exchange. Wait. I was able to get a successful reboot out of it, which is a good thing. Then I have to retry my Veeam backups for Exchange. I also have checkpoints disabled.

    Looking at this, seeing that Intel networking is a fairly common thread, I made a change. I stopped/disabled the Intel management service to see if it has any impact.

    I have the Broadcom 10GbE/1GbE daughtercard coming tomorrow. I will drop it in and we will see how it does.


    • 편집됨 joeatmjp 2019년 5월 17일 금요일 오후 12:17 Spelling
    2019년 5월 17일 금요일 오전 1:31
  • I sadly too have these issues at a customer. This is a brand new 2-node S2D cluster running Server 2019 DC on 2 HP DL380 G10's Latest intel drivers, latest HP drivers. It has 10GB intel NIC's.

    We notice the problem first in VeeAm (9.5 U4), but I'm almost certain it has another cause.
    Same error as the others: "Creating Checkpoint 9%". VM's are responsive as long as you dont want to move them or shut them down.

    Only fix is to hard reboot the host.

    2019년 6월 3일 월요일 오전 8:38
  • I sadly too have these issues at a customer. This is a brand new 2-node S2D cluster running Server 2019 DC on 2 HP DL380 G10's Latest intel drivers, latest HP drivers. It has 10GB intel NIC's.

    We notice the problem first in VeeAm (9.5 U4), but I'm almost certain it has another cause.
    Same error as the others: "Creating Checkpoint 9%". VM's are responsive as long as you dont want to move them or shut them down.

    Only fix is to hard reboot the host.

    out of curiosity, did your s2d volumes include either a nested resiliency volume and/or did you add drives to the system after creating any volumes?  

    I haven't had the problem in two months now and it was happening on a pretty regular 2-week schedule for me in the past and the last thing I did was to move all my data from the original CSVs I created to new ones that were not nested resiliency.

    Oh yea...  I also had dedup turned on for some of those volumes and that is now off.  

    2019년 6월 4일 화요일 오후 1:22
  • No, it is (or should be) a fairly simple setup in that regard. Just a couple of regular CSV's, no dedup. It's been 13 days since last issues, but I really dont trust it.

    We are going to add extra 10GB NIC's to the hosts soon to seperate data streams (production data vs migration and S2D sync data). This is something a MS tech told us might help.

    2019년 6월 11일 화요일 오전 8:50
  • I have the same problem

    Here the environnement :

    2 Server Windows Server 2019 Datacenter (1809) with Hyper-V Role

    Veeam Backup & Replication 9.5.4.2753

    Veeam replica job HyperV01 to HyperV02

    Servers are directly connected with Broadcom NetXtreme E-Series Advanced Dual-port 10GBASE-T for the replication.

    I first met the problem which led me to this technet post :

    "Creating snapshot at 9%"

    Backup or replication job failed

    Cannot launch new backup job or cancel snapshot creation

    Event ID 19060 Hyper-V VMMS

    I did a hard reboot of the server one night. And i disabled replication job every 15 minutes and changed production snapshot to standard snapshot.

    I didn't have any problem with backup during one week until today.

    Tonight, i will hard reboot the server, update veeam, update 10Gb NIC card...

    If someone found the solution ?

    Thanks for help


    • 편집됨 techad51 2019년 6월 12일 수요일 오전 7:21
    2019년 6월 12일 수요일 오전 7:19
  • I have the same problem.

    How did you solve the problem?

    Thanks!


    • 편집됨 sas434343 2019년 6월 25일 화요일 오전 8:04
    2019년 6월 25일 화요일 오전 8:04
  • Hi

    I did latest version for veeam, WS2019, 10Gb Nic card, dell r540's firmware.

    In Hyper-V Manager, I changed hyper-v checkpoint production to hyper-v checkpoint standard.

    I split veeam's jobs by OS, activate hyper-v tools quiescence for the older OS in the job configuration, and make one snapshot by VM, not one volume snapshot.

    I had no problem since 3 weeks. But i had 5 weeks without problem between the initial setup and the first snapshot problem. I have no explaination.

    Alexandre

    2019년 7월 4일 목요일 오전 6:11
  • Hi

    I did latest version for veeam, WS2019, 10Gb Nic card, dell r540's firmware.

    In Hyper-V Manager, I changed hyper-v checkpoint production to hyper-v checkpoint standard.

    I split veeam's jobs by OS, activate hyper-v tools quiescence for the older OS in the job configuration, and make one snapshot by VM, not one volume snapshot.

    I had no problem since 3 weeks. But i had 5 weeks without problem between the initial setup and the first snapshot problem. I have no explaination.

    Alexandre

    Ok i spoke too fast, same problem this night during backup.

    2019년 7월 5일 금요일 오전 6:44
  • We have now encountered the problem on a third system. On the other two systems, however, everything has been fine for several weeks now.

    All the systems involved have one thing in common: They were converted from VMWare to HyperV.
    What does that look like for you? Is it possible to reduce the problem to that?
    2019년 7월 9일 화요일 오후 12:29
  • To anyone have this problem, do you have older OS ? (2000,2003,2008,XP)

    Do you enable backup VSS integration service in hyper-v vm properties ?
    2019년 7월 9일 화요일 오후 12:53
  • We also have this problem from the start (4 months) of our new HP DL380-G10 HyperV 2019 cluster, MSA2052, 10Gb Intels and Altaro software. It happens every 5-10 day’s.

    We have all 2012R2 VM’s, and one 2008R2. I removed the 2008R2 from backup. Maybe it helps. All other VM’s were migrated from an older 2012R2 HyperV host (not converted from VMWARE)

    Tried the option to set production to standard. No luck. The API Altaro uses forces a production snapshot. The VSS integration is checked (default)

    2019년 7월 10일 수요일 오후 2:37
  • Ok i think veeam try also to do production checkpoint instead standard checkpoint while i changed in the hyper-v settings of the virtual machine.

     Event id 18016 Hyper-V-VMMS

    Can not create production control points for VM01. (Virtual Machine ID: E9E041FE-8C34-494B-83AF-4FE43D58D063)

    And in Veeam log, i have id 150 VEEAM MP

    VM VM01 task has finished with 'Failed' state.
    Task details: Failed to create VM recovery checkpoint (mode: Hyper-V child partition snapshot) Details: Failed to call wmi method 'CreateSnapshot'. Wmi error: '32775'
    Failed to create VM recovery snapshot, VM ID '260fa868-64f9-418f-a90a-d833bc7ec409'.
    Retrying snapshot creation attempt (Failed to create production checkpoint.)

    I have more logs since i disable VSS integration yesterday (necessary for production checkpoint)if i'm not mistaken




    • 편집됨 techad51 2019년 7월 10일 수요일 오후 3:56
    2019년 7월 10일 수요일 오후 3:50
  • We have the exact same problem. 2 Hyper-V 2019 servers fully patched replicating to eachother without problem. Using latest Veeam backup 9.5 on a seperate server.  We have converted a few VM's (server 2012 and server 2016) from VMWare to Hyper-V. We have done this with multiple servers, every week we converted a couple, then let it run, check backups etc. If all good for a few days continue to the next to migrate everything off from an old VMWare cluster to a new Hyper-V server.

    Now this weekend we converted / moved an 2012R2 Exchange 2013 server to Hyper-V.  Setup Veeam to backup this server as well, all went OK, no problem. One day later without restarts or anything same backup job causes Hyper-V to hang on snapshot 9%. Also unable to create snapshots using the Hyper-V manager.  We have had this before at a different customer, we were able to fix it by updating an Intel X710 network card to latest Intel drivers (not the Microsoft ones). But now I'm not sure as we have already tried this here, makes no difference. 

    At this moment I don't know what is causing this. The only thing that works is rebooting / forcing a reboot of the Hyper-V host.   In one case at a different client this lead to massive data corruption because the Hyper-V Management Service then seems to hang losing track of current snapshots.  We want to be extra careful with this now.  Hoping for any extra clues from you guys on what to look for next.   I will try to set checkpoints to Standard. I need a daily backup that is OK before being able to troubleshoot a live VM.  This causes so much overtime after hours it's not funny.

    2019년 7월 22일 월요일 오후 1:52
  • I've started a call with Microsoft regarding this issue.

    At the moment the Microsoft Rep has asked we uninstalled Sophos AV from the Host machine; which we've done.

    We've got one VM set to Standard Checkpoints, one to Production Checkpoints.

    Backup software is Altaro.

    Now awaiting for the issue to replicate and get Microsoft back on the case. Will update when I know more.

    2019년 7월 22일 월요일 오후 2:24
  • We have the exact same problem. 2 Hyper-V 2019 servers fully patched replicating to eachother without problem. Using latest Veeam backup 9.5 on a seperate server.  We have converted a few VM's (server 2012 and server 2016) from VMWare to Hyper-V. We have done this with multiple servers, every week we converted a couple, then let it run, check backups etc. If all good for a few days continue to the next to migrate everything off from an old VMWare cluster to a new Hyper-V server.

    Now this weekend we converted / moved an 2012R2 Exchange 2013 server to Hyper-V.  Setup Veeam to backup this server as well, all went OK, no problem. One day later without restarts or anything same backup job causes Hyper-V to hang on snapshot 9%. Also unable to create snapshots using the Hyper-V manager.  We have had this before at a different customer, we were able to fix it by updating an Intel X710 network card to latest Intel drivers (not the Microsoft ones). But now I'm not sure as we have already tried this here, makes no difference. 

    At this moment I don't know what is causing this. The only thing that works is rebooting / forcing a reboot of the Hyper-V host.   In one case at a different client this lead to massive data corruption because the Hyper-V Management Service then seems to hang losing track of current snapshots.  We want to be extra careful with this now.  Hoping for any extra clues from you guys on what to look for next.   I will try to set checkpoints to Standard. I need a daily backup that is OK before being able to troubleshoot a live VM.  This causes so much overtime after hours it's not funny.

    Hello

    Problem happened again four days ago. I had to hard reboot Hyper-V again.

    In your case, is it a fresh install ? When problem is appears ? Did you upgrade veeam to lastest version or install directly last version ?

    I open a case with microsoft. I'm waiting return.

    2019년 7월 23일 화요일 오전 9:05
  • Can anyone conform this only happends with 10gb Nic's? 
    • 편집됨 JeffT79 2019년 7월 23일 화요일 오후 4:08
    2019년 7월 23일 화요일 오후 4:04
  • For us, for now, upgrading the drivers (from Microsoft to Broadcom) and installing the latest Broadcom firmware did the trick for us on Hyper-V Server 2019.

    We have a combination of Broadcom 1Gb and 10gb NICs on our Dell PowerEdge740xd server.

    2019년 7월 23일 화요일 오후 6:14
  • Our problem seems to have been fixed by either two things: fully patching the Hyper-V host and also making sure that the VM itself is updated.  And other than that we have updated our Intel X710 10gb fiber cards to the latest Intel driver in stead of the Microsoft one. This has solved our backups hanging at 9% but now we have new problems that might be unrelated to this topic.  When we connect virtual machines to the Hyper-V virtual switch with the 10gb networkcard, the host disconnects from the network every 30 seconds with a 10400 event in the logs. When we switch the VM's to a Hyper-V virtual switch with the regular RJ45 port everything is OK. Also stable network connectivity on the 10gb card.  Really weird, one problem after the other
    2019년 7월 24일 수요일 오전 7:56
  • Same Problem with Windows Server 2019 S2D Cluster with HP DL380Gen10 Server. 10GB Mellanox 640 Network Adapter. Driver was Microsoft. Updatet to newest. one week ok. now the same Problems again. Backup software is Acronis.
    2019년 7월 25일 목요일 오후 2:49
  • Count me in on this... brand new Server 2019 Core Hyper-V Cluster (Microsoft OS up to date), HPE 460 G10 Blades using HPE QLogic NX2 10Gb drivers, and latest rev of Veeam (9.5up4b). Happens sporadically about once every week to week and a half on any one of my 6 nodes. Just got off the phone with HPE support and confirmed all hardware, firmware, and drivers were up to date and came back green on health check. Was going to open a ticket with Veeam until I ran across this thread. Might go ahead and do it anyway just to make sure, but I am not sure what else to do to be honest.

    Update... After a little more research, I did run across this forum posting about Windows Defender. Looks like this fixed it for some users. I have it implemented now and will see what happens. 

     https://social.technet.microsoft.com/Forums/en-US/dc125221-824e-46ad-955e-8cdaaa66dec7/hyperv-live-mitration-fail-when-hyperv-replica-is-enabled-in-virtual-machines?forum=winserverhyperv

    • 편집됨 Tankster 2019년 7월 25일 목요일 오후 11:44 New info found
    2019년 7월 25일 목요일 오후 5:31
  • I THINK WE FOUND THE SMOKING GUN!

    .

    Could it be all related to 10Gb NIC’s + teaming caused by the Virtual Machine Queue (VMQ) ??? Our intel 10Gb 562SFP+ are in teaming and it seems that you have to configure each of the NIC’s in the team to not overlap on the same CPU cores. 

    Enter the following command to check your VMQ setting. (“FIBER*”=NIC NAME)
    The outcome: BaseVmqProcessor was 0 on both. So they overlapped! 

    Get-NetAdapterVmq | Sort Name | ? Name -Like "FIBER*" | FT -A

    With the following commands we have configured the VMQ for each adapter. The settings are related to the amount of cpu/cores you have in your server.  We have 2 x 8 - No Hyperthreading.

    Set-NetAdapterRss "Fiber01" -BaseProcessorNumber 0 -MaxProcessors 4

    Set-NetAdapterRss "Fiber02" -BaseProcessorNumber 4 -MaxProcessors 4
    Set-NetAdapterVmq "Fiber01" -BaseProcessorNumber 8 -MaxProcessors 4
    Set-NetAdapterVmq "Fiber02" -BaseProcessorNumber 12 -MaxProcessors 4

    Charbel Nemnom vmq-rss (google) has a great article and more info about VMQ

    After this setting we have rebooted all VM’s and back-up is running smooth for couple of day’s now. 




    • 편집됨 JeffT79 2019년 7월 27일 토요일 오후 9:39
    2019년 7월 25일 목요일 오후 9:59
  • UNFORTUNATELY, THIS EVENING IT HAPPEND AGAIN....
    2019년 7월 27일 토요일 오후 9:40
  • We have converted a few VM's (server 2012 and server 2016) from VMWare to Hyper-V.

    Which software did you use to migrate the VMs?
    Looks like, only systems that we migrated from VMWare to HyperV (with the 5nine converter) are affected.

    Or can someone disprove that? Does anyone have the problem with newly installed Windows servers?

    2019년 7월 29일 월요일 오전 8:46
  • Opened a Microsoft Support Call. It is a known issue. but no solution or hotfix available at this time. Microsoft told us to be sure to have no Microsoft drivers with network Cards and switch to Standard Checkpoints until a hotfix is available. i will test this next days.
    2019년 7월 30일 화요일 오전 11:52
  • Do you mind sharing your support case ID?  I'd like to open a "me too" case so I can get on the notification list for the fix.  And since we're in the process of commissioning the new infrastructure that exhibits this problem, I might be in a better position in terms of being able to get outages scheduled for installation of any hotfix and/or to down-n-up my Hyper-V hosts as sometimes seems to be necessary to resolve the situation after a VM gets into this 'stuck' state. 
    2019년 7월 30일 화요일 오후 11:53
  • I have the same problem. Brand new DELL Blade servers with W2019 and all the lastest drivers. I'v tried also with the MS basic drivers. I'm are using Netvault backup. I have no migrated VMs, just clean fully patched 2019 virtuals. I created a PS script that takes checkpoints every 10minutes. After a while some virtuals were stuck on 9%. So my conclusion is that there is something fundamentally wrong with Hyper-V 2019. Using production or standard checkpoints the results were same.

    I also manged to get the Hyper-V stuck when trying to storage migrate (unsuccesfully) VMs from another host. The VMs that were running in the destination (W2019) are in the same state as they were when stuck in the 9% checkpoint. Can't do any operation with them. Hyper-V services restart doesn't help. Only reboot to the host helps.


    • 편집됨 Chrlie 2019년 7월 31일 수요일 오전 10:35
    2019년 7월 31일 수요일 오전 10:33
  • Update from my case with Microsoft;

    Microsoft have asked for the removal of Altaro backup and Sophos Anti-Virus.

    They've then asked for the clean-boot of all VM's which are on the Host and the Host it's self; which I've done.

    I've now got a script creating checkpoints of VM's and removing them - waiting for it to happen again while the host and VM's are in clean-boot.

    2019년 7월 31일 수요일 오후 12:30
  • Hello

    Microsoft asked to me :

    - Update firmware and drivers of network cards

    - Update Windows Server

    I already did this last month. Problem happened again 10 days ago.

    2019년 8월 2일 금요일 오전 8:26
  • Update from my case with Microsoft;

    Microsoft have asked for the removal of Altaro backup and Sophos Anti-Virus.

    We have the same problem with Veeam Backup and Kaspersky Antivirus. So there are no correlations here either.

    Does Microsoft know this thread?
    2019년 8월 5일 월요일 오전 10:19
  • Afraid so, I started the case with a link to this thread to begin with.

    I followed what they were after, which was the following;

    1. Windows Update the Hypervisor.
    2. Remove Sophos from the Hypervisors & VM's.
    3. Clean boot all VM's and the Hypervisor.

    After following this, I managed to get the error to reoccur by simply check-pointing the VM's every 10 minutes, then merging the Checkpoints. Got Microsoft back on the phone and he stated the following;

    The Check-point is stuck at 9% because the internal VSS of the Guest VM is stuck in a "Timed Out" state. Please can you run Windows Update on the VM's along with the Hypervisor.

    I know that I'm going to Windows Update the VM's and the issue is to reoccur, I'll jump through their hoops the final time to prove once more; this is a problem with Windows Server 2019.

    FYI; this is a completely standalone host. We have 4 HP ProLiant DL380 Gen10 servers, no Clustering, no 10Gb/s NICs running the VM's on Local - full SSD storage.

    I have the following script which runs every 10 Minutes to cause the issue; even with the Hypervisor in a clean boot state.

    $Vms = Get-VM
    foreach ($Vm in $Vms) {
        $Snaps = Get-VMSnapshot -VM $Vm
        if ($Snaps.Length -eq 0) {
            Checkpoint-VM -VM $Vm
        }
    }
    foreach ($Vm in $Vms) {
        Remove-VMSnapshot -VM $Vm
    }
    
    $Today = Get-Date
    Add-Content C:\X\Snapshot-All-Vms.txt "Completed: $Today"

    2019년 8월 5일 월요일 오전 10:33
  • @_Dickens - Just to chime in here that we are seeing exactly the same issue on one of our W2019 servers. Some observations below: 

    We managed one, maybe two good backups after the host is rebooted (normally forcibly as the VMMS service hangs on reboot) before we hit the 9% Checkpoint issue again.

    HP DL380 Gen10. 10Gb NIC (but does the same when using the 1Gb NIC), Sophos in the VM (Not host), Veeam B&R 9.5 Update 4.

    Hyper-V replica running to an identical host.

    Several of the VSS writers in the VM go "Timed Out" after the failed Checkpoint. 

    Both VM and host are up to date via Windows Update. Latest HP SSP in the host.

    Even the though the Checkpoint says 9%, there's only ever a 4Mb AVHDX created that seemingly has no relationship with the parent VHDX as it's never merged and can be safely deleted after shutting the VM down.

    The VMMS service crashes when creating the Checkpoint.

    When shutting down the VMs within themselves, the VM always sticks at Shutting Down within Hyper-V manager even though they have fully shutdown. I invariably end up trying to stop the VMMS and Data Sharing Services on the host in a desperate attempt to allow a clean(ish) shutdown in the host.

    I've performed CHKDSK /f on VM and host.

    Moved the Checkpoint location to another volume.

    Specified Standard Checkpoints if Production Checkpoints fail.

    Disabled RSS and offloading on all NICs.

    All in all, this has been a nightmare. I initially suspected Veeam was the issue, but having read several accounts of users with Altaro, SolarWinds and various other backup vendors, I've come around to the notion that there's something inherently wrong with the W2019 hyper visor.

    Tonight I will change the backup to run within the VM rather than at the container level as I can't afford to spend any more time on this issue.

    Hopefully, MS are monitoring this thread and putting some energy into re-creating and eventually offering a resolution.




    • 편집됨 MDS_UK 2019년 8월 5일 월요일 오후 2:04
    2019년 8월 5일 월요일 오후 12:29
  • We had the problem with three customers in the meantime:

    Customer 1:
    Windows Server 2019
    HyperV Standalone
    Veeam 9.5.4.2753
    Kaspersky Security 10.1.1 for Windows Servers
    Proliant DL380 Gen10
    - HPE Ethernet 1Gb 4-port 331i (Driver: Hewlett-Packard 214.0.0.0)
    - HPE Ethernet 10Gb 2-port 530T (Driver: Cavium 7.13.150.0)


    Customer 2:
    Windows Server 2019
    HyperV Failover Cluster with NetApp E-Series
    Veeam 9.5.4.2753
    No virus protection
    ProLiant DL360 Gen10
    - HPE Ethernet 1Gb 4-port 331i (Driver: Hewlett-Packard 214.0.0.0)
    - HPE Ethernet 10Gb 2-port 530T (Driver: Cavium 7.13.145.0)


    Customer 3:
    Windows Server 2016
    HyperV Failover Cluster with NetApp E-Series
    Veeam 9.5.4.2753
    No virus protection
    ProLiant DL360 Gen9
    - HPE Ethernet 1Gb 4-port 331i (Driver: Hewlett-Packard 214.0.0.0)
    - HPE Ethernet 10Gb 2-port 561T (Driver: Intel 4.1.76.0)


    All hosts and VMs have the latest Windows updates.

    • 편집됨 Dennis_K5121 2019년 8월 6일 화요일 오전 6:45
    2019년 8월 5일 월요일 오후 1:34
  • Hello, I had this problem on an updated Windows Server 2012 R2 Hyper-V host, which is now Windows Server 2019. After updating LAN and disk controller driver the problem disappeared.
    2019년 8월 19일 월요일 오전 10:02
  • Worked for me - in Acronis. Was getting same issue with Server 2019 with 40G NICs. Disabling VMQ both on server and VMs made the problem go away.

    Not sure how much of hit this causes to the network throughput - tried testing, but difficult to get an accurate assessment

    2019년 8월 25일 일요일 오후 10:19
  • After adjusting the VMQ settings we had only one failure. Our 2019 cluster is running problem free for the past 4 weeks now  (knock on wood)
    2019년 8월 27일 화요일 오전 7:52
  • I have been just dealing with it for the past couple of weeks, but I too disabled VMQ on servers and VMs just yesterday and was finally able to at least get a good backup of everything again. Hoping this "resolves" the issue until Microsoft can address whatever this ultimately turns out to be.
    • 편집됨 Tankster 2019년 8월 29일 목요일 오후 12:07
    2019년 8월 29일 목요일 오후 12:07
  • So we had a case open with Microsoft for 3 months now. We have 3 clusters with now 2 having the issue. Initially it was only 1. The 2nd one started having the issue about 2-3 weeks ago. First 2 clusters didn't have the issue, these were configured back in March and April with Server 2019. The third cluster that had the issue since the beginning were installed on May-June wiht Server 2019. I have a feeling one of the newer updates is causing the issue. The 1st cluster not having the problem has not been patched since.

    To this day nothing was resolved and they have no idea what it might be. Now they are closing the case on us because the issue went from one Host in our Cluster to another host, and our scope was the first Hyper-V host having the issue. Unbelievable. The issue is still there though just happening on another host in the Cluster.

    The clusters experiencing the issues have the latest generation Dell Servers in them, PE 640s, while the one not having the issue only has older generation PE 520, PE 630, etc.

    The way we realize the issue is that we have a PRTG Sensor checking our host for responsiveness. At some random point in the day or night, PRTG will report that the sensor is not responding to general Hyper-V Host checks (WMI). After this, no checkpoints, backups, migrations, setting changes can happen because everything is stuck. Can't restart VMMS service or kill it.

    Here is what we have tested with no solution yet:

    • Remove all 3rd party applications - BitDefender (AV), Backup Software (Backup Exec 20.4), SupportAssist, WinDirStat, etc. - Didn't fix it.

    • Make sure all VMSwitches and Network adapters were identical in the whole cluster, with identical driver versions (Tried Intel, and Microsoft drivers on all hosts) - Didn't fix it.

    • Check each worker process for the VM - When a VM got stuck during a checkpoint or migration. - Didn't fix it.

      • get-vm | ft name, vmid

        • compare vmid to vmworkerprocess.exe seen in details -> Task Manager

        • kill process

        • Hyper-V showed VM running as Running-Critical

        • Restart VMMS service (didn't work)

        • net stop vmms (didn't work)

        • Restart Server -> VMs went unmonitored

        • After restart everything works fine as expected

    • Evict Server experiencing issues in Cluster -> This just causes the issue to go to another host, but the issue is still there. - Didn't fix it.

      • Create two VMS (one from template, one new one) on the evicted host -> No issues here, never gets stuck, but other hosts still experience the issue.

    • Install latest drivers, updates, BIOS, firmware for all hardware in all the hosts of the cluster. - didn't fix it.

    • We migrated our hosts to a new Datacenter, running up to date switches (old Datacenter - HP Switches, new Datacenter - Dell Switches), and the issue still continues.

    • New Cat6 wiring was put in place for all the hosts - Issue still continues.

    • Disable "Allow management operating system to share this network adapter" on all VMSwitches - issue still continues

    • Disable VMQ and IPSec offloading on all Hyper-V VMs and adapters - issue still continues

    • We're currently patched all the way to August 2019 Patches - issue still continues.

    We asked Microsoft to assign us a higher Tier technician to do a deep dive in to kernel dumps and process dumps, but they would not do it until we exhausted all the basic troubleshooting steps. Now they are not willing to work further because the issue moved from 1 host to another after we have moved from one datacenter to another. So it seems like based on how the Cluster comes up and who's the owner of the disks and network, it might determine which hosts has the issue.

    Also, our validation testing passes for all the hosts, besides minor warnings due to CPU differences.

    Any ideas would be appreciated.

    2019년 9월 4일 수요일 오후 6:26
  • petyurkutyur11

    I had this on a newly built Windows 2019 Dell R430 with Intel 10G X520 Adapters. It was causing an issue with HyperV replication and Production Checkpoints. I had two previous R430s in the cluster and I noticed that the two existing servers were running the Microsoft NIC driver and that the new one was running the Intel NIC driver. So Iremoved the HyperV virtual NIC, broke the team and reverted the new servers NIC drivers to the MS version 3.12.11.1 driver for the X520. Re-established the team, recreated the HyperV virtual switch and the problem has not reoccurred since.

    2019년 9월 6일 금요일 오전 9:18
  • Queeg505, thanks for the response. I wish it was that simple for us. I have done the same about a couple of months ago to see if going from the newly installed Intel drivers to the old Microsoft drivers would help. It did not really help for us.

    For the different NIC types that we have, these are the current driver versions we are running.

    Intel Gigabit 4P i350-t Adapter - 12.15.22.6 (Previously 12.15.184.0 Intel Driver)

    Intel Ethernet 10G 2P x540-t Adapter - 3.12.11.1 (Previously 4.1.4.0 Intel Driver)

    Broadcom NetXtreme Gigabit Ethernet - 214.0.0.0 (Previously 17.2.1.0)

    Intel Ethernet Converged Network Adapter X710-t - 1.8.103.2

    I have verified that all servers are running identical driver versions across these adapters.

    2019년 9월 6일 금요일 오후 4:12
  • Just to add our experience until Microsoft actually acknowledge this problem and look at a fix! We have two Server 2019 Hyper-V clusters.

    Cluster 1 – Brand new 6 node S2D cluster (all flash).
    Intel 10 GB NICs,100GB Mellanox NICs for storage, S2D storage and ReFS storage.
    Veeam backups failed nearly every day with checkpoints hanging at 9%. VMMS service will not stop even with process explorer. Only a hard reset would fix the node. We updated the Intel network drivers and disabled receive side coalescing (the driver introduced a new problem with rsc enabled). Checkpoints changed to standard and disabled backup (vss) integration service on each vm. All nodes were rebooted and backups have now succeeded for two weeks. However we are now too worried to live migrate or take checkpoints in case anything breaks again!

    Cluster 2 – 6 Node Old 2012 R2 cluster with FC SAN storage (which has run for 5 years prior to 2019 upgrade) reinstalled as Server 2019 (fresh install).
    Emulex 10GB NICs,CSV NTFS Volumes on a Hitachi SAN.

    Veeam is replicating some VMs to this cluster. Currently hangs at 9% checkpoint nearly every day requiring node hard reboot. Have tried drivers and disabling RSC – no change. Have just disabled RSS and VMQ and awaiting results.

    I have noticed that when the VMMS service is locked up you cannot use PowerShell to make any changes to the network adapters (hangs). Device Manager also hangs when making any changes to the network adapters.

    Based on the above it leads me to believe it is related to the network adapters as they become unmanageable at the same time. Just wish we knew the cause or MS would take some interest in fixing it!




    • 편집됨 CEvans2008 2019년 9월 9일 월요일 오후 10:39
    2019년 9월 9일 월요일 오후 10:23
  • Same here.

    A single Hiper-V server with Windows Server 2019 standard, with only two VM running on it. Backup with ARCServe UDP, without any antivirus nor any other software. Intel10Gb NICs X722 with Microsoft drivers 1.8.103.2.

    After a few backups, system stuck both VM at 9%...VM machines does not respond,  I'm unable to clear reboot the host server, and unique solution is to made a hard reset.

    I will update drivers and put result here.

    2019년 9월 10일 화요일 오후 5:51
  • I am glad to report that since I disabled VMQ on my adapters and VMs two weeks ago, I have not had a single issue. I empathize with those that have tried this and still having problems. I wish Microsoft would hurry up and get a fix for this soon. 
    2019년 9월 11일 수요일 오후 7:17
  • We had this exact same issue on non-clustered Dell servers with Intel X520 10G NICs, but also had success with the following:

    - Updated NIC drivers, using Intel's latest X520 drivers (4.1.143.0)

    - Set Jumbo Packets enabled on the NIC at 9014 bytes (this set previously when we had the Microsoft NIC drivers, so this was not a new change)

    - Disabled VMQ on all VMs that had it enabled

    After that, our weekly VM export backups have worked for 2 weeks without issue.

    2019년 9월 16일 월요일 오후 4:43
  • Hi.

    I just want to share my experience that updating Intel NIC driver manually did the trick.

    Although the newest driver that existed for this NIC was only 3 months newer I really doubt it would work but it did for now (really crossing my fingers that problem went away)

    SuperMicro motherboard, NIC information:

    NIC driver model: Intel 82579LM

    BEFORE THE UPDATE
    Driver date: 05.04.2016
    Driver Version: 12.15.22.6

    AFTER
    Driver date: 25.07.2016
    Driver Version: 12.15.31.4

    with best regards

    B


    BCR

    2019년 9월 17일 화요일 오전 4:20
  • Found a solution on another thread. The issue is related to VMQ, but in order for the changes to work, you most likely have to disable it to all the VMs in the VM Advanced Network Settings in your Cluster, restart the VMs, and also restart the hosts. This is probably why the initial time we disabled VMQ, the fix didn't work. After the host froze and we restarted it didn't happen again.

    Another solution another person posted was the following:

    From Microsoft Support we received a powershell command for hyper-v 2019 and the issue is gone ;)

    Set-VMNetworkAdapter -ManagementOS -VrssQueueSchedulingMode StaticVrss

    It is a bug from Windows Server 2019 and Hyper-V
    2019년 9월 18일 수요일 오후 5:12
  • Thanks B. I did the same and it worked for about 10 days so definitely keep a close eye. The temporary fix for me was to change to standard checkpoints but obviously that's not a good long term plan considering the need for app aware features. 
    2019년 9월 20일 금요일 오후 3:59
  • Got a fix from Microsoft. At this time Server 2019 is in Testmode, because of unsigned file from Microsoft. But since this, errors were gone. So I think Micorosoft will implement the fix in a future update.
    2019년 9월 22일 일요일 오후 4:58
  • Hello.

    Disabling VMQ's seems to have worked for me now.

    Please can you elaborate on this unsigned file? Is this a driver which you've been asked to install which has been unsigned somehow? Please could you share your Microsoft Case Reference number so I could give this to my Microsoft Support Advisor?

    Thanks.

    Oliver.

    2019년 9월 24일 화요일 오후 3:54

  • From Microsoft Support we received a powershell command for hyper-v 2019 and the issue is gone ;)

    Set-VMNetworkAdapter -ManagementOS -VrssQueueSchedulingMode StaticVrss

    It is a bug from Windows Server 2019 and Hyper-V
    Thank you very much. On our Windows Server 2019 we have set the parameters and are looking forward to see if the problem is solved.

    Is there a similar command for Windows Server 2016?  The parameter "VrssQueueSchedulingMode" is not available for Windows Server 2016.
    2019년 9월 26일 목요일 오후 12:36
  • Hello,

    the following solution on a Lenovo SR650 Host with Windows Server 2019 OS (only Hyper-V Role) and 4x10GBit Intel X722 LOM has worked:

    Changes the driver of the Intel X722 4 x 10Gbit LOM card from Microsoft drivers to the new’s Intel drivers (1.10.130.0, 9.5.2019). After this change the network performance of the VMs were absolutely BAD! Then we switched of the driver value of “Recv.-Seqment-Coalescing” (IPv4 and IPv6) from active to inactive. Network-Speed was normal on Host (X722 LOM NIC 10Gbit Port 1) and VMs (X722 LOM NIC 10Gbit Port 2).

    After testing creating more the 50 checkpoints each day (using the Microsoft script) for two weeks, we assume now, that the problem is been solved.


    One more interesting note: The checkpoint stuck at 9% was just a symtom for the problem at hand. Even without checkpoints creation there can always be a problem after a few days that the host no longer had control over the VMs. Neither Hyper-V MMC nor Hyper-V Powershell could be used to apply commands such as restart, save, etc. to the VMs. VMs continued to run, Hyper-V replication also worked. However, a VM could not restart if it was initiated from within the VM. The VM simply did not start anymore and in the Hyper-V Manager the status remains at "Shutdown". In this case, the host always had to be switched off using the power switch!

    Probably this information should help other users to solve such a “miracle” problem.

    Thank you to all other in this thread, that helped us solving this problem!

    St. Reppien

    2019년 10월 14일 월요일 오전 8:54