none
S2D IO TIMEOUT when rebooting node

    Întrebare

  • I am building a 6 Node cluster, 12 6TB drives, 2 4TB Intel p4600 PCIe NVME drives - Xeon Plat 8168/768GB Ram, LSI9008 HBA.

    The cluster passes all tests, switches are properly configured and the cluster works well, exceeding 1.1 million IOPS with VMFleet. However, at current patch as of now (April 18 2018) I am experiencing the following scenario:

    When no storage job is running, all vdisks listed as healthy and I pause a node and drain it, all is well, until the server actually is rebooted or taken offline. At that point a repair job is initiated, and IO suffers badly, and can even stop all together, causing vdisks to go in to paused state due to IO timeout. (listed as the reason in cluster events) Exacerbating this issue, when the paused node reboots and joins, it will cause the repair job to suspend, stop, then restart (it seems.. tracking this is hard was all storage commands become unresponsive while the node is joining) At this point io is guaranteed to stop on all vdisks at some point for long enough to cause problems, including causing VM reboots. The cluster was initially formed using VMM 2016. I have tried manually creating the vdisks, using single resiliency (3 way mirror), multi tier resiliency, same effect. This behavior was not observed when I did my POC testing last year. Its frankly a deal breaker and unusable, as if I cannot reboot a single node without stopping entirely my workload, I cannot deploy. I'm hoping someone has some info. I'm going to re-install with Server 2016 RTM media and keep it unpatched, and see if the problem remains. However it would be desirable to at least start the cluster at full patch. Any help appreciated. Thanks


    • Editat de James Canon miercuri, 18 aprilie 2018 08:00
    miercuri, 18 aprilie 2018 07:52

Răspunsuri

Toate mesajele

  • OK I cleaned the servers, reinstalled Server 2016 version 10.0.14393, and the cluster is handling pauses as expected. I am taking a guess that KB4038782 is the culprit, as that changed logic related to suspend / resume and now no longer puts disks in maintenance mode when suspending a cluster node. I will patch up to August 2017 and see if the cluster behaves as expected. Then until I can get something from Microsoft on this i'm not likely to patch beyond that for a while. 

    If anyone knows anything, I'm happy to hear it!

    Thanks

    joi, 19 aprilie 2018 01:27
  • Hi ,

    Sorry for the delayed response.

    This is a quick note to let you know that I am currently performing research on this issue and will get back to you as soon as possible. I appreciate your patience.

    If you have any updates during this process, please feel free to let me know.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    marți, 1 mai 2018 15:57
  • Hey James, I am experiencing the exact same issue. I have a fully functioning 4 node S2D HCI cluster that I have used for a while now. Fully in production, rock solid. Never saw any IO pausing during node updates and moving workloads around. I am building another 4 node HCI cluster for another data center, same configuration but since I am deploying it new I grabbed the latest OS build KB4103720 and updated the servers. I have been pulling my hair out since last Thursday with this. I have been combing over my configuration and comparing it to what I have in production now. The servers are now on KB4284880 14393.2312 and still the IO drops. 

    What I am seeing sounds exactly the same as you. Running a disk IO test on the nodes, pause a node all is well. Reboot a node and within a few seconds of initiating the node reboot the IO on the other nodes comes to a complete stop. Sometimes it stops long enough for the test software to throw an IO error othertimes it stops for 15~20 seconds and then back to full speed. It will stay churning along at full speed until that node starts to reboot and rejoin the cluster and the IO will completely stop on the nodes again but for less time, maybe 10 seconds and then full speed ahead. Then the VD repair job fires off as expected.

    I am semi glad to read that its not some hardware thing. I am going to reimage my servers with an earlier version of SRV16 and see if I can get on down the road with this thing. Thanks for the post.


    -Jason

    vineri, 15 iunie 2018 19:58
  • James I can verify that after wiping the operating system off the cluster nodes and reimaging them with Server 2016 datacenter version [March 29, 2018—KB4096309 (OS Build 14393.2156)] I am not experiencing the issues any longer. So some update between 2156 and 2312 breaks the S2D resiliency. This is much further along than the August 2017 patch.

    -Jason

    vineri, 22 iunie 2018 19:59
  • What if you put all the disks in maintenance mode prior to rebooting the node? Then take the disks out of maintenance mode after the node reboots?

    Powershell code to do this can be found here http://kreelbits.blogspot.com/2018/07/the-proper-way-to-take-storage-spaces.html

    marți, 10 iulie 2018 20:42
  • Following with great interest.

    We are experiencing the same issues on a 9 node HCI cluster where all processes in relation to node draining are undertaken (one at a time) and as soon as the machine reboots, catastrophe occurs. Last time we saw the symptoms defined here we lost 7/9 CSVs and 420 production VDIs fell over.

    If anyone has a definite "this KB breaks S2D resilience during a node reboot" cause then I'd love to hear it.

    Thanks a lot, great thread.
    P

    Update - 

    Following on from a conversation with a vendor we've been urged to install the latest KB/CU update for Server 2016. 

    https://support.microsoft.com/en-gb/help/4345418

    Word is there are non listed updates in this CU that potentially alleviate the I/O Timeout Issues being felt by people running HCI/S2D platforms.

    I've applied this update to all 9 nodes in our cluster and will perform some further testing, but so far it looks fairly positive as no issues have been felt whilst downing any of the nodes after patching.

    The patching process was not so smooth, the cluster went into a very odd/inaccessible yet "running" state during patching but I disgress.


    • Editat de Paul May luni, 23 iulie 2018 10:20
    vineri, 20 iulie 2018 10:26
  • Can confirm the latest CU does not fix the issue.

    Just drained a node of all roles, paused it then rebooted.
    Instantly STATUS_IO_TIMEOUT c00000b5.

    4 cluster nodes experienced eviction from the cluster and all storage has gone into regenerating.


    • Editat de Paul May miercuri, 25 iulie 2018 16:38
    luni, 23 iulie 2018 09:35
  • I apologize for the lack of direct version numbers.

    I can say that applying up the the last CU in April 2018 DOES still cause this problem. I reinstalled completely with 2016 RTM and deployed VMFleet, then used it to stress my cluster while I paused and drained a node. I was able to go through all six nodes with the last April CU with VM fleet running and saw IO pauses of only a matter of 2 or less seconds during the transition on the fleet. That said, I deployed. Today I experienced failure during a Veeam backup of one mirror accelerated parity disk, which I traced back to an issue with node 4. After mucking about a bit, I paused and drained the node and decided to reboot it.

    Instantly all 5 vdisks when in to degraded state, a repair job started and IO was severely limited. To top it all off, the Veeam issue with VSS that initially caused the issue, triggered the cluster to move one of my vm roles to other nodes rapidly, and corrupted its OS vhdx. Had to go to the backup to restore it as it was beyond repair. This has cost me a TON of time.. I'm calling in to open a ticket or whatever the hell Microsoft's process is for SA customers Monday, and they will either produce a good answer or I have to call this technology not ready for prime time, and flee back to VMWare.

    I must add, I now know what them folks at Harlan and Wolfe ship yards must have felt like when they finished building the ship, just for it to sink before it even got where it was going.. Thank god my hardware is also compliant for vSAN... 

    Anyone got anything? Cause I'm tapped out.

    --Frustrated 

    sâmbătă, 28 iulie 2018 04:49
  • If I put the disks into maintenance mode it removes the CSV and all the VMs go offline anyway. I wouldn't be able to tell the difference.

    -Jason

    luni, 30 iulie 2018 21:30
  • So I called Microsoft.. Our SA benefits include what they say is a "2 hour" SLA. 14 hours later I get a call back, and really got no where. So I started thinking about the network stack. Anyone here still watching this thread? I'm wondering if there is some PFC issue with my switch.. Anyone here know what the effects of a malfunctioning or not configured PFC setup would do to S2D on a non congested (FAR below link rate of the network card) network segment?

    luni, 30 iulie 2018 23:40
  • Could be something with the network. Check out this link https://social.technet.microsoft.com/Forums/lync/en-US/7050ddc3-fc25-4487-819e-e36f609b9005/s2d-disk-performance-problem-grinding-to-a-halt?forum=winserverClustering Do you have the latest nic drivers and firmware? There are some threads about certain drivers not working with certain firmware.. There are also some issues with intel nvme drives...
    marți, 31 iulie 2018 03:35
  • Hi everyone

    We are engaging directly with Microsoft on this one as well.

    For perspective I've setup two identical 9 node clusters, dell hardware, cisco switches.
    When I say identical, I mean  identical down to the firmware of every single device, switch configuration, internal components and so on.

    Cluster A is running KB4088787 from 13th March 2018 and is experiencing no issues.
    I can drain, failover, reboot and bring nodes back up without any major impact to the CSVs.

    Cluster B is running KB4345418 from 16th June 2018. This configuration has the issues listed in this thread, becoming far more prevalent when the cluster is running a simple workload (around 30 W10 VDIs).

    All drivers are up to date, all firmware is up to date, everything we have is in a supportable and validated configuration. Our cluster validation tests on both clusters are absolutely clean, not even a warning.

    I am now working with another tech supplier who is using HP kit with Cisco 9k switches and are seeing exactly the same issue as we are with the Dell. This customer have also broken down their clusters into small node numbers and eliminated RDMA completely by running S2D over TCP. Problem still persists. So in that case, anything networking related specifically around PFC and QoS/CoS isn't relevant.

    We've gone further and run network traces and analysed switch error counters whilst simulating the error and have seen no issues when the STATUS_IO_TIMEOUT occurs. Our network validation for RDMA/RoCE has been performed by the vendor and another technology supplier - no issues seen at all with out of order packets, CRC errors and so on.

    The evidence at the moment points strongly to an S2D / clustering code update that seems to have occurred in an April CU, so we are going through the painstaking process of having to patch up to future CU's one at a time to recreate the issue and work out which CU may be causing unforeseen problems.

    I'm not  sure how that genuinely helps us work toward a fix and this has been extremely slow going so far.



    • Editat de Paul May marți, 31 iulie 2018 10:38
    marți, 31 iulie 2018 10:30
  • Please open a case with Microsoft support, we need to capture some dumps and debug this issue.  If you have a case open, please tell your case owner to reach out to me so that I can give them instructions on what data I want collected.

    Thanks!
    Elden

    marți, 31 iulie 2018 18:58
    Proprietar
  • Hi Elden, you're speaking to me daily at the moment in the UK :)

    James Canon - a new bit of information and something for you to try.
    Check your SMB signing settings either in local or group policy on your HCI nodes.

    I'm not stating this will work for you, but it may be worth taking a look - we've disabled SMB singing on both client and server side and the issues have disappeared.

    To be specific in GPO that applies to our HCI cluster nodes:

    Computer Settings | Policies | Windows Settings | Security Settings | Local Policies | Security Settings

    Microsoft Network Server - Digitally sign communications (always) - DISABLED

    Microsoft Network Server - Digitally sign communications (if client agrees) - DISABLED

    Microsoft Network Client - Digitally sign communications (always) - DISABLED

    Microsoft Network Client - Digitally sign communications (if client agrees) - DISABLED

    If you do decide to take a look and try this, consider briefly how it may impact connectivity between other things in your infrastructure or DCs before going ahead.

    I'm slowly going through the process of testing this 1 by 1 to see if it's a single setting or a combination of the above, but disabling all of these for me has worked as a sledgehammer approach.

    So far the only setting I've tested in isolation is the server (always) setting to disabled... and it seems very positive. No more timeouts or CSV replication pauses on drain/reboot, it's currently behaving just fine.

    I'm just about to drop 2 drained nodes at once with the above setting for a little deeper testing.

    More when I have it.

    Edit - dropped 2 drained nodes with the setting above set to DISABLED... fine. No timeouts. My next step is to reverse that back out and re-test.
    • Editat de Paul May miercuri, 1 august 2018 17:53
    miercuri, 1 august 2018 17:50
  • Hi Paul

    Are you still experiencing the same issues after you disabled SMB signing on all Nodes.?

    I am considering doing the same on my Cluster, would appreciate feedback on this

    Thanks

    Martin


    miercuri, 8 august 2018 07:20
  • Hi Martin

    In short... yes, the problems still occurred but it turns out turning the SMB signing off was an efficiency gain rather than a fix to the issue. (But you should still force it off).

    Seems that the issue listed in this post is now somewhat a recognised / accepted issue in the Software Bus Layer causing timeouts when nodes attempt to write to disks on a machine that has been rebooted.

    From what I understand SMB timeouts are shorter period than the SBL timeout and basically the SBL is "unaware" of the status of a node going down for reboot and tries to write to disks on nodes that are no longer active.

    I believe this will have to be fixed in patch.

    Until that happens there is something you can do - the disks need to be put into maintenance mode on the node prior to dropping it and you should avoid any timeout errors.

    Drain the node as normal and wait for it to enter the paused state.

    Powershell -

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Wait for it to complete fully.Check the status of the disks -

    Get-PhysicalDisk | Where-Object {$_.OperationalStatus –ne “OK”}

    Drop the node when completed, when it comes back up allow the s2d storage repair jobs to run as normal.
    When storage jobs are complete -

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode

    To bring it out of maintenance mode.

    I've found during testing this avoided any 5120 timeout errors (in any shape or form) but it does take a fair few minutes to run as it puts the SBL into the correct state to handle a reboot.


    • Editat de Paul May duminică, 12 august 2018 20:48
    duminică, 12 august 2018 20:45
  • Paul- Did MS and Elden advise that you put the disks in maintenance mode before rebooting? Just wondering if it’s something they’re officially advising or if it’s something that you’ve found to work. Likewise, is the SMB and SBL time-out stuff something coming from them? I’m just curious if we can expect a patch anytime soon? -Scott
    duminică, 12 august 2018 21:02
  • Hi Paul

    Thanks for the feedback.!

    I will try the above when i need to reboot one of my nodes

    Still don't know why Microsoft hasn't released a patch on this issue as this is quite a big issue especially on S2D which they are trying to promote quite a lot

    luni, 13 august 2018 07:02
  • Yes I can confirm this is coming directly from Microsoft PG and really smart people working for a hardware vendor I will not name :)

    The patch - I'm not sure. I'm aware of some information around the fact that this has already been ... "fixed" in server 2019 so I would hope something would be coming soon for 2016 to address the timeout issues.

    From a service / SLA / KPI perspective our commercial entities aren't keen on adopting this onto contract until said fix has been provided, but that's not something I generally concern myself with. From a technical perspective the above workout does exactly what it needs to, to prevent any issues.

    The only thing you're not guarded against without said patch is node failure.

    luni, 13 august 2018 15:16
  • Directly to Martin - 

    "Still don't know why Microsoft hasn't released a patch on this issue as this is quite a big issue especially on S2D which they are trying to promote quite a lot"

    Without meaning to stir trouble here - the reason we received was because "no one else is seeing it". Make of that what you will.

    I think it's fair to say that with a vast amount of cases seen by their tech support this issue is usually caused by RoCE being misconfigured and network issues occurring as a result.

    However, simply saying "this isn't happening anywhere else" appears inaccurate. Everyone here is running the same code and appears to be experiencing the same issue. Worst still if there is a genuine bug in the code that has been addressed in a Server 2019 variant (only what I've been told) this must have been known.


    • Editat de Paul May luni, 13 august 2018 15:21
    luni, 13 august 2018 15:20
  • Update:  A KB article has been published that discusses this issue:

    https://support.microsoft.com/en-us/help/4462487/event-5120-with-status-io-timeout-c00000b5-after-an-s2d-node-restart-o


    Thanks!
    Elden




    luni, 13 august 2018 19:56
    Proprietar
  • Hello there,

    we are having the same issues and the mentioned command is not a solution...

    Our Environment is a 5 node disaggregated S2D Cluster (sofs), with HDDs and SSDs. We are using ROCE v2 with Mellanox Switches. 

    Yesterday I started to install the newest Windows Updates on our S2D Cluster and used the mentioned command: 

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Right before this command I used Suspend-ClusterNode -Drain which moved the CSVs correctly fromt the Node.

    After both commands, I checked the state and found the physical disks in MaintenanceMode but the StorageEncluse and the SSU were still healthy and ok, in other words not in MaintenanceMode. We monitored the RDMA traffic on all nodes and saw that the suspended Node traffic break-in.

    In the meantime the VirtualDisks changed state to Degraged as well to Incomplete, StorageJobs began to running.

    It was time to restart the Node. When I restart a Node I run the command Get-VirtualDisk from another. The command hangs a few moments untill the restarted Node has stopped all services. The most disk are now detached... VMs are restarting... StorageFaultDomain are now the related Types in LostCommunications.

    To make everything worse, on the third node (yeah... we still go further on patching the nodes...) after Enable-StorageMaintenanceMode an Optimize and Reblanace started. I had to wait till the tasks were completed before I could restart the Node. Now I am right before patching the last both...

    This is a very ciritical issue and I appreciate every help. Has anyone had success with the mentioned command???

    Thanks

    vineri, 24 august 2018 09:54
  • Hi

    Yes i kind of had the same issues on my cluster as well

    I have a 5 node cluster (SSD, HDD) , Mellanox Switches , SuperMicro Nodes.

    The past weekend i did patching on nodes and did the reboots afterwards

    I used the below command to place all my disks in maintenance before the reboots:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    I made sure all physical disk and Virtual disks were in a healthy state before rebooting nodes, also made sure all storage-jobs were in sync

    4 of my nodes rebooted with out any issues, but rebooting the last node my whole cluster went into a hung state and CSV and VM's went down in production environment.

    I left the cluster for about 1 hour and all CSV's came online again, but all CSV's are now in a degraded state:

    Now i have the same issues with "Lost Communication" on HPV03

    Storage-Jobs are extreme slow now after failures on cluster and re-balance is only on 30% after leaving it to sync for almost a week (Storage jobs almost seems like it stuck)

    Any advise would be appreciated.?

    Thanks


    vineri, 24 august 2018 12:04
  • PWarPF and Martin Le Roux- what firmware do you have on your Mellanox cards? Also, what driver version?
    vineri, 24 august 2018 12:56
  • Hello scottkreel,

    thank you for your respone.

    We have Mellanox ConnectX-3 Pro Ethernet Adapter (Part Number: MCX314A-BCCT). Every node in the Cluster is running on Firmware Version 2.40.5032 and Driver Version 5.35.12978.0 since setup in early 2017. Also Hyper-V Nodes having the same build.

    vineri, 24 august 2018 14:00
  • What if you upgrade the firmware to 02.42.5000? I thought I read somewhere that was the minimum supported? I think driver version is fine. Can anyone else confirm this?
    vineri, 24 august 2018 14:17
  • Thanks again for your respone. I were a few days out of office.

    I will try the firmware update in our test enviroment. I hope I can rebuild the failure behaviour.

    miercuri, 29 august 2018 07:21
  • I wasn´t able to reproduce the failure in our test enviroment... But I successfully updated the Mellanox Card Firmware. Unfortunalty I didn´t find a way reload the firmware without rebooting the host.

    My next steps will be to upgrade the firmware on the remaining two nodes while updating windows... 

    We will see if it was helpfull with the next Windows Updates in September...

    miercuri, 29 august 2018 12:31
  • Does your test environment mirror your production environment verbatim? 

    Can you confirm PFC/CoS and pause frames are working on the nodes? https://blog.workinghardinit.work/tag/pause-frames/ Each priority shoudl have a counter for send and received pause frames. How many classes are you using? 

    Would you say using the mentioned storage maintenance mode for the disks makes the situation worse? i.e. you're better off not enabling storage maintenance mode on the nodes before rebooting.

    miercuri, 29 august 2018 13:11
  • No, not really... only the amount of nodes, Mellanox Cards, Mainboard and LSI Adapter are the same...

    Yes, the counters are rising also on the switch.

    I am sorry, what do you mean with "How many classes are you using?"?

    We had the same problems with Windows Updates in July without using StorageMaintenanceMode. So no, actually there are no differences for us...

    joi, 30 august 2018 08:18
  • The traffic class. The DCB priority. Get-NetQosTrafficClass
    joi, 30 august 2018 14:22
  • scottkreel you also need to update the Mellanox driver to version 5.50
    joi, 30 august 2018 17:04
    Proprietar
  • A KB article has been written discussing the impact of enabling SMB Signing or SMB Encryption on an RDMA enabled NIC, which was also discussed in this thread.  Here's the KB:

    https://support.microsoft.com/en-us/help/4458042/reduced-performance-after-smb-encryption-or-smb-signing-is-enabled

    Thanks!
    Elden

    joi, 30 august 2018 17:06
    Proprietar
  • In the May cumulative update we introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures (and specifically to better handle RoCE congestion).  This has had some side effects in increasing timeouts when a node is rebooted, which can effect a system under stress.  Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a different node in the cluster is rebooted.

    Symptoms can sometimes be more extreme on systems with large amounts of memory when a Live Dump is created triggered by the 5120 being logged.  This can cause nodes to fall out of cluster membership or volumes to fail.  Disabling Live Dumps is also another way to help mitigate impact when the issue occurs.

    We are working on a fix, until it is available a workaround that addresses the issue is to invoke Storage Maintenance Node prior to rebooting a node in a Storage Spaces Direct cluster.  Let’s say when patching for example.

    So, first drain the node, then invoke Storage Maintenance Mode, then reboot.  Here’s the syntax:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Once the node is back online disable Storage Maintenance Mode with this syntax :

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode

    End-to-end shutdown process goes:

    1. Run Get-VirtualDisk cmdlet and ensure the HealthStatus shows ‘Healthy’

    1. Then drain the node by running:  
      Suspend-ClusterNode -Drain

    1. Then invoke Storage Maintenance Mode
      Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    1. Run Get-PhysicalDisk cmdlet and ensure the OperationalStatus shows ‘In Maintenance Mode’

    1. Reboot     Restart-Computer


    Thanks!
    Elden




    I am glad to read that you guys worked out the problem and a fix is under way, but I am not sure this should be marked as the "answer" on this thread because this workaround only applies to scheduled, planned server maintenance and not a failure. The answer should be a link to a KB article for a hotfix when it arrives.

    -Jason

    joi, 30 august 2018 17:32
  • Thank you for all the support.

    We only configured Priority 3 for SMB.

    Name                      Algorithm Bandwidth(%) Priority                  PolicySet        IfIndex IfAlias
    ----                      --------- ------------ --------                  ---------        ------- -------
    [Default]                 ETS       50           0-2,4-7                   Global
    SMB                       ETS       50           3                         Global

    We will try a driver update in the next maintenance window..... this will probably be the case with the next Windows updates.

    We don´t have any of the reported eventlog messages regarding smb-signing or smb-encryption also the two settings are not enabled.

    Get-SmbServerConfiguration | Select EncryptData,RequireSecuritySignature | fl
    EncryptData                    : False
    RequireSecuritySignature : False

    vineri, 31 august 2018 07:09
  • Does anyone from Microsoft have an update on this?

    This post started in April, its now mid September, clearly a number of production users are suffering severe issues with their S2D clsuters. 

    We too have a production cluster which is experiencing exactly the issues described, we've tried all the suggestions above and yet yesterday during patching we suffered more random VM reboots as a supposedly paused node was rebooted.

    (BTW, in our cluster, the issue also seems to kill Hyper V replica to a DR site - resyncs required all round)

    From this post we thought it might be Windows Defender:

    https://social.technet.microsoft.com/Forums/ie/en-US/dc125221-824e-46ad-955e-8cdaaa66dec7/hyperv-live-mitration-fail-when-hyperv-replica-is-enabled-in-virtual-machines?forum=winserverhyperv

    But that doesnt seem to have resolved it (slightly better), and Microsoft *STILL* cannot agree on the official exclusions list (VMSP.EXE) is missing from the official WD exclusions.

    It's unpleasant to explain to clients that after months Microsoft dont seem to be focused on this.

    (We too have an open case, but its likely walking through treacle)

    Any luck from anyone experimenting?

    Many thanks!

    duminică, 16 septembrie 2018 17:07
  • Hi Mark

    We logged a call with Microsoft awhile ago and we recieved a "Private Hotfix" from Microsoft which we applied on all hosts

    Microsoft state in the mail they will be releasing this hotfix in the next rollup

    We did a reboot of all our host the past weekend, and this is the first time we didn't have any issues during reboots

    I really hope they release this patch soon as i know a lot of people is struggling with this issue

    Thanks


    luni, 17 septembrie 2018 04:05
  • Hi Martin,

    Thanks for sharing the update, we have a case open but they've not shared that information.

    I dont suppose there is an ID number for the fix you can share? :)

    Many thanks

    Mark.

    marți, 18 septembrie 2018 08:35