none
S2D IO TIMEOUT when rebooting node

    Question

  • I am building a 6 Node cluster, 12 6TB drives, 2 4TB Intel p4600 PCIe NVME drives - Xeon Plat 8168/768GB Ram, LSI9008 HBA.

    The cluster passes all tests, switches are properly configured and the cluster works well, exceeding 1.1 million IOPS with VMFleet. However, at current patch as of now (April 18 2018) I am experiencing the following scenario:

    When no storage job is running, all vdisks listed as healthy and I pause a node and drain it, all is well, until the server actually is rebooted or taken offline. At that point a repair job is initiated, and IO suffers badly, and can even stop all together, causing vdisks to go in to paused state due to IO timeout. (listed as the reason in cluster events) Exacerbating this issue, when the paused node reboots and joins, it will cause the repair job to suspend, stop, then restart (it seems.. tracking this is hard was all storage commands become unresponsive while the node is joining) At this point io is guaranteed to stop on all vdisks at some point for long enough to cause problems, including causing VM reboots. The cluster was initially formed using VMM 2016. I have tried manually creating the vdisks, using single resiliency (3 way mirror), multi tier resiliency, same effect. This behavior was not observed when I did my POC testing last year. Its frankly a deal breaker and unusable, as if I cannot reboot a single node without stopping entirely my workload, I cannot deploy. I'm hoping someone has some info. I'm going to re-install with Server 2016 RTM media and keep it unpatched, and see if the problem remains. However it would be desirable to at least start the cluster at full patch. Any help appreciated. Thanks


    • Edited by James Canon Wednesday, April 18, 2018 8:00 AM
    Wednesday, April 18, 2018 7:52 AM

Answers

All replies

  • OK I cleaned the servers, reinstalled Server 2016 version 10.0.14393, and the cluster is handling pauses as expected. I am taking a guess that KB4038782 is the culprit, as that changed logic related to suspend / resume and now no longer puts disks in maintenance mode when suspending a cluster node. I will patch up to August 2017 and see if the cluster behaves as expected. Then until I can get something from Microsoft on this i'm not likely to patch beyond that for a while. 

    If anyone knows anything, I'm happy to hear it!

    Thanks

    Thursday, April 19, 2018 1:27 AM
  • Hi ,

    Sorry for the delayed response.

    This is a quick note to let you know that I am currently performing research on this issue and will get back to you as soon as possible. I appreciate your patience.

    If you have any updates during this process, please feel free to let me know.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Tuesday, May 1, 2018 3:57 PM
  • Hey James, I am experiencing the exact same issue. I have a fully functioning 4 node S2D HCI cluster that I have used for a while now. Fully in production, rock solid. Never saw any IO pausing during node updates and moving workloads around. I am building another 4 node HCI cluster for another data center, same configuration but since I am deploying it new I grabbed the latest OS build KB4103720 and updated the servers. I have been pulling my hair out since last Thursday with this. I have been combing over my configuration and comparing it to what I have in production now. The servers are now on KB4284880 14393.2312 and still the IO drops. 

    What I am seeing sounds exactly the same as you. Running a disk IO test on the nodes, pause a node all is well. Reboot a node and within a few seconds of initiating the node reboot the IO on the other nodes comes to a complete stop. Sometimes it stops long enough for the test software to throw an IO error othertimes it stops for 15~20 seconds and then back to full speed. It will stay churning along at full speed until that node starts to reboot and rejoin the cluster and the IO will completely stop on the nodes again but for less time, maybe 10 seconds and then full speed ahead. Then the VD repair job fires off as expected.

    I am semi glad to read that its not some hardware thing. I am going to reimage my servers with an earlier version of SRV16 and see if I can get on down the road with this thing. Thanks for the post.


    -Jason

    Friday, June 15, 2018 7:58 PM
  • James I can verify that after wiping the operating system off the cluster nodes and reimaging them with Server 2016 datacenter version [March 29, 2018—KB4096309 (OS Build 14393.2156)] I am not experiencing the issues any longer. So some update between 2156 and 2312 breaks the S2D resiliency. This is much further along than the August 2017 patch.

    -Jason

    Friday, June 22, 2018 7:59 PM
  • What if you put all the disks in maintenance mode prior to rebooting the node? Then take the disks out of maintenance mode after the node reboots?

    Powershell code to do this can be found here http://kreelbits.blogspot.com/2018/07/the-proper-way-to-take-storage-spaces.html

    Tuesday, July 10, 2018 8:42 PM
  • Following with great interest.

    We are experiencing the same issues on a 9 node HCI cluster where all processes in relation to node draining are undertaken (one at a time) and as soon as the machine reboots, catastrophe occurs. Last time we saw the symptoms defined here we lost 7/9 CSVs and 420 production VDIs fell over.

    If anyone has a definite "this KB breaks S2D resilience during a node reboot" cause then I'd love to hear it.

    Thanks a lot, great thread.
    P

    Update - 

    Following on from a conversation with a vendor we've been urged to install the latest KB/CU update for Server 2016. 

    https://support.microsoft.com/en-gb/help/4345418

    Word is there are non listed updates in this CU that potentially alleviate the I/O Timeout Issues being felt by people running HCI/S2D platforms.

    I've applied this update to all 9 nodes in our cluster and will perform some further testing, but so far it looks fairly positive as no issues have been felt whilst downing any of the nodes after patching.

    The patching process was not so smooth, the cluster went into a very odd/inaccessible yet "running" state during patching but I disgress.


    • Edited by Paul May Monday, July 23, 2018 10:20 AM
    Friday, July 20, 2018 10:26 AM
  • Can confirm the latest CU does not fix the issue.

    Just drained a node of all roles, paused it then rebooted.
    Instantly STATUS_IO_TIMEOUT c00000b5.

    4 cluster nodes experienced eviction from the cluster and all storage has gone into regenerating.


    • Edited by Paul May Wednesday, July 25, 2018 4:38 PM
    Monday, July 23, 2018 9:35 AM
  • I apologize for the lack of direct version numbers.

    I can say that applying up the the last CU in April 2018 DOES still cause this problem. I reinstalled completely with 2016 RTM and deployed VMFleet, then used it to stress my cluster while I paused and drained a node. I was able to go through all six nodes with the last April CU with VM fleet running and saw IO pauses of only a matter of 2 or less seconds during the transition on the fleet. That said, I deployed. Today I experienced failure during a Veeam backup of one mirror accelerated parity disk, which I traced back to an issue with node 4. After mucking about a bit, I paused and drained the node and decided to reboot it.

    Instantly all 5 vdisks when in to degraded state, a repair job started and IO was severely limited. To top it all off, the Veeam issue with VSS that initially caused the issue, triggered the cluster to move one of my vm roles to other nodes rapidly, and corrupted its OS vhdx. Had to go to the backup to restore it as it was beyond repair. This has cost me a TON of time.. I'm calling in to open a ticket or whatever the hell Microsoft's process is for SA customers Monday, and they will either produce a good answer or I have to call this technology not ready for prime time, and flee back to VMWare.

    I must add, I now know what them folks at Harlan and Wolfe ship yards must have felt like when they finished building the ship, just for it to sink before it even got where it was going.. Thank god my hardware is also compliant for vSAN... 

    Anyone got anything? Cause I'm tapped out.

    --Frustrated 

    Saturday, July 28, 2018 4:49 AM
  • If I put the disks into maintenance mode it removes the CSV and all the VMs go offline anyway. I wouldn't be able to tell the difference.

    -Jason

    Monday, July 30, 2018 9:30 PM
  • So I called Microsoft.. Our SA benefits include what they say is a "2 hour" SLA. 14 hours later I get a call back, and really got no where. So I started thinking about the network stack. Anyone here still watching this thread? I'm wondering if there is some PFC issue with my switch.. Anyone here know what the effects of a malfunctioning or not configured PFC setup would do to S2D on a non congested (FAR below link rate of the network card) network segment?

    Monday, July 30, 2018 11:40 PM
  • Could be something with the network. Check out this link https://social.technet.microsoft.com/Forums/lync/en-US/7050ddc3-fc25-4487-819e-e36f609b9005/s2d-disk-performance-problem-grinding-to-a-halt?forum=winserverClustering Do you have the latest nic drivers and firmware? There are some threads about certain drivers not working with certain firmware.. There are also some issues with intel nvme drives...
    Tuesday, July 31, 2018 3:35 AM
  • Hi everyone

    We are engaging directly with Microsoft on this one as well.

    For perspective I've setup two identical 9 node clusters, dell hardware, cisco switches.
    When I say identical, I mean  identical down to the firmware of every single device, switch configuration, internal components and so on.

    Cluster A is running KB4088787 from 13th March 2018 and is experiencing no issues.
    I can drain, failover, reboot and bring nodes back up without any major impact to the CSVs.

    Cluster B is running KB4345418 from 16th June 2018. This configuration has the issues listed in this thread, becoming far more prevalent when the cluster is running a simple workload (around 30 W10 VDIs).

    All drivers are up to date, all firmware is up to date, everything we have is in a supportable and validated configuration. Our cluster validation tests on both clusters are absolutely clean, not even a warning.

    I am now working with another tech supplier who is using HP kit with Cisco 9k switches and are seeing exactly the same issue as we are with the Dell. This customer have also broken down their clusters into small node numbers and eliminated RDMA completely by running S2D over TCP. Problem still persists. So in that case, anything networking related specifically around PFC and QoS/CoS isn't relevant.

    We've gone further and run network traces and analysed switch error counters whilst simulating the error and have seen no issues when the STATUS_IO_TIMEOUT occurs. Our network validation for RDMA/RoCE has been performed by the vendor and another technology supplier - no issues seen at all with out of order packets, CRC errors and so on.

    The evidence at the moment points strongly to an S2D / clustering code update that seems to have occurred in an April CU, so we are going through the painstaking process of having to patch up to future CU's one at a time to recreate the issue and work out which CU may be causing unforeseen problems.

    I'm not  sure how that genuinely helps us work toward a fix and this has been extremely slow going so far.



    • Edited by Paul May Tuesday, July 31, 2018 10:38 AM
    Tuesday, July 31, 2018 10:30 AM
  • Please open a case with Microsoft support, we need to capture some dumps and debug this issue.  If you have a case open, please tell your case owner to reach out to me so that I can give them instructions on what data I want collected.

    Thanks!
    Elden

    Tuesday, July 31, 2018 6:58 PM
    Owner
  • Hi Elden, you're speaking to me daily at the moment in the UK :)

    James Canon - a new bit of information and something for you to try.
    Check your SMB signing settings either in local or group policy on your HCI nodes.

    I'm not stating this will work for you, but it may be worth taking a look - we've disabled SMB singing on both client and server side and the issues have disappeared.

    To be specific in GPO that applies to our HCI cluster nodes:

    Computer Settings | Policies | Windows Settings | Security Settings | Local Policies | Security Settings

    Microsoft Network Server - Digitally sign communications (always) - DISABLED

    Microsoft Network Server - Digitally sign communications (if client agrees) - DISABLED

    Microsoft Network Client - Digitally sign communications (always) - DISABLED

    Microsoft Network Client - Digitally sign communications (if client agrees) - DISABLED

    If you do decide to take a look and try this, consider briefly how it may impact connectivity between other things in your infrastructure or DCs before going ahead.

    I'm slowly going through the process of testing this 1 by 1 to see if it's a single setting or a combination of the above, but disabling all of these for me has worked as a sledgehammer approach.

    So far the only setting I've tested in isolation is the server (always) setting to disabled... and it seems very positive. No more timeouts or CSV replication pauses on drain/reboot, it's currently behaving just fine.

    I'm just about to drop 2 drained nodes at once with the above setting for a little deeper testing.

    More when I have it.

    Edit - dropped 2 drained nodes with the setting above set to DISABLED... fine. No timeouts. My next step is to reverse that back out and re-test.
    • Edited by Paul May Wednesday, August 1, 2018 5:53 PM
    Wednesday, August 1, 2018 5:50 PM
  • Hi Paul

    Are you still experiencing the same issues after you disabled SMB signing on all Nodes.?

    I am considering doing the same on my Cluster, would appreciate feedback on this

    Thanks

    Martin


    Wednesday, August 8, 2018 7:20 AM
  • Hi Martin

    In short... yes, the problems still occurred but it turns out turning the SMB signing off was an efficiency gain rather than a fix to the issue. (But you should still force it off).

    Seems that the issue listed in this post is now somewhat a recognised / accepted issue in the Software Bus Layer causing timeouts when nodes attempt to write to disks on a machine that has been rebooted.

    From what I understand SMB timeouts are shorter period than the SBL timeout and basically the SBL is "unaware" of the status of a node going down for reboot and tries to write to disks on nodes that are no longer active.

    I believe this will have to be fixed in patch.

    Until that happens there is something you can do - the disks need to be put into maintenance mode on the node prior to dropping it and you should avoid any timeout errors.

    Drain the node as normal and wait for it to enter the paused state.

    Powershell -

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Wait for it to complete fully.Check the status of the disks -

    Get-PhysicalDisk | Where-Object {$_.OperationalStatus –ne “OK”}

    Drop the node when completed, when it comes back up allow the s2d storage repair jobs to run as normal.
    When storage jobs are complete -

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode

    To bring it out of maintenance mode.

    I've found during testing this avoided any 5120 timeout errors (in any shape or form) but it does take a fair few minutes to run as it puts the SBL into the correct state to handle a reboot.


    • Edited by Paul May Sunday, August 12, 2018 8:48 PM
    Sunday, August 12, 2018 8:45 PM
  • Paul- Did MS and Elden advise that you put the disks in maintenance mode before rebooting? Just wondering if it’s something they’re officially advising or if it’s something that you’ve found to work. Likewise, is the SMB and SBL time-out stuff something coming from them? I’m just curious if we can expect a patch anytime soon? -Scott
    Sunday, August 12, 2018 9:02 PM
  • Hi Paul

    Thanks for the feedback.!

    I will try the above when i need to reboot one of my nodes

    Still don't know why Microsoft hasn't released a patch on this issue as this is quite a big issue especially on S2D which they are trying to promote quite a lot

    Monday, August 13, 2018 7:02 AM
  • Yes I can confirm this is coming directly from Microsoft PG and really smart people working for a hardware vendor I will not name :)

    The patch - I'm not sure. I'm aware of some information around the fact that this has already been ... "fixed" in server 2019 so I would hope something would be coming soon for 2016 to address the timeout issues.

    From a service / SLA / KPI perspective our commercial entities aren't keen on adopting this onto contract until said fix has been provided, but that's not something I generally concern myself with. From a technical perspective the above workout does exactly what it needs to, to prevent any issues.

    The only thing you're not guarded against without said patch is node failure.

    Monday, August 13, 2018 3:16 PM
  • Directly to Martin - 

    "Still don't know why Microsoft hasn't released a patch on this issue as this is quite a big issue especially on S2D which they are trying to promote quite a lot"

    Without meaning to stir trouble here - the reason we received was because "no one else is seeing it". Make of that what you will.

    I think it's fair to say that with a vast amount of cases seen by their tech support this issue is usually caused by RoCE being misconfigured and network issues occurring as a result.

    However, simply saying "this isn't happening anywhere else" appears inaccurate. Everyone here is running the same code and appears to be experiencing the same issue. Worst still if there is a genuine bug in the code that has been addressed in a Server 2019 variant (only what I've been told) this must have been known.


    • Edited by Paul May Monday, August 13, 2018 3:21 PM
    Monday, August 13, 2018 3:20 PM
  • Update:  A KB article has been published that discusses this issue:

    https://support.microsoft.com/en-us/help/4462487/event-5120-with-status-io-timeout-c00000b5-after-an-s2d-node-restart-o


    Thanks!
    Elden




    Monday, August 13, 2018 7:56 PM
    Owner
  • Hello there,

    we are having the same issues and the mentioned command is not a solution...

    Our Environment is a 5 node disaggregated S2D Cluster (sofs), with HDDs and SSDs. We are using ROCE v2 with Mellanox Switches. 

    Yesterday I started to install the newest Windows Updates on our S2D Cluster and used the mentioned command: 

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Right before this command I used Suspend-ClusterNode -Drain which moved the CSVs correctly fromt the Node.

    After both commands, I checked the state and found the physical disks in MaintenanceMode but the StorageEncluse and the SSU were still healthy and ok, in other words not in MaintenanceMode. We monitored the RDMA traffic on all nodes and saw that the suspended Node traffic break-in.

    In the meantime the VirtualDisks changed state to Degraged as well to Incomplete, StorageJobs began to running.

    It was time to restart the Node. When I restart a Node I run the command Get-VirtualDisk from another. The command hangs a few moments untill the restarted Node has stopped all services. The most disk are now detached... VMs are restarting... StorageFaultDomain are now the related Types in LostCommunications.

    To make everything worse, on the third node (yeah... we still go further on patching the nodes...) after Enable-StorageMaintenanceMode an Optimize and Reblanace started. I had to wait till the tasks were completed before I could restart the Node. Now I am right before patching the last both...

    This is a very ciritical issue and I appreciate every help. Has anyone had success with the mentioned command???

    Thanks

    Friday, August 24, 2018 9:54 AM
  • Hi

    Yes i kind of had the same issues on my cluster as well

    I have a 5 node cluster (SSD, HDD) , Mellanox Switches , SuperMicro Nodes.

    The past weekend i did patching on nodes and did the reboots afterwards

    I used the below command to place all my disks in maintenance before the reboots:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    I made sure all physical disk and Virtual disks were in a healthy state before rebooting nodes, also made sure all storage-jobs were in sync

    4 of my nodes rebooted with out any issues, but rebooting the last node my whole cluster went into a hung state and CSV and VM's went down in production environment.

    I left the cluster for about 1 hour and all CSV's came online again, but all CSV's are now in a degraded state:

    Now i have the same issues with "Lost Communication" on HPV03

    Storage-Jobs are extreme slow now after failures on cluster and re-balance is only on 30% after leaving it to sync for almost a week (Storage jobs almost seems like it stuck)

    Any advise would be appreciated.?

    Thanks


    Friday, August 24, 2018 12:04 PM
  • PWarPF and Martin Le Roux- what firmware do you have on your Mellanox cards? Also, what driver version?
    Friday, August 24, 2018 12:56 PM
  • Hello scottkreel,

    thank you for your respone.

    We have Mellanox ConnectX-3 Pro Ethernet Adapter (Part Number: MCX314A-BCCT). Every node in the Cluster is running on Firmware Version 2.40.5032 and Driver Version 5.35.12978.0 since setup in early 2017. Also Hyper-V Nodes having the same build.

    Friday, August 24, 2018 2:00 PM
  • What if you upgrade the firmware to 02.42.5000? I thought I read somewhere that was the minimum supported? I think driver version is fine. Can anyone else confirm this?
    Friday, August 24, 2018 2:17 PM
  • Thanks again for your respone. I were a few days out of office.

    I will try the firmware update in our test enviroment. I hope I can rebuild the failure behaviour.

    Wednesday, August 29, 2018 7:21 AM
  • I wasn´t able to reproduce the failure in our test enviroment... But I successfully updated the Mellanox Card Firmware. Unfortunalty I didn´t find a way reload the firmware without rebooting the host.

    My next steps will be to upgrade the firmware on the remaining two nodes while updating windows... 

    We will see if it was helpfull with the next Windows Updates in September...

    Wednesday, August 29, 2018 12:31 PM
  • Does your test environment mirror your production environment verbatim? 

    Can you confirm PFC/CoS and pause frames are working on the nodes? https://blog.workinghardinit.work/tag/pause-frames/ Each priority shoudl have a counter for send and received pause frames. How many classes are you using? 

    Would you say using the mentioned storage maintenance mode for the disks makes the situation worse? i.e. you're better off not enabling storage maintenance mode on the nodes before rebooting.

    Wednesday, August 29, 2018 1:11 PM
  • No, not really... only the amount of nodes, Mellanox Cards, Mainboard and LSI Adapter are the same...

    Yes, the counters are rising also on the switch.

    I am sorry, what do you mean with "How many classes are you using?"?

    We had the same problems with Windows Updates in July without using StorageMaintenanceMode. So no, actually there are no differences for us...

    Thursday, August 30, 2018 8:18 AM
  • The traffic class. The DCB priority. Get-NetQosTrafficClass
    Thursday, August 30, 2018 2:22 PM
  • scottkreel you also need to update the Mellanox driver to version 5.50
    Thursday, August 30, 2018 5:04 PM
    Owner
  • A KB article has been written discussing the impact of enabling SMB Signing or SMB Encryption on an RDMA enabled NIC, which was also discussed in this thread.  Here's the KB:

    https://support.microsoft.com/en-us/help/4458042/reduced-performance-after-smb-encryption-or-smb-signing-is-enabled

    Thanks!
    Elden

    Thursday, August 30, 2018 5:06 PM
    Owner
  • In the May cumulative update we introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures (and specifically to better handle RoCE congestion).  This has had some side effects in increasing timeouts when a node is rebooted, which can effect a system under stress.  Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a different node in the cluster is rebooted.

    Symptoms can sometimes be more extreme on systems with large amounts of memory when a Live Dump is created triggered by the 5120 being logged.  This can cause nodes to fall out of cluster membership or volumes to fail.  Disabling Live Dumps is also another way to help mitigate impact when the issue occurs.

    We are working on a fix, until it is available a workaround that addresses the issue is to invoke Storage Maintenance Node prior to rebooting a node in a Storage Spaces Direct cluster.  Let’s say when patching for example.

    So, first drain the node, then invoke Storage Maintenance Mode, then reboot.  Here’s the syntax:

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    Once the node is back online disable Storage Maintenance Mode with this syntax :

    Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode

    End-to-end shutdown process goes:

    1. Run Get-VirtualDisk cmdlet and ensure the HealthStatus shows ‘Healthy’

    1. Then drain the node by running:  
      Suspend-ClusterNode -Drain

    1. Then invoke Storage Maintenance Mode
      Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode

    1. Run Get-PhysicalDisk cmdlet and ensure the OperationalStatus shows ‘In Maintenance Mode’

    1. Reboot     Restart-Computer


    Thanks!
    Elden




    I am glad to read that you guys worked out the problem and a fix is under way, but I am not sure this should be marked as the "answer" on this thread because this workaround only applies to scheduled, planned server maintenance and not a failure. The answer should be a link to a KB article for a hotfix when it arrives.

    -Jason

    Thursday, August 30, 2018 5:32 PM
  • Thank you for all the support.

    We only configured Priority 3 for SMB.

    Name                      Algorithm Bandwidth(%) Priority                  PolicySet        IfIndex IfAlias
    ----                      --------- ------------ --------                  ---------        ------- -------
    [Default]                 ETS       50           0-2,4-7                   Global
    SMB                       ETS       50           3                         Global

    We will try a driver update in the next maintenance window..... this will probably be the case with the next Windows updates.

    We don´t have any of the reported eventlog messages regarding smb-signing or smb-encryption also the two settings are not enabled.

    Get-SmbServerConfiguration | Select EncryptData,RequireSecuritySignature | fl
    EncryptData                    : False
    RequireSecuritySignature : False

    Friday, August 31, 2018 7:09 AM
  • Does anyone from Microsoft have an update on this?

    This post started in April, its now mid September, clearly a number of production users are suffering severe issues with their S2D clsuters. 

    We too have a production cluster which is experiencing exactly the issues described, we've tried all the suggestions above and yet yesterday during patching we suffered more random VM reboots as a supposedly paused node was rebooted.

    (BTW, in our cluster, the issue also seems to kill Hyper V replica to a DR site - resyncs required all round)

    From this post we thought it might be Windows Defender:

    https://social.technet.microsoft.com/Forums/ie/en-US/dc125221-824e-46ad-955e-8cdaaa66dec7/hyperv-live-mitration-fail-when-hyperv-replica-is-enabled-in-virtual-machines?forum=winserverhyperv

    But that doesnt seem to have resolved it (slightly better), and Microsoft *STILL* cannot agree on the official exclusions list (VMSP.EXE) is missing from the official WD exclusions.

    It's unpleasant to explain to clients that after months Microsoft dont seem to be focused on this.

    (We too have an open case, but its likely walking through treacle)

    Any luck from anyone experimenting?

    Many thanks!

    Sunday, September 16, 2018 5:07 PM
  • Hi Mark

    We logged a call with Microsoft awhile ago and we recieved a "Private Hotfix" from Microsoft which we applied on all hosts

    Microsoft state in the mail they will be releasing this hotfix in the next rollup

    We did a reboot of all our host the past weekend, and this is the first time we didn't have any issues during reboots

    I really hope they release this patch soon as i know a lot of people is struggling with this issue

    Thanks


    Monday, September 17, 2018 4:05 AM
  • Hi Martin,

    Thanks for sharing the update, we have a case open but they've not shared that information.

    I dont suppose there is an ID number for the fix you can share? :)

    Many thanks

    Mark.

    Tuesday, September 18, 2018 8:35 AM
  • Hello together,

    the new KB (https://support.microsoft.com/en-au/help/4457127/windows-10-update-kb4457127) contains:

    (...)Addresses an issue that causes many input and output (I/O) failures when QoS is enabled. The system does not attempt a retry, and the error code is “STATUS_Device_Busy”. This occurs during the periodic failover if Windows Cluster uses storage pool and Multipath I/O (MPIO) is enabled. After installing this update, you can create a registry key (Red_DWORD) with the value “0x1” to allow a retry. The registry path is “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\StorPort\QoSFlags “(...)


    Has anyone installed the new KB and set the Regkey? 

    Thanks in advance

    Monday, September 24, 2018 1:28 PM
    • Edited by scottkreel Thursday, October 18, 2018 8:51 PM Added link
    Thursday, October 18, 2018 8:32 PM
  • Is anyone still experiencing this issue even after applying the latest patch? (KB4462928)

    Friday, October 26, 2018 2:28 PM
  • So we just finished applying the update to the last 3 hosts in our 8 node cluster last night and had no issues with the pool or the csv's going into offline/pending online state after the  host reboot like we were having before.  You still have to run the extra script though to enable storage maintenance mode after you put the host in maintenance mode.

    Friday, October 26, 2018 2:59 PM
  • Thank you for getting back to me... what happens if you reboot a node without running the extra script?
    Friday, October 26, 2018 3:21 PM
  • So after the May CU before we knew about the enable-maintenancemode script we had corruption on some of our VM's and actually had to rebuild a few. SQL cluster nodes would fail over and all kinds of problems. Once we knew about the secondary script the issues were mostly the pool temporarily going offline (for a few seconds) as well as the CSV's. After the 2928 update we didn't experience that. We have another 7 node cluster we will be updating next week.  The main issue is that we lost trust with the business to patch hosts during the day and forcing us to do it at night which nobody wants.
    Friday, October 26, 2018 3:34 PM
  • Unfortunately I have to feedback that for us at least the issue is not fixed!

    We have 'aggressively' applied both the September and October patches to the 4 nodes of our clients S2D cluster.... all 4 nodes are showing up to date for patches.

    That was a week ago.

    At 3.30am this morning, the cluster logged a 'paused I/O' error and the cluster died in a heap, lots of crashed VMs, VM's stuck that wont respond, wont live migrate, wont shutdown.

    95% of Hyper V replication for those VM's to the DR site are also now wrecked awaiting resync.

    This is a shockingly bad issue to still be going on after so many months, as per comments above this is literally at risk of killing servers and losing data.

    Microsoft when are you going to FIX it??

    Sunday, October 28, 2018 9:24 AM
  • MarkCIT - Was this after you rebooted a node? Did one of your nodes die? 
    Sunday, October 28, 2018 2:08 PM
  • Hi Sakreel,

    This cluster was patched in September with the 'QoSFlags' registry entry as well. It had been relatively stable, no random reboots, just a bit 'touchy' when rebooting for patching. 

    When we patched it in October (I'm still not entirely clear whether Sep or Oct had the patch in), this was the first time in months that we've patched the cluster without some kind of issue (replicas killed, etc). 

    So we were feeling a *little* more confident.

    That was a week ago, then randomly (for the first time in a couple of months), 3.30am this morning, nothing happening (no backups etc), just a 'paused I/O' issue logged, then the cluster dived. Alerting notified us and we picked it up a few hours later.

    To be honest it was a complete mess, VM's that wouldnt live migrate, rebooting one node (to try and clear things up) resulted in the entire CSV going offline (which of course instantly killed EVERY VM).

    We've got VM's with CHKDSK errors, a DC that keeps rebooting. The whole thing is a disaster area.

    I'm struggling to understand why Microsoft is being so quiet and so slow to fix this issue when it is clearly extremely serious.

    Thanks
    Mark


    • Edited by MarkCIT Sunday, October 28, 2018 3:22 PM
    Sunday, October 28, 2018 3:20 PM
  • This was the first event logged at 3.26am:

    Cluster Shared Volume 'S2D Volume' ('Cluster Virtual Disk (S2D Volume)') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    Then 18 seconds later 3 of the 4 nodes were removed from the 'active cluster' which is of course followed by absolute chaos...

    This has been a pattern since earlier in the year... we've tried configuring Windows Defender in a special way (following posts elsewhere about this).. we've made sure everything is updated.. we've applied every monthly patch and configured the QosFlags....

    Which has amounted to... still have the issue... 

    Sunday, October 28, 2018 4:14 PM
  • Just curious if you installed the later patch KB4462928 because we also installed the October CU but it wasn't until after the 18th of October that the new patch came out which superseded the previous October update so we had to install that one too.
    Tuesday, October 30, 2018 12:23 AM
  • Actually we are starting to think this was a very unlucky coincidence now.

    This cluster has dedicated management NICs on each node, and it looks as though the management NICs lost connectivity for a few minutes.

    We didnt expect the I/O to pause and nodes to be evicted should that happen, but clearly we've misunderstood something about the critical nature of the management NICs. 

    We are reviewing the configuration now to see if we have misunderstood something, so apologies for the probable red herring. 

    Hopefully the KB issue is actually resolved.


    • Edited by MarkCIT Tuesday, October 30, 2018 10:42 AM
    Tuesday, October 30, 2018 10:42 AM