none
S2D IO TIMEOUT when rebooting node

    Întrebare

  • I am building a 6 Node cluster, 12 6TB drives, 2 4TB Intel p4600 PCIe NVME drives - Xeon Plat 8168/768GB Ram, LSI9008 HBA.

    The cluster passes all tests, switches are properly configured and the cluster works well, exceeding 1.1 million IOPS with VMFleet. However, at current patch as of now (April 18 2018) I am experiencing the following scenario:

    When no storage job is running, all vdisks listed as healthy and I pause a node and drain it, all is well, until the server actually is rebooted or taken offline. At that point a repair job is initiated, and IO suffers badly, and can even stop all together, causing vdisks to go in to paused state due to IO timeout. (listed as the reason in cluster events) Exacerbating this issue, when the paused node reboots and joins, it will cause the repair job to suspend, stop, then restart (it seems.. tracking this is hard was all storage commands become unresponsive while the node is joining) At this point io is guaranteed to stop on all vdisks at some point for long enough to cause problems, including causing VM reboots. The cluster was initially formed using VMM 2016. I have tried manually creating the vdisks, using single resiliency (3 way mirror), multi tier resiliency, same effect. This behavior was not observed when I did my POC testing last year. Its frankly a deal breaker and unusable, as if I cannot reboot a single node without stopping entirely my workload, I cannot deploy. I'm hoping someone has some info. I'm going to re-install with Server 2016 RTM media and keep it unpatched, and see if the problem remains. However it would be desirable to at least start the cluster at full patch. Any help appreciated. Thanks


    • Editat de James Canon miercuri, 18 aprilie 2018 08:00
    miercuri, 18 aprilie 2018 07:52

Toate mesajele

  • OK I cleaned the servers, reinstalled Server 2016 version 10.0.14393, and the cluster is handling pauses as expected. I am taking a guess that KB4038782 is the culprit, as that changed logic related to suspend / resume and now no longer puts disks in maintenance mode when suspending a cluster node. I will patch up to August 2017 and see if the cluster behaves as expected. Then until I can get something from Microsoft on this i'm not likely to patch beyond that for a while. 

    If anyone knows anything, I'm happy to hear it!

    Thanks

    joi, 19 aprilie 2018 01:27
  • Hi ,

    Sorry for the delayed response.

    This is a quick note to let you know that I am currently performing research on this issue and will get back to you as soon as possible. I appreciate your patience.

    If you have any updates during this process, please feel free to let me know.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    marți, 1 mai 2018 15:57
  • Hey James, I am experiencing the exact same issue. I have a fully functioning 4 node S2D HCI cluster that I have used for a while now. Fully in production, rock solid. Never saw any IO pausing during node updates and moving workloads around. I am building another 4 node HCI cluster for another data center, same configuration but since I am deploying it new I grabbed the latest OS build KB4103720 and updated the servers. I have been pulling my hair out since last Thursday with this. I have been combing over my configuration and comparing it to what I have in production now. The servers are now on KB4284880 14393.2312 and still the IO drops. 

    What I am seeing sounds exactly the same as you. Running a disk IO test on the nodes, pause a node all is well. Reboot a node and within a few seconds of initiating the node reboot the IO on the other nodes comes to a complete stop. Sometimes it stops long enough for the test software to throw an IO error othertimes it stops for 15~20 seconds and then back to full speed. It will stay churning along at full speed until that node starts to reboot and rejoin the cluster and the IO will completely stop on the nodes again but for less time, maybe 10 seconds and then full speed ahead. Then the VD repair job fires off as expected.

    I am semi glad to read that its not some hardware thing. I am going to reimage my servers with an earlier version of SRV16 and see if I can get on down the road with this thing. Thanks for the post.


    -Jason

    vineri, 15 iunie 2018 19:58
  • James I can verify that after wiping the operating system off the cluster nodes and reimaging them with Server 2016 datacenter version [March 29, 2018—KB4096309 (OS Build 14393.2156)] I am not experiencing the issues any longer. So some update between 2156 and 2312 breaks the S2D resiliency. This is much further along than the August 2017 patch.

    -Jason

    vineri, 22 iunie 2018 19:59
  • What if you put all the disks in maintenance mode prior to rebooting the node? Then take the disks out of maintenance mode after the node reboots?

    Powershell code to do this can be found here http://kreelbits.blogspot.com/2018/07/the-proper-way-to-take-storage-spaces.html

    marți, 10 iulie 2018 20:42