none
Storage Spaces Direct / Cluster Virtual Disk goes offline when rebooting a node

    Întrebare

  • Hello

    We have several Hyper-converged einvoronments based on HP ProLiant DL360/DL380.
    We have 3 Node and 2 Node Clusters, running with Windows 2016 and actual patches, Firmware Updates done, Witness configured.

    The following issue occurs with at least one 3 Node and one 2 Node cluster:
    When we put one node into maintenance mode (correctly as described in microsoft docs and checked everything is fine) and reboot that node, it can happen, that one of the Cluster Virtual Disks goes offline. It is always the Disk Performance with the SSD only storage in each environment. The issue occurs only sometimes and not always. So sometimes I can reboot the nodes one after the other several times in a row and everything is fine, but sometimes the Disk "Performance" goes offline. I can not bring this disk back online until the rebooted node comes back online. After the node which was down during maintenance is back online the Virtual Disk can be taken online without any issues.

    We have created 3 Cluster Virtual Disks & CSV Volumes on these clusters:
    1x Volume with only SSD Storage, called Performance
    1x Volume with Mixed Storage (SSD, HDD), called Mixed
    1x Volume with Capacity Storage (HDD only), called Capacity

    Disk Setup for Storage Spaces Direct (per Host):
    - P440ar Raid Controller
    - 2 x HP 800 GB NVME (803200-B21)
    - 2 x HP 1.6 TB 6G SATA SSD (804631-B21)
    - 4 x HP 2 TB 12G SAS HDD (765466-B21)
    - No spare Disks
    - Network Adapter for Storage: HP 10 GBit/s 546FLR-SFP+ (2 storage networks for redundancy)
    - 3 Node Cluster Storage Network Switch: HPE FlexFabric 5700 40XG 2QSFP+ (JG896A), 2 Node Cluster directly connected with each other

    Cluster Events Log is showing the following errors when the issue occurs:

    Error 1069 FailoverClustering
    Cluster resource 'Cluster Virtual Disk (Performance)' of type 'Physical Disk' in clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    Warning 5120 FailoverClustering
    Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') has entered a paused state because of 'STATUS_NO_SUCH_DEVICE(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    Error 5150 FailoverClustering
    Cluster physical disk resource 'Cluster Virtual Disk (Performance)' failed.  The Cluster Shared Volume was put in failed state with the following error: 'Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk10\ClusterPartition2\ (error 2)'

    Error 1205 FailoverClustering
    The Cluster service failed to bring clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Error 1254 FailoverClustering
    Clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    Error 5142 FailoverClustering
    Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

    Any hints / inputs appreciated. Had someone something similar?

    Thanks in advance

    Philippe



    joi, 1 martie 2018 12:36

Toate mesajele

  • Is this a configuration that has been validated for S2D by HP?  I see that there is a RAID controller in your configuration.  Has HP certified that particular RAID controller for use in S2D?  Hardware requirements for S2D are much more stringent than for other Server configurations.  If you are not using configurations in which all hardware components have been certified by the hardware vendor for use in S2D, it would not be uncommon to run into issues.

    tim

    joi, 1 martie 2018 13:21
  • Make sure you have the latest monthly update installed, and then run the Cluster Validation tool.  You can run it from Failover Cluster Manager or PowerShell with the Test-Cluster cmdlet.    Usually the root cause of this issue is because you are using non-SES complaint hardware, and Validate will check for that.

    For the P440ar, ensure the following:

    1. Put in HBA Mode
    2. Upgrade the HPE Smart Array Firmware to version 4.52 (or higher)
    3. Install the HPE October 2016 Service pack for ProLiant SPP (or later) to get the proper driver

    Thanks!
    Elden


    sâmbătă, 3 martie 2018 21:05
    Proprietar
  • Hello Tim

    At the time of purchase HP had no validated configuration for S2D. But as mentioned in this technical whitepaper, it is validated: https://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4aa6-8953enw - Firmware is up to date.

    luni, 5 martie 2018 20:02
  • Hello Elden

    The P440ar is in HBA Mode, Firmware is 6.06, in general Firmware Updates have been done recently (December 2017). 

    Validate Cluster goes through fine. Just get some warnings because not all VMs are running and I don't have recommended reserve capacity. Unsingned drivers: only some USB attached to a host.

    Windows Update Level: January 2018 (February not yet done)

    marți, 6 martie 2018 07:48
  • We have a three node and a two nodes Hyperconverged clusters and we're facing the same issue.

    Microsoft closed a support case after a month of analysis because the NVMes used were not in the Windows Server Catalog.

    Our configurations are these:

    1st

                3 Lenovo ThinkSystem SR650

                4 6Tb SAS HDD for each node

                2 900Gb NVMe for each node

    2nd

                2 Lenovo ThinkSystem SR650

                4 2Tb SAS HDD for each node

                2 800Gb SSD for each node (these are listed in Windows Servrer Catalog)

    When I switch off one node, after some time or sooner in case of high I/O, the CSV and CSV only, not the cluster itself, goes offline crashing all VMs running.

    Alex

    miercuri, 4 aprilie 2018 15:15
  • Hi Alex,

    I am sorry to hear that happened to you with your support case.  Please re-open your case and provide my name as a contact if they have any questions or concerns and I will assure you get the assistance you need.

    Thanks!!
    Elden Christensen
    Principal PM Manager
    Windows Server - High Availability & Storage

    joi, 5 aprilie 2018 03:42
    Proprietar
  • Hello Elden

    With us it twas the same. Microsoft Partner Support closed the case twice very fast without any help. They don't feel responsible and want us to contact Premier Support which we have not.

    In the mean time we have updated our ProLiant servers to February 2018 firmware from HP and also installed Windows Updates from March. Nothing is changing.

    What I can say is that we didn't have the issues with our 3 node cluster the last few times we rebooted them.

    For the two node clusters - both of them - the behaviour remains and it is exactly as following:

    1. Maintenance mode on first node (drain roles) - everything ok
    2. Reboot of first node - everything ok
    3. Wait for disk repairs to complete, check if everything is ok & healthy - all seems ok
    4. Maintenance mode on second node (drain roles) - everything ok
    5. Reboot of second node - FAIL as soon it is unreachable, Virtual Disk(s) go down and therefore VMs are crashing
    6. No way to bring the volumes back online until the second node is back online. As soon it's available we can bring online everything

    Philippe


    joi, 5 aprilie 2018 06:33
  • Hello Elden

    With us it twas the same. Microsoft Partner Support closed the case twice very fast without any help. They don't feel responsible and want us to contact Premier Support which we have not.

    In the mean time we have updated our ProLiant servers to February 2018 firmware from HP and also installed Windows Updates from March. Nothing is changing.

    What I can say is that we didn't have the issues with our 3 node cluster the last few times we rebooted them.

    For the two node clusters - both of them - the behaviour remains and it is exactly as following:

    1. Maintenance mode on first node (drain roles) - everything ok
    2. Reboot of first node - everything ok
    3. Wait for disk repairs to complete, check if everything is ok & healthy - all seems ok
    4. Maintenance mode on second node (drain roles) - everything ok
    5. Reboot of second node - FAIL as soon it is unreachable, Virtual Disk(s) go down and therefore VMs are crashing
    6. No way to bring the volumes back online until the second node is back online. As soon it's available we can bring online everything

    Philippe


    Hello Philippe,

    For us was the similar, it looked like that this was happening only when node 1 or node 2 were off.

    But this was true only when we had only a few VMs on it. Now it happens  always, to test it we have only to power on all VMs when one node is down, and the CSV goes offline.

    Alex

    joi, 5 aprilie 2018 07:00
  • Hi Elden,

    thank for your answer, we have opened a fresh a case for the two nodes cluster, I'll give your name as soon as they call me.

    Thank you again,

    Alex

    joi, 5 aprilie 2018 07:27
  • Hello Alex

    You mean it's related to the load on the Virtual Disk?

    I can confirm that. When we do the maintenance with only very few VMs running on the CSV (the others shut down before) then we can do maintenance without any resource failures on Cluster side.

    joi, 5 aprilie 2018 07:54
  • The support gave directions to download, install and launch the Windows Hardware Lab Kit, we will see where this lead us...

    Alex

    joi, 5 aprilie 2018 18:57
  • Hello Philippe,

    can you tell me how have you connected the two nodes cluster? With DAC Cable (or Twinax) or with SFP+ and FC cable?

    And what about the three nodes one?

    Thank you,

    Alex

    vineri, 6 aprilie 2018 12:45
  • Hello Alex

    We are using the following cable: HPE X240 10G SFP+ SFP+ 1.2m DAC Cable 

    For 2 and 3 Node clusters.
    For the 3 Node we have the following Switch: HPE FlexFabric 5700-40XG-2QSFP+ Switch

    So we are using the cable which is recommended by HP in the Technical White paper.

    Philippe

    vineri, 6 aprilie 2018 13:14
  • Did any resolution come of this I am having same issue all vms on node 1 reboot node 2 node 1 crashes
    marți, 5 iunie 2018 20:34
  • Not for us, we have a new technician connected to our cluster these days, let's see what he finds out.

    The call was previously closed because he couldn't find our NVMe disks in the Windows Server Catalog, now Lenovo has completed the certification SDDC of all the hardware we have implemented so they reopened the case.

    I'll keep you informed,

    Alex

    miercuri, 6 iunie 2018 06:51
  • Hello Alex, 

    Hopefully the issue you are facing will be fixed. 

    Anyway, if you are going to build at least 4-node S2D cluster to safely tolerate more Hardware failures. 

    Alternatively, you can check StarWind Virtual SAN to build 2 or 3 node clusters. Virtual SAN can utilize any type of local storage to provide a single fault tolerant pool and build HyperConverged cluster. 

    Check the info here: https://www.starwindsoftware.com/resource-library/starwind-virtual-san-hyperconverged-2-node-scenario-with-hyper-v-cluster-on-windows-server-2016


    Cheers,

    Alex Bykvoskyi

    StarWind Software

    Blog:   Twitter:   LinkedIn:  

    Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    luni, 11 iunie 2018 13:17
  • I'm afraid our problem does not depend on the number of nodes present in the S2D but more likely to hardware incompatibility. We are facing the same issue (CSV goes offline when one node get shut down) on a two nodes cluster, on a three nodes cluster and on a four node cluster.

    Microsoft is working actively on the issue but we don't have a clue of what the cause is, all hardware is SDDC certified.

    Cheers,

    luni, 11 iunie 2018 13:24
  • Hello Alessio

    Are you investigating with Microsoft Premier Support?

    Unfortunately with us Microsoft Partner support was not willing to help and did want us to use Premier Support. Microsoft Partner support is kind of useless stuff at all but we don't have Premier support, so Microsoft does not help.

    We're still affected with the issue described and have not yet found a resolution.

    We update the servers (firmware and windows) frequently, hardware is fully supported as mentioned in HP whitepaper.

    Unfortunately we can not play around much as these environments are productive.

    marți, 12 iunie 2018 09:55
  • Hi, i'm unfortunately also seeing something similar to your problems. We're also on HPs, DL380 with FlexFabric 556FLR-SFP+ adapters. We're running 2012r2.

    I'm not sure this is so helpful to you but one thing you could try is to:

    One way to work around the problem of not beeing able to online the disks until the previous owner-node was back up was to in Failover Cluster Manager select the disk, right-click and choose "Bring online", then fast right-click again, choose "More Actions" and "Turn On Maintenance Mode".

    For some strange reason the second node can now own the disk and it runs just fine. It will however not fail back to first node unless i turn off maintenance mode.

    I created two test disks and tried changing owner back and forth and it's still not working on node 2 unless i put them in maintenance mode.

    This is running in production so i haven't dared to restart any of the nodes yet, we've already had two big outages due to the disks going offline.

    joi, 21 iunie 2018 09:04
  • Hi Alex,

    I am sorry to hear that happened to you with your support case.  Please re-open your case and provide my name as a contact if they have any questions or concerns and I will assure you get the assistance you need.

    Thanks!!
    Elden Christensen
    Principal PM Manager
    Windows Server - High Availability & Storage

    Hi Elden, how we can get any support from Microsoft?

    Now we have sometimes another volume down. Mainly during evening but sometimes during work hours. So production environments go down and there is no solution and no support from Microsoft. We use barracuda backup which has influence to I/O timeouts with the CSV volumes I think. 

    So, original issue: Volume down when rebooting a node (still persistent) --> always the same volume

    Now additional issue: Volume down without rebooting a node. Everything self correcting after some time (volume comes back online, VMs restart) --> mostly the same volume but a different one than with the reboot issue

    miercuri, 27 iunie 2018 12:43
  • Hi Alex,

    I am sorry to hear that happened to you with your support case.  Please re-open your case and provide my name as a contact if they have any questions or concerns and I will assure you get the assistance you need.

    Thanks!!
    Elden Christensen
    Principal PM Manager
    Windows Server - High Availability & Storage

    Hi Elden,

    if you still can help us, what about if I give you my case number (118022517703686) and you contact the technicians?

    I gave your name but I suppose nobody contacted you.

    From April to today nothing has changed, they continue to collect logs (last time yesterday) but don't find any issue.

    Thank you.

    Regards,

    Alex

    miercuri, 27 iunie 2018 12:52
  • Hi,

    I have the same issue with a two node cluster by HP DL380G9 servers and HP 546SFP+ NIC as S2D link.

    In our experience the cluster can survive the node reboot, but when I shut down a node then remove the power or unplug the S2D link then the cluster virtual disk goes off-line immediately.
    It looks somehow depends on the link status of network card.
    We also started the investigation by 3rd party professional but no result.

    If you have any progress please feed us!

    Thanks, Attila

    marți, 3 iulie 2018 14:49
  • I'm having the same issue(s) with three 10 node all SSD clusters I've setup using DL380G9 Servers and Chelsio NIC cards.  I'm wondering if anyone else also has the following symptoms, in addition to the CSV disks going offline during a reboot.

    1. Running the 'test-cluster' command from powershell sometimes reports a storage test error: The I/O operation has been aborted because of either a thread exit or an application request.  I see this error about 30% of the time when running cluster validation. 

    2. Monitoring storage jobs with 'get-storagejob' seems to randomly show disk repair jobs starting even though all the nodes have been online & stable.


    vineri, 27 iulie 2018 20:36
  • Hi Brian

    1. Test-Cluster works fine with us all the time, no similar issues there.

    2. Yes, we already observed disk repairs in that scenario but only once.

    luni, 30 iulie 2018 05:57
  • Hello

    We opened another support case as we do not get forward in this case and we have 5142 events and volume down also sometimes if more than one host is backed up at the same time.

    Now the Microsoft Support engineer told us this is a known issue with ReFS. But his statement was not in written form and not backed up by any Microsoft KB. So I asked for confirmation.

    Maybe Elden can say something about that.

    In general what I read is ReFS is recommended with S2D. Because we have the CSV Volumes for VMs we made 64k cluster size, but what I read 4k is recommended in most other use cases. (https://blogs.technet.microsoft.com/filecab/2017/01/13/cluster-size-recommendations-for-refs-and-ntfs/) 

    So for now we wait Microsofts answer and if this is not clarified by Microsoft we try to figure out behavior with the following options:

    - Volumes formatted with ReFS & 4k
    - Volumes formatted with NTFS & 4k



    luni, 30 iulie 2018 06:12
  • Same Problem Herr Wirth DL380 Gen9. 4k Refs.So thus ist mit Themen Solution .Please Keep us Up to date
    miercuri, 1 august 2018 19:24
  • hello

    same issue here with some couple of 2 nodes dell r730xd, ssd and hdd. 2016

    all hardware are in the Windows Server Catalog, and all software and  firmware are up to date


    i have open support ticket with Microsoft for more one year ago now and they recently send me some private patch but for the moment that's not correct the problem. :-(

    i confirm too that's related to the load on the Virtual Disk.

    if i have no i/o on csv it's staid online more longer than if the csv is in production.


    they found that :
    1.DRT      is full and the write fails when we try to write to a region which will      require a new entry to be added to drt.

    Event 506 - Completing a failed Write SCSI SRB request

    Irp Status - -1073740692 (   STATUS_FT_DI_SCAN_REQUIRED - # One or more copies of data on this device may be out of sync. No writes may be performed until a data integrity scan is completed.)


           2 . When writes start failing with these errors cluster will make the disk offline and vdisk will be detached


    They are currently looking at the data integrity logs to find out as to why DRT was full

    so that's related to the load on the Virtual Disk, when the drt is full then the csv go offline.


    wait and see futur patch...Microsoft are in progress…

    for information sometime (5%) when i restart one node the csv continue to staid offline and sometime i lost my datas. i succes to found a solution with the issue 2 of this link to put ths csv online and rescue my datas

    support.microsoft.com/en-gb/help/4294480/virtual-disks-resources-are-in-no-redundancy-or-detached-status

    another cool link :

    jtpedersen.com/2018/08/storage-issues-when-rebooting-a-s2d-node-after-may-patches/#more-2256

    happy to see that's i'm not alone ;-)



    • Editat de sxma2 miercuri, 22 august 2018 09:23
    miercuri, 22 august 2018 08:47