none
Storage Spaces Direct / Cluster Virtual Disk goes offline when rebooting a node

    Întrebare

  • Hello

    We have several Hyper-converged einvoronments based on HP ProLiant DL360/DL380.
    We have 3 Node and 2 Node Clusters, running with Windows 2016 and actual patches, Firmware Updates done, Witness configured.

    The following issue occurs with at least one 3 Node and one 2 Node cluster:
    When we put one node into maintenance mode (correctly as described in microsoft docs and checked everything is fine) and reboot that node, it can happen, that one of the Cluster Virtual Disks goes offline. It is always the Disk Performance with the SSD only storage in each environment. The issue occurs only sometimes and not always. So sometimes I can reboot the nodes one after the other several times in a row and everything is fine, but sometimes the Disk "Performance" goes offline. I can not bring this disk back online until the rebooted node comes back online. After the node which was down during maintenance is back online the Virtual Disk can be taken online without any issues.

    We have created 3 Cluster Virtual Disks & CSV Volumes on these clusters:
    1x Volume with only SSD Storage, called Performance
    1x Volume with Mixed Storage (SSD, HDD), called Mixed
    1x Volume with Capacity Storage (HDD only), called Capacity

    Disk Setup for Storage Spaces Direct (per Host):
    - P440ar Raid Controller
    - 2 x HP 800 GB NVME (803200-B21)
    - 2 x HP 1.6 TB 6G SATA SSD (804631-B21)
    - 4 x HP 2 TB 12G SAS HDD (765466-B21)
    - No spare Disks
    - Network Adapter for Storage: HP 10 GBit/s 546FLR-SFP+ (2 storage networks for redundancy)
    - 3 Node Cluster Storage Network Switch: HPE FlexFabric 5700 40XG 2QSFP+ (JG896A), 2 Node Cluster directly connected with each other

    Cluster Events Log is showing the following errors when the issue occurs:

    Error 1069 FailoverClustering
    Cluster resource 'Cluster Virtual Disk (Performance)' of type 'Physical Disk' in clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    Warning 5120 FailoverClustering
    Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') has entered a paused state because of 'STATUS_NO_SUCH_DEVICE(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    Error 5150 FailoverClustering
    Cluster physical disk resource 'Cluster Virtual Disk (Performance)' failed.  The Cluster Shared Volume was put in failed state with the following error: 'Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk10\ClusterPartition2\ (error 2)'

    Error 1205 FailoverClustering
    The Cluster service failed to bring clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Error 1254 FailoverClustering
    Clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    Error 5142 FailoverClustering
    Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

    Any hints / inputs appreciated. Had someone something similar?

    Thanks in advance

    Philippe



    1 martie 2018 12:36

Toate mesajele

  • Is this a configuration that has been validated for S2D by HP?  I see that there is a RAID controller in your configuration.  Has HP certified that particular RAID controller for use in S2D?  Hardware requirements for S2D are much more stringent than for other Server configurations.  If you are not using configurations in which all hardware components have been certified by the hardware vendor for use in S2D, it would not be uncommon to run into issues.

    tim

    1 martie 2018 13:21
  • Make sure you have the latest monthly update installed, and then run the Cluster Validation tool.  You can run it from Failover Cluster Manager or PowerShell with the Test-Cluster cmdlet.    Usually the root cause of this issue is because you are using non-SES complaint hardware, and Validate will check for that.

    For the P440ar, ensure the following:

    1. Put in HBA Mode
    2. Upgrade the HPE Smart Array Firmware to version 4.52 (or higher)
    3. Install the HPE October 2016 Service pack for ProLiant SPP (or later) to get the proper driver

    Thanks!
    Elden


    3 martie 2018 21:05
    Proprietar
  • Hello Tim

    At the time of purchase HP had no validated configuration for S2D. But as mentioned in this technical whitepaper, it is validated: https://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4aa6-8953enw - Firmware is up to date.

    5 martie 2018 20:02
  • Hello Elden

    The P440ar is in HBA Mode, Firmware is 6.06, in general Firmware Updates have been done recently (December 2017). 

    Validate Cluster goes through fine. Just get some warnings because not all VMs are running and I don't have recommended reserve capacity. Unsingned drivers: only some USB attached to a host.

    Windows Update Level: January 2018 (February not yet done)

    6 martie 2018 07:48
  • We have a three node and a two nodes Hyperconverged clusters and we're facing the same issue.

    Microsoft closed a support case after a month of analysis because the NVMes used were not in the Windows Server Catalog.

    Our configurations are these:

    1st

                3 Lenovo ThinkSystem SR650

                4 6Tb SAS HDD for each node

                2 900Gb NVMe for each node

    2nd

                2 Lenovo ThinkSystem SR650

                4 2Tb SAS HDD for each node

                2 800Gb SSD for each node (these are listed in Windows Servrer Catalog)

    When I switch off one node, after some time or sooner in case of high I/O, the CSV and CSV only, not the cluster itself, goes offline crashing all VMs running.

    Alex

    4 aprilie 2018 15:15
  • Hi Alex,

    I am sorry to hear that happened to you with your support case.  Please re-open your case and provide my name as a contact if they have any questions or concerns and I will assure you get the assistance you need.

    Thanks!!
    Elden Christensen
    Principal PM Manager
    Windows Server - High Availability & Storage

    5 aprilie 2018 03:42
    Proprietar
  • Hello Elden

    With us it twas the same. Microsoft Partner Support closed the case twice very fast without any help. They don't feel responsible and want us to contact Premier Support which we have not.

    In the mean time we have updated our ProLiant servers to February 2018 firmware from HP and also installed Windows Updates from March. Nothing is changing.

    What I can say is that we didn't have the issues with our 3 node cluster the last few times we rebooted them.

    For the two node clusters - both of them - the behaviour remains and it is exactly as following:

    1. Maintenance mode on first node (drain roles) - everything ok
    2. Reboot of first node - everything ok
    3. Wait for disk repairs to complete, check if everything is ok & healthy - all seems ok
    4. Maintenance mode on second node (drain roles) - everything ok
    5. Reboot of second node - FAIL as soon it is unreachable, Virtual Disk(s) go down and therefore VMs are crashing
    6. No way to bring the volumes back online until the second node is back online. As soon it's available we can bring online everything

    Philippe


    5 aprilie 2018 06:33
  • Hello Elden

    With us it twas the same. Microsoft Partner Support closed the case twice very fast without any help. They don't feel responsible and want us to contact Premier Support which we have not.

    In the mean time we have updated our ProLiant servers to February 2018 firmware from HP and also installed Windows Updates from March. Nothing is changing.

    What I can say is that we didn't have the issues with our 3 node cluster the last few times we rebooted them.

    For the two node clusters - both of them - the behaviour remains and it is exactly as following:

    1. Maintenance mode on first node (drain roles) - everything ok
    2. Reboot of first node - everything ok
    3. Wait for disk repairs to complete, check if everything is ok & healthy - all seems ok
    4. Maintenance mode on second node (drain roles) - everything ok
    5. Reboot of second node - FAIL as soon it is unreachable, Virtual Disk(s) go down and therefore VMs are crashing
    6. No way to bring the volumes back online until the second node is back online. As soon it's available we can bring online everything

    Philippe


    Hello Philippe,

    For us was the similar, it looked like that this was happening only when node 1 or node 2 were off.

    But this was true only when we had only a few VMs on it. Now it happens  always, to test it we have only to power on all VMs when one node is down, and the CSV goes offline.

    Alex

    5 aprilie 2018 07:00
  • Hi Elden,

    thank for your answer, we have opened a fresh a case for the two nodes cluster, I'll give your name as soon as they call me.

    Thank you again,

    Alex

    5 aprilie 2018 07:27
  • Hello Alex

    You mean it's related to the load on the Virtual Disk?

    I can confirm that. When we do the maintenance with only very few VMs running on the CSV (the others shut down before) then we can do maintenance without any resource failures on Cluster side.

    5 aprilie 2018 07:54
  • The support gave directions to download, install and launch the Windows Hardware Lab Kit, we will see where this lead us...

    Alex

    5 aprilie 2018 18:57
  • Hello Philippe,

    can you tell me how have you connected the two nodes cluster? With DAC Cable (or Twinax) or with SFP+ and FC cable?

    And what about the three nodes one?

    Thank you,

    Alex

    6 aprilie 2018 12:45
  • Hello Alex

    We are using the following cable: HPE X240 10G SFP+ SFP+ 1.2m DAC Cable 

    For 2 and 3 Node clusters.
    For the 3 Node we have the following Switch: HPE FlexFabric 5700-40XG-2QSFP+ Switch

    So we are using the cable which is recommended by HP in the Technical White paper.

    Philippe

    6 aprilie 2018 13:14
  • Did any resolution come of this I am having same issue all vms on node 1 reboot node 2 node 1 crashes
    5 iunie 2018 20:34
  • Not for us, we have a new technician connected to our cluster these days, let's see what he finds out.

    The call was previously closed because he couldn't find our NVMe disks in the Windows Server Catalog, now Lenovo has completed the certification SDDC of all the hardware we have implemented so they reopened the case.

    I'll keep you informed,

    Alex

    6 iunie 2018 06:51
  • Hello Alex, 

    Hopefully the issue you are facing will be fixed. 

    Anyway, if you are going to build at least 4-node S2D cluster to safely tolerate more Hardware failures. 

    Alternatively, you can check StarWind Virtual SAN to build 2 or 3 node clusters. Virtual SAN can utilize any type of local storage to provide a single fault tolerant pool and build HyperConverged cluster. 

    Check the info here: https://www.starwindsoftware.com/resource-library/starwind-virtual-san-hyperconverged-2-node-scenario-with-hyper-v-cluster-on-windows-server-2016


    Cheers,

    Alex Bykvoskyi

    StarWind Software

    Blog:   Twitter:   LinkedIn:  

    Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    11 iunie 2018 13:17
  • I'm afraid our problem does not depend on the number of nodes present in the S2D but more likely to hardware incompatibility. We are facing the same issue (CSV goes offline when one node get shut down) on a two nodes cluster, on a three nodes cluster and on a four node cluster.

    Microsoft is working actively on the issue but we don't have a clue of what the cause is, all hardware is SDDC certified.

    Cheers,

    11 iunie 2018 13:24
  • Hello Alessio

    Are you investigating with Microsoft Premier Support?

    Unfortunately with us Microsoft Partner support was not willing to help and did want us to use Premier Support. Microsoft Partner support is kind of useless stuff at all but we don't have Premier support, so Microsoft does not help.

    We're still affected with the issue described and have not yet found a resolution.

    We update the servers (firmware and windows) frequently, hardware is fully supported as mentioned in HP whitepaper.

    Unfortunately we can not play around much as these environments are productive.

    12 iunie 2018 09:55
  • Hi, i'm unfortunately also seeing something similar to your problems. We're also on HPs, DL380 with FlexFabric 556FLR-SFP+ adapters. We're running 2012r2.

    I'm not sure this is so helpful to you but one thing you could try is to:

    One way to work around the problem of not beeing able to online the disks until the previous owner-node was back up was to in Failover Cluster Manager select the disk, right-click and choose "Bring online", then fast right-click again, choose "More Actions" and "Turn On Maintenance Mode".

    For some strange reason the second node can now own the disk and it runs just fine. It will however not fail back to first node unless i turn off maintenance mode.

    I created two test disks and tried changing owner back and forth and it's still not working on node 2 unless i put them in maintenance mode.

    This is running in production so i haven't dared to restart any of the nodes yet, we've already had two big outages due to the disks going offline.

    în urmă cu 9 ore şi 11 minute