none
Two Node S2D cluster - Disk Volume fails when one host goes down RRS feed

  • Question

  • hello,

    i got a two node S2D cluster, each server got 4 HDDs and 2 SSDs. the servers are connected using 2 NICs on each ( back to back connection) and an additional 2 NICs for external network. Quorum is  configured as file share.

    the cluster works fine when both nodes are up and running, But in case either one of the nodes goes down all the S2D volumes goes down as well. this should not happen obviously.

    i see the following errors in the event log:

    event ID 1069

    Cluster resource 'Cluster Pool 1' of type 'Storage Pool' in clustered role '790e54e8-fe11-4198-b2f7-833cad5bcb8d' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    event ID 1792

    Cluster physical disk resource failed periodic health check.

    Physical Disk resource name: Cluster Virtual Disk (Volume)
    Device Number: 8
    Device Guid: {41b93dc5-7fa3-4c26-b6e2-e76c9e3e6509}
    Error Code: 0
    Additional reason: ClusDiskReportedFailure

    If the reason is ReattachTimeout, it means attaching a new RHS process to the disk resource took too long.
    If the reason is ClusDiskReportedFailure, it means the underlying disk device was removed from the system.
    If the reason is QuorumResourceFailure, it means this is a Spaces quorum resource.
    If the reason is VolumeNotHealthy, it means one of the volumes is not healthy and may need repair.

    event ID 1038

    Ownership of cluster disk 'Cluster Virtual Disk (Volume)' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

    cluster validation does not show any major error

    anyone saw this issue before? please let me know


    • Edited by gilt111 Sunday, January 15, 2017 5:14 PM
    Sunday, January 15, 2017 5:09 PM

Answers

  • A 2-node Storage Spaces Direct (S2D) cluster has the entire cluster or all storage becomes unavailable if one node is rebooted or fails.

    Cause:

    This is commonly caused by either the cluster losing quorum, or the pool data is not synchronized across all nodes, or the hardware is not compatible with Storage Spaces Direct

    Resolution:

    Witness:

    When deploying a Failover Cluster it is recommend to always configure a Witness, this is critical for a 2-node cluster.  If a 2-node cluster has no witness, the cluster may lose quorum in the event of a failure and the cluster will stop making storage unavailable.

    For Storage Spaces Direct deployments there are two supported witness types:

    • File Share Witness
    • Cloud Witness

    Resync Still in Progress

    When a node in a S2D cluster is unavailable, the changes made to the overall system are tracked while that node is unavailable and then resynchronized once the node becomes available again.  This happens after planned and unplanned downtime, such as after rebooting to apply an update or after a server failure. Once the server is available data will begin to resyncronyze, and for a 2-node S2D deployment this must complete before making another node unavailable.  In the event that only nodes which have a stale copy of the data which has not yet syncronyzed, the S2D cluster will be stopped to ensure there is no data loss. 

    Open PowerShell and run the Get-StorageJob cmdlet and verify that all rebuild jobs have completed, before making another node unavailable.  See this document for the process of bringing nodes in a Storage Spaces Direct cluster down for maintenance:

    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/maintain-servers

    Hardware does not support SCSI Enclosure Services (SES):

    Storage Spaces Direct takes data and breaks it up into smaller chunks (a slab) and then distributes copies of the slaps across different fault domains to achieve resiliency.  The default for a 2-node cluster, it is a node level fault domain.  Storage Spaces Direct leverages SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures the data may not be correctly placed across fault domains in a resilient way.  This can result in all copies of a slab being lost in the event of a loss of a single fault domain.

    Consult the hardware vendor to verify compatibility with Storage Spaces Direct.

    When deploying a Windows Server Failover Cluster the first action is to run the cluster Validate tool, on a Storage Spaces Direct cluster this will include a special set of tests to verify compatibility. This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet. 

    A test has been added to the cluster Validate tool to verify if the hardware is compatible with SES.  Download the latest monthly update or the following update:
    https://support.microsoft.com/en-us/help/4025339/windows-10-update-kb4025339

    Additionally, you can view Storage Spaces Direct’s enclosure to disk mappings in running the following PowerShell cmdlet’s:

    • Get-StorageEnclosure | Get-PhysicalDisk
    • Get-StorageEnclosure

    The overall list of Storage Spaces Direct hardware requirements can be found at the following link:
    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements

    Sunday, March 26, 2017 3:28 PM
    Owner

All replies

  • When a node in a 2-node S2D cluster goes down, you need to wait for it to come up and resync again.  So it's easy when doing testing in a lab to invoke a failure, then quickly invoke another one, then another, then reboot, then a...

    So run the Get-StorageJob cmdlet and make sure that all rebuilds / resync's have completed, before invoking the next failure.

    Thanks!
    Elden

    Monday, January 16, 2017 3:32 AM
    Owner
  • this is not the issue. the storage pool fails even when there are not storage jobs running.

    this storage pool has no data on it , still goes down.


    • Edited by gilt111 Monday, January 16, 2017 8:11 AM
    Monday, January 16, 2017 8:10 AM
  • i did some more testing and noticed that the pool fails only if node #1 goes down.

    i case node #1 is up and #2 goes down the pool keeps working.

    this helps somehow?

    Monday, January 16, 2017 10:16 AM
  • I suspect I've got the same issue in my Lab but for me it's not just volumes, it's the whole pool that goes offline.  I have one server I can shutdown but I can't shutdown the other.  I've been working on it for several days now.  I  rebuilt the whole two node cluster today, but now the other server can't be shut down without bringing the pool offline.... Leads me to believe that my issue may be permission related or something...  Any recommendations of further troubleshooting?
    Monday, January 16, 2017 11:22 AM
  • I just want to confirm...  you have a witness configured? (either Cloud Witness or File Share Witness).  That's really important for a 2-node cluster.

    You stated that you have a File Share Witness configured...  is the resource Online?  When you shutdown the one node, does the cluster maintain quorum and the cluster as a whole stay up?

    You might also want to try running a Optimize-StoragePool cmdlet, to ensure the 2-way mirror extents are properly placed on both nodes.  You can check progress with Get-StorageJob cmdlet.

    Thanks!
    Elden

    • Proposed as answer by ahophing Thursday, December 6, 2018 1:33 AM
    • Unproposed as answer by ahophing Thursday, December 6, 2018 1:33 AM
    Tuesday, January 17, 2017 6:30 PM
    Owner
  • i got a file share witness running on a third server. the witness resource is online when the storage pool fails.

    this happen only when node #1 fails. when node #1 goes down it works. Optimize-StoragePool did not solved the issue.

    the errors i get in the event logs are in my first message.

    Wednesday, January 18, 2017 1:05 PM
  • i've tried to re-install the entire environment and still i get the same issue. when server #1 fails to entire storage pool fails. if server #2 fails is works fine.

    the logs in the event log is as follows :

    Majority of the pool disks hosting space meta-data for virtual disk {UUID} failed a space meta-data update, which caused the storage pool to go in failed state. Return Code: The pack does not have a quorum of healthy disks.

    Virtual disk {UUID} has failed a write operation to all its copies.             

    Majority of the pool disks hosting space meta-data for virtual disk {UUID} failed a space meta-data update, which caused the storage pool to go in failed state. Return Code: A majority of disks failed to be updated with the new configuration.

    seems like the metadata is not written to disks in node #2 , or it's not written correctly. i tried to run the optimize-storagepool cmdlet but it i did not help. can you help me figure out what's wrong we my deployment ?

    • Edited by gilt111 Thursday, January 19, 2017 7:56 AM
    Thursday, January 19, 2017 7:53 AM
  • Hi,

    I'm afraid I could not find much related information about the issue.

    I suggest you open a case with Microsoft, more in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.
    Here is the link:
    https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0

    Best Regards,
    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, January 23, 2017 1:38 AM
  • FYI, I am having this same issue and have had a case open with MS since mid-December. I am now working with a Tier 2 Support Engineer (mostly running commands and sending logs), but my confidence in a solution is waning.

    What hardware are you using?

    Tuesday, February 14, 2017 2:15 PM
  • hi,

    i still have this issue , i also got an open case with Microsoft for 3 weeks and i still didn't got any lead for what could be the problem.

    i use Dell R730 with HBA330 controller. but even when i installed this same architecture on a nested environment i still has the same issue. i guess this is not related to the hardware.

    please let me know if you find any solution to this issue. 

    Wednesday, February 15, 2017 9:29 AM
  • I'm getting the same with two Dell T630s and an HBA330 controller. MS have taken away a bunch of logs and also run a cluster troubleshooter on the 2-node cluster to get data for analysis, but if other people are saying that they've been dealing with this before Christmas and they're still not up and running..

    To me it feels that the CSV volumes get marooned on the node they were created on. So if you create two Volumes on node 1 then node 2 goes down, the CSV folders are still accessible by node 1 (no matter the ownership of the volume). But if you take node 1 down then node 2 loses visibility of both its volumes. Or something crazy like that.

    I've rebuilt this 2-node s2d cluster three times now and each time it's behaved exactly the same way. I'm considering doing it a fourth time this weekend but using Server 2012r2 or 2016 as the physical DC to see if that makes a difference.

    • Edited by stengle Thursday, February 16, 2017 9:22 PM
    Thursday, February 16, 2017 9:09 PM
  • did you find any solutions?   did it helped when you used different DC servers ?

    i tried to install the same architecture on a nested virtual environment and manged to make it work.

    this makes my think the issue might be related to the hardware 

    Monday, February 20, 2017 8:48 AM
  • No solutions from me yet. MS were meant to get back to me on Friday and didn't. Spent a bit of time chasing them today but nothing. I didn't rebuild over the weekend because I want the setup to not change whilst MS investigate. Hopefully, tomorrow, otherwise I'm going to be building something with hyper-v replication and a DAG for exchange.
    • Edited by stengle Monday, February 20, 2017 6:28 PM
    Monday, February 20, 2017 6:28 PM
  • I see a couple people saying they have a case open, but aren't getting traction on this issue.  If so, shoot me an email     EldenC@Microsoft

    Thanks!
    Elden

    Tuesday, February 21, 2017 3:22 AM
    Owner
  • OK so today another 2-3 hours with an MS support engineer. It feels like we're getting closer, some evidence pointing to the Dell branded Mellanox ConnectX-3 EN 10 GbE SPF cards, although there are also other possibilities at this time.

    The Dell Mellanox cards in my servers are using v5.25.12665.0 drivers and v2.36.5080 firmware. But from the Mellanox site at http://www.mellanox.com/page/winof_matrix?mtag=windows_sw_drivers explicitly mention that for the 5.25 drivers the firmware must be v2.36.5150 or higher.

    Unfortunately Dell don't have any newer firmware for this card for me to test, so at this point it is just a possible suspicion that this is related to the CSV problem. They say that the last number in the version is just a minor revision number and won't have any effect on S2D. I have emailed Mellanox to ask them if they have anything to suggest, but since the cards are Dell branded Mellanox I'm not sure how much they will be able to help.

    Are other people who see this problem also using Mellanox cards that aren't up to the recommended driver / firmware revision?

    MS have also said they have managed to reproduce the problem and it may be due to something else, so I await and see the latest analysis of the latest logs that were taken away, and then hopefully a conclusion to this giant pain in the, er, neck.

    Wednesday, February 22, 2017 8:40 PM
  • Having the exact same issue here. I have a 2 node Hyper-converged cluster using S2D and when node 1 is restarted the CSV fails with the same "Network Path not found". The actual storage pool remains online as does the failover cluster itself. When node 2 is restarted things work fine. Using a File Share Witness on a third server. The two nodes are SuperMicro SS 6027R-3RF4+ servers with 2 system drives, 2 SSDs and 5 HDDs each. They do not have Mellanox cards in them. I am about to add a third node to this cluster and will see if that helps the situation at all.
    Thursday, February 23, 2017 2:32 AM
  • I am using Mellanox Branded cards: Driver v5.25.12665.0 drivers and v2.36.5080 firmware. So I don't think that is it either. I have also tried with a file share and a cloud witness. It changes nothing.
    Friday, February 24, 2017 9:00 PM
  • You could try if cumulative update https://support.microsoft.com/en-us/help/4010672 solves this issue.
    Monday, February 27, 2017 3:10 PM
  • I am using Mellanox Branded cards: Driver v5.25.12665.0 drivers and v2.36.5080 firmware. So I don't think that is it either. I have also tried with a file share and a cloud witness. It changes nothing.

    Sorry but I don't understand your point.

    I have heard back from Mellanox and they explicitly state that whilst their card's firmware v2.36.5080 is supported in Windows 2016, for S2D they recommend firmware v2.36.5150. Unfortunately (1) the Dell rebadged Mellanox cards I have only have firmware v2.36.5080 on them (and there are no firmware updates available), and (2) my hardware has been deployed to customer site, means I can't test if newer Mellanox firmware would fix my problem.

    Their exact words were:

    1. After consulting with our engineers for further clarification regarding the required firmware level please note that Version 2.36.5080 is good for Server 2016 but for Storages Spaces Direct specifically we recommend Driver 5.25 and Firmware 2.36.5150 and above.

    I have to say that since these cards are rebadged for Dell, Mellanox support has been fantastic in answering my questions, and promptly; the firmware on these Dell Mellanox 10 GbE cards isn't their responsibility, but they've talked with me and advised me what I can do (not much) and what I can't do (flash their firmware on the Dell branded Mellanox cards). Where as Dell's response has been a shrug of the shoulders and a 'it should work' and 'we will look into it'.

    However, I do not know for sure if the new firmware would have fixed the problem; there may be a Microsoft issue that they are not aware of, or have not told me about, that was causing this issue. But for me I think it is an interesting data point.

    rdlenk_wsu: Did your third node help at all?
    • Edited by stengle Friday, March 3, 2017 7:10 PM
    Friday, March 3, 2017 6:50 PM
  • Some follow-up...  two of the cases in this thread was due to using Dell R620, which do not support SCSI Enclosure Services (SES) and are not compatible hardware with Storage Spaces Direct (S2D).

    Dell has many solutions which support S2D (but not the R620), please contact Dell for what hardware they support with S2D.

    More information on S2D hardware requirements can be found here:
    https://technet.microsoft.com/windows-server-docs/storage/storage-spaces/storage-spaces-direct-hardware-requirements

    Thanks!
    Elden


    Sunday, March 5, 2017 4:30 PM
    Owner
  • As Elden as stated, my problem seems to be that the Dell R620 does not support SCSI Enclosure Services (SES). I also got a call from the support engineer stating that a bug exists in the validation routine that should have caught the lack of SES support.

    Monday, March 6, 2017 7:29 PM
  • Interesting. I am going to assume that my two Dell T630's also don't support SES. Now to find my Dell account manager and insert two servers into him..

    • Edited by stengle Tuesday, March 7, 2017 9:55 PM
    Tuesday, March 7, 2017 9:55 PM
  • I'm getting the same issue  with two HP DL380 G7. Tried everything since 3 month's, no luck.


    Friday, March 24, 2017 2:17 PM
  • A 2-node Storage Spaces Direct (S2D) cluster has the entire cluster or all storage becomes unavailable if one node is rebooted or fails.

    Cause:

    This is commonly caused by either the cluster losing quorum, or the pool data is not synchronized across all nodes, or the hardware is not compatible with Storage Spaces Direct

    Resolution:

    Witness:

    When deploying a Failover Cluster it is recommend to always configure a Witness, this is critical for a 2-node cluster.  If a 2-node cluster has no witness, the cluster may lose quorum in the event of a failure and the cluster will stop making storage unavailable.

    For Storage Spaces Direct deployments there are two supported witness types:

    • File Share Witness
    • Cloud Witness

    Resync Still in Progress

    When a node in a S2D cluster is unavailable, the changes made to the overall system are tracked while that node is unavailable and then resynchronized once the node becomes available again.  This happens after planned and unplanned downtime, such as after rebooting to apply an update or after a server failure. Once the server is available data will begin to resyncronyze, and for a 2-node S2D deployment this must complete before making another node unavailable.  In the event that only nodes which have a stale copy of the data which has not yet syncronyzed, the S2D cluster will be stopped to ensure there is no data loss. 

    Open PowerShell and run the Get-StorageJob cmdlet and verify that all rebuild jobs have completed, before making another node unavailable.  See this document for the process of bringing nodes in a Storage Spaces Direct cluster down for maintenance:

    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/maintain-servers

    Hardware does not support SCSI Enclosure Services (SES):

    Storage Spaces Direct takes data and breaks it up into smaller chunks (a slab) and then distributes copies of the slaps across different fault domains to achieve resiliency.  The default for a 2-node cluster, it is a node level fault domain.  Storage Spaces Direct leverages SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures the data may not be correctly placed across fault domains in a resilient way.  This can result in all copies of a slab being lost in the event of a loss of a single fault domain.

    Consult the hardware vendor to verify compatibility with Storage Spaces Direct.

    When deploying a Windows Server Failover Cluster the first action is to run the cluster Validate tool, on a Storage Spaces Direct cluster this will include a special set of tests to verify compatibility. This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet. 

    A test has been added to the cluster Validate tool to verify if the hardware is compatible with SES.  Download the latest monthly update or the following update:
    https://support.microsoft.com/en-us/help/4025339/windows-10-update-kb4025339

    Additionally, you can view Storage Spaces Direct’s enclosure to disk mappings in running the following PowerShell cmdlet’s:

    • Get-StorageEnclosure | Get-PhysicalDisk
    • Get-StorageEnclosure

    The overall list of Storage Spaces Direct hardware requirements can be found at the following link:
    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements

    Sunday, March 26, 2017 3:28 PM
    Owner
  • If you are seeing a 2-node S2D solution go down when you reboot a node, we are seeing it be because of two reasons which are likely your issue:

    Witness:

    • When deploying a cluster, we recommend you always deploy a Witness, and this is especially critical for a 2-node cluster.
    • For S2D you can leverage a File Share Witness or a Cloud Witness. So that's the first thing to verify, is that you have a Witness resource. If you look in the System Event log, you will see errors associated with the cluster losing quorum if that's your issue.

    Hardware does not support SCSI Enclosure Services (SES):

    When deploying a Windows Server Failover Cluster, the first thing you always want to do is run the Validate tool (including S2D). This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet.

    In the Validate report you will see a "List Storage Enclosures" test. Do you see anything listed there? If not, then it's likely that your hardware does not support SES and is not compatible with S2D.  We are working to add a test to Validate to catch this condition and raise an error to make it clear.

    You should also run the following cmdlet's to verify, which will list the enclosures and the disk mappings:

    • Get-StorageEnclosure | Get-PhysicalDisk
    • Get-StorageEnclosure | fl *

    S2D uses SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures, the data placement is not resilient... which can be why the volumes are going offline when a node is rebooted.

    All the OEM's have solutions which support S2D. Make sure you talk to your hardware vendor of choice to ensure compatibility.
    Also, hot off the press, we just published a doc which outlines the process for safely rebooting a S2D node: https://technet.microsoft.com/windows-server-docs/storage/storage-spaces/maintain-servers?f=255&MSPPError=-2147217396

    Thanks!!
    Elden


    Hi Elden

    In the Cluster Validation Report  under "List Storage Enclosures" test i see on both nodes

    - 1 Enclosure "HP ProLiant DL380 G7" 

    - 2 NVMe SSD Disks (Cache)

    -  None of my 4 SAS HDD (Capacity)

    In the same Report under "List Storage Pools" the 4 SAS HDD Drives are reported, and the hole Report has no errors too.

    Is that the problem with SES that you mean? 

    Thank you

    Mike

    Wednesday, March 29, 2017 11:30 AM
  • If you are seeing a 2-node S2D solution go down when you reboot a node, we are seeing it be because of two reasons which are likely your issue:

    Witness:

    • When deploying a cluster, we recommend you always deploy a Witness, and this is especially critical for a 2-node cluster.
    • For S2D you can leverage a File Share Witness or a Cloud Witness. So that's the first thing to verify, is that you have a Witness resource. If you look in the System Event log, you will see errors associated with the cluster losing quorum if that's your issue.

    Hardware does not support SCSI Enclosure Services (SES):

    When deploying a Windows Server Failover Cluster, the first thing you always want to do is run the Validate tool (including S2D). This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet.

    In the Validate report you will see a "List Storage Enclosures" test. Do you see anything listed there? If not, then it's likely that your hardware does not support SES and is not compatible with S2D.  We are working to add a test to Validate to catch this condition and raise an error to make it clear.

    You should also run the following cmdlet's to verify, which will list the enclosures and the disk mappings:

    • Get-StorageEnclosure | Get-PhysicalDisk
    • Get-StorageEnclosure | fl *

    S2D uses SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures, the data placement is not resilient... which can be why the volumes are going offline when a node is rebooted.

    All the OEM's have solutions which support S2D. Make sure you talk to your hardware vendor of choice to ensure compatibility.
    Also, hot off the press, we just published a doc which outlines the process for safely rebooting a S2D node: https://technet.microsoft.com/windows-server-docs/storage/storage-spaces/maintain-servers?f=255&MSPPError=-2147217396

    Thanks!!
    Elden


    Hi Elden

    In the Cluster Validation Report  under "List Storage Enclosures" test i see on both nodes

    - 1 Enclosure "HP ProLiant DL380 G7" 

    - 2 NVMe SSD Disks (Cache)

    -  None of my 4 SAS HDD (Capacity)

    In the same Report under "List Storage Pools" the 4 SAS HDD Drives are reported, and the hole Report has no errors too.

    Is that the problem with SES that you mean? 

    Thank you

    Mike

    Exact same problem here. Get-StorageEnclosure shows my 2 NVMe SSDs but none of my 4 SAS HDD (connected to Avago Broadcom 9300-8i).

    When one node goes down, the Volume is offline. Any solution? Can SES be done with a firmware update on the Avago?

    Monday, April 24, 2017 10:07 AM
  • If you are seeing a 2-node S2D solution go down when you reboot a node, we are seeing it be because of two reasons which are likely your issue:

    Witness:

    • When deploying a cluster, we recommend you always deploy a Witness, and this is especially critical for a 2-node cluster.
    • For S2D you can leverage a File Share Witness or a Cloud Witness. So that's the first thing to verify, is that you have a Witness resource. If you look in the System Event log, you will see errors associated with the cluster losing quorum if that's your issue.

    Hardware does not support SCSI Enclosure Services (SES):

    When deploying a Windows Server Failover Cluster, the first thing you always want to do is run the Validate tool (including S2D). This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet.

    In the Validate report you will see a "List Storage Enclosures" test. Do you see anything listed there? If not, then it's likely that your hardware does not support SES and is not compatible with S2D.  We are working to add a test to Validate to catch this condition and raise an error to make it clear.

    You should also run the following cmdlet's to verify, which will list the enclosures and the disk mappings:

    • Get-StorageEnclosure | Get-PhysicalDisk
    • Get-StorageEnclosure | fl *

    S2D uses SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures, the data placement is not resilient... which can be why the volumes are going offline when a node is rebooted.

    All the OEM's have solutions which support S2D. Make sure you talk to your hardware vendor of choice to ensure compatibility.
    Also, hot off the press, we just published a doc which outlines the process for safely rebooting a S2D node: https://technet.microsoft.com/windows-server-docs/storage/storage-spaces/maintain-servers?f=255&MSPPError=-2147217396

    Thanks!!
    Elden


    Hi Elden

    In the Cluster Validation Report  under "List Storage Enclosures" test i see on both nodes

    - 1 Enclosure "HP ProLiant DL380 G7" 

    - 2 NVMe SSD Disks (Cache)

    -  None of my 4 SAS HDD (Capacity)

    In the same Report under "List Storage Pools" the 4 SAS HDD Drives are reported, and the hole Report has no errors too.

    Is that the problem with SES that you mean? 

    Thank you

    Mike


    Exact same problem here. Mike, did you get any further?
    Monday, April 24, 2017 12:57 PM
  • Hi,

    I've got the same issue here with 2-node S2D cluster (2 SSD, 4 HDD), but Elden suggestion do not work in my case.

    Have you finally solved the problem? What solution Microsoft support has been provided?

    Thanks, John

    • Proposed as answer by John_Bird Tuesday, July 17, 2018 7:30 AM
    • Unproposed as answer by John_Bird Tuesday, July 17, 2018 7:30 AM
    Monday, July 16, 2018 1:34 PM
  • Elden-

    How can we confirm that we have SES if we've already created the S2D cluster prior to KB4025339? I ran the validation cluster tool after KB4025339 but I didn't see anythign referencing SES. Which test is it under?

    If I run the following powershell I see all the disks seperated out by enclosure:

    Get-StorageEnclosure | %{$_.UniqueId; $_.FriendlyName;$_ | Get-PhysicalDisk }

    Does that mean SES should be working and the slabs of data and metadata will be spread across the fault domains (nodes in this case) in a resilient fashion?

    -Scott

    Tuesday, July 31, 2018 2:22 PM
  • Ok, I *might* have found a solution to this issue, it works great for me.  I was having the same problem, two supermicro servers in a cluster and pausing/rebooting one would lead to the disk going offline and breaking everything.  Pausing/rebooting the other node worked every time.  So I paused/patched a node and watched the RDMA traffic through perfmon.  Sure enough it's shuttling a decent amount of data to the machine that is supposed to be paused.  So this is what I did.  I loaded the failover manager, went to enclosures, clicked both the disk and server tabs on one of the enclosures with the goal of correlating one of the disk serial numbers to the node name (my two enclosures are named exactly the same making it impossible to know which enclosure belongs to which node by name alone).  Using that info I ran (Thanks Scott):

    Get-StorageEnclosure | %{$_.UniqueId; $_.FriendlyName;$_ | Get-PhysicalDisk }

    This command gives you the UniqueID number for each enclosure along with the serial numbers for all drives broken out by enclosure.  Using this info I figured out which UniqueID applies to which node and documented it for future use.  Once you know which ID goes to which node you can run this command to put that enclosure into maintenance mode (use your UniqueID in place of mine):

    Get-StorageEnclosure -UniqueId "5DCB4C41F7806500" | Enable-StorageMaintenanceMode

    All of a sudden the RDMA traffic stopped, I was able to reboot and nothing broke!  Unpausing the rebooted node caused the disk to regenerate as expected and all is now right in the world.

    Hope this helps!





    • Edited by mdchaser Friday, August 3, 2018 10:06 PM World Domination
    Friday, August 3, 2018 9:32 PM
  • Does it work if enable maintenance mode by node? http://kreelbits.blogspot.com/2018/07/the-proper-way-to-take-storage-spaces.html then you don't have to remember enclosure IDs or find them each time...


    You might have more then one enclosure per node too... http://kreelbits.blogspot.com/2018/07/hierarchical-view-of-s2d-storage-fault.html
    Friday, August 3, 2018 10:06 PM
  • That's a very good question.  I tried a couple iterations of putting a node directly into maintenance mode but all of them failed.  I'll try the link you sent next time as that would definitely be a better method!  The method I used may only work for simple clusters with a single enclosure per node.  
    Friday, August 3, 2018 10:53 PM
  • Ok, you talked me into testing it out.  You are right, that method is just much simpler.  Ignore everything I said and follow the directions in scottkreel's post :).
    Friday, August 3, 2018 11:02 PM
  • I get why this works and we will be trying it out in our environments, but the whole point of a 2 node cluster is to be able to survive if a whole node goes offline. 

    We should be able to pull the power on 1 host and the cluster should stay up without having to put anything in maintenance mode. 

    Friday, August 10, 2018 2:38 PM
  • I have the same issue and it mostly happens when with generic file shares (drive letter shares).

    CSV's rarely have this issue in a 2 node cluster.

    What I saw which intriges me is that if for example you share a folder on a disk let's assume Q:\MyFileShare there is also the administrative share Q:\ with Q$ and the strange thing is that the latter can not be set to Continues Availablity while the share can be.

    Also on this share Allow caching of share is enabled which to me is absolutely bogus because why you would enable caching on an admin share while you do not enable this on the shares active on that volume? It looks like a developer at MS just made a mistake and enabled the wrong checkbox when it should enable continues availablity for the whole disk if you chose it for a share.

    Of cause that can cause issues.

    Of cause we could ask the the question why there is still a generic file share when we have S2D and we use this to create a Scale Out File Server with CSV's? Does someone still do Netbios?

    Adding a File Share to a SQL Server FCI automatically choses a Generic File Server while all the disks used are on CSV. Hello SQL Server? Are you aware of what is going on around you?

    Defenitely something the SQL Server team should work on.

    Conclusion: no solution and Microsoft Support has no idea to.

    Tuesday, October 16, 2018 1:21 PM
  • Our Windows 2016 S2D Cluster appears stable and we are able to reboot nodes without any CSVs failing following the install of https://support.microsoft.com/en-gb/help/4480977

    From the changelog:

    January 17, 2019—KB4480977 (OS Build 14393.2759)

    Addresses boot failure issues that occur when you restart certain hyperconverged infrastructure (HCI) virtual machines.
    Addresses issues with taking snapshots on hyperconverged Storage Spaces Direct (S2D) cluster nodes.
    Addresses an issue that prevents volumes from going online as expected when you add back drain nodes during maintenance.
    Addresses an issue that fails to decrement the dirty region tracking reference count when a storage repair job is running on hyperconverged Storage Spaces Direct (S2D) cluster nodes

    Saturday, January 19, 2019 10:29 AM
  • I have 2 node windows cluster and VM SQL Cluster, one night one node physical 1 mirror disk failed and  fail over to node 2 and suddenly server automatically restarted,  

     is it possible if one server node mirror drive fail and other node can be restart ? 

    please help me to figure

    what is the cause of restart 

    thanks and looking for prompt response . 


    • Edited by kumar7777 Friday, June 14, 2019 2:38 PM
    Friday, June 14, 2019 2:37 PM