locked
2 node Storage Spaces Direct, vDisk goes offline when second node reboots RRS feed

  • Question

  • I have a 2 node Storage Spaces Direct with 2 identical Dell R730XD servers. I used all default settings, and did not customize any storage settings. I can reboot the first node with no issue (Pause the node, wait for drain, reboot, then Resume). The vDisk stays up, and the VM's stay running. However, if I reboot the second node, following the same sequence, then the vDisk goes offline, which of course kills the VM's. Once the second node comes back online, the vDisk returns to Online status, and the VM's resume.

    this is obviously not the behavior I was anticipating. Each server has 12 disks spindle disks (7200 rpm), plus an NVMe drive. The OS is installed on two SSD drives at the back. The cluster passes the validation tests.

    Should I have customized the storage instead of letting the system define it for me??

    Thursday, June 22, 2017 2:25 AM

Answers

  • Symptom:

    A 2-node Storage Spaces Direct (S2D) cluster has the entire cluster or all storage becomes unavailable if one node is rebooted or fails.

    Cause:

    This is commonly caused by either the cluster losing quorum, or the pool data is not synchronized across all nodes, or the hardware is not compatible with Storage Spaces Direct

    Resolution:

    Witness:

    When deploying a Failover Cluster it is recommend to always configure a Witness, this is critical for a 2-node cluster.  If a 2-node cluster has no witness, the cluster may lose quorum in the event of unplanned downtime and the cluster will stop making storage unavailable.

    For Storage Spaces Direct deployments there are two supported witness types:

    • File Share Witness
    • Cloud Witness

    Resync Still in Progress

    When a node in a S2D cluster is unavailable, the changes made to the overall system are tracked while that node is unavailable and then resynchronized once the node becomes available again.  This happens after planned and unplanned downtime, such as after rebooting to apply an update or after a server failure. Once the server is available data will begin to resyncronyze, and for a 2-node S2D deployment this must complete before making another node unavailable.  In the event that only nodes which have a stale copy of the data which has not yet syncronyzed, the S2D cluster will be stopped to ensure there is no data loss.  

    Open PowerShell and run the Get-StorageJob cmdlet and verify that all rebuild jobs have completed, before making another node unavailable.  See this document for the process of bringing nodes in a Storage Spaces Direct cluster down for maintenance:

    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/maintain-servers  

    Hardware does not support SCSI Enclosure Services (SES):

    Storage Spaces Direct takes data and breaks it up into smaller chunks (a slab) and then distributes copies of the slaps across different fault domains to achieve resiliency.  The default for a 2-node cluster, it is a node level fault domain.  Storage Spaces Direct leverages SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures the data may not be correctly placed across fault domains in a resilient way.  This can result in all copies of a slab being lost in the event of a loss of a single fault domain.

    Consult the hardware vendor to verify compatibility with Storage Spaces Direct.

    When deploying a Windows Server Failover Cluster the first action is to run the cluster Validate tool, on a Storage Spaces Direct cluster this will include a special set of tests to verify compatibility. This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet.  

    A test has been added to the cluster Validate tool to verify if the hardware is compatible with SES.  Download the latest monthly update or at a minimum the July update at the following link: https://support.microsoft.com/en-us/help/4025339/windows-10-update-kb4025339

    Additionally, you can view Storage Spaces Direct’s enclosure to disk mappings in running the following PowerShell cmdlet’s:

    • Get-StorageEnclosure | Get-PhysicalDisk 
    • Get-StorageEnclosure

    The overall list of Storage Spaces Direct hardware requirements can be found at the following link: https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements  

    Saturday, June 24, 2017 2:08 AM
  • Yup, looks like that is your issue.  I am not sure on the compatibility of the LSI 9211-8i.  Might want to verify you have the latest firmware / driver, and it's in HBA mode.
    Saturday, June 24, 2017 3:21 AM
  • Also... my understanding is that Dell supports the R730 / R730xd / R630 which have more than 8 drive slots can support SES, and in turn be compatible with S2D.  Additionally, the HBA Dell supports with S2D is the HBA330.

    Dell is the official source on which of their hardware they support with S2D, make sure you talk to them.

    Thanks!
    Elden

    Saturday, June 24, 2017 4:16 AM

All replies

  • did you make sure every cluster disks' Health Status is Healthy and Operational Status is OK before reboot the second node?

    double check your create volume PowerShell cmdlet , follow cmdlet for your reference:

    New-Volume -StoragePoolFriendlyName S2DPool -FriendlyName vDisk01 -FileSystem CSVFS_REFS -Size 2048GB -PhysicalDiskRedundancy 1


    • Edited by Sifusun Thursday, June 22, 2017 2:51 AM
    Thursday, June 22, 2017 2:50 AM
  • Thank you Sifusun for the reply!

    Here is the command that I ran to create the volume:

    New-Volume -StoragePoolFriendlyName TestPool -FriendlyName "Test-Resilient1" -FileSystem CSVFS_ReFS -StorageTierfriendlyNames Performance, Capacity -StorageTierSizes 1200GB, 87994GB

    It seems that I am missing the -PhysicalDiskRedundancy 1 switch. What does this switch do, and what does the 1 signify? Should I delete the volume and recreate it with that switch to get the independent storage on each node?

    Thank you again!!

    Thursday, June 22, 2017 3:02 AM
  • PhysicalDiskRedundancy 1 is mirror, 2 is 3-way mirror, you have only 2 nodes so it should be 1.

    you can create a new volume with this cmdlet and then try again 


    • Edited by Sifusun Thursday, June 22, 2017 3:20 AM
    Thursday, June 22, 2017 3:19 AM
  • I have a 2 node Storage Spaces Direct with 2 identical Dell R730XD servers. I used all default settings, and did not customize any storage settings. I can reboot the first node with no issue (Pause the node, wait for drain, reboot, then Resume). The vDisk stays up, and the VM's stay running. 

    Hi,

    important are the fault domain settings (here 2 Hardware nodes) and SES support of your HBA. Additionally the minimum Hardware requirement is, with a mix of NVMe and HDD, -> 2 NVMe and 4 HDD.

    Proof again your volume configuration against tiering:

    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/create-volumes

    Example: Using storage tiers

    In deployments with three types of drives, one volume can span the SSD and HDD tiers to reside partially on each. Likewise, in deployments with four or more servers, one volume can mix mirroring and dual parity to reside partially on each.+

    To help you create such volumes, Storage Spaces Direct provides default tier templates called Performance and Capacity. They encapsulate definitions for three-way mirroring on the faster capacity drives (if applicable), and dual parity on the slower capacity drives (if applicable).

    bye,
    Marcel


    https://www.windowspro.de/marcel-kueppers

    I write here only in private interest

    Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.

    Thursday, June 22, 2017 6:38 AM
  • Thank you!

    Does the SES support get checked during cluster validation tests? It seems I recall reading that somewhere, but now can't find the source. How do I manually validate SES support?

    As far as fault domain, if I set PhysicalDiskRedundancy to 1 when I create the volume, will that correctly configure the fault domain to be at the server level?

    Thursday, June 22, 2017 4:01 PM
  • So I removed the virtual disk, but prior to creating a new disk, I rebooted both nodes to see what the behavior was with the Storage Pool. Just like the vDisk, the storage pool goes offline when the second node reboots. But the SP stays online when the first node reboots. Is there any specific configuration for this storage pool? Is this typical?

    Thank you all!

    RTS

    Friday, June 23, 2017 12:02 AM
  • Thank you!

    Does the SES support get checked during cluster validation tests? It seems I recall reading that somewhere, but now can't find the source. How do I manually validate SES support?

    Behind the validation report -> against Storage Spaces Direct -> look around List Storage Enclosures. They must be recognized.

    https://www.windowspro.de/marcel-kueppers

    I write here only in private interest

    Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.

    Friday, June 23, 2017 6:03 AM
  • On the Validation report, the Dell R720XD shows up and identified under Storage Enclosures. It includes the model number, serial number,etc. I assume this means that the Dell R720XD chassis is SES capable.

    However, on another 3 node S2D deployment (with SuperMicro, not Dell), under Storage Enclosures on the validation report, it lists BOTH the chassis (SuperMicro) as well as the HBA (LSI). The SuperMicro chassis has the NVMe's listed under it, and the LSI "enclosure" has the SSD/HDDs listed.

    The reason I mention this is that on my Dell R720XD's, ONLY the R720XD chassis shows under the Storage Enclosure section of the report, with the NVMe's. But the LSI HBA does not show up, nor do the HDD's show up in that section.

    So my question is should the Enclosure Awareness extend to the HBA, and should that be separately listed in the 'List Storage Enclosure' section of the report, like it is on my SuperMicro S2D clusters?


    Friday, June 23, 2017 11:56 AM
  • Symptom:

    A 2-node Storage Spaces Direct (S2D) cluster has the entire cluster or all storage becomes unavailable if one node is rebooted or fails.

    Cause:

    This is commonly caused by either the cluster losing quorum, or the pool data is not synchronized across all nodes, or the hardware is not compatible with Storage Spaces Direct

    Resolution:

    Witness:

    When deploying a Failover Cluster it is recommend to always configure a Witness, this is critical for a 2-node cluster.  If a 2-node cluster has no witness, the cluster may lose quorum in the event of unplanned downtime and the cluster will stop making storage unavailable.

    For Storage Spaces Direct deployments there are two supported witness types:

    • File Share Witness
    • Cloud Witness

    Resync Still in Progress

    When a node in a S2D cluster is unavailable, the changes made to the overall system are tracked while that node is unavailable and then resynchronized once the node becomes available again.  This happens after planned and unplanned downtime, such as after rebooting to apply an update or after a server failure. Once the server is available data will begin to resyncronyze, and for a 2-node S2D deployment this must complete before making another node unavailable.  In the event that only nodes which have a stale copy of the data which has not yet syncronyzed, the S2D cluster will be stopped to ensure there is no data loss.  

    Open PowerShell and run the Get-StorageJob cmdlet and verify that all rebuild jobs have completed, before making another node unavailable.  See this document for the process of bringing nodes in a Storage Spaces Direct cluster down for maintenance:

    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/maintain-servers  

    Hardware does not support SCSI Enclosure Services (SES):

    Storage Spaces Direct takes data and breaks it up into smaller chunks (a slab) and then distributes copies of the slaps across different fault domains to achieve resiliency.  The default for a 2-node cluster, it is a node level fault domain.  Storage Spaces Direct leverages SCSI Enclosure Services (SES) mapping to ensure slabs of data and the metadata is spread across the fault domains in a resilient fashion. If the hardware does not support SES, there is no mapping of the enclosures the data may not be correctly placed across fault domains in a resilient way.  This can result in all copies of a slab being lost in the event of a loss of a single fault domain.

    Consult the hardware vendor to verify compatibility with Storage Spaces Direct.

    When deploying a Windows Server Failover Cluster the first action is to run the cluster Validate tool, on a Storage Spaces Direct cluster this will include a special set of tests to verify compatibility. This can be done from Failover Cluster Manager or with the Test-Cluster cmdlet.  

    A test has been added to the cluster Validate tool to verify if the hardware is compatible with SES.  Download the latest monthly update or at a minimum the July update at the following link: https://support.microsoft.com/en-us/help/4025339/windows-10-update-kb4025339

    Additionally, you can view Storage Spaces Direct’s enclosure to disk mappings in running the following PowerShell cmdlet’s:

    • Get-StorageEnclosure | Get-PhysicalDisk 
    • Get-StorageEnclosure

    The overall list of Storage Spaces Direct hardware requirements can be found at the following link: https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-hardware-requirements  

    Saturday, June 24, 2017 2:08 AM
  • Thank you Mr. Christensen!!

    On my two node cluster, when I run the command 

    Get-StorageEnclosure | Get-PhysicalDisk

    then all I see are my four NVMe's! No SSDs and no HDDs listed!

    When I run the command

    Get-StorageEnclosure | fl *

    it lists lots of details on both the Dell 720XD servers. 

    So that leads me to this last question: is it possible for the chassis to be SES compatible, but not the HBA? In these two servers, I'm running an older HBA, an LSI 9211-8i. I'm wondering if that is the issue. Clearly the SSD/HDD's are not mapped to an enclosure. Thank you!

    RTS

     

    Saturday, June 24, 2017 2:43 AM
  • Yup, looks like that is your issue.  I am not sure on the compatibility of the LSI 9211-8i.  Might want to verify you have the latest firmware / driver, and it's in HBA mode.
    Saturday, June 24, 2017 3:21 AM
  • Also... my understanding is that Dell supports the R730 / R730xd / R630 which have more than 8 drive slots can support SES, and in turn be compatible with S2D.  Additionally, the HBA Dell supports with S2D is the HBA330.

    Dell is the official source on which of their hardware they support with S2D, make sure you talk to them.

    Thanks!
    Elden

    Saturday, June 24, 2017 4:16 AM