locked
Failed node in S2D cluster leaves Enclosure behind in StorageSubsystem RRS feed

  • Question

  • Hi.

    We have (had) a 4 node S2D cluster running Hyperconverged.

    One of the nodes had a critical failure, which resulted in a complete system rebuild of that node (don't ask me why). We knew we had to do some manual cleanup in the cluster. We know the drill (I think).

    - Evict failed node from cluster
    - After Storage-Jobs finish, remove the missing disks in the S2D pool. Pool size shrinks.
    - Reinstall failed node, do not join cluster yet. First on this node remove S2D Pool, Virtual Disk and reset the S2D drives to enable them to be pooled again.

    When we try to re-add the node, it fails. Cluster report only shows 1 warning which interests me. On the 3 remaining nodes, something is still left over from the old failed node. In the Storage layer.

    ********************************************************************************************

    Get-StorageSubSystem -FriendlyName *clu* | Debug-StorageSubSystem

    Severity: Critical

    Reason         : Communication has been lost to the x-xxx-xxxx-02, Enclosure #: ઈ, DP, BP13G+EXP.
    Recommendation : Start or replace the storage enclosure.
    Location       : x-xxx-xxxx-02, Enclosure #: ઈ, DP, BP13G+EXP
    Description    : Enclosure #: ઈ, DP, BP13G+EXP

    ********************************************************************************************

    x-xxx-xxxx-02 is the name of the old failed node. The new node has the same name, but off course has been re-installed. And the error is not in the new node, but on the old 3 nodes. They still seem to look/wait for Node 2 to return. I doubt this will ever work, and would rather get rid of the error. As if the new node was never part of the cluster and is simply a new node.
    Even if I give the new node a different name, the error of the old node is still on the 3 remaining nodes.

    So, how do I get rid of the old error on the old 3 node cluster ?

    This cluster is doing just fine with the 3 remaining nodes, regardless of the old error. It runs fine, Pool is healthy, disks are healthy. Nothing more is left over from the old Node 2. Except for the error in the StorageSubsystem as stated above.

    Anyone ?

    Greetings,
    Richard

    Thursday, October 4, 2018 8:44 AM

Answers

  • Hi, I´ve had the same issue in a number of S2D cluster setup during the last year. The error always occur after updating HBA firmware on a cluster node. This gives a duplicate "ghost HBA". It can be easily verifed by just looking in failover cluster manager under Storage\Enclosures whre you have a duplicate HBA
    (2 HBA´s with the same serial number, but one without any disks attached).

    Rebooting all cluster nodes helped in most cases. In some cases I also had to move cluster core resources around before the warning went away.

    /C

    • Proposed as answer by Clab79 Friday, October 5, 2018 3:19 PM
    • Marked as answer by Richard Willkomm Tuesday, October 9, 2018 8:51 AM
    Friday, October 5, 2018 3:14 PM

All replies

  • Small update.

    Looks like I already found something. I rebooted the remaining 3 nodes in the cluster (one by one).
    And after the last node was reboot, the old error in the Debug is gone. The 3 nodes had not been rebooted after the node fail 3 weeks ago. Sounds solid.

    Have not re-added the fresh node yet to see if it succeeds now. First diner and some sleep.

    Greetz

    Thursday, October 4, 2018 3:33 PM
  • Hi, I´ve had the same issue in a number of S2D cluster setup during the last year. The error always occur after updating HBA firmware on a cluster node. This gives a duplicate "ghost HBA". It can be easily verifed by just looking in failover cluster manager under Storage\Enclosures whre you have a duplicate HBA
    (2 HBA´s with the same serial number, but one without any disks attached).

    Rebooting all cluster nodes helped in most cases. In some cases I also had to move cluster core resources around before the warning went away.

    /C

    • Proposed as answer by Clab79 Friday, October 5, 2018 3:19 PM
    • Marked as answer by Richard Willkomm Tuesday, October 9, 2018 8:51 AM
    Friday, October 5, 2018 3:14 PM
  • Hi,

    Just want to confirm the current situation.

    Please feel free to let me know if you need further assistance.

    Best regards,

    Michael


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Sunday, October 7, 2018 4:07 AM
  • Hi,

    I managed to add the failed node now. Enclosure error is gone, so rebooting definitely works. Rebooting nodes in S2D is tricky though, with the storagejobs kicking in and needing to finish first before moving on to the next node. Wish there was another way to clean up the Subsystem.

    But after re-adding the node to the cluster, for some reason the 12 disks (4SSD and 8 Spindles) in the failed node were not all re-added in the pool, and 2 disks are not set to Journal usage type but Auto-Select instead. Even though all 4 servers are identical (all 12 disks). I used the 'add all eligible storage' option when adding the node in FCM.

    I can probably fix the pool manually by adding the remaining 3 disks which are now in the primordial pool, and changing the Usage type of some disks. But the whole idea behind S2D is that the system figures out what layout the pool should have, how many disks, what disk using which type of usage, etc. etc. And with 4 identical servers, how hard can it be ? It worked the 1st time when we build the cluster.

    Oh well. You never stop learning I guess ;)

    Greets
    Richard

    Tuesday, October 9, 2018 8:51 AM