none
Up But Isolated Cluster Node

    Întrebare

  • I'm running Server 2016 fully patched in a 5 node cluster.  Hyper-V and S2D for a hyper-converged solution running a few hundred VMs.  Two days ago one of my nodes decided that it wanted to be cranky.  This caused the roles to rearrange on the systems and ended up putting one of my healthy nodes in the "Isolated" state.  <g class="gr_ gr_456 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins replaceWithoutSep" data-gr-id="456" id="456">Root</g> cause for the other node that went out to lunch is still unknown and is being researched separately.  However, this other healthy node has been stuck in an online but isolated state.  See screenshot.  I've seen plenty of examples where the node is offline and isolated, typically a network problem(network looks line. I have three separate NICs with separate switches/<g class="gr_ gr_3558 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="3558" id="3558">vlans</g>/IP space).  I can live migrate VMs, my S2D storage is fully healthy on the cluster.  No issues using this node, but I don't like the "isolated" state.  I ran the cluster validation test for networking and it returns healthy.  No warnings or errors in the validation test.  Event logs show that the node when isolated, but in the same second I have a follow-up event that it's no longer isolated.  These events exist on all nodes in the cluster, so there is no reason why it should be isolated.  I'm sure if I rebooted this node(or even restarted the cluster service) that it would come back online as healthy, but another node in the cluster is having hardware issues, so that's not an option at the moment.  Any thoughts would be appreciated on how to remove the isolated state.  The end of the <g class="gr_ gr_5208 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="5208" id="5208">powershell</g> command shows it all.  <g class="gr_ gr_5545 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins doubleReplace replaceWithoutSep" data-gr-id="5545" id="5545">State</g> is Up. StatusInformation is Isolated...


    4 mai 2018 14:20

Toate mesajele

  • Hi ,

    This is a quick note to let you know that I am currently performing research on this issue and will get back to you as soon as possible. I appreciate your patience.
    If you have any updates during this process, please feel free to let me know.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    7 mai 2018 09:15
  • I'd like to report a very similar issue we encountered just 15 minutes ago

    We have a 10 Node cluster, Server 2016 fully patched, where 3 Nodes had their Status set to "Up/Isolated".

    A colleague emptied one of the 3 affected nodes and restarted the cluster service which led to a change from Up/Isolated to Up on all the 3 nodes.

    @OP: I see that may not be an option since you got that host with the hardware issue, just wanted to let u know.

    7 mai 2018 10:44
  • Hi Manuel Schenkel,

    I appreciate your comments.  Not sure how I got into this state, but it's annoying.  Everything that I can tell from this issue is that it's a bug.  I'm working on the hardware fix this morning and will then hopefully try to get a restart the cluster service on the "isolated" node.

    Are you guys also running S2D and Hyper-V?

    Thanks,
    Brad

    7 mai 2018 12:27
  • Is the S2D configuration a WSSD certified solution? https://www.microsoft.com/en-us/cloud-platform/software-defined-datacenter  Certified S2D hardware is required to pass additional qualifying (AQ) tests that are above and beyond normal Windows Server certification due to the extra strain S2D puts on cluster components.  If you are not using a certified platform from one of the vendors listed in the above link, it is almost like you are going to go through the same steps the vendor went through to get their configurations certified.

    tim

    7 mai 2018 13:14
  • Tim,

    Thanks for your post.  It is not a WSSD certified solution.  They didn't have these back in 2016 when we first setup our cluster.  I've been running S2D with Hyper-V for almost 2 years on 2016 and I've never seen this state before.  Also, the "requirements" for S2D are pretty simple and don't require WSSD certified solutions.

    Thanks,
    Brad


    7 mai 2018 13:37
  • "They didn't have these back in 2016 when we first <g class="gr_ gr_116 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" data-gr-id="116" id="116">setup</g> our cluster. "

    Looks like you are using Grammerly - it tends to mess up posts.  Understand that there were not any certified solutions back when Windows Server 2016 was released.  That's because the certifications were new and it was taking the vendors about six months to work through all the tests for their first systems.

    "I've never seen this state before. "

    Famous last words. <grin>

    "Also, the "requirements" for S2D are pretty simple and don't require WSSD certified solutions."

    True, it does not require a WSSD certified solution, but that certification was put in place for a reason.  This is very similar to the way regular failover clusters used to have a separate list for supported configurations.  One could put together a cluster using their own components and it worked, but if something went wrong, support was pretty much up to the person who put the system together.  S2D is in its initial release, and while a great product, it does place a lot more stress onto the components of a system than does a failover cluster.  The AQ tests are designed to stress the individual components and combined configuration to ensure that they can take the load.  Also, as a result of the 'newness' of S2D, expert support is a little harder to come by.  By using certified solutions, you are ensured that the components have been strenuously tested and are less likely to 'break' on any updates because the vendors are testing.  But it is still not a guarantee.  And then troubleshooting, even on a certified solution, requires additional expertise that is generally more readily available to people who have paid for premier support, as the basic support channels do not have the training on this product and it takes a long time to get an issue elevated through the channels.

    I've worked with clusters since the NT days.  My recommendation for any production system would be to work with a vendor, or at least a partner that specializes in S2D, to configure the system and then sign up for premier support.  For lab work, non-certified systems and components are fine, but not for production.


    tim

    7 mai 2018 14:09
  • Tim,

    I agree with all of your wisdom.  I'm still green behind the ears.  Sadly I'm in a quasi dev/semi-prod environment, so my budget is limited.

    Thanks,
    Brad

    7 mai 2018 14:13
  • Hi Brad

    "Are you guys also running S2D and Hyper-V?"

    That particular cluster is running Hyper-V, but it's not an S2D. However, we got 2 S2D clusters running Hyper-V as well and when I checked those yesterday, they were looking great. No strange Up/Isolated status anywhere.

    Regards,

    Manuel


    8 mai 2018 10:46
  • Hey Brad,

    Did you ever figure out if this is a true Windows 2016 Bug?  I have the same issue as well. 

    Thanks,
    -Steven

    10 mai 2018 08:25
  • Hi Steven,

    Sadly, no.  I've restarted the cluster service on my "Isolated" node, but that didn't fix it for me.  I even tried rebooting the server, but that still didn't fix it.  As far as I can tell, my "up" and "isolated" node is actually fully online and healthy in the cluster.  It has replicated the S2D storage and I can <g class="gr_ gr_551 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins replaceWithoutSep" data-gr-id="551" id="551">live</g> migrate VMs.  It even has a "vote" in the cluster.  If it was actually isolated, it shouldn't have a vote and I shouldn't be able to manage it properly.  So, I'm sticking with it being a bug for now.

    Brad

    10 mai 2018 12:27
  • If you believe it to be a bug, you should report it as such.  'Reporting' it in a forum is not the way bugs get fixed because this is not an official bug reporting mechanism.  If it truly is a bug, opening a case with Microsoft support does not cost anything.

    tim

    11 mai 2018 12:17
  • Tim,

    I actually already have a ticket open with MS.  Waiting for their official reply on the topic.

    Thanks,
    Brad

    11 mai 2018 14:34