none
Server 2016 S2D Cluster unable to Drain Role (incomplete drain - paused) RRS feed

  • Question

  • Hi everyone,

    I am experiencing a strange behavior while trying to drain one node of a 2-node S2D Cluster on Server 2016:

    In FailoverManager -> Nodes -> right click the node -> pause -> drain roles.

    The Status changes to "draining" but then it Shows "drain failed" with Information:

    One or more roles were not moved from this node. Use the Roles tab to see These roles, and view their critical events to determine why they were not moved from this node.

    And under Show critical Events:

    Node drain failed on Cluster node <NodeName>.

    Reference the node's System and Application event logs and cluster logs to investigate the cause of the drain failure.  When the problem is resolved, you can retry the drain Operation.

    But I could not find any errors in none of These Event logs...

    To check if there were some Cluster resources that were not moved I executed Get-ClusterResource, but all cluster resources were on the other (running) node (node#2):

    PS C:\Windows\system32> Get-ClusterResource | select Name, IsCoreResource, State, OwnerNode

    Name                                    IsCoreResource      State     OwnerNode
    ----                                       --------------           -----      ---------
    Cluster IP Address                                False     Online      node#2
    Cluster Name                                       True      Online      node#2
    Cluster Pool 1                                       False     Online      node#2
    File Share Witness                                True      Online      node#2
    Health                                                 False     Online      node#2
    Storage Qos Resource                           False    Online      node#2
    Virtual Machine <testVM>                     False    Online      node#2
    Virtual Machine Cluster WMI                  False    Online      node#2
    Virtual Machine Configuration <testVM> False    Online      node#2

    So I don't understand, why the drain fails... any suggestions?

    Thanks!

    Wednesday, June 14, 2017 10:47 AM

Answers

  • Hi,

    thanks for the additional informations... now we continued to investigate the issue and ended up with recreating the whole storage pool and vdisks.

    I am convinced that some smb settings were setup wrong and this caused the whole system to become inbalanced. I will write again after these steps - lets see.

    • Proposed as answer by Mary DongModerator Wednesday, June 28, 2017 1:36 AM
    • Marked as answer by YankeeP Wednesday, September 18, 2019 2:54 PM
    Tuesday, June 27, 2017 5:33 PM

All replies

  • Hi,

    where is this testVM lying on and what happens if you live migrate this VM?

    bye,
    Marcel


    https://www.windowspro.de/marcel-kueppers

    I write here only in private interest

    Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.

    Wednesday, June 14, 2017 11:32 AM
  • Hi,

    the testVM is owned by Node#2 (i'm draining node#1). LiveMigration is working flawlessly...

    Even when I drain node#1 while the vm is running on node#1 it's beign live migrated correctly. afterwards the same error is shown.

    Wednesday, June 14, 2017 11:34 AM
  • Is the file share witness accessible from both nodes?
    Try to takeover manually the resources and check step by step.

    https://www.windowspro.de/marcel-kueppers

    I write here only in private interest

    Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.


    Wednesday, June 14, 2017 11:45 AM
  • When I tested a two-node S2D cluster in my home lab I ended up with almost the same issue as this guy:  except in my case, I managed to bring back the pool online.

    My failure started during the planned maintenance when I tried to drain the node exactly as you did. I was using Azure Cloud Witness and it seemed that latency caused the issue. Where is your File Share Witness is located?

    Also, based on my experience, two-node S2D cluster is a pretty weak setup. S2D does the great job starting from 4 nodes from what I can tell. Consider using StarWind Free or HPE VSA Free. For example, StarWind was initially designed for using in 2-node deployments while HPE VSA works great in "2+witness" or 3-node configurations.


    • Edited by Russel H Thursday, June 15, 2017 5:06 PM
    Thursday, June 15, 2017 5:05 PM
  • The file share is accessible - pingable and via smb.

    Is there any cluster log file where I could find more details what could possibly be wrong?

    Thursday, June 15, 2017 7:15 PM
  • The file share witness is a simple share on my 2008 R2 physical dc. both cluster nodes and the dc are on the same switch. ping < 1ms.

    Thursday, June 15, 2017 7:17 PM
  • Hi,

    You could also follow the blog to do troubleshooting from the logs in server 2016 cluster.

    https://blogs.msdn.microsoft.com/clustering/2015/05/14/windows-server-2016-failover-cluster-troubleshooting-enhancements-cluster-log/

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, June 19, 2017 2:46 AM
    Moderator
  • hi, thank you for that hint!

    in the get-clusterlog log files there are several occurences of these errors:


    WARN  [RHS] Error 50 from resource type control for restype Storage Replica.

    ERR   [API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.


    and this error seems to be the reason why the drain role command Fails:


    ERR [RCM] rcm::DrainMgr::SetStorageMaintenanceMode: [DrainMgr] Storage Maintenance Mode enable:true fail. Error 0x9 - One or more physical disks host data for virtual disks that have a lower fault domain awareness than the fault domain object specified.


    any idea on that?


    • Edited by YankeeP Monday, June 19, 2017 3:32 PM
    Monday, June 19, 2017 2:59 PM
  • Hi YankeeP,

    For now I couldn't find the official documents from Microsoft that describe this error.

    For more professional support about log analysis , I suggest you may need to contact CSS this as it is likely going to require deeper technical analysis beyond the scope of forums.  In addition, please check the article about configure  S2D with 2 nodes.

    Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Tuesday, June 20, 2017 2:08 AM
    Moderator
  • Hi,

    thanks for the additional informations... now we continued to investigate the issue and ended up with recreating the whole storage pool and vdisks.

    I am convinced that some smb settings were setup wrong and this caused the whole system to become inbalanced. I will write again after these steps - lets see.

    • Proposed as answer by Mary DongModerator Wednesday, June 28, 2017 1:36 AM
    • Marked as answer by YankeeP Wednesday, September 18, 2019 2:54 PM
    Tuesday, June 27, 2017 5:33 PM
  • Hi YankeeP,

    Glad that it ends up now and thanks for sharing your workaround.

    Best Regards,

    Mary


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, June 28, 2017 1:37 AM
    Moderator