none
Cluster issue

    Întrebare

  • Dear All,

    Good Day ,

     I just need a help in analyzing cluster issues which happened in our environment last week.We have a 2 node cluster with 10 cluster resources inside it .One morning we got issue with all the resources and when checked we found the resources were in 'Failed' or 'Pending Online' status .The resources are SAN based storages and the concerned team found no issues on the storage side .Also the connectivity was confirmed OK.

    In between the one of the nodes were rebooting.

    When checked the logs we found error with event id 1230  which

    "A component on the server did not respond in a timely fashion. This caused the cluster resource 'Cluster Disk 2' (resource type 'Physical Disk', DLL 'clusres.dll') ".....

    At the end of various checks we identified that the Time Zone in one of the Nodes were different .Also the Volume Shadow service (VSS) was in stopped status. In between we tried to manually bring up the resources .Some worked ,but the others went back to "Failed" status.

    Later the time zone was corrected and both cluster nodes were stable and working .

    Here  would like to give more inputs ?Are the above the real root cause ?

    Now I have again checked the server everything fine ,but the VSS service was set to manual and snapshots were not taken for today .I have changed it to Automatic .


    Arun


    • Editat de AadhiArun 20 noiembrie 2017 09:28 spell correction
    20 noiembrie 2017 08:20

Toate mesajele

  • Hi Arun,

    Please generate the cluster logs and go through the cluster logs. To get the cluster logs use the following command

    Get-Clusterlog -UseLocalTime

    "UseLocalTime" is to get the logs with the servers' time stamp otherwise it will be in CET. The logs will be in the C:\Windows\Cluster\Reports\cluster

    Other possiblities

    1. Was there any recent change in the environment?

    2. Was there any specific tasks that were running during or just before the outage?

    3. Did you see any 'Is Alive' checks failure errors in the cluster events, from the event it is clear that the cluster disk 2 has gone into some kind of unresponsive state.

    So to get the root cause, cluster logs have to be analysed

    Regards,
    Bala

    20 noiembrie 2017 09:29
  • Thank you Bala ,I will do it and will post the results here .

    Arun

    20 noiembrie 2017 11:50
  • Also Bala it is not to a single cluster resource.We had 10 resources which went to Failed status.Also can you help me with the question related to Time Zone ?

    Arun

    20 noiembrie 2017 11:59
  • Hi Arun,

    When the RHS is resetting it will reset the entire cluster, so all the clustered resources will go offline or failed. IF yours is 2012 and above, there are certain enhancements in handling the RHS which is detailed in the following article

    https://blogs.msdn.microsoft.com/clustering/2013/01/24/understanding-how-failover-clustering-recovers-from-unresponsive-resources/

    20 noiembrie 2017 12:45
  • Hi AadhiArun,

    1. Please run cluster validation wizard to check if there's any warning and error;

    2. As for the time change, you may enable related policy to audit system time change, please check the following article for detailed information of this part:

    https://docs.microsoft.com/en-us/windows/device-security/auditing/event-4616

    3. What is the OS of the cluster nodes, check if they are patched with the latest windows updates;

    Best Regards,

    Anne


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    21 noiembrie 2017 08:57
    Moderator
  • Hi AadhiArun,

    Just to check if the above reply could be of help, if yes, you may mark useful reply as answer, if the issue remains, welcome to feedback.

    Best Regards,

    Anne


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    23 noiembrie 2017 04:37
    Moderator
  • Thank you Anne for the feedback.But Can we run the cluster validation during production hours ? Will that create any problems in current structure  ?

    Arun

    30 mai 2018 13:42
  • Hi Arun

    the cluster validation will try to move disk resources from one node to the other so it  is recommended to run it with a downtime to be more safe.

    Regards,
    Bala N

    4 iunie 2018 12:50