none
Cluster Fail-over - Failed to bring secondary node online ??

    Question

  • Hi

    Server : Windows server 2008

    DB Server : SQL Server 2008 (SP1)

     

    Here are the series of events which happened.

    1.) Event ID: 1135

    Cluster node 'XYZ' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    2.) Event ID: 1049

         Cluster IP address resource 'SQL IP Address 1 (XYZ)' cannot be brought online because a duplicate IP address '10.9.8.113' was detected on the network.  Please ensure all IP addresses are unique.

    3.) Event ID: 1069

         Cluster resource 'SQL IP Address 1 (XYZ)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

    4.) Event ID: 1049

         Cluster IP address resource 'Cluster IP Address' cannot be brought online because a duplicate IP address '10.9.8.112' was detected on the network.  Please ensure all IP addresses are unique.

    5.) Event ID: 1069

        Cluster resource 'Cluster IP Address' in clustered service or application 'Cluster Group' failed.

    6.) Event ID: 1066

    Cluster disk resource 'Cluster Disk 25' indicates corruption for volume '\\?\Volume{88552e6f-aea2-11df-9790-0026b92fffa7}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:\Windows\Cluster\Reports\ChkDsk_ResCluster Disk 25_Disk16Part1.log'. Chkdsk may also write information to the Application Event Log.

    7.) Event ID : 1066

    Cluster disk resource 'Cluster Disk 26' indicates corruption for volume '\\?\Volume{88552e05-aea2-11df-9790-0026b92fffa7}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:\Windows\Cluster\Reports\ChkDsk_ResCluster Disk 26_Disk4Part1.log'. Chkdsk may also write information to the Application Event Log.

    8.) Event ID: 1049

      (Same message as point 2)

    9.) Event ID: 1069

         (Same message as point 3)

    10.) Event ID : 1049

    (same message as point 4)

    11.) Event ID :1069 

           (same message as point 5)

    12.) Event ID :1205

        The Cluster service failed to bring clustered service or application 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

    13.) Event ID: 1069

          Cluster resource 'Cluster Disk 17' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

    14.) Event D : 1049

          (same message as point 2)

    15.) Event ID: 1069

    Cluster resource 'SQL IP Address 1 (XYZ)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

    16.) Event ID : 1205

     The Cluster service failed to bring clustered service or application 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

     

    first of all,I went through all the logs, and could not find the reason for fail-over initialization. There should be some thing logged why the failover happened? secondly after failover the service was not coming online due to duplicate IP address detection. later when we try  to manually bring the service online from cluster management it comes online successfully. i dont understand how would duplicate IP address get resolved when we start manually.

    Lastly we see few errors related to physical disk resource between failover retries, is this could be the correlated to failover error ? Please help to troubleshoot these errors, i am not so good at clustering and Thanks for your help in advance....:)

    Thanks

    Mushtaq

     

     

     

     

     

     

     

     

     

     

    Tuesday, October 19, 2010 5:06 PM

Answers

  • Starting with event 1.  It looks like your cluster lost network communications.  At that point, cluster arbitration took over.  One node "won" control of the Witness disk and established a quorum.  The other node was removed from participating in the cluster.  At that time, the cluster tried to take over the SQL Server resource group and move it to the node that had established quorum.  That didn't work because the old node still showed the IP address as active.

    The disk corruption is not good.  If you are using iSCSI disks for your cluster, then it is a side effect of the netowrk failure.

    Definitely something wrong in the network stack. 


    Geoff N. Hiten Principal Consultant Microsoft SQL Server MVP
    Wednesday, October 20, 2010 4:28 PM
    Moderator

All replies

  • Starting with event 1.  It looks like your cluster lost network communications.  At that point, cluster arbitration took over.  One node "won" control of the Witness disk and established a quorum.  The other node was removed from participating in the cluster.  At that time, the cluster tried to take over the SQL Server resource group and move it to the node that had established quorum.  That didn't work because the old node still showed the IP address as active.

    The disk corruption is not good.  If you are using iSCSI disks for your cluster, then it is a side effect of the netowrk failure.

    Definitely something wrong in the network stack. 


    Geoff N. Hiten Principal Consultant Microsoft SQL Server MVP
    Wednesday, October 20, 2010 4:28 PM
    Moderator
  • Do we have the solution identified. i do facing the same problem with the exact sequence of the event logs mentioned above but only thing is that there is no issues with the disks.

    Monday, November 11, 2013 3:38 AM
  • Gokul,

    Please start a new question, even though it is very similar to this one. When you post it, please post the cluster.log output for that time region and errors that exist.

    -Sean


    Sean Gallardy | Blog | Twitter

    Tuesday, November 12, 2013 5:50 PM
    Answerer