Windows Server TechCenter > Windows Server Forums > Clustering > Inexplainable 2008 Failover Cluster Issues
Ask a questionAsk a question
 

QuestionInexplainable 2008 Failover Cluster Issues

  • Wednesday, October 14, 2009 8:53 AMScott Eggleston Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    We have a 2008 Failover Node & Disk Majority SQL 2005 cluster.

    There are 2 nodes in the cluster with 2008 Ent 64-bit SP2 installed.

    At around 00:20 each morning we see various FailoverClustering errors in the event logs on both servers.

    EventID: 1135, 1069, 1177

    Before the FailoverClustering events are seen, 2 informational events appear regarding the 'Microsoft Failover Clustering Virtual Adapater'

    EventID: 4201 'The system detected that network adapter Local Area Connection* 9 was connected to the network, and has initiated normal operation.'

    This is causing the resources to failover to the secondary node.

    I have run the Cluster Validation Wizard and everything passes. I have disabled the Windows Firewall service on both nodes.

    We are presenting the storage via NetApp and the nodes have 3 nics installed

    NIC1 - Server Vlan - Speed/Duplex Set to 1000Mb Full
    NIC2 - Storage Vlan - Speed/Duplex Set to 1000Mb Full
    NIC3 - Heartbeat - Speed/Duplex Set to 100Mb Full

    Please can anyone help me troubleshoot these issues ?

    Thanks

    Scott

All Replies

  • Wednesday, October 14, 2009 12:18 PMEdwin vMierloMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    can you post the actual detail of the 3 events ?

    thanks,
    Edwin.
  • Wednesday, October 14, 2009 1:16 PMomril Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi Scott and Edwin,

    I have the same issue and I've been struggling for a month now.

    2008 failover node failing randomly every 30-90 minutes after sp 2 was installed,
    before the update failover was much more frequent.


    I think I can see a pattern in the events on the node server,
    the event Scott mentioned is logged twice before the service crashes.

    this is the order in which they appear.(I'm adding the time service event because it happens everytime so i can't determine if it's related or not)


    event id 27 -

    The time provider NtpClient is currently receiving valid time data from <DominaController> (ntp.d|0.0.0.0:123->10.0.0.40:123).

    two of those: event id 4201 -

    The system detected that network adapter Local Area Connection* was connected to the network, and has initiated normal operation.

    event id 1135 -

    Cluster node 'MYMCSQL2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    event id 1069 -

    Cluster resource 'Quorum' in clustered service or application 'Cluster Group' failed.


    this is the pattern pretty much.

    hardware is exactly the same on both nodes,
    both NIC drivers are up to date.

    cluster validation gives no clue.


    thanks,
    Omri.


    Edit:

    BTW,

    Scott, do you have Scom agent installed ?
  • Thursday, October 15, 2009 4:26 AMTim Quan - MSFTMSFT, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hi Scott,

     

    Event ID 1135 — Cluster Service Startup

    http://technet.microsoft.com/en-us/library/dd353973(WS.10).aspx

     

    Event ID 1069 — Clustered Service or Application Availability

    http://technet.microsoft.com/en-us/library/dd353893(WS.10).aspx

     

    Event ID 1177 — Quorum and Connectivity Needed for Quorum

    http://technet.microsoft.com/en-us/library/dd353872(WS.10).aspx

     

    Event ID 4201 — TCP/IP Network Interface Connectivity

    http://technet.microsoft.com/en-us/library/dd392958(WS.10).aspx

     

    Hope it helps.

     

    Tim Quan - MSFT

  • Friday, October 16, 2009 10:59 AMScott Eggleston Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hi Guys,

    Thanks for replies.

    These are the details of the events:

    1069:  Cluster resource 'Disk :\' in clustered service or application 'Cluster Group Name' failed. (Which is occuring for each Disk)

    For some unknow reason someone has added a FileServer resource on the disk that the SQL data is held on. I cannot pinpoint in the logs when this was done. What EventID should I be looking for when a Cluster Resource has been added ?

    4201: The system detected that network adapter Local Area Connection* 12 was connected to the network, and has initiated normal operation.

    7024: The SQL Server (SQL CLuster Instance Name) service terminated with service-specific error 3449 (0xD79).

    57: The system failed to flush data to the transaction log. Corruption may occur.

    1073: The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was '1359'.

    7034: The Distributed Transaction Coordinator (03fcbf57-73fd-4375-8243-35af385ea1a2) service terminated unexpectedly. It has done this 4 time(s).

    We have logged a call with VMware and they have confirmed that they think it is a networking issue, do you agree from what you know here ?

    Thanks

    Scott

    1135:  Cluster node 'NODE 2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    1126: Cluster network interface 'NODE 2 & 1" - Heartbeat - "IP"' for cluster node 'NODE 2' on network 'Heartbeat' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    7031: The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

    7024: The Cluster Service service terminated with service-specific error 1359 (0x54F).

    And the list goes on........

    We have logged a call with VMware and they are advising that it is network related, from what you know so far, do you agree ?

    Thanks

    Scott

  • Friday, October 16, 2009 11:05 AMScott Eggleston Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Omril,

    I am also receiving the events related to NTP.

    I do not have MOM/SCOM agents installed.

    Thanks

    Scott

  • Sunday, October 18, 2009 11:56 AMomril Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi Scott,


    I've removed the MOM agent as well as the Trend Micro AV client,

    and I haven't had a failure in 3 days.

    are you using Trend Micro ?

    thanks,

    Omri
  • Monday, October 19, 2009 7:43 AMJesper Arnecke Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hiya,

    When you write AV, just make sure that you have all excluded that needs to be excluded.

    Note that this is not marked for Windows server 2008, but from what I can see the same principles would apply:
    http://support.microsoft.com/default.aspx/kb/250355

  • Tuesday, October 20, 2009 8:55 AMScott Eggleston Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hey Guys,

    We are having the same issue in our development environment as well, which does not have AV installed.

    There must be something more sinister than AV.

    Scott
  • Wednesday, October 21, 2009 9:11 AMEdwin vMierloMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    1069:  Cluster resource 'Disk :\' in clustered service or application 'Cluster Group Name' failed. (Which is occuring for each Disk)

    That is the message which is worrying to me.
    A 1069 means that the physical disk resources in the cluster is failed

    I would do three things
    1) check the system/app eventlog in the period prior to this 1069 to see if there are clues
    2) check the cluster.log file in the same period and period prior to this (note: cluster.log is timestamped in UTC, which is GMT without Day Light Savings adjustment)
    3) open a support case with your storage vendor and or Microsoft to further investigate

    Rgds,
    edwin.
  • Friday, October 23, 2009 6:53 PMmark987654321 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    We've having the same issue, it happens a couple times a week. Unfortunately MSFT, VMWare, EMC all point fingers at eachother and we can get nowhere. Anyone have any further updates?
  • Wednesday, October 28, 2009 7:45 PMSam Tech Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Disable TCP Chiminy or TCP offload http://support.microsoft.com/kb/951037
  • Thursday, November 05, 2009 9:36 AMomril Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi Sam,

    I've tried disabling TCP offload and it didn't do the trick.

    btw: type of NIC, broadcom BCM5708s


  • Thursday, November 05, 2009 11:19 AMJesper Arnecke Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hiya,

    I've heard quite some places that the broadcom NIC's are giving some very inexplainable errors. - Altho only relating to Hyper-V as far as I've heard.
    If you got the possibility try to get your connections on another vendor NIC. - just for the sake of testing, it might be worth trying..
  • Sunday, November 08, 2009 8:36 AMomril Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    thanks  for that reply,
    alas it's not an option right now.

    but i think i'll just try to evict the failing node and reinsert it, although i know
    most of you tried it already.

    I'm just at my wits end here.