Inexplainable 2008 Failover Cluster Issues
- Hi,
We have a 2008 Failover Node & Disk Majority SQL 2005 cluster.
There are 2 nodes in the cluster with 2008 Ent 64-bit SP2 installed.
At around 00:20 each morning we see various FailoverClustering errors in the event logs on both servers.
EventID: 1135, 1069, 1177
Before the FailoverClustering events are seen, 2 informational events appear regarding the 'Microsoft Failover Clustering Virtual Adapater'
EventID: 4201 'The system detected that network adapter Local Area Connection* 9 was connected to the network, and has initiated normal operation.'
This is causing the resources to failover to the secondary node.
I have run the Cluster Validation Wizard and everything passes. I have disabled the Windows Firewall service on both nodes.
We are presenting the storage via NetApp and the nodes have 3 nics installed
NIC1 - Server Vlan - Speed/Duplex Set to 1000Mb Full
NIC2 - Storage Vlan - Speed/Duplex Set to 1000Mb Full
NIC3 - Heartbeat - Speed/Duplex Set to 100Mb Full
Please can anyone help me troubleshoot these issues ?
Thanks
Scott
All Replies
- can you post the actual detail of the 3 events ?
thanks,
Edwin. - Hi Scott and Edwin,
I have the same issue and I've been struggling for a month now.
2008 failover node failing randomly every 30-90 minutes after sp 2 was installed,
before the update failover was much more frequent.
I think I can see a pattern in the events on the node server,
the event Scott mentioned is logged twice before the service crashes.
this is the order in which they appear.(I'm adding the time service event because it happens everytime so i can't determine if it's related or not)
event id 27 -
The time provider NtpClient is currently receiving valid time data from <DominaController> (ntp.d|0.0.0.0:123->10.0.0.40:123).
two of those: event id 4201 -
The system detected that network adapter Local Area Connection* was connected to the network, and has initiated normal operation.
event id 1135 -
Cluster node 'MYMCSQL2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
event id 1069 -
Cluster resource 'Quorum' in clustered service or application 'Cluster Group' failed.
this is the pattern pretty much.
hardware is exactly the same on both nodes,
both NIC drivers are up to date.
cluster validation gives no clue.
thanks,
Omri.
Edit:
BTW,
Scott, do you have Scom agent installed ? Hi Scott,
Event ID 1135 — Cluster Service Startup
http://technet.microsoft.com/en-us/library/dd353973(WS.10).aspx
Event ID 1069 — Clustered Service or Application Availability
http://technet.microsoft.com/en-us/library/dd353893(WS.10).aspx
Event ID 1177 — Quorum and Connectivity Needed for Quorum
http://technet.microsoft.com/en-us/library/dd353872(WS.10).aspx
Event ID 4201 — TCP/IP Network Interface Connectivity
http://technet.microsoft.com/en-us/library/dd392958(WS.10).aspx
Hope it helps.
Tim Quan - MSFT
Hi Guys,
Thanks for replies.
These are the details of the events:
1069: Cluster resource 'Disk :\' in clustered service or application 'Cluster Group Name' failed. (Which is occuring for each Disk)
For some unknow reason someone has added a FileServer resource on the disk that the SQL data is held on. I cannot pinpoint in the logs when this was done. What EventID should I be looking for when a Cluster Resource has been added ?
4201: The system detected that network adapter Local Area Connection* 12 was connected to the network, and has initiated normal operation.
7024: The SQL Server (SQL CLuster Instance Name) service terminated with service-specific error 3449 (0xD79).
57: The system failed to flush data to the transaction log. Corruption may occur.
1073: The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was '1359'.
7034: The Distributed Transaction Coordinator (03fcbf57-73fd-4375-8243-35af385ea1a2) service terminated unexpectedly. It has done this 4 time(s).
We have logged a call with VMware and they have confirmed that they think it is a networking issue, do you agree from what you know here ?
Thanks
Scott1135: Cluster node 'NODE 2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
1126: Cluster network interface 'NODE 2 & 1" - Heartbeat - "IP"' for cluster node 'NODE 2' on network 'Heartbeat' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.7031: The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.
7024: The Cluster Service service terminated with service-specific error 1359 (0x54F).
And the list goes on........
We have logged a call with VMware and they are advising that it is network related, from what you know so far, do you agree ?
Thanks
Scott- Edited byScott Eggleston Friday, October 16, 2009 11:12 AM
- Edited byScott Eggleston Friday, October 16, 2009 11:13 AM
Omril,
I am also receiving the events related to NTP.
I do not have MOM/SCOM agents installed.
Thanks
Scott- Hi Scott,
I've removed the MOM agent as well as the Trend Micro AV client,
and I haven't had a failure in 3 days.
are you using Trend Micro ?
thanks,
Omri - Hiya,
When you write AV, just make sure that you have all excluded that needs to be excluded.
Note that this is not marked for Windows server 2008, but from what I can see the same principles would apply:
http://support.microsoft.com/default.aspx/kb/250355 - Hey Guys,
We are having the same issue in our development environment as well, which does not have AV installed.
There must be something more sinister than AV.
Scott - 1069: Cluster resource 'Disk :\' in clustered service or application 'Cluster Group Name' failed. (Which is occuring for each Disk)
That is the message which is worrying to me.
A 1069 means that the physical disk resources in the cluster is failed
I would do three things
1) check the system/app eventlog in the period prior to this 1069 to see if there are clues
2) check the cluster.log file in the same period and period prior to this (note: cluster.log is timestamped in UTC, which is GMT without Day Light Savings adjustment)
3) open a support case with your storage vendor and or Microsoft to further investigate
Rgds,
edwin.- Unmarked As Answer byTim Quan - MSFTMSFT, ModeratorThursday, November 05, 2009 10:54 AM
- Marked As Answer byTim Quan - MSFTMSFT, ModeratorMonday, November 02, 2009 1:59 AM
- We've having the same issue, it happens a couple times a week. Unfortunately MSFT, VMWare, EMC all point fingers at eachother and we can get nowhere. Anyone have any further updates?
- Disable TCP Chiminy or TCP offload http://support.microsoft.com/kb/951037
- Unmarked As Answer byTim Quan - MSFTMSFT, ModeratorThursday, November 05, 2009 10:54 AM
- Marked As Answer byTim Quan - MSFTMSFT, ModeratorMonday, November 02, 2009 1:59 AM
- Hi Sam,
I've tried disabling TCP offload and it didn't do the trick.
btw: type of NIC, broadcom BCM5708s
- Hiya,
I've heard quite some places that the broadcom NIC's are giving some very inexplainable errors. - Altho only relating to Hyper-V as far as I've heard.
If you got the possibility try to get your connections on another vendor NIC. - just for the sake of testing, it might be worth trying.. - thanks for that reply,
alas it's not an option right now.
but i think i'll just try to evict the failing node and reinsert it, although i know
most of you tried it already.
I'm just at my wits end here.

