locked
Possible reasons for not getting an automatic failover successful in Exchange 2007 SP3(CCR) RRS feed

  • Question

  • Hello Experts,

    I wonder if you could help me with an issue we are currently have. All my CCRs are physical nodes, FSW is hosted on HUBs(HUBs are hosted on VMs), manual failover is completed successfuly. ALL cluster test passed OK, but every time there is an issue(network, hardware, etc) automatic failover did not work properly, and I had to manually failover.

    Can anyone explain to me, what are different reasons for not having an automatic failover in Exchange 2007 SP3?

    All nodes and FSW are running Windows 2008 Ent SP2.

    Thank you in advance


    Franki
    Wednesday, January 19, 2011 2:53 PM

Answers

  • The times I've seen behavior like this is when the disk controller for the OS drive hangs. This causes a lot of processes to hang as well (if for anything else they're page faulting). What's sad is that the cluster service doesn't hang. It's able to maintain it's cluster heart beat connection to the other node. The end result is that according to the cluster, the node is still up, but no new connections are able to be made.
    • Marked as answer by Jason Patrick Monday, January 31, 2011 5:47 PM
    Wednesday, January 19, 2011 7:28 PM

All replies

  • In your cases was the node concidered down according to the Windows Failover Cluster?

    Wednesday, January 19, 2011 3:20 PM
  • Hi jader,

     

    I did not understand your question, can you be more specific?


    Franki
    Wednesday, January 19, 2011 3:29 PM
  • hi,

    do you have 2 nodes in your SCC cluster?  if so where is your file share witness?

    Possibly you are experiencing a situation where the number of votes in the cluster cannot decide on where the services should be active and services are stopped?

    Wednesday, January 19, 2011 4:09 PM
  • Yes, we do have 2 nodes in my CCR, and FSW is hosted into a HUB(a VM)

    Just to clarify, Both nodes of CCR are physical servers, and the HUB hosting FSW is a VM

    Lafontma, Any ideas?


    Franki
    Wednesday, January 19, 2011 4:16 PM
  • I think you need to describe in more details the situation when the automatic failover doesnt work.

    For example, if a core network issue ocuurs that prevents node 1 and node 2 to talk to each other AND they also cant talk to the FSW, services will go down since a split-brain issue occurs.

    A hardware issue of one of the nodes should not prevent automatic failure over...  In that case you would need to investigate the logs and see why the second node couldnt go online.

    Cant really help you more without understanding what exactly was the issue when the failover did not work

    Wednesday, January 19, 2011 4:23 PM
  • Thanks for the quick answer.

    Issue Details:

    Issue start with unable to RDP or accessing via ILO either to active or passive node, and thus automatic failover did not work properly. From what I seen, by the time an active node becomes unresponsive,(frozen), we are unable to RDP into this box, so exchange resources went offline, and the fast solution found is RDP into passive node, manually failover to passive node, and reboot previous active node.

    Can you help me to understand why when the active node sometimes becomes unresponsive, cause the automatic failover did not work properly?

    Why the manual failover works fine even if the active node is unaccesible via RDP or ILO?

     

     

     

     


    Franki
    Wednesday, January 19, 2011 4:41 PM
  • For the automatic failover process to occur successfully, the cluster service needs to be able to achieve Quorum.  In order to that, in your situation, 2 out of 3 nodes need to be able to communicate together.  The FSW counts a "node".

    Your first phrase mentions that both active and passive nodes are frozen.  in that case it is quite natural that services are taken down, both servers are technicall down.  Once that occurs, you need to manually intervene.

    So from what i'm understanding of your issue, both servers become unresponsive, then at some point it seems your passive node becomes available, but by that time its too late, you are in a split brain issue. when you are doing the manual failover, it seems that the FSW is available and the passive node so the cluster is able to work.

    I guess your main issue is why do those 2 servers become unresponsive at the same time?  That will always cause a split brain issue and bring the services down AND require a manual intervention to bring them back up.

    correct me if i'm wrong in the assumption that both servers become unresponsive at the same time

    Wednesday, January 19, 2011 4:54 PM
  • I believe there was a typo in my initial phrase.

    Incident 1.

    Active node becomes frozen, but passive node and FSW are online. Even with that, manual failover must be run

    incident 2.

    passive node becomes frozen, but active node and FSW are online. Again not sure why this affect my CCR

    Each incident took place at different dates/times, and do not why is affecting the entire CCR.

    Any specific cluster settings I should check?


    Franki
    Wednesday, January 19, 2011 5:05 PM
  • The phrase "failover not working" could cover a lot of scnearios. Since it's so fuzzy we're now going to play twenty questions to figure out exactly what you're experiencing. For example, you might think that something bad happened on the computer, which should result in a failover, but if clustering keeps heartbeating the machine, nothing will result in a signal, telling the cluster to failover.

    Wednesday, January 19, 2011 5:27 PM
  • Hi lafontma,

    Any other suggestions?

    Thank you in advance


    Franki
    Wednesday, January 19, 2011 6:14 PM
  • It does come down to the fact that your "unresponsiveness" doesnt seem to translate in a failover.

    For the failover to occur, one of the resource in the cluster group needs to fail.

    by default some of the resource in the Exchange cluster group do NOT affect the group and do not initiate a failover.

    The only resources that affect the group failover are the IP address and Computer name resource.  You might be experiencing a failure that impacts the Information store but this resource does NOT cause a failover :S

    Besides trying to identify what is really happening when it happens, you can probably bet that the IP and computer name resources are not affected hence no failover.

     

    Wednesday, January 19, 2011 6:31 PM
  • Completely agreed with you statement, however From Cluster Management Tool we did recently change all exchange cluster dependencies to failover in case of any failure(IS,SA,SG), but again, I do not understand why automatic failover does not cause an automatic failover.

    Any other ideas or suggestions?

    Any settings you may think I would check?


    Franki
    Wednesday, January 19, 2011 6:39 PM
  • It just means that service wise, your exchange services are not failing... Even if the server is responsives, those resource are probably still in a functional mode.

    So again, dont think its the cluster that is not working correctly, its just that the unresponsiveness doesnt not affect the cluster resources so the cluster still thinks its alive and well.

    I think you need to go thru the system logs for those downtime and try to identify what is failing.  Maybe some application uses 100% cpu which technically doesnt make the cluster unavailble in terms of the cluster resources..

    i think your at the point were this aint a cluster/exchange issue.  Its more like a server/application issue on the server that is causing it to be unresponsive but not affecting the cluster resources..

    Sorry, think i cant really help you much more..  you need to identify the source of the issue and not why the cluster is not failing over.  The cluster not failing over is a symptom not the cause

     

    Wednesday, January 19, 2011 6:45 PM
  • Thanks for the update,

    Am I correct in saying "If your active node goes offline, but passive node and FSW are online" that should not affect your environment and an automatic failover would take place?

    By the time each incident happened, users are unable to access their mailboxes, but the funny part of this history is, once you RDP into passive node, run all powershell commands, all test passed successfully, and no errors from eventlogs, and finally the only solution available is manual failover to passive node, and reboot other node

    correct me if I am wrong, but is not an strange behaviour?


    Franki
    Wednesday, January 19, 2011 6:56 PM
  • You are correct in saying that.

    Since the active node is the one not serving up connections, wouldn't it be better to TS into the active node, and run the different Test-* cmdlets to see what's up?

    Wednesday, January 19, 2011 7:12 PM
  • If your active node goes offline, but passive node and FSW are online" that should not affect your environment and an automatic failover would take place?   -Correct, but your active node is NOT going offline.. its unresponsive.. big difference

    correct me if I am wrong, but is not an strange behaviour? Yes

    But again, it can be as simple as some other application running on the server that is causing the issue.. It can be network related like.. can be all kinds of issues that wont cause the cluster service to initiate a failover.

    You basically confirmed that Exchange/Cluster wise everything is fine even if the mailboxes are unavailable. 

    Only suggestion i have is to configure performance counters in order to log important components so that you might have an idea if you are having ressource utlization that are very high when the unresponsiveness occurs..

     

     

    Wednesday, January 19, 2011 7:16 PM
  • All different Test-* passed successfuly from passive node, in fact once I manually failover to passive node, users will get access to their mailboxes

    so, at this point, should I start looking at task manager for any cpu/memory spikes?

    Any other suggestions?


    Franki
    Wednesday, January 19, 2011 7:17 PM
  • Yes, and most important imo for Exchange is disk latency/performance.

    So CPU/Memory/disk/ and network.  It can basically be any of those.

    Wednesday, January 19, 2011 7:20 PM
  • What type of performance counters would you like to monitor?

    Any specific perf counters?

     


    Franki
    Wednesday, January 19, 2011 7:21 PM
  • Wednesday, January 19, 2011 7:25 PM
  • The times I've seen behavior like this is when the disk controller for the OS drive hangs. This causes a lot of processes to hang as well (if for anything else they're page faulting). What's sad is that the cluster service doesn't hang. It's able to maintain it's cluster heart beat connection to the other node. The end result is that according to the cluster, the node is still up, but no new connections are able to be made.
    • Marked as answer by Jason Patrick Monday, January 31, 2011 5:47 PM
    Wednesday, January 19, 2011 7:28 PM
  • Quote: “no errors from eventlogs”

    There’s no error event in the application log on the active node after you reboot it from freeze?

    Have you run ExBPA against the CMS for health check?

    James Luo

    TechNet Subscriber Support in forum

    If you have any feedback on our support, please contact tngfb@microsoft.com


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
    Thursday, January 20, 2011 5:27 AM
  • Hi Jader3rd

    Does the CMS will failover from Active Node to the passive node if it can not see the File Share Witness or if there is problem in the Private heart beat network ?


    /* Server Support Specialist */

    Tuesday, August 19, 2014 6:43 AM