using DAG for database replication only and total manual operation

Answered using DAG for database replication only and total manual operation

  • 2012年4月16日 下午 01:32
     
     

    I have two mailbox servers in two sites both running active mailboxes and users.  After looking over what it would take to setup a 2 member DAG I think it is not for us.  My reasoning is the following:

    It seems that if the WAN goes out for whatever reason two things would happen.

    1) The server at the site with no fsw (secondary site) would come offline.
    2) The server at he site with the fsw (primary site) would want to activate the database copies of the other server.

    In our case the sites are failry indepedent and there really is no reason why the users at the secondary site should be inconvenienced by a WAN outage.  With the DAG, if the WAN goes out I would need to force the secondary server online (using an alt fsw I think) and I would also have to make sure the primary server does not mount/activate the secondary server databases.  To make things worse, I would then have to undo all this when the WAN comes back up!  All this seems like too much hassle since WAN outages are infrequent and when they do happen they only last a few minutes or hours.  In most cases I would rather wait it out than do any type of switchover.


    What I would want is for the DAG to be totally manual operation.  Basically, when the WAN goes out I would want absoultely nothing to happen.  No servers going offline, no databases trying to activate, nothing.  I want DAG to merely replicate databases and then I would manually activate only when I felt it was necessary.  I know it kind of goes against what most people want out of DAG but I think this should be doable?  No?  Any and all ideas would be greatly appreciated.

    Thanks,


    Rgds, Diego

所有回覆

  • 2012年4月16日 下午 05:02
     
     
    Continuous Replication uses the cluster database to communicate state among the nodes. It depends on the Dag members being in the same Windows Failover Cluster. It would not be possible to do what you want.
  • 2012年4月16日 下午 06:17
     
     

    Hello Jared and thanks for your response.  Maybe I didn't phrase my question as clearly as I should have.  I realize that I have to use a DAG and I am OK with that.   I just would want to modify its behaviour in a way that suits my particular situation.

    I believe that there is a cmdlet that can be used to block any database from being activated or being considered for activation.  If it works like I think it does that would resolve 1/2 the problem.

    The other 1/2 would be to somehow prevent a node from dismounting databases and going offline when the DAG loses quorum.  If there isn't a way to force this behaviour maybe there is a way to quickly get it back online without quorum?

    Hopefully there aren't any other issues that I am overlooking.


    Rgds, Diego

  • 2012年4月16日 下午 06:26
     
     已答覆

    Hi Diego

    I think this is along the lines of what you are looking for: http://eightwone.com/2010/08/24/blocking-automatic-activation-in-dags/

    Cheers, Steve

  • 2012年4月16日 下午 06:47
     
     
    That definitely will help with part of the problem!  Now I need to find a way to keep the secondary server from going offline.  Thank you.

    Rgds, Diego

  • 2012年4月16日 下午 06:51
     
     
    That definitely will help with part of the problem!  Now I need to find a way to keep the secondary server from going offline.  Thank you.

    Rgds, Diego


    Using a single DAG I don't think it is possible.
  • 2012年4月18日 下午 04:46
     
     
    What about cluster communications timeout setting?  This should exist, no?  If I can find this and set it to a long period to say 10-12 hours then I think I would in effect keep nodes up during a WAN failure.

    Rgds, Diego

  • 2012年4月18日 下午 05:16
     
     

    You want the stores to dismount and go offline when when quorum is lost to protect against split-brain.

     Im not sure what you are trying to accomplish, but you should be using DAC mode and do a Data Center Switchover if required. This is a manual process.

    The timeouts are listed here: 

  • 2012年4月19日 下午 05:34
     
     

    I am simply trying to leverage the database replication features of a DAG without the hassles and ramifications that cluster failovers would bring.  I figure that most failures can be resolved within a few hours which is well within our SLA targets.  This way we can still have offsite data copies and backup server capability with having to deal with server and site failovers that might occur when running the DAG.

    I figure I can use this to prevent split-brain "Set-MailboxServer –Identity <ServerID> – DatabaseCopyAutoActivationPolicy Blocked".

    Then I need a way to block a server from going offline if DAG loses quorum but it seems like that is not possible.  So I thought a long timeout would do it but from what you provided it seems like the max is 40 seconds which is way too short.

    What I am thinking now is to maybe drop the FSW into a third location which both DAG members can reach independently of the link they use to talk to each other.  Maybe that way even if the DAG members can't talk to each other they can each talk to the FSW and that will keep them from going offline.   

     

    Rgds, Diego


    • 已編輯 tato386 2012年4月19日 下午 05:35
    •  
  • 2012年4月19日 下午 06:17
     
     

    >>What I am thinking now is to maybe drop the FSW into a third location which both DAG members can reach independently of the link they use to talk to each other.  >>Maybe that way even if the DAG members can't talk to each other they can each talk to the FSW and that will keep them from going offline.  

    This would work if you could find a way to get it going.  You could use an alternate FSW in your main site to keep those databases online in the event of a massive WAN outage.

  • 2012年4月19日 下午 06:34
     
     

    I am simply trying to leverage the database replication features of a DAG without the hassles and ramifications that cluster failovers would bring.  I figure that most failures can be resolved within a few hours which is well within our SLA targets.  This way we can still have offsite data copies and backup server capability with having to deal with server and site failovers that might occur when running the DAG.

    I figure I can use this to prevent split-brain "Set-MailboxServer –Identity <ServerID> – DatabaseCopyAutoActivationPolicy Blocked".

    Then I need a way to block a server from going offline if DAG loses quorum but it seems like that is not possible.  So I thought a long timeout would do it but from what you provided it seems like the max is 40 seconds which is way too short.

    What I am thinking now is to maybe drop the FSW into a third location which both DAG members can reach independently of the link they use to talk to each other.  Maybe that way even if the DAG members can't talk to each other they can each talk to the FSW and that will keep them from going offline.   

     

    Rgds, Diego



    To what end? I still dont get it.  The stores wont be automatically mounted in the other site regardless if you have disabled automatic activation and/or they dont have quorum.


    • 已編輯 Andy D-MVP 2012年4月19日 下午 07:12
    • 已編輯 Andy D-MVP 2012年4月19日 下午 07:13
    • 已編輯 Andy D-MVP 2012年4月19日 下午 07:13
    •  
  • 2012年4月19日 下午 07:06
     
     

    Let me try to explain.  Site A has active users and an Exchange server.  Site B has active users and an Exchange server.  The sites are geographically separated by 1000 miles, one in the North and one in the South.  There is a WAN that connects the sites for inter-site communication but nothing business critical runs on the WAN.  Now since both sites are running Exchange 2010 we think, hey, let's take advantage of the DAG features and replicate our Exchange data to the other site.  This way, if there is ever a catastrophic issue on one side we can use the other as a temporary disaster recovery site.  Sounds reasonable right?

    Now lets say we use one DAG, 2 servers and one FSW because we don't want to implement (and maintain) half a dozen servers, CAS arrays, hub transport and  all that jazz. Let's say the North site has a WAN outage.  Its rare but when it does happen, the telco sends a tech out and within a few hours the site is back up.  Or maybe Cisco sends out another router in a few hours.  Users in the South get kicked off Exchange and ask (rightly so I may add), "Why do we go offline if the North has an issue? There is nothing wrong down here!"

    Now we have to explain about quorums and so on.  We have to explain we have to reconfigure our DAG for an alternate FSW, run a few EMS commands and so on.  Maybe the one or two guys that know how to do this are not instantly available.  So now we have incurred down time at both sites instead of just one.  Then the following day when the WAN is backup or the Cisco part has arrived we have to undo all this!  This is what I want to avoid.

    Seems like the only piece I need now is to keep the servers from going offline if WAN goes out.  Since the cluster timeout is too short I guess I need to study having the FSW at a third site which will be available to both sites via alternate link.  Maybe a VPN tunnel over Internet.  Suggestions?


    Rgds, Diego

  • 2012年4月19日 下午 07:08
     
     

    Also, I think its worth mentioning that putting the FSW in a 3rd location is generally a bad idea. In fact:

    Misconception Number 2: Microsoft recommends that you deploy the Witness Server in a third datacenter when extending a two-member DAG across two datacenters


    • 已編輯 Andy D-MVP 2012年4月19日 下午 07:08
    •  
  • 2012年4月19日 下午 07:46
     
     

    Honestly? It sounds like you need 2 DAGs, one in each location. Along with the HTs and CAS in each site so each can operate independently of each other.

    For DR, add an additional node to each DAG in the other respective site, replicate the stores across and disable automatic activation if required.

    Otherwise, trying to game the cluster and DAG logic will lead only to heartache I suspect.

    Just my thoughts.

  • 2012年4月19日 下午 08:44
     
     
    It is clear from the misconception article that the FSW at 3rd site won't get me what i am looking for.  I will focus my attention on setting up redundancy on the WAN which seems to be the weak link.  No appetite on my end for setting up so many servers.  Thanks.

    Rgds, Diego

  • 2012年4月20日 上午 06:25
     
     

    Hi,

    I recommend you to put active database on one server, if it lose the site connection, we can manually use additional FSW to enable the cluster.

    After the connection come to live, we can swithover to the primary data center.


    Xiu Zhang

    TechNet Community Support

  • 2012年4月24日 上午 07:21
     
     
    Any update?

    Xiu Zhang

    TechNet Community Support

  • 2012年4月24日 下午 12:55
     
     已答覆

    Hello Xiu,

    I did not find a perfect solution but I did find something that seems workable.  For the problem of databases going active when I don't want them to I am using the "Suspend-MailboxDatabaseCopy –Identity dbname\servername –ActivationOnly" command.  The only drawback to this is that when this command is enabled it cannot be overriden by the GUI.  I must first use "Resume-MailboxDatabaseCopy –Identity dbname\servername".

    Afterwards I found that databases will still not activate if they can't communicate with the other server.  This seems odd since you want to activate databases when servers go down!  Since communications between servers is broken I need to activate using this "Move-ActiveMailboxDatabase -Identity dbname -ActivateOnServer servername -MountDialOverride BestEffort  -SkipHealthChecks -SkipLagChecks -SkipActiveCopyChecks". 

    I would have to do this only in situations where we determined that an outage is going to last long enough that we want to move clients over to the alternate server.  Remember we don't have a dedicated DR server, we have two active mailbox servers, and we do not want to implement a CAS array. We don't want to reconfigure clients for short term outages which in our case short term could be up to one full working day. Anything less than that and we would rather focus on fixing the outage than moving clients around.

    Secondly I found that I can use "net start clussvc /forcequorum" on the secondary server when it does not have communication to either the primary server or the FSW.  When I do this I can mount the databases and client access and mail flow seem to work.

    So bottom line, if the WAN goes out I will have to take some manual actions to get the secondary server to go online but it is not too bad.  Nevertheless I will work on setting up a secondary backup WAN connection so that the chances of having to go thru these procedures are minimized.  In the future I hope that maybe Microsoft will add some features to the DAG that will allow for easier management of outages with DAG enviroments that do not have CAS arrays and dedicated DR servers.

    Thanks to all for your help and suggestions.


    Rgds, Diego

  • 2012年4月24日 下午 01:54
     
     

    Just one more thing before you go  :)

    I would implement DAC Mode, that way you can use the built-in powershell commandlets to "clean-up" and  handle the quourm if you need to force things