none
AlwaysOn Disaster/recovery Testing

    Question

  • Hi,

    I have a AlwaysoN environment with the following setup

    Node                site                          Quorom_vote    

    srv-1                 primary                   1

    srv-2                 primary                   1

    srv-3                 primary                   1 

    drsrv-1             secondary               1

    drsrv-2             secondary               1

    fileshare          primary/secondary   1

    I have syncronus replication between the hosts on the primary with automatic failover

    and asycnhronous with manual to the secondary site.

    This works well until we do DR tests, then we get into trouble.

    Failing over AG's to DR works fine, but when we cut the line between the two sites the WFC fails with lost quorom message. Which is correct sine three votes out of six is not majority. So then I have to start the cluster in forced quorom mode. When failing back to primary site I once again need to take down the cluster and start in normal mode.

    Is there any way to get this working without the cluster going down when i cut the line. Can I in someway "freeze" the cluster on the secondary site before I cut the line ? So that the cluster dosen't care about lost quorom ?

    I know that adding another node to the secondary site would solve the problem and I also know that Windows 2012 is the recommended solution since it have dynamic quorom configuration. But until I have the possibiolity to do any of that I would like an interim solution if possible.

    Best Regards,

    Staffan

    Monday, September 30, 2013 7:56 AM

All replies

  • Since you already have 3 nodes on the primary site, you can remove the file share witness altogether. You want to protect the primary site so if the connection between the two sites go down, you still have quorum on the primary site. 

    Edwin Sarmiento SQL Server MVP Edwin Sarmiento SQL Server MVP
    Blog | Twitter | LinkedIn
    SQL Server High Availability and Disaster Recover Deep Dive Course

    Monday, September 30, 2013 4:31 PM
  • Yes.  You don't want the secondary site to come online automatically when the link between the primary and secondary site is lost. 

    David


    David http://blogs.msdn.com/b/dbrowne/

    Monday, September 30, 2013 4:36 PM
  • Yes, if the Connection fails and my AG's are running on the primary side  that would work without fileshare witness. But my problem is when I have manully failed over to the secondary for a DR test and then cut the line. Then the secondary no longer have majority and the cluster goes down. And my AG's go down. and i have to start it in force quorom mode. and the shut it down again when the line is up so that I can start in normal mode a failback my AG's.

    Monday, September 30, 2013 5:40 PM
  • No that's correct, but I also want my secondary site to stay online after manul failover when the line is cut.

    /Staffan

    Monday, September 30, 2013 5:41 PM
  • For manual DR, you should remove the quorum votes from the DR nodes, and reconfigure quorum voting during a DR failover.

    From: Quorum considerations for disaster recovery configurations

    For Manual Failover:

    Node vote assignment

    • Node votes should not be removed from nodes at the primary site, SiteA
    • Node votes should be removed from nodes at the backup site, SiteB
    • If a long-term outage occurs at SiteA, votes must be assigned to nodes at SiteB to enable a quorum majority at that site as part of recovery

    David


    David http://blogs.msdn.com/b/dbrowne/

    Monday, September 30, 2013 7:08 PM
  • Hi again,

    thinkI'm starting to get the picture now.

    For my manual failover and DR test the best solution is probably to remove forum votes on the primary site after I have failed over everythin but before pulling the plug. Then the cluster on the secondary site will stay online during testing. Primary site will then fail but that's now problem.

    After I have connected the line again I can bring that cluster nodes up again before failing back.

    But I do have some services on the secondary site that are only local and if the line between the sites is cut by an operator for maintenance or something, I still have a problem becuasae that service will fail.

    I begining to thing that maybe two seperate clusters with AlwaysON replicats inbetween would be a better solution.

    Will have a chat with someone at SQL PASS Summit  in Charöotte next week and see if I can sort it out.

    Thanks,

    Staffan

    Wednesday, October 09, 2013 8:37 AM
  • Be sure to check out this session

    CAT: AlwaysOn Customer Panel – Lessons Learned & Best Practices [DBA-303-M]

    http://www.sqlpass.org/summit/2013/Sessions/SessionDetails.aspx?sid=5220


    Edwin Sarmiento SQL Server MVP Edwin Sarmiento SQL Server MVP
    Blog | Twitter | LinkedIn
    SQL Server High Availability and Disaster Recover Deep Dive Course

    Wednesday, October 09, 2013 7:16 PM