Exchange 2013 DAG – Datacenter Failover and Disaster Recovery

Although we have so many article over the internet for the datacenter failover and site resilience thought to summarize all of them in short note what is need on failover period instead reading 2 to 3 hours on getting the concept what we need.

Exchange 2013 Terminology

Few terminology should be know by Exchange Administrator regarding their environment:

Primary Active Manager which runs inside the Microsoft Exchange Replication Service used to notify and react in case of server failure. The PAM owns the cluster quorum resource and holds the information about active, passive and mounted databases.

Standby Active Manager provides information of the server hosting the active copy of a mailbox database to the Client Access or Transport services.

Datacenter Activation Coordination uses a protocol called Datacenter Activation Coordination Protocol (DACP) to avoid split brain .When a DAG is running in DAC mode, When the server reboots, the Active Manager starts up the bit as 0 (Database Dismount state). It communicates with other members in the DAG when it responds the bit set to 1 and allowed to mount database

Quorum Details

Odd number of nodes --->  Node Majority

Even number of nodes (but not a multi-site cluster) --->  Node and Disk Majority

Even number of nodes, multi-site cluster --->  Node and File Share Majority

Even number of nodes, no shared storage  ---> Node and File Share Majority

Continuous replication uses initial File Mode to replicate 1 MB of file to the passive database. When File Mode completes it moves to Block Mode for immediate updates

Port 3343 is used Nodes for listening incoming connections from other nodes of the DAG Members

I believe it more enough to know the definition let us move practically what we do in our Exchange infra. It’s always good to have documentation of the below component information which will helps in case if our servers are in disaster.

Verification of Exchange 2013 DAG Components:

Primary Active Manager:

To verify PAM

Get-DatabaseAvailabilityGroup <DAG NAme> -status |fl Name, PrimaryActiveManager

To move PAM on different DAG Member

Cluster group  "Cluster Group" /MoveTo:<DAG Server Name>

AutoDatabaseMountDial: 

Get-Mailboxserver <MailboxServerName> | FL Name, AutoDatabaseMountDial

 

BestAvailability (default) - Copy queue length of ≤12 Logs count

GoodAvailability - Copy queue length ≤6  Logs count.

Lossless - Copy queue length Zero Log Count

 

Datacenter Activation Coordination (DAC) 

Get-DatabaseAvailablityGroup –Identity <DAGName> | FL Name, DataCenterActivationModel

To verify Quorum

cluster /quorum

 

To verify Continuous Replication Mode 

Get-Counter -ComputerName <> -Counter “\MSExchange Replication(*)\Continuous replication - block mode Active”

 

To check replication network 

Get-MailboxDatabaseCopyStatus -Server <Severname> -ConnectionStatus | FL Name, Incominglogcopyingnetwork, Seedingnetwork

To Check DagNetworkConfiguration

Get-DatabaseAvailabilityGroup | FL Name, ManualDagNetworkConfiguration

Check the Exchange server location in AD site 

Get-ExchangeServer –Identity <server_name> -Status | FL 

Datacenter SwitchOver 

When the primary site fails due to disaster on the odd nodes due to power Outage or server failure follow the below steps

  • Verify the Started Server and Stopped servers in the DAG 

Get-DatabaseAvailabilityGroup <DAGName>  -Status | FL Name, *Servers

  • Use the Stop-DatabaseAvailabilityGroup to mark the primary site DAG members are in failed state. 

Stop-DatabaseAvailabilityGroup –Identity <DAGName> -ActiveDirectorySite PrimarySite 

  •  Verify the Started Server and Stopped servers in the DAG 

Get-DatabaseAvailabilityGroup <DAGName>  -Status | FL Name, *Servers

  • Stop the cluster service in all the passive node of the secondary site 

Stop-service clussvc 

  • Use the Restore-DatabaseAvailablityGroup to remove the stoppedmailbox server from the DAG and re-establish the quorum using the alternate Witness server 

Restore-DatabaseAvailabilityGroup <DAGName> -Activedirectorysite DR

  • When the service or power is restored in the Primary site is up run Start-DatabaseAvailabilityGroup to revert the datacenter 

Start-DatabaseAvailabilityGroup <DAGName> -ActiveDirectorySite ProductionSite

  • Check out the Quorum model

Get-ClusterQuorum | fl

  • Still if it’s show the older quorum model execute the below powershell cmdlet 

DatabaseAvailabilityGroup -Identity DAG01