none
Replication writer failing

    Question

  • I have a performance issue that I believe, without prejudicing feedback, is going to be related to NetApp SnapManager for Exchange. Just in case this isn’t related to this backup software I wanted to get feedback from other Exchange experts. Our environment, at a Mailbox level, consists of two Mailbox servers in a production datacentre and one Mailbox server in an offsite datacentre comprising a 3 node dag. All databases are mounted on the two production servers and are roughly evenly split in terms of which server has mounted copies. The server OS is Server 2008 R2 Enterprise SP1, Exchange is 2010 SP2 RU3.

    With no distinct pattern I see the performance of either production Mailbox server drop to a point that the end user experience is impacted. On investigation I can see that the databases that are mounted by the server experiencing the issue are healthy state and mounted. The passive copies of the databases held on the two remaining servers (that are mounted and live on the server with an issue) will have high queue lengths. Navigating round the server is painfully slow. If I try and put the server with an issue into maintenance mode (to move the active database gracefully) it will fail with “…WARNING: An error occurred while communicating with the Microsoft Exchange Replication service…”. Running “vssadmin list writers” will show the writer isn’t healthy. If I try and activate the databases from another server via the EMC this will fail with a similar error too. The only option I have is force the server to shut down.

    I can see errors that I have researched outlined below. The repetition of these doesn’t always link to a performance hit. The articles I can see aren’t necessarily related to NetApp but do indicate backup issues.

    Log Name:      Application
    Source:        ESE BACKUP
    Date:          07/11/2012 07:07:56
    Event ID:      914
    Task Category: General
    Level:         Warning
    Keywords:      Classic
    User:          N/A
    Computer:      XXXXXX
    Description: Information Store (1508) The surrogate backup by XXXXXX has stopped with error 0xFFFFFFFF.

    http://social.technet.microsoft.com/Forums/lt/exchange2010/thread/5f57f48c-9e65-4252-afd2-67c7ebd75a3c

    “…Problem solved we switched to stream backup instead of a back with one session with net backup…”

    Log Name:      Application
    Source:        ESE
    Date:          07/11/2012 07:07:56
    Event ID:      215
    Task Category: Logging/Recovery
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      XXXXXXXXXXX
    Description: Information Store (1508) *********: The backup has been stopped because it was halted by the client or the connection with the client failed. http://www.microsoft.com/technet/support/ee/transform.aspx?ProdName=Exchange&ProdVer=6.5.6940.0&EvtID=215&EvtSrc=ESE&LCID=1033
    “…The most frequent cause of the error is that a 3rd party online Exchange-aware backup software program has problems…”

    The third party backup software will use the Microsoft Exchange Replication service. Backup failures may/may not be related to the failure of the writer – I can’t pinpoint an exact cause and effect relationship there. What I believe is happening is that the backup software uses the writer and on occasion it’s left in a state that Exchange can’t recover from. As such logs can’t be shipped by this service to the other dag members. This accounts for the high queue lengths on the other two servers.

    I have had the issue before at the company I currently work for now and it was resolved by an update to NetApp SnapManager for Exchange. I have also had similar issues in the past working for a managed service company and ended up playing vendor tennis between Microsoft and NetApp. I am mindful though that at some stage in the future there could be a RU or SP that could resolve this. I recall having numerous problems in the past there were resolved by Exchange 2010 SP1 RU4.

    Any thoughts? The storage team who look after the NetApp and backup side of things don’t think there are any issues.

    Tuesday, November 20, 2012 3:53 PM

All replies

  • Hi,

    Thank you for your post.

    This is a quick note to let you know that we are performing research on this issue.

    Thanks,

    Simon

    Thursday, November 22, 2012 2:57 AM
    Moderator
  • Hi,

    From your description, it seems like the database copies are not in health. You can run cmdlet in EMS to check them: get-mailboxdatabase | get-mailboxdatabasecopystatus

    in general, if the database copies failed, you can try to perform reseed operation to fix it.

    on the other hand, I'd like to know what's your primary concern, the VSS writer or database copies state?

    Thursday, November 22, 2012 4:53 AM
    Moderator
  • Can you post the output of vssadmin list writers

    ***Don't forget to mark helpful or answer***

    **Note:(My posts are provided “AS IS” without warranty of any kind)

    Thursday, November 22, 2012 5:15 AM
  • Thanks for the feedback and looking into this further.

    I don’t think the root cause is a problem with the health of the database copies in the first instance. Here is a scenario: if server A has 5 mounted databases that replicated to servers B and C. The replica writer is unhealthy/pauses on server A for whatever reason, then I would expect log shipping to fail to servers B and C yet the active copies on A to remain healthy and mounted – this is exactly what I see. I believe the increased queue lengths and unhealthy copies on the other dag members are a symptom not a cause. It’s not the destination servers that are receiving shipped logs that are having the issue it’s the performance issue, although their passive copies are out of sync, it’s the server with the mounted databases being unable to ship the logs that has the issue. I’ve based the logic on http://msdn.microsoft.com/en-us/library/bb204080.aspx.

    I don’t have the output of the vssadmin list writers at the time of the issue. Invariably when this has happened the server is unresponsive. What might help is the output generated when I tried to put the server with the issue into maintenance mode. This was run from another server as the problem server was unresponsive:

    “WARNING: An error occurred while communicating with the Microsoft Exchange Replication service. The task will mark the database copy SERVER A to suspend it when the service is available. Verify that the service is running. Error: A server-side administrative operation has failed. The Microsoft Exchange Replication service may not be running on server SERVER A. Specific RPC error message: Error 0x6d9 (There are no more endpoints available from the endpoint mapper) from cli_RequestSuspend3”.

    We did have the issue previously and it was resolve by a NetApp SME update.

    Thursday, November 22, 2012 9:59 AM
  • Hi,

    please let us know:

    1. did the issue occur on the fix node?

    2. when the database is activated on the affected node, what's the result of get-mailboxdatabase | get-mailboxdatabasecopystatus? are all copies in health? are there high value on copyQueue and ReplayQueue?

    3. how things are going on the affected node when mailbox databases are activated on a working node?

    in general, if we see high value on copyQueue, the possible is a network issue; if we see high value on ReplayQueue, the possible cause is related to hardisk. we can enable performance monitor to find out the cause.

    Note: if this is a performance issue, the result of get-mailboxdatabasecopystatus should be returned health.

    Monday, November 26, 2012 1:04 AM
    Moderator
  • Hi

    I can’t quite understand your response but will try and interpret this as best I can.

    1-When the issue occurs on the problem node the server will be unresponsive. Its active/mounted copies of databases are healthy. These active copies will be in an unhealthy state on the other two DAG members.
    2-Approximately half of the 10 databases will be active on the server with the issue. The server is unresponsive so I don’t have the information in terms of copystatus etc. However if it’s server A that is having the issue and this server has databses 1-5 mounted then these will be healthy on server A. Users with mailboxes on the databses 1-5 will have at best poor Outlook connectivity and more than likely no connectivity. The queue length on Servers B and C for databases 1-5 will be high.

    3-In order to activate the databases that are mounted on the problem server onto another server you will have to power it off. If you try and activate the databases from another server via the EMC it will fail. If you are lucky enough to get the EMS to respond, putting the server in maintenance mode fails with the error: “…WARNING: An error occurred while communicating with the Microsoft Exchange Replication service…”.

    At the time of the issue the memory utilisation will be held by store.exe and the replication writer. I have read that when a snapshot is taken for a backup, new transactions are written to memory which accounts for the elevated store.exe memory utilisation and the poor performance of the server. I have also read that sometimes the flag that is generated to indicate a snapshot is taken doesn’t always clear, which could account for the problem I have.

    Tuesday, November 27, 2012 10:23 AM
  • hi,

    from your description, there are high value of the queue on five databases on the affected node. So you can't switchover these database manually from EMC, this is by design. If you want to you have to active these databases in EMS with special parameter.

    Obviously, we can there is something wrong on the affected node. please let us know the result of the cmdlet  get-mailboxdatabase | get-mailboxdatabasecopystatus so that we can know what's the value of the queue. And you can try to run cmdlet Get-Mailboxserver | where {$_.DatabaseAvailabilityGroup -ne $null} | Test-ReplicationHealth | fl to check your DAG node state.

    in general, you may need to check your application log and system log on the affected node to see what's the cause. Per my experience, backup should not be the cause, it may lock transaction log files, but it won't cuase issue on MSExchange Repeplication service.

    on the other hand, the behavior you see on the mailbox servers for store.exe is completely normal. This is by design, ESE(store.exe) can allocate the memory it needs dynamically. ESE will grow the cache to consume almost all available RAM on the server if there is no other memory pressure on the system

    Wednesday, November 28, 2012 1:58 AM
    Moderator
  • Hi Sun

    Thanks for the reply. I know the switchover functionality is by design but it illustrates the responsiveness of the server, the health of the passive copies and vss writers on the problem node. Running powershell on the affected node will not work as it’s unresponsive. I have checked the application log and researched this extensively. I have even copied in my first post some pertinent logs. If you look at event 215 you will see I reference a technet thread showing the impact of a failed backup on Exchange. When I research this error, as well as other errors, the common factor is the backup software. When I have had this issue in the past, both at my present company and working for a managed service, the issue was resolved by a NetApp SME update. I can find plenty of Symantec KB articles to support this cause and effect relationship but not many for NetApp. Not only from research but from experience, backup software does use and have an impact on the MSExchange Replication Service as previously outlined: “…Exchange Writers coordinate with the Exchange services (operating on behalf of the requestor) to prepare the database files for backups, freeze the IO activity resulting from Exchange transactions before backing up the database, and then to unfreeze and truncate log files after the backup is complete… The Store Writer supports both backups and restores, while the Replication Writer supports only backups…” (http://msdn.microsoft.com/en-us/library/bb204080.aspx)

    I understand that when the backup software uses the replication writer it uses a flag to mark the backup as active that may not always cleared, which could be my issue. At this point when the writer is being used, new transactions are written to memory and not a log file. This behaviour ties in with high memory usage of the replication writer which is why I mentioned that point. I know it’s normal for store.exe to use as much memory as it can or needs and that it will release it accordingly, but is not normal to see the replication writer using so much.

    Wednesday, November 28, 2012 11:28 AM
  • Hi,

    so far, we can try to disable/stop all service related to the backup software to narrwo down the issue. You can also enable performance log to monitor the state as well. Per my experience, backup software may lock transactin log files/database file, but it won't cause Microsoft Exchange Replication not responding.

    Thursday, November 29, 2012 9:06 AM
    Moderator
  •  

    Hello ,

     

    Did the answers help you? Let us know if you need further assistance from us.

    Thursday, December 6, 2012 1:49 AM
    Moderator