DPM 2010 - BMR Consistency Checks and Recovery Points Stalling
-
Wednesday, August 01, 2012 2:14 PM
Hi,
I have DPM 2010 QFE2 (3.0.7707.0) which is successfully backing up 16 client servers. However I have 3 (maybe 4) client servers which are failing on BMR backups and I cannot get to the bottom of it.
When a consistency check or recovery point backup is performed on these servers, the local Windows Server Backup job on the client stalls. It gets as far as copying data and then get stuck at some random point (usually quiet early on). The local job continues to run, doing nothing and after the 24 hour timeout period the DPM server errors it as timed out. Even then the local backup job is still attempting to run. Cancelling a stalled job via the DPM or the WSB gui has no effect and the only way to stop the job is to stop the Block Level Backup Engine service, at which point WSB won't run again until the client server has been restarted.
There are no errors that I can find. In the client application event log, there is an entry for the backup starting and nothing else relevant after that, there are no errors for DPMRA or VSS. On the DPM it only errors after 24 hours creating a time out message (Failed to create the System State backup within the timeout period. (ID: 30215)). I've looked at the trace logs on the client (WbadminUI.0.etl) and the DPM error logs (in c:\Program Files\Microsoft DPM\DPM\Temp) and whilst I don't fully understand them, there was nothing obviously wrong there that I could see (though the latter is very difficult to read).
Environment:
DPM Server: Physical server - 8 core 2.4Ghz, 12GB RAM backing up to a disk SAN, Windows 2008 R2 SP1
DPM Version: 2010 QFE2 (3.0.7707.0)
Client 1: Windows 2008 R2 SP1 virtual server running on Citrix Xenserver, 2 Cores, 4GB Ram, Role: File Server
Client 2: Windows 2008 Std SP2 virtual server running on Citrix Xenserver, 2 Cores, 3GB Ram, Role: Database Server (SQL 2008)
Client 3: Windows 2008 R2 SP1 virtual server running on Citrix Xenserver, 2 Cores, 2GB Ram, Role: Citrix Web Interface
Client 4: Windows 2008 R2 SP1 Physical server, 8 core 2.53Ghz, 12GB Ram, Role: Domain Controller (I have not done any troubleshooting on this one yet, but it is exhibiting the same symptoms.Things I have tried:
I've tried removing the BMR from the protection group and adding it back in again. This triggers a consistency check which does work. Any further backups after that will fail however. I have done this several times and its always the same, the initial consistency check works, but any further ones stall.
I've worked through the system state/BMR troubleshooting article on the technet blog:
- Removing BMR from the protection group so that its only doing a System State backup works ok.
- Add BMR to the protection groups and it passes the initial consistency check and then fail any further backups. This suggests that the problem lies somewhere with the BMR backup.
- Performing a manual System State backup using Wbadmin start systemstatebackup -backuptarget:e: works fine
- Performing a manual BMR backup using wbadmin start backup -allcritical -backupTarget:e: works fine. This suggests to me that the issue is not with WSB but something to do with DPM.
- vssadmin list writers whilst the backup is stalled shows no writers in an errored state, but does show several as "waiting for completion"
- vssadmin list shadows shows that a shadow copy has been created at the start of the backup
- I've checked the page file on both the client servers and the DPM server and both are they are already set appropriately for DPM
- I've checked the local disk space on the client servers and there is plenty of room
- I've checked the disk allocation on the DPM server and it is fine.
I've tried removing the DPM Remote Agent from the client and reinstalling it - this did not help
I've tried removing a client from the protection group and scratching the old data so that when I added it back in it got a fresh allocation of disk space - this did not help.
I've read the release notes for the later updates to DPM2010 and none of them appear to relate to the issue I have.
Without any errors in the logs to indicate what the problem is, I am now stuck. I find it odd that the BMR does work when its first added to the protection group, but then fails consistently afterwards. To me this suggests that the system is capable of taking a BMR backup, but something is causing it to hang.
As a interim fix, I'm going to remove the BMR protection for all the affected servers so that I at least get System State backups, however I need to get the BMR working. Any ideas, thoughts and suggestions would be much appreciated.
Thanks in advance.
All Replies
-
Wednesday, August 01, 2012 5:48 PMModerator
Hi
This sounds like TCP Chimney issue, please disable Chimney and RSS on both DPM and the effected servers.
951037 Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;951037To determine the current status of TCP Chimney Offload and Rss: netsh int tcp show global
To disable chimney: netsh int tcp set global chimney=disabled
To enable chimney: netsh int tcp set global chimney=enabled
To disable RSS: netsh int tcp set global rss=disabled
To enable RSS: netsh int tcp set global rss=enabledPlease remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
- Proposed As Answer by Mike JacquetMicrosoft Employee, Moderator Wednesday, August 01, 2012 5:49 PM
-
Thursday, August 02, 2012 8:40 AM
Hi,
Thanks for the quick response. I've read the linked article and have tried disabling Chimney and RSS on the DPM server and one of the effected servers. Unfortunately it hasn't helped, the BMR starts ok and then gets stuck, each time at a different point.
Anything else to try?
Thanks again.
-
Thursday, August 02, 2012 9:26 AMwhat is the free disk space in each server?
ITLAAL
-
Friday, August 03, 2012 9:23 AM
Hi,
Client HD Space free
Client 1: C:\ 23GB free, D:\ 318GB free
Client 2: C:\ 18GB free, D:\24GB free
Client 3: C:\ 32GB free
Client 4: C:\ 20GB free, D:\ 25GB freeDPM Disk Allocation
Client 1: Replica volume: 30GB allocated, 11.77GB used | Recovery point volume: 99.57GB allocated, 35.34GB used
Client 2: Replica volume: 50GB allocated, 8.36GB used | Recovery point volume: 148.56GB allocated, 66.57GB used
Client 3: Replica volume: 30GB allocated, 90.76MB used | Recovery point volume: 99.57GB allocated, 2.33GB used
Client 4: Replica volume: 40GB allocated, 11.81GB used | Recovery point volume: 99.57GB allocated, 26.33GB usedDPM server Disk Space free
C:\ 11GB free, D:\116GB free
Is it likely to be a disk space issue? I'm not sure because if I remove the server from the protection group and put it back again, the initial consistency check works ok proving that it can work, but the subsequent checks fail. Also as I understand it, BMR copies the data directly to the DPM server, where as system state backups save to the local storage before copying the data to the DPM. The system state backups, which require local storage are working, but the BMR which does not is failing.
I think Mike is likely on the right track with it being some thing to do data transfer between the client and the server. Unfortunately disabling Chimney and RSS didn't help. I didn't restart the DPM server after disabling them on it (it has other services on it, so can't reboot it within working hours), is it possible that it requires it?
Thanks
-
Friday, August 03, 2012 9:58 AM
Try to restart the server.
have you tested running the BMR locally from the server using wbadmin? will it finish?
http://technet.microsoft.com/en-us/library/cc742083(v=ws.10).aspx
ITLAAL
-
Friday, August 03, 2012 2:43 PM
We have some down time scheduled for next Tuesday, I'll reboot the server then.
Yes I have tested the BMR with wbadmin following the command given in the troubleshooting guide (wbadmin start backup -allcritical -backupTarget:e:) it worked fine.
Thanks
-
Friday, August 03, 2012 8:41 PM
One more question. What are the servers roles? What is installed on them?
I remember an old thread were someone had the exact same problem having SQL DBs on C:\
ITLAAL
-
Monday, August 06, 2012 9:14 AM
Hi,
Client 1: File Services Role
Client 2: Easytrace Cashless Catering running a SQL 2008 database on drive c:\
Client 3: Citrix Web Interface, IIS
Client 4: Domain Controller: Active Directory, DHCP & DNSI've read the thread about the problem with SQL DBs on C:\ but have not investigated that yet as only one of the affected servers has a SQL database. Moving it is likely to be a pain so I want to resolve this issue on the other servers first, then if the database server is still not working I will have to plan the move.
At the moment I have successful system state backups (not BMR) and will be rebooting the DPM server tomorrow, hoping that the Chimney and RSS may still be the issue.
Thanks
-
Wednesday, August 08, 2012 12:47 PM
Hi again,
I have been able to restart the DPM server after disabling Chimney and RSS as suggested by Mike. Unfortunately this has not fixed it and the same problem. I have since disabled TCP Connection Offset and RSS on the network interface properties on the DPM server and have disabled Large Receive Offload and Large Send Offload on the virtual NIC properties on the client server. This has not helped either.
I'm now going to try using wbadmin to perform a BMR backup directly to the share on the DPM server that is allocated to this client to see if that works. I'll report back how this goes.
Does anyone have any other troubleshooting or solution ideas?
Thanks
-
Wednesday, August 08, 2012 5:26 PMModerator
Hi
That sounds like a good idea, BMR volume on the DPM server is really locked down from a security standpoint, so to write to that share use this procedure.
1) Download psexec.exe from www.sysinternals.com
2) Run PSEXEC -s cmd.exe (this will switch window to system context)
3) Afterwards - type WHOAMI it should return:nt authority\system
4) Run wbadmin.exe command from there and use the BMR share on the DPM Server as the target location.
Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
-
Thursday, August 09, 2012 9:17 AM
Hi Mike,
Thanks for that. I haven't got around to testing that as I had a breakthrough with another line of enquiry. I ran Process Monitor to see what was happening when the backup stalled and found that at the exact same time that wbengine.exe stopped copying data smc.exe (Symantec Endpoint Protection (SEP)) started doing stuff. I then found that by stopping SEP the stalled backup would resume. So the problem is definately something to do with the antivirus protection.
I've tried setting the SEP policy to not scan during backup, but this did not work. So now I am looking at stopping SEP using a pre backup script and starting it afterwards with a post backup script. However I having difficulty getting my scriptingconfig.xml file correct. I don't know what the datasourcename value needs to be set to for a BMR backup. I've tried c: but that doesn't seem to be working. I've googled etc but haven't found an answer. Can you tell me what I need to put as the datasourcename?
Here is my scriptingconfig.xml file:
<?xml version="1.0" encoding="utf-8"?>
<ScriptConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://schemas.microsoft.com/2003/dls/ScriptingConfig.xsd">
<DatasourceScriptConfig DataSourceName="C:">
<PreBackupScript>"C:\_Novus\dpm\stopsmc.cmd" </PreBackupScript>
<PostBackupScript>"C:\_Novus\dpm\startsmc.cmd" </PostBackupScript>
<TimeOut>30</TimeOut>
</DatasourceScriptConfig>
</ScriptConfiguration>
Thanks again
- Marked As Answer by Intraclast Friday, August 10, 2012 8:27 AM
-
Thursday, August 09, 2012 9:56 AM
Hi Again,
I've worked out what to put for the datasourcename using a SQL script posted by Mike on another thread, just took a bit more digging to find it. It had to be "System Protection".
So the pre & post backup script seem to be working ok, stopping and starting Symantec Endpoint Protection and allowing BMR to complete successfully. Hoorah!
This is only on one of the four servers (the file server) so I will now have to try this on the rest of them.
I also still have network offsetting disabled on the client and server. I'll try setting that back to how it was before as I think its likely that the problem was all to do with the antivirus.
I'll report back to confirm whether this has resolved the whole problem or not in a day or so.
Thanks again!
-
Thursday, August 09, 2012 2:33 PMModerator
Hi That's great news.
Can you please share with the community the specific version of the Symantec Endpoint Protection (SEP) that is giving you problems ? Is there an update available that you can try to see if that helps ?
Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
-
Friday, August 10, 2012 8:26 AM
Hi,
This problem is with Symantec Endpoint Protection version 11.0.6005.562 which I think is quite out of date. I have an update of our SEP system on my to do list, when I've done that I'll try it again and report back, its likely to be a while however.
I've pushed out my pre and post processing scripts to the four affected servers and it seems to have resolved the problem on all of them.
Thanks again
-
Friday, August 10, 2012 2:28 PMModerator
Hi,
Thanks Kindly for identifying Symantec Endpoint Protection version 11.0.6005.562 as the root cause of your BMR hang issue, that may prevent others from long troubleshooting sessions. 8-) Please update us after upgrading to the a newer version.
Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
-
Wednesday, August 15, 2012 2:51 PM
Hi again,
I've updated SEP to 11 RU7 MP2 (11.0.7200.1147), and after testing it a couple of times it appears to have resolved the issue. I'm going to disable the pre & post processing scripts on the other servers and will report back if the problem does occur again. For now though it seems that upgrading SEP11 to the latest version has resolved the issue.

