Intermittent failures with Protected 2008R2 Workgroup Agents
-
lunes, 15 de agosto de 2011 18:22
Hello,
In DPM 2010, we are having intermittent failures with protection agents install on Windows 2008 R2 wSP1 computers that are not members of the domain (Untrusted/Workgroup). They all fail at different times part way though performing a consistency check with data having already been transferred as part of the job.
We have 12 domain servers that have no problems (all 2008 R2 with SP1) and another 7 workgroup/untrusted servers (5 running 2008 R2 with SP1 and 2 running 2003). The domain computers and 5 computers with 2003 server backup without any issues.
The protected servers are on the same subnet as the single DPM server and separated by only 1 network switch. We have disabled the Windows firewall on both the protected server and DPM server. We do not see any errors on the protected servers in the event logs and can run a continuous ping between servers without losing any while the job runs.
Here is the error we get on all of the servers...
The replica of Volume C:\ on SERVER1 is inconsistent with the protected data source. All protection activities for data source will fail until the replica is synchronized with consistency check. You can recover data from existing recovery points, but new recovery points cannot be created until the replica is consistent.
For SharePoint farm, recovery points will continue getting created with the databases that are consistent. To backup inconsistent databases, run a consistency check on the farm. (ID 3106)
DPM failed to communicate with SERVER1 because the computer is unreachable. (ID 41 Details: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond (0x8007274C))
More information
Recommended action: 1) Make sure that SERVER1 is online and remotely accessible from the DPM server.
2) If a firewall is enabled on SERVER1, make sure that it is not blocking requests from the DPM server.
3) If you are using backup LAN, make sure that the backup LAN settings are valid.
Any suggestions to correct this?
Nick Dorak
Todas las respuestas
-
lunes, 22 de agosto de 2011 16:45Moderador
Hi Nick,
A couple of things to try,
1. Do the jobs fail at any point if you run them manually?
2. How did you disable Windows Firewall? If you have not done so already test stopping the Base Filtering Engine (net stop BFE). If this works then it would appear Windows Firewall is still blocking the DPM traffic. You can reboot to renable BFE and all the dependant components then try running the setdpmserver commands for workgroup/untrusted domain servers.
Thanks,
Marc
-
martes, 23 de agosto de 2011 13:30
Thanks Marc,
1. The jobs fail the same manually as if scheduled.
2. I disabled the firewall as you suggested (by stopping the BFE service) and the issue remains the same.
Any further suggestions?
Nick Dorak -
lunes, 12 de septiembre de 2011 7:45Moderador
What data-sources are you trying to protect on untrusted/workgroup computers?
This posting is provided "AS IS" with no warranties, and confers no rights -
martes, 11 de octubre de 2011 14:33Moderador
Nick,
Are you still experiencing this problem? If so can you answer Prateek's question about the type of datasource(s) you are trying to protect?
Thanks,
Marc
-
martes, 11 de octubre de 2011 14:56
Sorry, I didn't notice the previous reply.
I currently have a case open with Microsoft support and am working with them to get this corrected.
If you have some incite, we are trying to backup mainly "All Volumes" and the "System State" for each server. Some also have SQL and Exchange databases.
Nick Dorak -
miércoles, 12 de octubre de 2011 17:08
have you tried to delete the PG and retain the data and reprotect it?
// Laith
-
miércoles, 12 de octubre de 2011 20:12
Yes, we have.
We have several different protection groups, protection groups can have a combination of protected servers that are working and protected servers with intermittent backups based on the details of my first message.
Thanks,
Nick Dorak -
jueves, 13 de octubre de 2011 4:59Is the communication is up between the DPM server and the protected server?
-
jueves, 13 de octubre de 2011 12:27
Yes, I can continually ping between servers without any failures.
I have disabled Windows Firewalls and Antivirus software for testing with no difference in behavior.
Again, it's the "intermittent" factor that has me stumped. If the issue was "always", I would assume a communication error.
Nick Dorak -
sábado, 15 de octubre de 2011 8:09Have you tried to use throtelling and enable on-the-wire compression?
-
jueves, 20 de octubre de 2011 19:24Yes, we have set Throttling within DPM, however I am not sure about "on-the-wire compression". Where is that enabled?
Nick Dorak -
martes, 25 de octubre de 2011 5:35
If you right click on your PG and then choose "optimize performance". under network you will be able to enable-on-the-wire compression that will take the load off your Network into servers (CPU and RAM) but since backup is at night then that will not effect anyone.
Hope that helps,
Laith.
-
viernes, 28 de octubre de 2011 19:00
I have enabled this but it didn't seem to help as the same error is still occurring.
Nick Dorak -
domingo, 30 de octubre de 2011 17:46
Hi Nick,
You said that the servers are in untrusted domain. Is there any firewall in the middle?
// Laith.
-
martes, 01 de noviembre de 2011 20:00There are no firewalls in the middle. They are on the same LAN, seperated by one switch and with the Windows Firewalls also disabled.
Nick Dorak -
lunes, 14 de noviembre de 2011 18:18Nick I've found that sometimes this is due to a problem with the local account used by DPM 2010 on the protected workgroup server. If you have a policy in place where the password automatically requires a reset after x amount of time you'll lose connection. Fixing the password on the ps and refreshing the agent state from the DPM admin console usually does the trick.
-
miércoles, 16 de noviembre de 2011 14:20There are no password reset policies in place for the DPM accounts that are created. I have tried re-creating the accounts in the past, however it doesn't correct the issue. We always set our DPM account to not expire passwords. Thanks for the suggestion.
Nick Dorak -
lunes, 28 de noviembre de 2011 14:22
download netmon.
run scan on ports open when the connection is open between the two servers (you can do that by running a simple agent refresh). Check ports opened from the PS while the communication is not working. If the ports are not opened then its a communication issue.
Please get back to me with any info!
// Laith.
-
viernes, 16 de diciembre de 2011 8:30
Any news?
-
viernes, 16 de diciembre de 2011 13:25
Not a big netmon user. I had someone else examine the results and was informed that ports get opened and data starts transferring, then keep-alive messages are seen.
I see DPM sending the necessary keep-alive to keep the session open.
1513660 1:02:55 PM 12/5/2011 0.0500194 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x00000000133EF5C0 send keep-alive at SndUna = 1047656124 (0x3E71F6BC).
1513661 1:02:55 PM 12/5/2011 0.0000175 Idle (0) DPM-SERVER 5718 (0x1656) SERVER1 54894 (0xD66E) TCP TCP:[Keep alive]Flags=...A...., SrcPort=5718, DstPort=54894, PayloadLen=1, Seq=1047656123 - 1047656124, Ack=3363045343, Win=254
1513662 1:02:55 PM 12/5/2011 0.0002181 Idle (0) SERVER1 54894 (0xD66E) DPM-SERVER 5718 (0x1656) TCP TCP:[Keep alive ack]Flags=...A...., SrcPort=54894, DstPort=5718, PayloadLen=0, Seq=3363045343, Ack=1047656124, Win=251
1513663 1:02:55 PM 12/5/2011 0.0000146 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x00000000133EF5C0: Received data with number of bytes = 0 (0x0). ThSeq = 3363045343 (0xC873FFDF).
1513664 1:02:56 PM 12/5/2011 0.2643508 Idle (0) SERVER1 54894 (0xD66E) DPM-SERVER 5718 (0x1656) TCP TCP:[Keep alive]Flags=...A...., SrcPort=54894, DstPort=5718, PayloadLen=1, Seq=3363045342 - 3363045343, Ack=1047656124, Win=251
1513665 1:02:56 PM 12/5/2011 0.0000265 Idle (0) DPM-SERVER 5718 (0x1656) SERVER1 54894 (0xD66E) TCP TCP:[Keep alive ack]Flags=...A...., SrcPort=5718, DstPort=54894, PayloadLen=0, Seq=1047656124, Ack=3363045343, Win=254
Nick Dorak -
sábado, 17 de diciembre de 2011 22:43
Hi Nick,
And did you ran Netmon when you had the problem and compared the result?
What im suspecting is that when the DPM agent lose connection to the DPM server then its because there are some ports used by another program or ports blocked by some rules. Thats im asking to compare the rules.
/ Laith
-
lunes, 23 de enero de 2012 19:28Moderador
Hi Nick,
You mentioned in an earlier post you had opened a support case for this issue. Has it been resolved?
Thanks,
Marc
-
lunes, 23 de enero de 2012 19:36
Marc,
The Microsoft phone support case is still open and has NOT been resolved.
We are working to evaluate alternatives to DPM as we have not been able to get this issue corrected.
Nick Dorak -
lunes, 27 de febrero de 2012 21:06
hi Nick
is there any update on this case?
-
martes, 28 de febrero de 2012 19:25No, the issue stil exists. We are using another backup solution for the time being, just on the computers with issues. Hoping DPM 2012 comes out sonner than later.
Nick Dorak
- Editado NickDorak martes, 28 de febrero de 2012 19:27 Spelling
-
martes, 01 de mayo de 2012 11:49
Nick,
I know this is a little late to the party but thought I'd throw my ten cents in to help you out. In our multi-tenant datacentre, we had similar issues to what you are describing and in most cases the cause was either an outdated DPM agent (the latest QFE rollup for DPM 2010 needs to be installed on both the DPM server and all the agents), Anti-Virus exclusions not being set properly, protection groups running over each others time - i.e. more than 3 jobs running at the same time and in one case we had an issue with a dodgy network switch which needed a firmware upgrade.
Also, check out this post on untrusted domain account passwords resetting - even with password expiry turned off:
http://kevingreeneitblog.blogspot.com/2011/10/resolving-dpm-2012-untrusted-domain.html
As you probably know by now, DPM 2012 has been released and that supports full Certificate Based Authentication for untrusted domain clients. Once I deployed this into our datacentre, we haven't had any of the intermittent problems with untrusted backups failing so hopefully if you deploy that, it will solve the issues for you!
Hope this helps,
Kevin.
-
lunes, 07 de mayo de 2012 17:27
Kevin,
We have made the change to DPM 2012 in hopes it would help with some of our DPM issues (which it hasn't). We have confirmed that the passwords are correct as the backups do start and Agents do update their status. I would be interested in reading a blog post for "Converting your Untrusted DPM Agents to a PKI setup" if you have one upcoming.
It keeps sounding like a switch issue, but our switches are all up to date and I can't find any proof that this may be the cause.
I have had a case open with Microsoft Partner/Phone support this entire time but they have been unhelpful, take a long time to reply and keep changing engineers so we have to start each test all over again from the beginning (basically to laborious on our behalf). We found it simpler to backup these machines using a 3rd party imaged based product to our NAS. One day I hope to get everything working on DPM.
Nick Dorak

