none
Sporadic failure to backup to tape, computer unreachable.

    Question

  • DPM Version 2010, 3.0.7706.0

    OS: Server 2008 R2 Enterprise

    Environment: 5 total servers on local domain behind NAT/FW. 2 DCs, services spread across the others such as DHCP, DNS, NAS, Print Server, HyperV, and DPM. DPM is scheduled to backup to tape after hours (11pm) 5 nights a week.

    Detailed description of issue: DPM sporadically fails to backup to tape, and it appears to be a random server each time.

    E-mail alert error:  Description: Backup to tape failed.

    DPM failed to communicate with server01.domain.local because the computer is unreachable.

     

    For more information, open DPM Administrator Console and review the alert details in the Monitoring task area.

     

    DPM admin console error: The back up to tape job failed for the following reason: (ID 3311)

    DPM failed to communicate with the protection agent on server01.domain.local because the agent is not responding. (ID 43 Details: Internal error code: 0x8099090E)

    Attempted resolutions:  I have Googled and read a lot of information on all errors received, including this forum. To eliminate DNS from the equation I have added all of the servers to the hosts file, but the sporadic backup failure continues. I learned that DPM uses DCOM to communicate to the DPM agents. I ran a DCOM test as per the instructions here - http://support.microsoft.com/kb/259011 - but the test was successful to and from. The server running DPM does have CAPI2 errors in the event log, specifically:

     

    Failed extract of third-party root list from auto update cab at: <http://www.download.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.

     

    That error lead me to - http://support.microsoft.com/kb/2328240 - Which I then followed the instructions to attempt repairing, both manually and via the “fix-it”, both attempts failed to resolve the error. I found this - http://www.petenetlive.com/KB/Article/0000304.htm - which helped me find out the specific process that was causing the CAPI2 error. Here is the result of that CAPI2 error:

     

    ProcessName: TriggerJob.exe

    Result: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.

     

    I have verified that the system time is accurate on all servers. I do believe TriggerJob.exe is a DPM service, but at this point I am unsure what can be done to fix the problem. Any input would be appreciated.

    Friday, February 24, 2012 11:48 PM

All replies

  • Triggerjob isn't the proble.

    Triggerjob.exe is a process used by SQL Agent to fire up a DPM scheduled job.

    If triggerjob.exe never runs (or starts but it failes to do what it was supposed to) then DPM would never know about that scheduled job and you should not see any error at all in DPM console. Because you are getting an error, this means that triggerjob.exe never failed to be launched and contact DPM.

    So my question to you is: Are DPM/DCs/Protected servers on the same network (don't need to go trought NAT/Firewall) to talk between themselves?


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 12:20 AM
  • They are on the same network and the same domain, yes. Symantec has no firewall enabled and the group policy managed Windows firewall has exclusions for server communication. Also, if it were a firewall issue, why would it work for three days and fail on the fourth? (honest question).
    Saturday, February 25, 2012 12:23 AM
  • would it work for three days and fail on the fourth?

    Is this what really happens? Is there any step you need to take to resolve the communication issue or it clears up by itself?


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 12:39 AM
  • Yes I could look through my e-mail logs to confirm, but it is very random, sometimes it will fail two days in a row, and work fine for 3, then fail for one.

    The issue clears up on its own, and returns on its own too! :(

    Saturday, February 25, 2012 12:41 AM
  • Ok... this is kinda tough to troubleshoot.

    You might need to pick one of these servers and have a network trace running  (one netmon on the DPM server and another one in one of the servers that is having this issue).

    Agent communication refresh runs every 30 minutes (top and bottom of the hour). Another question is when you get the error you post on this thread. It happens after DPM was able to transfer data or right after the job started?


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 12:46 AM
  • "Ok... this is kinda tough to troubleshoot." Yea welcome to my world! :)

    "You might need to pick one of these servers and have a network trace running  (one netmon on the DPM server and another one in one of the servers that is having this issue)." What would I be looking for specifically in netmon? just to see if DPM's attempt at communication with the agent actually makes it to the agent server?

    "Another question is when you get the error you post on this thread. It happens after DPM was able to transfer data or right after the job started?" That seems to be somewhat random too. I'm looking at the last 4 errors: (the backup starts at 11:00 PM) 11:08 PM, 3:30 AM, 2:32 AM, and 4:31 AM.

    I also forgot to include some other e-mail errors, I am not sure if they are relevant:

    (gotten a couple of these)

    Description: Backup to tape failed.

    The DPM service terminated unexpectedly during completion of the job. The termination may have been caused by a system reboot.

    (and some of these, though this could be related to a separate space issue I am working on on this server)

    Description: Backup to tape failed.

    DPM failed to synchronize changes for  C:\ on server03.domain.local because the snapshot volume did not have sufficient storage space to hold the churn on the protected computer

    Saturday, February 25, 2012 12:58 AM
  • Can you look for the 3311 error under Monitoring/Jobs and post it here?

    There other two errors you posted are not relevant for the agent communication failure.

    The first one tells me that DPM Service crashed....

    The second one the Protected wasn't able to expand the snapshot backup area in time thus the operation was aborted...


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 1:02 AM
  • Monitoring/jobs is completely blank, I don't recall ever seeing anything here.

    @ other comments, yes they didn't seem to be relevant, but I figured I would include them anyway.

    Also, I currently have 10 inactive alerts for "agent not reachable".

    The DPM protection agent on server04.domain.local could not be contacted. Subsequent protection activities for this computer may fail if the connection is not established. The attempted contact failed for the following reason: (ID 3122)

    The protection agent operation failed because it could not communicate with server04.domain.local. (ID 300 Details: The RPC server is unavailable (0x800706BA))

    Saturday, February 25, 2012 1:16 AM
  • Correction, I found some info after changing the filter to failed jobs for yesterday and today.

    DPM failed to communicate with server01.domain.local because the computer is unreachable. (ID 41 Details: No such host is known (0x80072AF9))

    and

    DPM failed to communicate with server01.domain.local because the computer is unreachable. (ID 41 Details: The RPC server is unavailable (0x800706BA))

    and

    DPM failed to communicate with server01.domain.local because the computer is unreachable. (ID 41 Details: The RPC server is unavailable (0x800706BA))

    Saturday, February 25, 2012 1:22 AM
  • under monitoring/jobs you will need to create a filted to see errors that happend on previous days (up to 30 days back)...

    One thing we can try to do right away is to check if TCP Chimney is enabled on your DPM server and protected servers. If it is would it be possible for you to disable?

    If your protected servers are running Windows 2008/2008R2 you can follow these steps to disable chimney/RSS.

     

    1.  Disable TCP Chimney and RSS by running these two commands from CMD.EXE

     netsh int tcp set global chimney=disabled

     netsh int tcp set global RSS= disabled

     

    1.  Disable these same settings on the driver level.

    For that, go to Start -> Run and type in NCPA.CPL

    Right click the physical adapter attached to your network and select properties.

    Click on Configure.

    Go to Advanced tab and set all highlighted options below to None/Disable (Anything that says Offload and Receive Side Scalling (RSS)

    More information about RSS and TCP Chimney can be found on the article below.

    http://support.microsoft.com/kb/951037


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 1:24 AM
  • Thank you.

    RPC errors can happen if we can't get a valid ticket for kerberos or if the remote server (protected) can't connect back to the DPM server for whatever reason.

    No such host is known gives me an idea of some problem connecting to DC/DNS....


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 25, 2012 1:27 AM
  • Wow thanks for the help Wilson! I will have to try your last suggestions and get back to you later.
    Saturday, February 25, 2012 1:27 AM
  • Wilson, I have completed the changes that you recommended. As the issue is sporadic I won't know if it had any effect until the error occurs again. Are there some steps we can take to test/troubleshoot DNS and the DC as you mentioned above? I have already tested pinging between servers fine.
    Monday, February 27, 2012 8:18 PM
  • I wanted to add a screenshot of the error log, you can see how inconsistently the errors occur.

    Tuesday, March 13, 2012 12:14 AM