none
DPM Communication Error RRS feed

  • Question

  • We upgraded from DPM 2010 to 2012, when trying to restore to either a phyicsal or virtual location I get this error after about 10 GB have been restored.  I have tried rebooting all machines, restarting the services and throttling the client.

    The recovery jobs for S:\ that started at Tuesday, June 26, 2012 2:04:18 PM, with the destination of Server1, have completed. Most or all jobs failed to recover the requested data. (ID: 3111)

    The DPM service was unable to communicate with the protection agent on Server1. (ID: 52)

    Thank you.


    JD Young

    Tuesday, June 26, 2012 9:36 PM

All replies

  • Hi,

    There should be a hex code in the failed recovery job details (not from the alert) , can you supply that detail ?


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, June 26, 2012 10:31 PM
    Moderator
  • Here is the hex code

    (ID 52 Details: The semaphore timeout period has expired (0x80070079))

    Thank you.


    JD Young

    Tuesday, June 26, 2012 11:29 PM
  • Hi,

    Thanks for the code - that helps.

    Diagnostic steps when "Semaphore timeout" is hit during network transfer:

    1. Check if the protected server (sender) or DPM (receiver) was under stress or inaccessible during the time of failure – from event logs from both the machines. Retry should work if the packet loss was because of either of the servers being inaccessible or under stress for a period.

    2. Check if the network between the PS and the DPM is flaky – retransmit count from ‘netstat –s’ or perfmon counters can give an idea.

    3. If the network is expected to be flaky, setting a higher TCP/IP maximum retransmission timeout as described in
    http://support.microsoft.com/kb/170359 might help -increase the TcpMaxDataRetransmissions to 10 or more.

    4. Else contact network support engineer to diagnose the packet loss issue – netmon captures from both machines, packet route and network layout/devices will be required to start the investigation.

    5. Take some performance monitor logs on both DPM and Proteted server side.

    Some good and basic perfmon counters to take to see if the servers are under stress are below.

    Logical Disk/Physical Disk
    ******************
    \%idle
    • 100% idle to 50% idle = Healthy
    • 49% idle to 20% idle = Warning or Monitor
    • 19% idle to 0% idle = Critical or Out of Spec
    \%Avg. Disk Sec Read or Write
    • .001ms to .015ms  = Healthy
    • .015ms to .025 = Warning or Monitor
    • .026ms or greater = Critical or Out of Spec
    Current Disk Queue Length (for all instances)
    80 requests for more than 6 minutes.
    • Indicates possibly excessive disk queue length.
    Memory
    *******
    \Pool Non Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Pool Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Available Megabytes
    • 50% of free memory available or more =Healthy
    • 25% of free memory available = Monitor.
    • 10% of free memory available = Warning
    • Less than 100MB or 5% of free memory available = Critical or Out of Spec.
    Processor
    *******
    \%Processor Time (all instances)                                                                   
    • Less than 60% consumed = Healthy
    • 51% - 90% consumed = Monitor or Caution
    91% - 100% consumed = Critical


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, June 26, 2012 11:47 PM
    Moderator
  • At times the CPU load can be high but I am not sure that is the cause of the issue.   Originally when restoring the task would error out around 10 GB, when restoring individual sub folders I haven’t had this error.  I have been able to successfully restore most sub folders without issue, the problem has occurred when trying to restore the entire continents of all the sub folders.  There is one sub folder that is failing during a restore process but I haven’t been able to determine the exact cause.

    On the sub folder that wouldn’t restore successfully, I tried to delete all of the contents and remove the directories.  I wasn’t able to delete all of the files and folders because there was an issue with file names that where too long.  I am wondering if this is what is causing the problem with restoring the folder the entire directory.


    JD Young

    Friday, June 29, 2012 4:42 PM
  • Hi,

    That is for sure strange.  Please ensure that if you have this registry setting, that is it set for 3 or higher, or even delete it as a test.  I've seen lower valued cause restores to hang / fail.

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent]
    "BufferQueueSize"=dword:00000003


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, June 29, 2012 10:57 PM
    Moderator