none
The operation failed because of a protection agent failure. (ID 998 Details: The semaphore timeout period has expired (0x80070079)) RRS feed

  • Question

  • Recently we got this error, while backup to tape, and no idea what is it.

    What cause of this error?

    The operation failed because of a protection agent failure. (ID 998 Details: The semaphore timeout period has expired (0x80070079)


    DPM 2010 with latest roll-up (KB2250444) | DELL Server R710 (Windows 2008 R2 SP1) RAM: 24GB PF: 36-60GB | DELL TL2000 (2 Drives) | And still struggling and monitoring... :(

    Thursday, May 31, 2012 10:18 PM

Answers

  • Hi,

    Try this workaround, increase the TcpMaxDataRetransmissions to 10 or more.

    How to modify the TCP/IP maximum retransmission timeout
    http://support.microsoft.com/kb/170359


    Diagnostic steps when "Semaphore timeout" is hit during network transfer:

    1. Check if the protected server (sender) or DPM (receiver) was under stress or inaccessible during the time of failure – from event logs from both the machines. Retry should work if the packet loss was because of either of the servers being inaccessible or under stress for a period.

    2. Check if the network between the PS and the DPM is flaky – retransmit count from ‘netstat –s’ or perfmon counters can give an idea.

    3. Else contact network support engineer to diagnose the packet loss issue – netmon captures from both machines, packet route and network layout/devices will be required to start the investigation.

    4. Take some performance monitor logs on both DPM and Proteted server side.

    Some good and basic perfmon counters to take to see if the servers are under stress are below.

    Logical Disk/Physical Disk
    ******************
    \%idle
    • 100% idle to 50% idle = Healthy
    • 49% idle to 20% idle = Warning or Monitor
    • 19% idle to 0% idle = Critical or Out of Spec
    \%Avg. Disk Sec Read or Write
    • .001ms to .015ms  = Healthy
    • .015ms to .025 = Warning or Monitor
    • .026ms or greater = Critical or Out of Spec
    Current Disk Queue Length (for all instances)
    80 requests for more than 6 minutes.
    • Indicates possibly excessive disk queue length.
    Memory
    *******
    \Pool Non Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Pool Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Available Megabytes
    • 50% of free memory available or more =Healthy
    • 25% of free memory available = Monitor.
    • 10% of free memory available = Warning
    • Less than 100MB or 5% of free memory available = Critical or Out of Spec.
    Processor
    *******
    \%Processor Time (all instances)                                                                   
    • Less than 60% consumed = Healthy
    • 51% - 90% consumed = Monitor or Caution
    91% - 100% consumed = Critical


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Thursday, May 31, 2012 11:20 PM
    Moderator

All replies

  • Hi,

    Try this workaround, increase the TcpMaxDataRetransmissions to 10 or more.

    How to modify the TCP/IP maximum retransmission timeout
    http://support.microsoft.com/kb/170359


    Diagnostic steps when "Semaphore timeout" is hit during network transfer:

    1. Check if the protected server (sender) or DPM (receiver) was under stress or inaccessible during the time of failure – from event logs from both the machines. Retry should work if the packet loss was because of either of the servers being inaccessible or under stress for a period.

    2. Check if the network between the PS and the DPM is flaky – retransmit count from ‘netstat –s’ or perfmon counters can give an idea.

    3. Else contact network support engineer to diagnose the packet loss issue – netmon captures from both machines, packet route and network layout/devices will be required to start the investigation.

    4. Take some performance monitor logs on both DPM and Proteted server side.

    Some good and basic perfmon counters to take to see if the servers are under stress are below.

    Logical Disk/Physical Disk
    ******************
    \%idle
    • 100% idle to 50% idle = Healthy
    • 49% idle to 20% idle = Warning or Monitor
    • 19% idle to 0% idle = Critical or Out of Spec
    \%Avg. Disk Sec Read or Write
    • .001ms to .015ms  = Healthy
    • .015ms to .025 = Warning or Monitor
    • .026ms or greater = Critical or Out of Spec
    Current Disk Queue Length (for all instances)
    80 requests for more than 6 minutes.
    • Indicates possibly excessive disk queue length.
    Memory
    *******
    \Pool Non Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Pool Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Available Megabytes
    • 50% of free memory available or more =Healthy
    • 25% of free memory available = Monitor.
    • 10% of free memory available = Warning
    • Less than 100MB or 5% of free memory available = Critical or Out of Spec.
    Processor
    *******
    \%Processor Time (all instances)                                                                   
    • Less than 60% consumed = Healthy
    • 51% - 90% consumed = Monitor or Caution
    91% - 100% consumed = Critical


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Thursday, May 31, 2012 11:20 PM
    Moderator
  • Hi Mike,

    I have faced the same issue in DPM 2012 R2 installed on Windows 2012 R2, we are taking long term backup on tape, We have done mentioned changes in DPM Server to resolve it..

    added mentioned DWORD in Registry, see snapshot..

    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters

    Value Name:  TcpMaxDataRetransmissions
    Data Type:   REG_DWORD - Number
    Valid Range: 0 - 0xFFFFFFFF
    Default:     5

    Set TcpMaxDataRetransmissions to 10

    Set mentioned property in Network Card

    ARP offload- Disabled
    Large Send Offload V2(IPV4)- Disabled
    Large Send Offload V2(IPV6)- Disabled
    NS Offload-Disabled
    Receive Side Scaling-Disabled
    TCP/UDP Checksum Offload(IPV4)-Disabled
    TCP/UDP Checksum Offload(IPV6)-Disabled
    Transmit Buffers-600

    Now, Backup is working fine...

    Additional Information for Fine Tunning

    1.

    For ShortErase Tape, do mentioned DWORD entry in Registry

    computer\hkey_local_machine\software\Microsoft\Microsoft Data Protection Manager\Agent

    Name-UseShortErase
    Type-REG_DWORD
    Data-0x00000000(0)


    2.

    if you face suspect in Tape, use mentioned link, it is very helpful

    http://itsalllegit.wordpress.com/2013/10/08/dpm-2012-sp1-suspect-tape/

    for Performance and Other Issues with DELL Tape Library, Download and Install

    ITDT tool and share the results with DELL Support..

    TIP: Don't Start multiple Jobs at same time when you are taking backup on Tapes, Plan it while creating Protection Group or Modify it later..

    We have also upgraded the Firmware of Tape Drive..


    Kirpal Singh











    Wednesday, July 30, 2014 11:44 AM